Deep Learning is a word these days frequently spoken by executives and engineers in numerous technology fields ranging from mobile systems to home appliances and automotive. Deep learning systems, although unprecedented in inference accuracy, they introduce a high computational complexity, questioning their usability in systems featuring a limited computational capacity such as in the case of embedded systems used in many markets today such as smartphones, IoT, Home Appliances etc. A possible workaround to this problem, is the use of heterogeneous computing: This involves the exploitation of every computing resource present on an embedded system (CPU, GPU, DSP) to which a part of the load is off-loaded increasing this way the overall computational capacity and thus processing speed. There are however cases, where a multicore CPU is the only available resource on an embedded system. So, there is a reasonable question in this case: Is it possible to have a decent inference speed?