Yes, there is a room for optimization. It seems to me bytesum_intrinsics.c doesn...

userbinator · on May 13, 2014

so that SIMD loads complete while SIMD adds are done

That's almost certainly already being done by the hardware. Loop unrolling has been counterproductive ever since ~Sandy Bridge or so, and probably hasn't been that great of an idea since the post-P4 days:

http://www.agner.org/optimize/blog/read.php?i=142#142

It is so important to economize the use of the micro-op cache that I would give the advice never to unroll loops.

sanxiyn · on May 13, 2014

This is a great advice, but I think it doesn't apply here. If code normally fits in uop cache but unrolled code doesn't, you are punished. But here both normal code and unrolled code fit uop cache.

Loop unrolling may be counterproductive, but loop unrolling for vectorization almost certainly isn't, even in the current CPUs. It really can't be done by the hardware, and all compilers unroll loops for vectorization, even if you disable loop unrolling in general.

nkurz · on May 13, 2014

You are correct that both fit in uop cache, but I think 'userbinator' is correct here that loop unrolling is of very limited benefit here (and almost everywhere else). Because of speculative execution, processors are basically blind to loops. They will keep executing ahead along the most likely path even if this means they are simultaneously executing different iterations of a loop. The loads will pre-execute just fine across the loop boundary. While there are cases where reducing the amount of loop logic will speed things up, it really can be done in hardware!

zokier · on May 13, 2014

I tested the program with -funroll-loops (which did unroll the sum_array loop) with no measurable difference in performance. I guess there are no easy shortcuts available anymore, you'd need to delve into the assembly to push for more performance.