Yes, there is a room for optimization. It seems to me bytesum_intrinsics.c doesn't saturate RAM bandwidth because loops are not unrolled. You should unroll the loop in addition to SIMD so that SIMD loads complete while SIMD adds are done, hiding memory latency. Otherwise you wait on loads.
so that SIMD loads complete while SIMD adds are done
That's almost certainly already being done by the hardware. Loop unrolling has been counterproductive ever since ~Sandy Bridge or so, and probably hasn't been that great of an idea since the post-P4 days:
This is a great advice, but I think it doesn't apply here. If code normally fits in uop cache but unrolled code doesn't, you are punished. But here both normal code and unrolled code fit uop cache.
Loop unrolling may be counterproductive, but loop unrolling for vectorization almost certainly isn't, even in the current CPUs. It really can't be done by the hardware, and all compilers unroll loops for vectorization, even if you disable loop unrolling in general.
You are correct that both fit in uop cache, but I think 'userbinator' is correct here that loop unrolling is of very limited benefit here (and almost everywhere else). Because of speculative execution, processors are basically blind to loops. They will keep executing ahead along the most likely path even if this means they are simultaneously executing different iterations of a loop. The loads will pre-execute just fine across the loop boundary. While there are cases where reducing the amount of loop logic will speed things up, it really can be done in hardware!
I tested the program with -funroll-loops (which did unroll the sum_array loop) with no measurable difference in performance. I guess there are no easy shortcuts available anymore, you'd need to delve into the assembly to push for more performance.