Author here! Thanks for the detailed breakdown. Let me address a few points:

- C++ SIMD being slower: The standard C++ uses i & 0x1 which lets the compiler auto-vectorize. With -O3 -ffast-math -march=native, gcc/clang do this really well. The explicit AVX2 version has overhead from manual vector setup and horizontal sum at the end. Modern compilers often beat hand-written SIMD for simple loops like this.

- Zig fast-math: Correct. Line 5 has @setFloatMode(.optimized) with a comment saying "like C -ffast-math".

- Julia: Also correct. Uses @fastmath @simd for - both annotations together.

- Crystal/Odin/Ada being slow: All three use x = -x which creates a loop-carried dependency that blocks auto-vectorization. The fast implementations use the branchless i & 0x1 trick instead.

- C# SIMD: Uses Vector512 doing 8 doubles per iteration. That explains the ~4x speedup.

- Nim vs C: Both compile via gcc with similar flags. Probably just measurement variance.

- Fortran: Interestingly does NOT use -ffast-math. Uses manual loop unrolling instead (processes 4 terms per iteration).

- Go: You're right that it's the fastest with its own compiler. No LLVM/GCC backend, just Go's own SSA-based compiler.

For suggestions - DMD, LFortran, and Scala Native would be great additions. PRs welcome!