More

mmozeiko · 2026-04-22T18:53:15 1776883995

sub is also recognized as zeroing idiom for register file. Intel documents these in "3.5.1.7 Clearing Registers and Dependency Breaking Idioms" from Optimization Reference Manual: https://www.intel.com/content/www/us/en/developer/articles/t...

Here's html version: https://zzqcn.github.io/perf/intel_opt_manual/3.html#clearin...

AMD has similar list in "2.9.2 Idioms for Dependency removal" from "Software Optimization Guide for the AMD Zen5 Microarchitecture" document: https://docs.amd.com/v/u/en-US/58455_1.00

mmozeiko · 2026-04-16T06:11:13 1776319873

xor swap trick was useful in older simd (sse1/sse2) when based on some condition you want to swap values or not:

  tmp = (a ^ b) & mask
  a ^= tmp
  b ^= tmp

If mask = 0xfff...fff then a/b will be swapped, otherwise if mask = 0 then they'll remain the same.

CJefferson · 2026-04-16T06:39:41 1776321581

Oh, that is cool, I’ve never seen that. I might add that to an extended version of the post sometime, I’ll be sure to credit you.

jagged-chisel · 2026-04-16T13:09:08 1776344948

So mask is marking the bits you want swapped and leaving the others in place.

koolala · 2026-04-16T10:32:47 1776335567

Could this still be the ideal way for vectors of Ints in WebGL2?

jesse__ · 2026-04-16T09:52:21 1776333141

That's hella cute

mmozeiko · 2026-03-13T19:03:49 1773428629

How does it compare to rendering SVG by Direct2D itself? When using ID2D1DeviceContext5::DrawSvgDocument method, and ID2D1SvgDocument can be loaded from file with ID2D1DeviceContext5::CreateSvgDocument + SHCreateStreamOnFileW.

mmozeiko · 2025-10-09T20:08:11 1760040491

From what I understand from code those unwarps are just doing matrix multiply to get unwraped pixel location? In this case doing these operations directly in fragment shader instead of texture lookup will be faster. Memory bandwidth is not free. But simple ALU like this (just couple FMA's) can easily hide in shadow of texture sampling that happens afterwards. So simply upload those undistortion matrices (mat1 & mat2) as uniforms and do matrix multiply in shader for adjusting texcoords.

monsecchris · 2025-10-09T20:15:29 1760040929

map1 and map2 are the same dimensions as the video image.

mmozeiko · 2025-10-09T20:34:04 1760042044

Ah, I see. Yeah, then the current approach is fine.

mmozeiko · 2025-07-29T18:54:13 1753815253

There's also Windows.Graphics.Capture. It allows to get texture not only for whole desktop, but just individual windows.

mmozeiko · on May 27, 2025

I don't know why Signal calls it "DRM" because the do not use DRM for this. Typically DRM means encryption & keys are involved (which is what Netflix & others are doing with Widevine or PlayReady).

All Signal does is just a simple Windows API call to exclude window from screen capture. SetWindowDisplayAffinity function with WDA_EXCLUDEFROMCAPTURE argument: https://learn.microsoft.com/en-us/windows/win32/api/winuser/...

k_roy · on May 28, 2025

And Microsoft, literally call it the DRM flag. DRM doesn't insist on being encrypted.

https://learn.microsoft.com/en-us/windows/client-management/...

And that "simple Windows API" call is pretty much absolute, since it's across the stack.

yupyupyups · on May 27, 2025

Thanks for the insight, I thought they took advantage of the whole DRM stack (including HDCP in monitors) to encrypt the UI and let the monitor decrypt it.

mmozeiko · on May 16, 2025

If you change logic and/or to bitwise and/or then it'll be branchless.

kragen · on May 17, 2025

True: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename... but I understood hoten to be saying that compilers would generally produce that version from the short-circuiting version, and they don't.

hoten · on May 17, 2025

Yeah I was wrong.

Do we know why the compiler doesn't do it? Surely the output is the same and avoiding branches is clearly faster.

Maybe short circuiting requires such an optimization not be made?

kragen · on May 19, 2025

There are cases where the optimization wouldn't be safe (like i < n && a[i] != k) but this is not one of them. Maybe the compiler is just dum. Or maybe avoiding branches is not clearly faster in cases like this? Have you measured this particular case?

mmozeiko · on March 12, 2025

https://github.com/veluca93/fpnge is a very fast png encoder. A bit lower compression ratio, but runs significantly faster than alternatives. Here is a presentation with benchmarks:

https://www.lucaversari.it/FJXL_and_FPNGE.pdf

a_t48 · on March 13, 2025

This worked out for me. What I was actually interested in was reducing CPU utilization, which generally speed is a fine substitute for (the same work being done in a smaller time slice means lower overall utilization). It reduced utilization enough that I'll likely push to use it for production in the future (not immediately, there's bigger low hanging fruit).

a_t48 · on March 12, 2025

This looks right up my alley, thanks! Will give it a shot.

mmozeiko · on Oct 6, 2024

There is a simple way to get that immediate from expression you want to calculate. For example, if you want to calculate following expression:

    (NOT A) OR ((NOT B) XOR (C AND A))

then you simply write

    ~_MM_TERNLOG_A | (~_MM_TERNLOG_B ^ (_MM_TERNLOG_C & _MM_TERNLOG_A))

Literally the expression you want to calculate. It evaluates to immediate from _MM_TERNLOG_A/B/C constants defined in intrinsic headers, at least for gcc & clang:

    typedef enum {
      _MM_TERNLOG_A = 0xF0,
      _MM_TERNLOG_B = 0xCC,
      _MM_TERNLOG_C = 0xAA
    } _MM_TERNLOG_ENUM;

For MSVC you define them yourself.

o11c · on Oct 7, 2024

To take the magic away, write it in binary:

  A = 0b11110000
  B = 0b11001100
  C = 0b10101010

meow_catrix · on Oct 7, 2024

The Amiga manual suggests normalizing to a conjunctive normal form.

bcoates · on Oct 7, 2024

The constant-math method above doesn't require that and works on denormalized expressions, too.

That said, the trick for turning four or more argument bitwise operations into a series of vpternlogd operations has yet to be posted

LegionMammal978 · on Oct 7, 2024

Suppose you have four boolean variables A, B, C, and D. You can calculate X as the result from A, B, C assuming that D is false, and Y as the result assuming that D is true. Then, a third ternary operation can switch between X and Y depending on D. This creates a tree of 3 operations, which I suspect is the best you can do in the worst case.

For five or more arguments, this naturally extends into a tree, though it likely isn't the most efficient encoding.

mmozeiko · on Sept 17, 2024

Unfortunately not everything. For example, no xbox 360 controller support: https://github.com/microsoft/GDK/issues/39