AMD has similar list
in "2.9.2 Idioms for Dependency removal" from "Software Optimization Guide for the AMD Zen5 Microarchitecture" document: https://docs.amd.com/v/u/en-US/58455_1.00
How does it compare to rendering SVG by Direct2D itself? When using ID2D1DeviceContext5::DrawSvgDocument method, and ID2D1SvgDocument can be loaded from file with ID2D1DeviceContext5::CreateSvgDocument + SHCreateStreamOnFileW.
From what I understand from code those unwarps are just doing matrix multiply to get unwraped pixel location? In this case doing these operations directly in fragment shader instead of texture lookup will be faster. Memory bandwidth is not free. But simple ALU like this (just couple FMA's) can easily hide in shadow of texture sampling that happens afterwards. So simply upload those undistortion matrices (mat1 & mat2) as uniforms and do matrix multiply in shader for adjusting texcoords.
I don't know why Signal calls it "DRM" because the do not use DRM for this. Typically DRM means encryption & keys are involved (which is what Netflix & others are doing with Widevine or PlayReady).
Thanks for the insight, I thought they took advantage of the whole DRM stack (including HDCP in monitors) to encrypt the UI and let the monitor decrypt it.
There are cases where the optimization wouldn't be safe (like i < n && a[i] != k) but this is not one of them. Maybe the compiler is just dum. Or maybe avoiding branches is not clearly faster in cases like this? Have you measured this particular case?
https://github.com/veluca93/fpnge is a very fast png encoder. A bit lower compression ratio, but runs significantly faster than alternatives. Here is a presentation with benchmarks:
This worked out for me. What I was actually interested in was reducing CPU utilization, which generally speed is a fine substitute for (the same work being done in a smaller time slice means lower overall utilization). It reduced utilization enough that I'll likely push to use it for production in the future (not immediately, there's bigger low hanging fruit).
Literally the expression you want to calculate. It evaluates to immediate from _MM_TERNLOG_A/B/C constants defined in intrinsic headers, at least for gcc & clang:
Suppose you have four boolean variables A, B, C, and D. You can calculate X as the result from A, B, C assuming that D is false, and Y as the result assuming that D is true. Then, a third ternary operation can switch between X and Y depending on D. This creates a tree of 3 operations, which I suspect is the best you can do in the worst case.
For five or more arguments, this naturally extends into a tree, though it likely isn't the most efficient encoding.
Here's html version: https://zzqcn.github.io/perf/intel_opt_manual/3.html#clearin...
AMD has similar list in "2.9.2 Idioms for Dependency removal" from "Software Optimization Guide for the AMD Zen5 Microarchitecture" document: https://docs.amd.com/v/u/en-US/58455_1.00