Hacker Newsnew | past | comments | ask | show | jobs | submit | mmozeiko's commentslogin

sub is also recognized as zeroing idiom for register file. Intel documents these in "3.5.1.7 Clearing Registers and Dependency Breaking Idioms" from Optimization Reference Manual: https://www.intel.com/content/www/us/en/developer/articles/t...

Here's html version: https://zzqcn.github.io/perf/intel_opt_manual/3.html#clearin...

AMD has similar list in "2.9.2 Idioms for Dependency removal" from "Software Optimization Guide for the AMD Zen5 Microarchitecture" document: https://docs.amd.com/v/u/en-US/58455_1.00


xor swap trick was useful in older simd (sse1/sse2) when based on some condition you want to swap values or not:

  tmp = (a ^ b) & mask
  a ^= tmp
  b ^= tmp
If mask = 0xfff...fff then a/b will be swapped, otherwise if mask = 0 then they'll remain the same.


Oh, that is cool, I’ve never seen that. I might add that to an extended version of the post sometime, I’ll be sure to credit you.


So mask is marking the bits you want swapped and leaving the others in place.


Could this still be the ideal way for vectors of Ints in WebGL2?


That's hella cute


How does it compare to rendering SVG by Direct2D itself? When using ID2D1DeviceContext5::DrawSvgDocument method, and ID2D1SvgDocument can be loaded from file with ID2D1DeviceContext5::CreateSvgDocument + SHCreateStreamOnFileW.


From what I understand from code those unwarps are just doing matrix multiply to get unwraped pixel location? In this case doing these operations directly in fragment shader instead of texture lookup will be faster. Memory bandwidth is not free. But simple ALU like this (just couple FMA's) can easily hide in shadow of texture sampling that happens afterwards. So simply upload those undistortion matrices (mat1 & mat2) as uniforms and do matrix multiply in shader for adjusting texcoords.


map1 and map2 are the same dimensions as the video image.


Ah, I see. Yeah, then the current approach is fine.


There's also Windows.Graphics.Capture. It allows to get texture not only for whole desktop, but just individual windows.


I don't know why Signal calls it "DRM" because the do not use DRM for this. Typically DRM means encryption & keys are involved (which is what Netflix & others are doing with Widevine or PlayReady).

All Signal does is just a simple Windows API call to exclude window from screen capture. SetWindowDisplayAffinity function with WDA_EXCLUDEFROMCAPTURE argument: https://learn.microsoft.com/en-us/windows/win32/api/winuser/...


And Microsoft, literally call it the DRM flag. DRM doesn't insist on being encrypted.

https://learn.microsoft.com/en-us/windows/client-management/...

And that "simple Windows API" call is pretty much absolute, since it's across the stack.


Thanks for the insight, I thought they took advantage of the whole DRM stack (including HDCP in monitors) to encrypt the UI and let the monitor decrypt it.


If you change logic and/or to bitwise and/or then it'll be branchless.


True: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename... but I understood hoten to be saying that compilers would generally produce that version from the short-circuiting version, and they don't.


Yeah I was wrong.

Do we know why the compiler doesn't do it? Surely the output is the same and avoiding branches is clearly faster.

Maybe short circuiting requires such an optimization not be made?


There are cases where the optimization wouldn't be safe (like i < n && a[i] != k) but this is not one of them. Maybe the compiler is just dum. Or maybe avoiding branches is not clearly faster in cases like this? Have you measured this particular case?


https://github.com/veluca93/fpnge is a very fast png encoder. A bit lower compression ratio, but runs significantly faster than alternatives. Here is a presentation with benchmarks:

https://www.lucaversari.it/FJXL_and_FPNGE.pdf


This worked out for me. What I was actually interested in was reducing CPU utilization, which generally speed is a fine substitute for (the same work being done in a smaller time slice means lower overall utilization). It reduced utilization enough that I'll likely push to use it for production in the future (not immediately, there's bigger low hanging fruit).


This looks right up my alley, thanks! Will give it a shot.


There is a simple way to get that immediate from expression you want to calculate. For example, if you want to calculate following expression:

    (NOT A) OR ((NOT B) XOR (C AND A))
then you simply write

    ~_MM_TERNLOG_A | (~_MM_TERNLOG_B ^ (_MM_TERNLOG_C & _MM_TERNLOG_A))
Literally the expression you want to calculate. It evaluates to immediate from _MM_TERNLOG_A/B/C constants defined in intrinsic headers, at least for gcc & clang:

    typedef enum {
      _MM_TERNLOG_A = 0xF0,
      _MM_TERNLOG_B = 0xCC,
      _MM_TERNLOG_C = 0xAA
    } _MM_TERNLOG_ENUM;
For MSVC you define them yourself.


To take the magic away, write it in binary:

  A = 0b11110000
  B = 0b11001100
  C = 0b10101010


The Amiga manual suggests normalizing to a conjunctive normal form.


The constant-math method above doesn't require that and works on denormalized expressions, too.

That said, the trick for turning four or more argument bitwise operations into a series of vpternlogd operations has yet to be posted


Suppose you have four boolean variables A, B, C, and D. You can calculate X as the result from A, B, C assuming that D is false, and Y as the result assuming that D is true. Then, a third ternary operation can switch between X and Y depending on D. This creates a tree of 3 operations, which I suspect is the best you can do in the worst case.

For five or more arguments, this naturally extends into a tree, though it likely isn't the most efficient encoding.


Unfortunately not everything. For example, no xbox 360 controller support: https://github.com/microsoft/GDK/issues/39


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: