Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm aware of the 53+3 bit algorithm for addition and subtraction, but I don't see how this could possibly work for multiplication. Could you please point me to a paper that explains how to compute a correctly rounded 53-bit product without computing the full 106-bit product?


I can't find a handy reference, but here is a 5x5 multiplication:

     1XXXX
   * 1XXXX
 _________
     XXXXX
    XXXXX
   XXXXX
  XXXXX
 1XXXX
Using a sticky bit means computing the logical-or of all the bits that are sufficiently far to the right in all these partial results, instead of computing them individually and summing them.

Like for addition, the idea is that the result you obtain is different from the exact result, but rounds the same. The sticky bit only serves to round (midpoint + a small quantity) correctly up.


Let's say that the full product in your example is

  1XXXXYZZZZ
To round it we need:

(1) the X bits for our answer

(2) the Y bit for rounding

(3) the logical-or of the Z bits for rounding

Ok so there might be a trick to compute (3) without computing ZZZZ, but we still want to know how to compute 1XXXXY without computing ZZZZ.


Perhaps a better way to describe it is that during the additions, the accumulator only ever needs to be shifted right in order to be aligned with the next number to add, never left. So the accumulator can be 53+3-bit wide and carry at each step the 53+3-bit representation of the partial sum already computed, and this is enough to produce the correctly rounded result after all numbers have been added.

This explanation assumes that the N partial products are added sequentially. I don't know what techniques are used to achieve the 4- or 5-cycle latencies of modern desktop processors. Implementations with more parallelism and/or that are part of an FMA unit have it harder, but I still expect that they don't naively compute a full width partial result before rounding it (in the case of the FMA, it would be impractical as the partial result could be thousands of bits wide)


Looks like you're actually computing the full product. The only optimization you're doing is forgetting the low order bits as soon as you're done with them. The author is asking to keep them, which sounds reasonable. You initially claimed they were "never computed in the first place", and we're asking you to back up that claim if you still think it's true.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: