Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A good article, but two quick thoughts:

1. With a good hash function, it's hard to accidentally stumble on the worst-case O(n) lookup cost. But against an adversary, it is usually very easy to have that problem. This has been a well-known security issue for over a decade, and over the last 4 years we're starting to see more exploitation of it.

2. The "little trick" at the end of the document that bumps matching items to the head of their bucket chain will usually provide only a marginal speedup (if many of your lookups hit chains longer than 1, your table is too small or your hash function is broken), and that optimization comes with a cost: lookup now modifies the underlying hash table, which is sometimes very undesirable.



> and over the last 4 years we're starting to see more exploitation of it.

Interesting; this came up in the Lua community and the Lua authors (who I have a ton of respect for) agreed to fix the problem, but thought the threat was overstated. Their argument was that many other forms of DoS attack are lower hanging fruit.

http://article.gmane.org/gmane.comp.lang.lua.general/87667


As to the first point, I remember this being described well in an older Phrack article [1] written by Solar Designer with regards to port scan detection:

In scanlogd, I'm using a hash table to lookup source addresses. This works very well for the typical case as long as the hash table is large enough (since the number of addresses we keep is limited anyway). The average lookup time is better than that of a binary search. However, an attacker can choose her addresses (most likely spoofed) to cause hash collisions, effectively replacing the hash table lookup with a linear search. Depending on how many entries we keep, this might make scanlogd not be able to pick new packets up in time.

[1] http://www.phrack.org/issues.html?issue=53&id=13#article


Do there exist good hash functions that are resistant or immune to the attacks you mention? For example, does Python 3.3's new hash randomization provide adequate protection?


This is what SipHash was made for: it's a cryptographic MAC that's competitive in speed with other hash functions meant for hash table use. The paper describes its design, and the weaknesses with competitors like Python's randomized hashing:

https://www.131002.net/siphash/siphash.pdf

Some code for it here:

https://www.131002.net/siphash/


Interesting, though at 140 cycles to hash 16 bytes, it's much slower than state-of-the-art hash functions, which can hash 16 bytes in just over 16 cycles. If you seed your non-cryptographic hash function with a cryptographically random number, you can get the best of both worlds: fast hashing and collision resistance against an adversary.


Not really. The SipHash page contains proof-of-concept code that breaks common good non-cryptographic randomized hashes, i.e., it generates collisions regardless of the chosen seed.

Regarding performance, SipHash is essentially as fast as Python's randomized hash, and quite faster for long inputs (>16 bytes), cf. http://bench.cr.yp.to/results-auth.html#amd64-sandy .


That code appears to rely on having direct access to the hash function; this seems very unlikely to be leaked to an attacker. More interesting would be code that gets the secret with only access to ordered iteration on the table, which is more likely to leak.

I don't know anything about Python's hash function, all of those benchmarks are still an order of magnitude slower than CityHash (for example), which peaks at 0.16 cycles/byte.

I would be interested to see SipHash put through SMHasher, which is the most comprehensive hash function test suite I know of: http://code.google.com/p/smhasher/


Seed recovery (which requires access to hash function) is for Python hash. For CityHash, MurmurHash (2 and 3) there are seed-independent collisions.

Of course, it passes SMHasher with 10 score -- it's a cryptographic PRF.


> For CityHash, MurmurHash (2 and 3) there are seed-independent collisions.

Are you referring to the "Advanced hash flooding" section of the paper, or am I missing this in another place? That analysis appears to rely on the hash function being highly linear/periodic (ie. you can always add "l" to get back the same hash value). I don't think that City, Spooky, or Murmur are nearly this linear. For example, City is based on CRC32 (ie. polynomial division).

Don't get me wrong, I'm impressed that SipHash is so fast for a cryptographic hash. But modern non-cryptographic hash functions are so fast (especially when they're CPU-accelerated, like City is on Intel thanks to the CRC32 instruction) that I don't think SipHash can claim to be the obvious choice for all hash tables.

I am interested to know though if the random seed + City/Spooky/Murmur approach really does have practical hash flooding attacks.


It's pretty quick to generate a lot of collisions, and the set of inputs you get will collide independently of the seed. Look (CityHash64):

$ time ./citycollisions 1000000 > /dev/null

real 0m0.904s

user 0m0.439s

sys 0m0.008s

$ ./citycollisions 3

128-bit key 741b848b0065aec04bb25b26ca2c3e9

CityHash64( 8e69324ad2a005ff2148534148202020, 16 ) = d3290c3d600b8add

CityHash64( 7ae31886221136ba2148534148202021, 16 ) = d3290c3d600b8add

CityHash64( a9a6b9c6888e94ff2148534148202022, 16 ) = d3290c3d600b8add

$ ./citycollisions 3

128-bit key 5e2a23015c89da98172fbcb77ad98468

CityHash64( 8e69324ad2a005ff2148534148202020, 16 ) = 59d0b041b16594b

CityHash64( 7ae31886221136ba2148534148202021, 16 ) = 59d0b041b16594b

CityHash64( a9a6b9c6888e94ff2148534148202022, 16 ) = 59d0b041b16594b


Wow, collisions against the best non-cryptographic hash functions that are independent of seed... color me impressed! This does then seem to strongly suggest that a cryptographic hash function is the only way to defend against hash flooding attacks. Props to the SipHash authors, this is some ground-breaking work.


See Downloads section here https://www.131002.net/siphash/

Edit: wait, I'm mistaken -- these are not independent of seed.


I think they are, actually! The code just uses "seed" in a slightly confusing way -- it's seeding its own algorithm with "seed" but using "key" as the seed for the hash function.

This is really impressive work.


Aha, thanks for clearing that up.


Isn't Python open source? Along with a large quantity of server-side software? I'd assume everyone already knows my hash functions, I didn't make any new ones.


He means access to [seeded] hash function output.


The strategy you propose gives immunity to the more straightforward attacks on hash tables, but in practice tends to leak information about your secret key through differences in the time taken by hash table operations. If you check out pages 14--16 of the SipHash paper, it goes into a bit more detail about this. You may also find Appendix B entertaining.


Can you name some of these speedy state-of-the-art hash functions?



But you said 16 cycles.

"For example, hashing 16 bytes takes 141 Bulldozer cycles with SipHash-2-4, against 82 and 126 for CityHash and SpookyHash, and 600 for MD5."

https://www.131002.net/siphash/siphash.pdf (p.12)


According to CityHash's benchmarks (http://code.google.com/p/cityhash/source/browse/trunk/README), they can hash 8 bytes in 6ns and 64 bytes in 9ns on a 2.67GHz machine (so 16-24 cycles).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: