1. With a good hash function, it's hard to accidentally stumble on the worst-case O(n) lookup cost. But against an adversary, it is usually very easy to have that problem. This has been a well-known security issue for over a decade, and over the last 4 years we're starting to see more exploitation of it.
2. The "little trick" at the end of the document that bumps matching items to the head of their bucket chain will usually provide only a marginal speedup (if many of your lookups hit chains longer than 1, your table is too small or your hash function is broken), and that optimization comes with a cost: lookup now modifies the underlying hash table, which is sometimes very undesirable.
> and over the last 4 years we're starting to see more exploitation of it.
Interesting; this came up in the Lua community and the Lua authors (who I have a ton of respect for) agreed to fix the problem, but thought the threat was overstated. Their argument was that many other forms of DoS attack are lower hanging fruit.
As to the first point, I remember this being described well in an older Phrack article [1] written by Solar Designer with regards to port scan detection:
In scanlogd, I'm using a hash table to lookup source addresses. This works very well for the typical case as long as the hash table is large enough (since the number of addresses we keep is limited anyway). The average lookup time is better than that of a binary search. However, an attacker can choose her addresses (most likely spoofed) to cause hash collisions, effectively replacing the hash table lookup with a linear search. Depending on how many entries we keep, this might make scanlogd not be able to pick new packets up in time.
Do there exist good hash functions that are resistant or immune to the attacks you mention? For example, does Python 3.3's new hash randomization provide adequate protection?
This is what SipHash was made for: it's a cryptographic MAC that's competitive in speed with other hash functions meant for hash table use. The paper describes its design, and the weaknesses with competitors like Python's randomized hashing:
Interesting, though at 140 cycles to hash 16 bytes, it's much slower than state-of-the-art hash functions, which can hash 16 bytes in just over 16 cycles. If you seed your non-cryptographic hash function with a cryptographically random number, you can get the best of both worlds: fast hashing and collision resistance against an adversary.
Not really. The SipHash page contains proof-of-concept code that breaks common good non-cryptographic randomized hashes, i.e., it generates collisions regardless of the chosen seed.
That code appears to rely on having direct access to the hash function; this seems very unlikely to be leaked to an attacker. More interesting would be code that gets the secret with only access to ordered iteration on the table, which is more likely to leak.
I don't know anything about Python's hash function, all of those benchmarks are still an order of magnitude slower than CityHash (for example), which peaks at 0.16 cycles/byte.
I would be interested to see SipHash put through SMHasher, which is the most comprehensive hash function test suite I know of: http://code.google.com/p/smhasher/
> For CityHash, MurmurHash (2 and 3) there are seed-independent collisions.
Are you referring to the "Advanced hash flooding" section of the paper, or am I missing this in another place? That analysis appears to rely on the hash function being highly linear/periodic (ie. you can always add "l" to get back the same hash value). I don't think that City, Spooky, or Murmur are nearly this linear. For example, City is based on CRC32 (ie. polynomial division).
Don't get me wrong, I'm impressed that SipHash is so fast for a cryptographic hash. But modern non-cryptographic hash functions are so fast (especially when they're CPU-accelerated, like City is on Intel thanks to the CRC32 instruction) that I don't think SipHash can claim to be the obvious choice for all hash tables.
I am interested to know though if the random seed + City/Spooky/Murmur approach really does have practical hash flooding attacks.
Wow, collisions against the best non-cryptographic hash functions that are independent of seed... color me impressed! This does then seem to strongly suggest that a cryptographic hash function is the only way to defend against hash flooding attacks. Props to the SipHash authors, this is some ground-breaking work.
I think they are, actually! The code just uses "seed" in a slightly confusing way -- it's seeding its own algorithm with "seed" but using "key" as the seed for the hash function.
Isn't Python open source? Along with a large quantity of server-side software? I'd assume everyone already knows my hash functions, I didn't make any new ones.
The strategy you propose gives immunity to the more straightforward attacks on hash tables, but in practice tends to leak information about your secret key through differences in the time taken by hash table operations. If you check out pages 14--16 of the SipHash paper, it goes into a bit more detail about this. You may also find Appendix B entertaining.
1. With a good hash function, it's hard to accidentally stumble on the worst-case O(n) lookup cost. But against an adversary, it is usually very easy to have that problem. This has been a well-known security issue for over a decade, and over the last 4 years we're starting to see more exploitation of it.
2. The "little trick" at the end of the document that bumps matching items to the head of their bucket chain will usually provide only a marginal speedup (if many of your lookups hit chains longer than 1, your table is too small or your hash function is broken), and that optimization comes with a cost: lookup now modifies the underlying hash table, which is sometimes very undesirable.