I did something similar when I wanted to search the HIBP database and if you are okay with some false positives you can do better than your results, both in terms of speed and size.
If you are okay with false positives, you can use a bloom filter and tune the number of false positives you want. I chose a false positive rate of 1 in a million so my data structure was still very accurate in determining whether a password was already hacked.
It only took 30 microseconds to determine if a password was in the list and for size, was at the theoretical limit of 22 bits per element or ~1.5gb.
I originally used a bloom filter which made it 2gb but given a bloom filter was just a sequence of 0s and 1s, I was able to use a golomb coding to shrink it down to 1.5gb.
The time to process the original 24gb however, is something that I could have improved, but I kinda lost interest once I already had something that was at the theoretical minimum size, as well as able to determine a password exists within 30 microseconds.
If you are okay with false positives, you can use a bloom filter and tune the number of false positives you want. I chose a false positive rate of 1 in a million so my data structure was still very accurate in determining whether a password was already hacked.
It only took 30 microseconds to determine if a password was in the list and for size, was at the theoretical limit of 22 bits per element or ~1.5gb.
I originally used a bloom filter which made it 2gb but given a bloom filter was just a sequence of 0s and 1s, I was able to use a golomb coding to shrink it down to 1.5gb.
The time to process the original 24gb however, is something that I could have improved, but I kinda lost interest once I already had something that was at the theoretical minimum size, as well as able to determine a password exists within 30 microseconds.
Anyways take a look if you're interested in trying a different approach: https://github.com/terencechow/PwnedPasswords