TL;DR P0 collaborated with DeepMind to make "Big Sleep," which is an AI agent (using gemini 1.5 pro) that can look through commits, spot potential issues, and then run testcases to find bugs. The agent found one in SQLite that was recent enough that it hadn't made it into an official release yet. They then tried to see if it could have been found with AFL, the fuzzer didn't find the issue after 150 cpu-hours.
Using a fuzzer was a terrible point of comparison. They’re the slowest, heaviest users of resources. They’d be better off comparing to static analyzers which find bugs fast. In this case, Infer might do since it’s designed to catch those errors.
My concept was running a bunch of open-source, static analyzers with the LLM’s essentially blocking false positives. They can do it analytically or by generating the test cases to prove the bug. It might also be easier to fine-tune open models for this since the job is narrower.