How is that a direct comparison? The link you gave has a quote that says it’s not:
> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints
They pointed the models at the known vulnerable functions and gave them a hint. The hint part is what really breaks this comparison because they were basically giving the model the answer.
No one is saying your nested for loop idea because it won't actually work in practice. In short, the signal to noise ratio will be too high - you will need to comb through a ton of false positives in order to find anything valuable, at which point it stops looking like "automated security research" and it starts looking like "normal security research".
If you don't believe me, you should try it yourself, it's only a couple of dollars. Hey, maybe you're right, and you can prove us all wrong. But I'd bet you on great odds that you're not.
Aisle said they pointed it at the function, not the file. So, the nr of LLM turns would be something like nr of functions * nr of possible hints * nr of repos.
Could indeed be a useful exercise to benchmark the cost.
This would still be more limied, since many vulnerabilities are apparent only when you consider more context than one function to discover the vulnerability. I think there were those kinds of vulnerabilities in the published materials. So maybe the Aisle case is also picking the low hanging fruit in this respect.
> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints
They pointed the models at the known vulnerable functions and gave them a hint. The hint part is what really breaks this comparison because they were basically giving the model the answer.