Researcher spends $1,500 testing if LLMs can hack a vulnerable app
A security researcher built a deliberately vulnerable book review app and spent $1,500 testing 16 different large language models on their ability to exploit a common misconfiguration. GPT-5.5 succeeded 7 out of 10 times, while several models like Gemini and Claude hit security guardrails or never found the right attack vector.


