The Doom Test: A Benchmark for AI Progress
- Casey Fox
- Mar 19
- 2 min read
At Tekletics, we’re always pushing the boundaries of AI and automation. One of our favorite ways to benchmark AI progress is with a simple but powerful test—can an AI build a version of Doom with a single prompt?
If you’ve spent any time on the internet, you’ve probably seen videos of people running the classic Doom game on absurdly underpowered hardware—think calculators, smart fridges, and even a Commodore 64. There’s an entire site dedicated to this madness called Can It Run Doom?, and scrolling through it is an experience in itself. Fun fact: even your Google search bar and Microsoft Word can run Doom.
The AI Doom Test
For the past year, we’ve been applying this same spirit of experimentation to large language models (LLMs). Our personal benchmark? Seeing if we could get an AI to build a simple version of Doom using a single prompt. We’ve tested nearly every major LLM—OpenAI’s GPT-3.5+, Gemini, Claude, LLaMa, DeepSeek, and more. Every single one failed on the first attempt. Sure, with continued prompting, we could eventually guide them into producing something functional, but none could succeed from the initial instruction.
That changed with Grok3 and GPT-4.5. These were the first two models to generate a functional, playable game from just a single prompt. Now, to be clear, neither truly replicated Doom, but both produced compilable code resulting in an actual working game. What fascinated us was how each model tackled the problem differently—each had strengths and weaknesses in its approach, logic, and implementation.
The Future: AI Agents & Iterative Refinement
For us, this experiment is more than just a fun test. It highlights why AI agents are the future. With older models, we had to manually guide them through multiple iterations to refine the result. In the near future, AI agents will be able to self-refine their outputs, continuously improving their work until they reach a high-confidence solution. This iterative capability is what will push AI from being just a tool to an autonomous problem-solving assistant.
Your Thoughts?
We’d love to hear from you. Do you have a personal test that you run on every new AI model? If so, what’s your benchmark? And if you’ve tried out Grok3 or GPT-4.5, which do you think performed better in terms of delivering a working result on the first try? Let’s discuss!
Comments