Counter-Strike Becomes the New Benchmark for Vibe Coding

The question now isn’t whether AI can vibe-code games. It is whether game-building becomes the new baseline for judging what an AI model can do. The post Counter-Strike Becomes the New Benchmark for Vibe Coding appeared first on Analytics India Magazine.

Programming, App Development, Web Development Dec 3, 2025 0 11 Add to Reading List

Counter-Strike Becomes the New Benchmark for Vibe Coding

There is no doubt that gaming and AI are deeply intertwined. Beyond the fact that several veteran AI builders are avid players of strategy games like Dota 2, firms like OpenAI and Google DeepMind have long been training AI agents within such game environments.

Now, however, the day seems closer when AI, specifically generative AI, is edging towards actually creating games. That too, with just input of prompts on vibe coding tools.

For instance, Stepan Parunashvili, co-founder and CTO of InstantDB, did not set out to write a manifesto for the next phase of AI development. He only wanted to see what would happen if the year’s most powerful models tried to build the same thing under pressure.

His choice was not a text parser or an algorithmic puzzle. It was Counter-Strike—or at least he tried to make it look like a first-person shooter game. There are mixed opinions on Hacker News about the results, with some calling it great and others saying it’s like a junior developer project.

For Parunashvili, the rules were simple. The game had to run in the browser, it had to be 3D, it had to be multiplayer, and it had to be built by the model itself. No human patches. No hand-coded rescue missions.

The result, or the goal, was not just a working game from each model; it offered a new way to measure AI systems.

The New Benchmarks are the Same

For context, in November, AI labs released their sharpest tools. GPT-5.1 Codex Max, Gemini 3 Pro and Claude Opus 4.5 arrived almost on top of each other. Instead of comparing them on sterile benchmarks, Parunashvili asked a cleaner question: If you hand them a real project where everything can break at once, how do they behave?

Parunashvili, in his YouTube video, walked through each step. Watching the agents build, break, adjust, rebuild and finally stabilise a multiplayer shooter gives a strange new picture of AI progress.

Suhail Doshi, former CEO of Mixpanel, described the challenge as “one way you can sense what’s coming next as a result of AI progress.” And that’s what it is. What made the experiment striking was not the success but the split personality of the results.

Claude built the nicest world. Its maps had shape. Its characters looked almost human. Its gun animations felt natural.

Gemini handled the backend like a seasoned systems engineer. It synced movement across players, handled rooms and saved maps without drama.

Codex Max landed somewhere in between. It fixed its mistakes, held the project together and rarely became confused.

These differences are the same ones visible in coding tests and benchmarks as we covered before.

Read: GPT-5.1 vs Gemini 3 Pro vs Claude Opus 4.5

Claude becomes the careful executor when the work demands clarity. Gemini becomes a deep reader when the work demands structure. Codex becomes a dependable worker when the work demands long sessions without losing track.

A comparison of the three models on separate coding challenges mapped neatly onto the Counter-Strike results.

Opus 4.5 handled ambiguous engineering tasks better than anyone. Codex-Max stayed alert across long debugging loops. Gemini excelled at reasoning tasks that required long context and tight logic.

So What’s the Outtake?

TL:DR performance:

Opus 4.5 won the frontend. It made better maps and better models

Gemini 3 Pro won the backend. It got more done in one shot.

Codex got the most "2nd place": it was good but not great at both frontend and backend.

Here's the scorecard: pic.twitter.com/MF7d56jcAu
— Instant (@instant_db) December 1, 2025

The Counter-Strike test compressed all of this into a few hours of building maps, enemies, guns, sound and multiplayer rooms.

At the frontend stage, Claude won everything from polygons to sound effects. When the task switched to presence, shooting logic and persistence, Gemini became the strongest. Codex stayed steady. It rarely produced the prettiest output or the deepest insight, but it adapted without falling over.

The one place where Claude stumbled was in the React refactor. useEffect ran twice, two canvases appeared and the animation loops duplicated. It was the same kind of trouble Claude faces in messy codebases. Parunashvili pointed out that this was not a model problem but a broader developer experience problem.

Humans also get tripped by the same hooks. He said the task showed the gap between “strictly vibe coding” and real engineering. That gap is where the next generation of tools must operate.

The multiplayer pass exposed another truth.

Gemini kept running builds to find errors before the user noticed. Codex relied on the introspection of libraries. Claude read the document step by step.

These styles matter because they shape how the future of automated coding feels. A model that tests itself takes work off developers’ plates. A model that reads documents but does not experiment will move carefully but slowly.

All three models produced almost “working” Counter-Strike clones with no human code. That is important. A game forces the entire stack into motion. Physics, lighting, sound, UI, networking, persistence, permissions and refactoring collide in a small space.

The test becomes a live arena where a model’s style can be seen as clearly as its skills.

The takeaway is sharper than any benchmark. Benchmarks tell you how a model performs on a clean question. Counter-Strike tells you how a model behaves when the work is dirty.

Claude builds beautiful worlds until the foundation shifts. Gemini handles chaos in the backend without blinking. Codex quietly finishes the job.

Parunashvili’s simple prompt has become a lens for where AI tools are going next. It is also a warning. “The promise that you never have to look at the code doesn’t quite feel real yet,” he said.

The question now is not whether AI can vibe-code games. It is whether game-building becomes the new baseline for judging what an AI model is capable of.

The post Counter-Strike Becomes the New Benchmark for Vibe Coding appeared first on Analytics India Magazine.

Read Original