Advancing AI Benchmarking with Game Arena
by salkahfi on 2/2/2026, 5:49:07 PM
https://blog.google/innovation-and-ai/models-and-research/google-deepmind/kaggle-game-arena-updates/
Comments
by: ofirpress
This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -<p>We have agents implement agents that play games against each other- so Claude isn't playing against GPT, but an <i>agent</i> written by Claude plays poker against an <i>agent</i> written by GPT, and this really tough task leads to very interesting findings on AI for coding.<p><a href="https://codeclash.ai/" rel="nofollow">https://codeclash.ai/</a>
2/2/2026, 6:23:51 PM
by: ZeroCool2u
I'd really like to see them add a complex open world fully physicalized game like Star Citizen (assuming the game itself is stable) with a single primary goal like accumulating currency as a measure of general autonomy and a proxy for how the model might behave in the real world given access to a bipedal robot.
2/2/2026, 7:41:38 PM
by: cv5005
My personal threshold for AGI is when an AI can 'sit down' - it doesn't need to have robotic hands, but it needs to only use visual and audio inputs to make its moves - and complete a modern RPG or FPS single player game that it hasn't pre-trained on (it can train on older games).
2/2/2026, 6:27:51 PM
by: 10xDev
If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead? This applies to other domains as well.
2/2/2026, 6:54:06 PM
by: tiahura
How about nethack?
2/2/2026, 6:12:12 PM
by: eamag
Curious why they decided to curate poker hands instead of a normal poker
2/2/2026, 6:08:09 PM
by: bennyfreshness
Wow. I'm generally in the AI maximalist camp. But adding Werewolf feels dangerous to me. Anyone who's played knows lying, deceipt, and manipulation is often key to winning. We really want models climbing this benchmark?
2/2/2026, 7:15:23 PM
by: PunchyHamster
making models target benchmark about being good at lying and getting away with it (werewolf) is certainly an interesting choice
2/2/2026, 8:06:39 PM
by: chaostheory
Anecdotal data point, but recently I’ve found Gemini to perform better than ChatGPT when it came to intent analysis.
2/2/2026, 6:21:58 PM
by: simianwords
Gemini tops all benchmarks but when it comes to real world usage it is genuinely unusable
2/2/2026, 7:07:41 PM