Hacker News Viewer

Advancing AI Benchmarking with Game Arena

by salkahfi on 2/2/2026, 5:49:07 PM

https://blog.google/innovation-and-ai/models-and-research/google-deepmind/kaggle-game-arena-updates/

Comments

by: ofirpress

This is a good way to benchmark models. We [the SWE-bench team] took the meta-version of this and implemented it as a new benchmark called CodeClash -<p>We have agents implement agents that play games against each other- so Claude isn&#x27;t playing against GPT, but an <i>agent</i> written by Claude plays poker against an <i>agent</i> written by GPT, and this really tough task leads to very interesting findings on AI for coding.<p><a href="https:&#x2F;&#x2F;codeclash.ai&#x2F;" rel="nofollow">https:&#x2F;&#x2F;codeclash.ai&#x2F;</a>

2/2/2026, 6:23:51 PM


by: ZeroCool2u

I&#x27;d really like to see them add a complex open world fully physicalized game like Star Citizen (assuming the game itself is stable) with a single primary goal like accumulating currency as a measure of general autonomy and a proxy for how the model might behave in the real world given access to a bipedal robot.

2/2/2026, 7:41:38 PM


by: cv5005

My personal threshold for AGI is when an AI can &#x27;sit down&#x27; - it doesn&#x27;t need to have robotic hands, but it needs to only use visual and audio inputs to make its moves - and complete a modern RPG or FPS single player game that it hasn&#x27;t pre-trained on (it can train on older games).

2/2/2026, 6:27:51 PM


by: 10xDev

If AI can program, why does it matter if it can play Chess using CoT when it can program a Chess Engine instead? This applies to other domains as well.

2/2/2026, 6:54:06 PM


by: tiahura

How about nethack?

2/2/2026, 6:12:12 PM


by: eamag

Curious why they decided to curate poker hands instead of a normal poker

2/2/2026, 6:08:09 PM


by: bennyfreshness

Wow. I&#x27;m generally in the AI maximalist camp. But adding Werewolf feels dangerous to me. Anyone who&#x27;s played knows lying, deceipt, and manipulation is often key to winning. We really want models climbing this benchmark?

2/2/2026, 7:15:23 PM


by: PunchyHamster

making models target benchmark about being good at lying and getting away with it (werewolf) is certainly an interesting choice

2/2/2026, 8:06:39 PM


by: chaostheory

Anecdotal data point, but recently I’ve found Gemini to perform better than ChatGPT when it came to intent analysis.

2/2/2026, 6:21:58 PM


by: simianwords

Gemini tops all benchmarks but when it comes to real world usage it is genuinely unusable

2/2/2026, 7:07:41 PM