Unknown publication · 2026-02-23
      · 155d
    

TetrisBench: How Language Models Develop Playing Strategies Through Code Generation

Yoko Li built TetrisBench, a benchmark where LLMs compete at Tetris by generating scoring functions rather than making move-by-move decisions. The experiment reveals distinct playing styles across models, with some prioritizing aggressive early strategies while others favor conservative, long-term approaches. Results show that strategy intervention frequency and long-horizon optimization differ significantly among models, even when performance metrics are similar.

6 metrics· Cited 0× in the knowledge base ·Open source ↗

Metrics in this report

Points Per Move

109.3pts

Gemini 3 Pro average

Efficiency metric across games

Points Per Move

110.4pts

Gemini 3 Flash average

Efficiency metric across games

Strategy Interventions Per Game

1.22updates/game

Gemini 3 Pro (minimum)

Strategy update frequency

Strategy Interventions Per Game

2.68updates/game

Gemini 3 Flash (maximum)

Strategy update frequency

Win Rate

62.0%

Gemini 3 Pro highest

Model-vs-model Tetris matches (800+ games)

Win Rate

60.3%

Gemini 3 Flash

Model-vs-model Tetris matches (800+ games)