TetrisBench: How Language Models Develop Playing Strategies Through Code Generation
Yoko Li built TetrisBench, a benchmark where LLMs compete at Tetris by generating scoring functions rather than making move-by-move decisions. The experiment reveals distinct playing styles across models, with some prioritizing aggressive early strategies while others favor conservative, long-term approaches. Results show that strategy intervention frequency and long-horizon optimization differ significantly among models, even when performance metrics are similar.
Metrics in this report
109.3pts
Gemini 3 Pro average
Efficiency metric across games
110.4pts
Gemini 3 Flash average
Efficiency metric across games
1.22updates/game
Gemini 3 Pro (minimum)
Strategy update frequency
2.68updates/game
Gemini 3 Flash (maximum)
Strategy update frequency
62.0%
Gemini 3 Pro highest
Model-vs-model Tetris matches (800+ games)
60.3%
Gemini 3 Flash
Model-vs-model Tetris matches (800+ games)