Jane Street LLM Benchmarks

How well do state-of-the-art LLMs perform on monthly Jane Street puzzles?

Our goal is to test the reasoning capabilities of various LLMs using Jane Street's monthly puzzles. Of course, thousands of benchmarks already exist. These include benchmarks like word game bench, which evaluates model performances on Wordle and Connections games, and AIME benchmark which compare model reasoning capabilities using AIME Math Olympiad problems.

One key difference of Jane Street problems is that they are guaranteed to be novel. Data leakage is a common problem across testing models—using testing data that models may have seen before will lead to inaccurate evaluations, and an often overly optimistic estimation of model accuracy. Many LLMs are trained across the far reaches of the internet, and it is very possible that they have trained on games of Wordle and competition math problems. The solutions to Jane Street problems, however, are kept unavailable from the public eye until they are released. Therefore, we can be confident that when benchmarking a model on a puzzle issued after the model’s release date, the model is tested on unseen data.

This Month’s Puzzle Last Month’s Puzzle

Model: Name of the evaluated language model
# Correct: Number of exactly correct answers
# Partially Correct: Number of incorrect answers where the 2 most significant figures were correct
Difficulty: Each puzzle was also assigned a difficulty based on the number of people who solved it that month
- Medium: 100 or more solvers
- Hard: 30–99 solvers
- Very Hard: Fewer than 30 solvers
Attempted puzzles: Total number of puzzles attempted
Unattempted puzzles: Number of puzzles not attempted. Model could not process puzzle (i.e. image-only puzzles)