Our goal is to test the reasoning capabilities of various LLMs using Jane Street's monthly puzzles. Of course, thousands of benchmarks already exist. These include benchmarks like word game bench, which evaluates model performances on Wordle and Connections games, and AIME benchmark which compare model reasoning capabilities using AIME Math Olympiad problems.
One key difference of Jane Street problems is that they are guaranteed to be novel. Data leakage is a common problem across testing models—using testing data that models may have seen before will lead to inaccurate evaluations, and an often overly optimistic estimation of model accuracy. Many LLMs are trained across the far reaches of the internet, and it is very possible that they have trained on games of Wordle and competition math problems. The solutions to Jane Street problems, however, are kept unavailable from the public eye until they are released. Therefore, we can be confident that when benchmarking a model on a puzzle issued after the model’s release date, the model is tested on unseen data.