Hello again! I've been working on an AI model designed to tackle tough math competition questions from the AI Mathematical Olympiad (AIMO) Progress Prize 2, and I wanted to share some recent discoveries and ideas I've come across along the way.
Why This Matters to Me
I've always believed that advanced math problems offer a great test of an AI's reasoning capabilities. Participating in a Kaggle challenge that sets a target of at least 47 correct solutions out of 50 pushes me to think beyond just final answers. It's a chance to learn how an AI can break down problems, plan each step, and keep its reasoning on track.
Building on TIR Insights
My main starting point has been the NuminaMath TIR approach, which references methods outlined in the TORA paper. This paper discusses ways to integrate tool usage with a language model's natural reasoning style. Although the original system aims for agent-based problem solving, I've tried similar ideas in my own setup, with mixed results:
Agentic Frameworks: I explored Agent Zero, expecting it to give my model direct access to resources like code or external searches. However, many math-focused models responded poorly to chain-of-thought prompts in agentic pipelines, prompting me to reconsider how best to guide them.
Data, Data, Data
NuminaMath CoT & TIR
I started with the NuminaMath CoT dataset and TIR. These sets are great for math problems at the IMO level or slightly below.
Eventually realized they weren't enough for the hardest problems, so I searched for more challenging datasets.
Omni Math & AoPS
Incorporated difficult samples from Omni Math at a certain complexity level, and classic problems from the AoPS dataset.
This helped broaden the range of math topics and problem structures.
Synthetic Solutions & Model Diversity
I tried solutions from multiple open-source models, such as Macro-o1 and QwQ-32B Preview.
By asking each to solve the same question in different ways, I ended up with a more diverse training set.
Discovering Maisa AI
One interesting find has been Maisa AI. While exploring tool-based approaches, I tested Maisa's system on around 600 sample math problems and was impressed with its consistency. I plan to analyze its accuracy rate more carefully and see if any of its reasoning patterns can inform my own model's training strategy. Maisa's unique approach includes:
Multistep Reasoning and Execution: It combines a Reasoning Engine with an Execution Engine to tackle each step carefully.
Computational Validation: Each result is checked or calculated directly, which cuts down on errors.
Traceability: Every decision is explained and logged, making debugging less of a headache.
I'm considering ways to learn from these features—particularly the idea of validating partial solutions automatically.
Refining My Approach
Step-by-Step Fine-Tuning: My model is being trained to show its thought process rather than just provide an answer. This approach highlights where it might be going wrong and how to fix it.
Selective Tool Integration: While the full agentic pipeline was tricky for math-specific LLMs, I still keep track of how (and when) the model might benefit from external code or symbolic math tools.
Feedback Loops: I've started building a small validation set to test new ideas quickly before using competition submission slots on Kaggle.
Where Everything Stands
Consolidated Dataset: NuminaMath CoT, TIR, Omni Math, AoPS, and my synthetic expansions are merged into a single resource.
Fine-Tuning Ongoing: The model shows promise in step-by-step logic. I'm aiming to match or surpass the performance of other open-source solutions.
Next Steps
Further Tool Experiments: Investigate how to embed code-runner or symbolic solvers directly into the workflow.
Performance Analysis: Compare results with solutions from Maisa AI on selected test sets to see where my model might improve.
Thanks for checking out my work! If you're curious about any of these approaches or want to exchange ideas about AI-driven math solutions, feel free to get in touch. I'm excited to keep fine-tuning my system and seeing how far I can push its reasoning skills.