top of page

Breaking Through the Ceiling: How OpenAI's O-Series Implements reinforcement learning in LLMS

Skribentens bild: Dennis HulsebosDennis Hulsebos

What if the limits of AI weren’t limits at all, but merely symptoms of used methods? The traditional approach to training large language models (LLMs)—scaling up datasets and parameters—was delivering diminishing returns, raising doubts about further progress. OpenAI’s O3 model has shattered these assumptions, demonstrating that the limits of LLMs were not inherent but a result of outdated training methods. By integrating advanced reinforcement learning techniques with machine learning, O3 has unlocked new potential in AI. However, its groundbreaking achievements come with significant challenges, including prohibitive compute costs and long inference times. OpenAI’s plan to release the model by 2025 suggests confidence that these obstacles can be overcome.


A Strategic Bet on LLMs

When OpenAI began its journey, it explored a variety of AI approaches, including reinforcement learning and robotics. Projects like Dactyl, which trained a robotic hand to solve a Rubik's Cube using reinforcement learning, demonstrated the organisation’s ability to innovate across multiple domains.


However, OpenAI ultimately made the strategic decision to focus on large language models (LLMs), recognising their potential to advance AI capabilities at scale. This choice was not without uncertainty. Many in the field believed that scaling LLMs—making them larger and training them on increasingly vast datasets—would eventually hit a wall, delivering diminishing returns. Sam Altman, OpenAI's CEO, acknowledged this challenge, emphasising that future progress would require improving reasoning and problem-solving capabilities, rather than simply increasing model size.


The O-series is a testament to this vision, representing a groundbreaking evolution within the LLM framework. By integrating reinforcement learning and structured reasoning into the traditional LLM approach, OpenAI has transcended the scaling limitations many once feared. The O-series doesn’t abandon the LLM path but redefines it, building on OpenAI’s early innovations while charting a new course for the future of AI.


Reinforcement Learning: The Key to O3's Success

Unlike its predecessors, O3 shifts the focus from simple next-word prediction to solving problems through structured reasoning. The model learns to break problems into smaller reasoning steps, predicting sequences that logically lead to correct answers (a process similar to solving puzzles step by step), marking a fundamental departure from traditional LLM training.


This process begins with a base model that generates multiple candidate solutions to a problem. A verifier model then evaluates these solutions, identifying errors and ranking the most accurate answers. By fine-tuning the model on verified correct reasoning steps, OpenAI has created a system capable of tackling complex tasks in domains such as mathematics and coding, where correctness can be objectively determined.


This approach not only improves accuracy but also enhances O3's ability to generalise across reasoning-heavy tasks. It demonstrates the effectiveness of reinforcement learning in scaling beyond the limits of conventional training, offering a pathway for future models to achieve even greater breakthroughs.


Benchmark Domination: Redefining What AI Can Achieve

O3’s performance on key benchmarks underscores its dominance in reasoning-heavy tasks, achieving unprecedented results across diverse domains. These benchmarks were specifically designed to test the limits of AI reasoning, coding, and problem-solving. Here are the highlights:


  • Frontier Math: Achieved 25.2% accuracy on a dataset of unpublished, exceptionally difficult mathematical problems, compared to less than 2% by previous models.

  • GPQA (Graduate-Level Science): Scored 87.7%, outperforming typical PhD-level human performance (~70%), demonstrating mastery in graduate-level scientific reasoning.

  • SBench (Software Engineering): Attained 71.7% accuracy on real-world software engineering tasks, a significant improvement over previous O-series models.

  • ARC-AGI (Reasoning): Scored 87.5% in high-compute mode, tripling the performance of earlier models, showcasing its general reasoning abilities.

  • AIME (Mathematical Excellence): Achieved 96.7%, missing only one question on this challenging math competition benchmark.


These results solidify O3’s place as a groundbreaking model capable of solving problems that were once thought to be out of reach for AI, redefining what is possible in reasoning-based challenges.


The Cost of Excellence

While O3’s performance is groundbreaking, it comes at a significant cost. The compute resources required to achieve its results are immense, with certain tasks reportedly costing $350,000 due to both the extensive computational demands and prolonged inference times required for reasoning-heavy tasks. For example, achieving 87.5% on certain benchmarks required up to 16 hours of inference time, significantly impacting its practicality for real-time applications.


OpenAI recognises these challenges and has expressed optimism about their resolution. Advancements in hardware, algorithmic efficiency, and cost-reduction strategies are expected to bring the model closer to viability for general use. With plans for a public release by 2025, OpenAI is racing to make O3 scalable and accessible while maintaining its high performance.


Limitations and Areas for Growth

Despite its achievements, O3 is not without limitations. The model’s strength lies in tasks with clear, objective answers. Areas like spatial reasoning and personal writing, where subjective interpretation and nuanced context are critical, remain challenging. OpenAI has acknowledged that the O-series is not optimised for tasks requiring creative expression or abstract reasoning, highlighting the need for further development before AI can achieve general intelligence.


These limitations serve as a reminder that while O3 represents a significant leap, AI still has a long way to go before it can excel across all domains. Its current capabilities are best suited to reasoning-heavy tasks, leaving room for improvement in areas where subjectivity and flexibility are required.


The Path Forward

O3’s rapid development—from O1 to its current state in just a few months—marks a turning point in AI research. By leveraging reinforcement learning and reasoning-based training, OpenAI has introduced a new paradigm that prioritises structured thinking over brute force. This approach not only addresses the scalability issues of traditional LLMs but also sets the stage for future advancements.


With OpenAI aiming for a 2025 release, O3 represents more than a technical achievement; it is a glimpse into the future of artificial intelligence. Its breakthroughs challenge long-held assumptions about the limits of LLMs, paving the way for applications that transcend current capabilities. While challenges remain, O3’s story is a testament to the power of innovation and a reminder that the boundaries of AI are far from fixed.


A Sneak Peek: Quantum Computing and AI

At the same time, another frontier is being pushed—quantum computing. Google’s recent unveiling of its quantum chip, Willow, demonstrates a processor capable of completing computations in under five minutes that would take today’s fastest supercomputers an estimated 10 septillion years.


This raises intriguing questions: what might such leaps in computational power mean for AI? Could quantum computing redefine what’s possible for models like O3, or are the fields of quantum and AI destined to remain separate for the foreseeable future? These questions will be explored in next month’s article.

bottom of page