Routing is quickly becoming a necessity to be at the frontier

tldr: model routing is really important in the age of sparse RL data

A recent trend in machine learning is that frontier models are being trained specifically for the task they are being tested on, uniquely because these tasks are economically valuable! It doesn’t matter if the model generalizes perfectly to some OOD example, the model should just be good at the task consumers care about downstream. This is particularly feasible because of RLVR finally working with LLMs, where model trainers can use specific tasks that are verifiable¹ and the models learn to maximize reward (solving the task) through long chains of thought (thinking). However, in the current paradigm of thinking models, verifiable data is hard to come by, with many companies quickly trying to replicate the datasets that frontier labs are using. It's quick to realize that, after acquiring all the low hanging fruit on the internet, many of these model developers will start acquiring different RL data based on what their employees aim to train on², and this data will continue to diverge with labs locking down different contracts with big enterprises.

As a little proof of existence, unique thinking abilities stemming from different frontier models have already been emerging in many agent use-cases, with companies on the application layer aiming to discover new solutions by interleaving models to find the strongest performance. This trivial solution of just using multiple models works for agents, but how does one optimize over these unique thinking properties for single-prompt uses of LLMs? This is especially tough as we don’t know the exact tasks each frontier LLM is trained on³.

Benchmarks are the most trivial solution for this problem, where application devs create their own internal benchmarks that reflect the real performance of different models on consumer tasks. In this setup, users quickly choose the model based on their self-assigned understanding. However, this manual selection becomes complex and quickly breaks down as a process when user tasks are not easily groupable / are at the frontier end of performance. In my opinion, this problem of divergence will be inevitable and more terse in the near future, and model routing is the clear solution for this. Model routing is where one can send prompts to different models, decided through preference data, optimizing for performance on some desired metric. Currently, routing is often used as a means to recover performance compared to a frontier model using smaller models, largely to save inference costs, and not solely optimizing for intelligent outputs.

What I’m really excited to see in the next few months is if someone can exploit these newly forming performance differences from RL data to outperform a singular frontier model's performance across uniquely different tasks (not just across benchmarks) in the single-shot setting. This would be hard, because it involves having preference data scattered across tasks while also serving these frontier models on these tasks at reasonable amounts. Another interesting direction would be to be able to route over subtleties within a task domain⁴, something that agent developers have been relying on heuristics for.

However, preference data and training routers isn’t an instant process. In the long run, with continually improving models⁵, A/B testing with human users will likely not be 1. Fast enough or 2. Precise enough to really unlock top performance. With the large amounts of money on the table (these companies are trying to automate chunks of GDP) and the lack of lock-in that most of these companies have, I believe the top AI-integrated platforms will have to create “user agents” that test their own application, bootstrapping on the preference data collected. This connects back to my previous post on the investigator agents and multi-faceted nature of next-token prediction.

Another interesting line of research is whether routing can be baked into a model. GPT-5 may just be a ton of different models with great routing behavior, possibly even curriculum learnt over the course of training so the “types” of underlying models cover the full distribution of user prompts. MoE is a simple version of this idea, but the routing mechanism would likely need to become more and more complex over time if models are trained separately. However, as I believe frontier labs will end up with different RL data, I don’t think this internal routing solution would be enough to stay at the frontier across all dimensions. In fact, it would be hilarious, but not too surprising if a big winner in the AI space was a wrapper that simply routed perfectly between frontier models on behalf of application layer companies.

Notes:

This is based on my understanding of open-source, all the self-rating / entropy / intrinsic eval stuff seems to be either done only on Qwen, a bias from GRPO, or just non-reproducible on large scales. However, OpenAI and DeepMind's IMO results does make me question how to do that well / think that long without being able to generate synthetic problems on the fly that are past model capabilities. Maybe it’s all just scrapable like AlphaProof, but generalist behaviors wouldn’t emerge from that? Anyways sticking to verifiable as the base assumption in this post.
Or even just what their employees understand. Unless people find more generalizable ways to leverage RL, even employees in the frontier labs think that the tasks their models will be able to solve are the ones they strictly optimize for.
Model diffing is a really interesting line of research that aims to understand “properties” of models, but will likely be hard to do robustly in the black-box setup from just APIs.
This can likely be done decently well simply with just a ton of preference data, which is something I think Cursor is likely trying to test out with their “Auto” mode for their agent. Or they are just facing issues from serving capacity. But they should do this because even frontier labs don’t have as good data of people arguing over such specialized tasks, albeit only in coding.
And models being the primary users of other models. These models might soon be changing faster than the monthly updates providers currently put out.