Next-token prediction is more customizable than you think

tldr: two broad directions I think more people should apply next-token prediction to

Next-token prediction has been an unlock for many sorts of modeling problems, but has been particularly highlighted in how large language models have been functionally used as chatbots. We are able to ask them questions about large corpuses of data and they can also learn on the fly based on our prompts. However, I believe the form factor of chatbots will likely not have a monopoly over inference compute in the long term¹. Here, I want to highlight two of the more interesting directions that I think more people should be working on:

Understanding Models with Models

Many machine learning models make various predictions that are hard to predict as a user. Some recent lines of work have designed systems that optimize over the output space of a ML model in order to better understand what inputs lead to certain outputs. The investigator agents line of work is particularly interesting from this perspective; it treats a machine learning model as some complex function, and aims to train an investigator model to elicit some output from that function given some constraints on the input space. Critically, the reward function here is pretty clean and using RL, one can train investigators that propose human-readable prompts, a pitfall of previous elicitation approaches using simpler optimization approaches. Although currently being used for eliciting unsafe behaviors in the single-turn setting, I think that in the long run, this line of work will be the future of benchmarks². Once models move into the superhuman capability range with tangible ranges of inference time compute, users (or companies on the behalf of their users) will care about elicitation probability (i.e. if an action that we care about is elicitable given less compute than another model, the model is better for that task than the other model). This elicitation probability concept will likely be extremely useful for model routing (something that is currently mainly defined by preference data from humans) with synthetic investigator agents that test models.

Predicting Discrete Actions

Many companies are trying to make browsing agents or complex "multi-agent systems" to solve relatively simple tasks humans do on computers, with many of them taking humans very little time to do. Broadly, they create this by having a LLM in a for loop (an agent) that calls functions to click things on screens. However, 1. Many of these tasks are not necessarily economically useful to automate, 2. Enabling humans to do more seems like a better and more tractable goal than these futuristic AI computer use workers, and 3. There are so many more interesting problems to be working on. Instead of trying to directly automate full human pipelines, what if you could just automate basic surfing actions with high accuracy. The best demonstration of this action prediction use case (and my favorite) is Cursor Tab. What if we had a Tab like model for everything we did on computers? This is actually what I'd like to see from all these "AI-first" browser companies. Instead of trying to use all the frontier LLMs to do basic tasks and regurgitate the same information I am googling up to me, companies planning to reimagine applications that people use need to pick up tiny open source models, collect user data, and train a low latency action predictor⁴. There are many design and engineering choices here: how diffs are shown on the interface, allowing for temporary branching structure and processes, but I think these product questions are what is important for our next generation of digital tools. Agents will not just fit into our workflows magically, we need more work on smaller action models that enable our best professionals to do more, faster.

As a speculative aside, a connected longer term vision here to predicting actions is the usage of VLAs used for robotics tasks (which output angles for a robot arm over timesteps). Many teams aim to automate end to end tasks that humans can do, but something that I personally find really interesting is that if we could create the robotics version of the tab model, Doc-Ock style where a robot does minor actions for a user. This is extremely connected to the literature on reinforcing empowerment objectives and seems very far out from real world designs, but is fun to speculate about.

Notes:

Assuming spend on inference compute is roughly related to the usefulness of the AI use case in our future economy.
Human handcrafted benchmarks will not last in the era of superhuman tasks that are not easily verifiable.
I will do a future post on my intuitions on the importance of how good AI products need really good model routing in the era of RL and how A/B testing with humans will slowly be augmented with swarms of agents.
Tab is probably multiple models with really strong routing preference data (not all Tabs are equally complex).