Ask anyone who works with AI day to day what the last year felt like, and they'll describe it in jumps. The model that finally handled spreadsheets. The one that could run a multi-step agent without falling over. The one where document extraction stopped needing a babysitter. Each felt like a specific release. Underneath the headline moments, progress is much smoother than it looks.
Where the cliffs come from.
A user judging an AI capability is almost always making a binary judgment. The agent finished the task, or it didn't. The extraction was correct, or it wasn't. The customer email got a good reply, or it embarrassed us. There's no partial credit at the point of use.
Binary judgments turn smooth underlying progress into visible cliffs. Picture a task that takes five steps to get right, where each model generation gets a little better at each step. From the user's seat, the model still fails — one wrong step is enough — until improvements compound past some threshold. Then, suddenly, it passes.
Most of what gets called emergence is this. Continuous improvement, projected onto a binary outcome, looks like a jump. Inside the labs, where capability is measured with continuous instruments — log-likelihood, partial-credit rubrics, edit distance — the curves are smooth and mostly forecastable. The cliffs are downstream of how we measure, not upstream of how models train.
Plan against the trajectory.
If you take the cliff view literally, AI roadmapping is mostly waiting. You read release notes, run a couple of prompts, and hope. If you take the smooth view, you can sit between releases and tell, with fair confidence, whether the next generation is likely to clear your bar — or whether it'll be the one after.
The trick is to build two yardsticks for every capability you care about. The first is strict and binary — the way the business actually judges the model. Did the agent close the ticket. Did the form extract correctly. This is the metric that tells you when to ship.
The second is continuous. Same inputs, different scoring — log-likelihood, a graded rubric, edit distance, anything that gives a number that wiggles. It won't ship anything by itself, but it tells you what the underlying curve is doing.
Run both across every model generation and the story becomes legible. The strict number stays flat and then steps up. The continuous number tracks the trajectory — and you can see your bar coming before the strict metric ever moves.
How we use this.
We run this pair on every engagement where the question is is the model good enough yet. The strict metric tells the client whether to roll out today. The continuous one tells them roughly how many generations away the answer changes. It turns a yes-or-no question into a forecast.
It also reframes vendor announcements. Before reacting to a competitor's demo or a headline benchmark, check the continuous metric on your own task. The headline is a poor proxy for what moved in the work you care about.
The cliffs you experience are real. They aren't noise. But they are a downstream effect — the visible shadow of a smoother process that has been running for years and will keep running. Plan against the cliffs and you'll always feel a step behind. Plan against the trajectory and you'll know what's coming.
Further reading.
Two papers shape how we think about this. The first documented how capabilities appear to emerge suddenly with scale. The second showed that much of the apparent suddenness is an artifact of how those capabilities are scored.
- Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., & Fedus, W. (2022). Emergent abilities of large language models. arXiv. doi.org/10.48550/arXiv.2206.07682
- Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are emergent abilities of large language models a mirage? arXiv. doi.org/10.48550/arXiv.2304.15004