Are Language Models Just Glorified T9? A Non-Technical Look at the Evidence

Here is an opinion I have said out loud more than once, usually with some confidence: the chatbots everyone is excited about are basically glorified T9. You remember T9, the predictive text on old phones, where you mashed the number keys and it guessed the word. The argument goes that a language model is the same trick, just enormous: it has read a huge pile of text, and all it really does is guess the next word, over and over, very convincingly. No understanding, no thought, just a very expensive autocomplete.

It is a satisfying thing to believe. It cuts a hyped technology down to size, and it has a kernel of something true in it. But "satisfying" and "true" are not the same word, and I did not want to keep repeating a line I had not actually checked. So I went and read the research, looking specifically for the evidence that would prove me wrong, because that is the only honest way to test an opinion you are fond of. This is what I found.

First, what T9 actually was, because the comparison only means something if we are fair about both sides of it. T9 was patented in the 1990s by a company called Tegic. Mechanically it is simple: you press a sequence of keys, and the software looks that sequence up in a fixed dictionary of words it was shipped with, then shows you the matches in order of how common they are. That is the whole idea. It does not learn the language. It does not know what a sentence means. It is a lookup table with a frequency counter bolted on. The patent itself describes "a complete dictionary containing the spelling of all of the words that a user might reasonably be expected to enter," presented "in order of decreasing frequency of use." [1]

Animation of a phone keypad: the key sequence 4-6-6-3 is pressed, and the dictionary offers the words good, home, gone and hood, settling on good as the most common. — How T9 worked: the same key sequence (4-6-6-3) fits several words, and the dictionary offered them ranked by how common they are. No understanding, just a lookup. Animation by Ramazan Yavuz.

So when someone says a language model is "just T9," the strict version of that claim is: it is a lookup table that regurgitates stored words. And that part is simply not how a language model works. There is no dictionary of sentences inside it to look things up in. But there is a softer, more interesting version of the claim, and that one deserves a serious hearing.

The steelman, the strongest honest version of my own opinion, goes like this. A language model really is trained to do one narrow thing: predict the next chunk of text. Engineers feed it oceans of writing and, over and over, cover up the next word and reward the model for guessing it. That is the entire training objective. The famous GPT-2 paper from 2019 lays out the maths plainly: the model learns the probability of each next piece of text given everything before it. [2] So at the level of "what is it being asked to do," the autocomplete comparison is not a slur. It is the literal job description.

And there is a respectable academic version of my skepticism too. In 2021 a group of well-known researchers, led by Emily Bender and Timnit Gebru, published a paper that gave the idea its nickname: the stochastic parrot. Their definition is worth quoting, because it is almost exactly my T9 line dressed in careful language. A language model, they wrote, is "a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning." [3] They argue the text "is not grounded in communicative intent, any model of the world, or any model of the reader's state of mind," and that the sense it makes is something the human reader supplies, not the machine. That is a serious, widely-cited position, and it is squarely on my side of the argument.

xkcd 1838: a stick figure describes a machine learning system as a big pile of linear algebra you pour data into and collect answers from; if the answers are wrong, you just stir the pile until they start looking right. — The "it is just statistics" intuition, drawn as a joke. The joke lands because there is real truth in it. The question is whether it is the whole truth.Comic: xkcd 1838, “Machine Learning” by Randall Munroe (CC BY-NC 2.5).

It gets stronger still, because language models do the thing T9 never did: they confidently produce text that is fluent, plausible, and completely false. Researchers call this confabulation, and in 2024 a team published a method for detecting it in the journal Nature. Their description is almost comic: when one of these models confabulates, "it makes something up for no reason," and you can catch it because asking the very same question again, with a different roll of the dice, often produces a different made-up answer. [4] This is not a rare laboratory curiosity. On one reproducible industry benchmark that checks whether a model sticks to the facts in a document it was handed, the better models invent unsupported claims under four percent of the time, while some weaker ones do it more than twenty percent of the time. [5] (Those are vendor-measured snapshot figures from May 2026, judged by software rather than humans, so treat them as a rough gauge, not gospel.)

And it has bitten real people. In a 2023 court case in New York, two lawyers filed a legal brief full of court decisions that did not exist. They had asked ChatGPT for supporting cases, and it produced them: convincing names, plausible quotes, the works. The only problem was that the cases were fiction. The judge fined the lawyers and noted they had kept defending the fake opinions even after their existence was questioned. [6] A tool that can hand a professional seven imaginary court cases with a straight face is, at least in that moment, behaving exactly like a parrot that has learned to sound like a lawyer. So far, my opinion is holding up well.

Now the part that made me put the brakes on. The "just T9" line quietly assumes that because the training task is simple, the result must be simple too. That assumption is where it falls apart, and the clearest evidence is an experiment about, of all things, the board game Othello.

A group of researchers trained a model on nothing but sequences of legal Othello moves. It was never told the rules. It never saw a board. It only ever predicted the next move, the purest possible version of "glorified autocomplete." Then they looked inside it, and found that it had built, all on its own, an internal picture of the board, where the pieces were and whose turn it was. They could even reach in, change that internal picture, and watch the model's predictions change accordingly. Their paper is titled, fittingly, "Emergent World Representations," and they report "evidence of an emergent nonlinear internal representation of the board state." [7] A pure next-move predictor built a model of the world to predict better. T9 never needed a model of anything, because looking a word up in a dictionary requires no understanding. This does.

The same surprise shows up in plain use. The bigger these models got, the more they started doing tasks nobody trained them to do. GPT-3, the 2020 model whose lineage led to the first ChatGPT, was shown to perform brand-new tasks just from a couple of examples typed into the prompt, with no retraining at all. [8] You can hand it a job it has never seen, show it two examples, and it follows the pattern. Your phone's predictive text cannot learn a new task from two examples in the chat box. It does one thing forever. So whatever is happening inside the larger models, "fixed lookup table" is not it.

This is also where the honest answer has to admit the science is still arguing with itself, because I do not want to swap one tidy story for another. There is a genuinely open debate about these surprising new skills, often called emergent abilities. One influential 2022 paper argued they appear suddenly at scale and cannot be predicted from smaller models. [9] A 2023 paper pushed back hard, arguing that a lot of that apparent suddenness is an illusion created by how we score the tests: measure the same skill on a gentler scale and the jump smooths out into a steady, predictable climb. [10] Both papers are from serious researchers, and the matter is not settled. So if anyone tells you, in either direction, that the question of machine "understanding" is closed, they are ahead of the evidence.

Bar chart on a log scale showing language model size growing from 340 million parameters (BERT, 2019) to 1.57 trillion (Switch-C, 2021), a roughly 4,600-fold increase in about three years. — Scale is the part nobody disputes. Model sizes grew thousands of times over in a few years, and behaviour changed with them. Data taken verbatim from Table 1 of Bender et al., the "Stochastic Parrots" paper [3]; chart drawn by Ramazan Yavuz.

So where does that leave my favourite line? Partly vindicated, partly retired. Here is the verdict I would now defend, and it is a split decision.

My instinct was right about the objective. These systems really are trained to predict the next piece of text, and they really do produce confident nonsense when that prediction goes wrong, in ways a thoughtful person would not. Anyone who talks about AI as if it knows things the way a human expert does has skipped over something real, and the stochastic-parrot crowd are right to keep pointing at it.

But my instinct was wrong about the mechanism and the consequences. "Glorified T9" smuggles in the idea of a lookup table, and that is not what is inside. A system trained only to predict can, at sufficient scale, build internal structure that looks a lot like a model of its little world, and can pick up tasks it was never shown. That is categorically more than T9 ever did, and pretending otherwise is its own kind of dishonesty. The truthful sentence is longer and less quotable than my old one: they are next-word predictors whose predictions got good enough to build surprising machinery underneath, and we are still arguing about how much that machinery amounts to.

Which is, I think, the useful place to land. Not "it is just autocomplete, relax," and not "it understands, be amazed." Something more careful and more interesting: a tool that is genuinely shallow in the ways its critics describe and genuinely surprising in the ways its critics miss, and worth understanding precisely because both of those are true at once. I will keep the joke. I will just stop pretending it is the whole story.

Are language models just glorified T9?

Sources