TLDR, LLMs are next-token prediction machines.
It’s not magic, it’s math!
Written by human, incorporating criticism from Gemini.
LLMs (Large Language Models) are powering all the latest popular chatbots like Gemini, ChatGPT and Copilot. But how do they actually work?
Basically, LLMs are next-token prediction machines. Given a piece of text, they try to predict what text is most likely to follow. For instance given the text
The duck is swimming in the
an LLM would probably predict the next word to be lake.
It’s similar to how smartphones suggest the next word for you when you are typing.
But, what exactly is a token?
In order to feed text into an LLM, the text must first be broken into small pieces that the LLM can understand. These pieces are called tokens. How exactly the text is broken up depends on each LLM. A token can be a single character, or a grouping of characters that often appear together in texts. For simplicity in the rest of this post, we’ll just assume that a token is one word.
So the example sentence from above
The duck is swimming in the lake
Would be broken into tokens like
["The", "duck", "is", "swimming", "in", "the", "lake"]
Parts of an LLM
Simply put, an LLM is a pipeline that basically consists of three parts, with text being passed in at the input, and text coming out again at the output:
- Tokenizer (+embedding layers), which breaks text into tokens, and converts each token into a vector (a list of numbers).
- Transformer, which takes the vectors from the tokenizer and converts them into another vector.
- Detokenizer (and some additional components), which takes the vector from the transformer and converts it to text

It contains billions of parameters that determine how text is transformed to and from vectors, and how these vectors are transformed. These parameters are also often called weights.
The LLM is trained to try to predict the token that is most likely to follow given a sequence of tokens. For instance, given the following sequence
The duck is swimming in
an LLM would probably output the next word to be the. Appending this to the text
The duck is swimming in the
and passing this text back into the LLM to have it continue further it would likely output lake.
The maximum length of the input sequence of tokens is fixed, meaning there is an upper limit to the amount of tokens that can be processed by an LLM in one pass. This is known as the context window, and it’s a fundamental limitation of how LLMs work.
Training
How does an LLM know how to continue this kind of text? It learns by being trained on billions of examples of text from the internet.
When starting the training process, the parameters of the LLM are typically all given random values. It will then just output gibberish. For instance inputting a sentence like
The duck is swimming in the
it would probably just output random tokens like calculator abruptly syntax Tuesday.
The training process works by passing billions of text samples like this one through the LLM, looking at how closely the output matches the expected value, like lake in the above example, then slightly adjusting all the parameters accordingly. Repeating this process billions of times eventually makes the LLM able to predict the next token quite well, if everything goes well.
Context management
In a nutshell, LLMs are just machines that predict the most statistically likely next token, given an input sequence of tokens, based on patterns in the data the LLM was trained on. They are stateless, meaning the LLM itself does not have memory, its output is determined by its input (although often with some randomization added).
For building applications using LLMs, managing the input therefore becomes very important. All the information needed to answer one’s query must be available in the input, but at the same time fit within the context window. RAG, grounding, prompt templates (and also the recent Claude skills) etc are all examples of different ways of managing the context that is given as part of the input to the LLM.
If you would like to learn more about how LLMs work, I would recommend the Build a Large Language Model (From Scratch) book. There are a lot of details that we glossed over here that is explained in this book. It explains in a quite straightforward manner the mechanics of how an LLM works, with practical Python examples.