limits

Why are model output tokens limited when inputs are capable of going up to ~1 million now?

For example, GPT 4.1:
Window size - 1,047,576
Max Output Tokens - 32,768

So, in reality, this is a multifaceted problem that involves architecture, datasets, compute power, and business practices.

LLMs are pattern prediction systems; they take the input and try to determine the most likely continuation of the patterns present. This works really well for short contexts because shorter contexts allow for more distinct variance in token probabilities. The longer the context, the less variance there is between tokens, leading to contextual drift or context loss in extremely long contexts.

There are various efforts to mitigate this, but it remains a significant issue. The problem stems from how LLM attention mechanisms work and the type of data they're trained on. Attention compares every token to every other token using dot products of the K, V, and Q values. While this lets the AI understand the relative importance of tokens, it becomes murkier with longer contexts.

Regarding datasets, they do play a role in the model's capabilities. Long contexts require a large amount of memory, making training quite intensive. Companies like Meta and OpenAI have the hardware to develop longer context models, but it’s still expensive. Unfortunately, these AIs can't extrapolate beyond the material they were trained on. If you train a model with 100-token samples, it can't extrapolate to 200 tokens.

Some solutions, like sliding context windows and streaming, try to address this. However, my understanding is that they only shift the reference starting point of the context, not increase the number of tokens the model can handle.

As for the actual content of the datasets, you're right; many free, open-source projects could provide lengthy data for training. However, this also ties into the processing power requirements for an LLM. The longer the context, the more compute time and power required. Many datasets were pre-trimmed or segmented to have a uniform max length. Smaller operations often don't have the capital to buy, rent, or power enough hardware for extremely long contexts. Training new architectures from scratch is costly, so reducing the max context length is a simple cost-cutting solution.

This leads to another issue. Since LLMs are pattern prediction systems, if they're mostly trained on short snippets, they'll produce more diverse short output probabilities, making them more prone to selecting short outputs. More choices in outputs means there is a greater likelyhood of one of them matching the pattern the LLM is working with.

Compute power is another widespread issue. In personal implementations, slower models with good results can be acceptable. However, for processing inputs for hundreds of users at a time, it's a different story. This ties closely with business practices. Longer user sessions on GPUs or clusters mean fewer customers served, equating to less revenue. By having a cutoff, businesses can serve more customers in a shorter time, leading to higher revenue.

If you've ever been stuck behind someone at a fast-food restaurant placing a large order, you've experienced this firsthand. Longer wait times for simple orders lead to less satisfied customers and a decline in clientele.

On top of all this, the method by which you run the LLM matters. For instance, IIRC, Ollama is set by default to handle 2k or 4k tokens. You need to configure it to work with larger contexts to use extended context models. Oobabooga's textgen webui offers multiple controls for this: you can set the max context size at load time, adjust the max context size passed to the LLM, and control the max output tokens. Each frontend has their own options, but I'm mostly familiar with Ooba and directly running llama.cpp.

But, and this is a big but, the model won't always produce the max output number for various reasons. One reason is the presence of the end-of-sentence token, which usually triggers the LLM to stop writing. Some UIs allow you to ban this token, making the model generate longer outputs.

It really boils down to a quote from the office: "Why waste time say lot word when few word do trick?"

Page updated

Google Sites

Report abuse

limits

Why are model output tokens limited when inputs are capable of going up to ~1 million now?

© Philomaths blog 2025