LLM_log #002: Tokenization in Large Language Modelling

LLM_log #002: Tokenization in Large Language Modelling

Understanding Tokenization in Large Language Models

Why GPT-4 Can’t Count the Letters in “Strawberry”

VLADIMIR MATIC, PhD – DataHacker.rs – January 2025

🍓
“How many letter ‘r’s are in the word strawberry?”
strawberry
s t r a w b e r r y
GPT-4’s Answer
“Two”
Correct Answer
Three

In early 2024, this simple question stumped GPT-4. A billion-dollar AI model failed at counting letters in a 10-letter word. Why?

The answer lies in tokenization.

🔍 Why GPT-4 Gets This Wrong

GPT-4 never sees the word as individual letters.
It sees it as tokens (chunks of text).

👁 What You See

s t r a w
b e r r y

10 individual letters

🤖 What GPT-4 Sees

straw berry

2 tokens (chunks)

💡

This Isn’t a Bug – It’s How LLMs Work

GPT-4’s tokenizer learned that “strawberry” is efficiently represented as two chunks: “straw” + “berry”. The model can’t “see” the individual ‘r’s because they’re buried inside these tokens.

It’s like asking someone to count letters in a sealed envelope. The information exists, but it’s not directly accessible.

🎯 More Tokenization Failures

1. Reversing Words

Task: Reverse “hello”
GPT-4: “olleh” ✅Task: Reverse “strawberry”
GPT-4: Often wrong ❌

Why? “strawberry” = [straw][berry]. The model can’t reverse what it can’t see letter-by-letter.

2. Multi-Digit Arithmetic

Task: 123 + 456
GPT-4: 579 ✅Task: 8732 x 9461
GPT-4: Often wrong ❌

Why? “870” might be one token, but “871” is [8][71]. Inconsistent tokenization makes patterns hard to learn.

3. Multilingual Cost

English: “Hello” = 1 token
Arabic: “marhaba” = 3-4 tokens
Cost: 3-4x more! 💰

Why? Tokenizers trained on English are inefficient for non-Latin scripts.

4. The Glitch Token

Input: “SolidGoldMagikarp”
GPT-3: Complete nonsense ❌

Why? This token existed in vocabulary but was never trained on, causing bizarre behavior.

Why Tokenization Matters: Three Real-World Impacts

1. 💰 Cost

API pricing is typically based on token usage – the more tokens, the higher the cost. Token efficiency directly translates to money.

Prompt: “Explain quantum entanglement in simple terms”GPT-2 tokenizer: 12 tokens
GPT-4 tokenizer: 7 tokensCost difference: 42% cheaper with GPT-4!

For a company making 1 million API calls per day with an average of 50 tokens per call:

  • With GPT-2 efficiency: 50M tokens/day x $0.03/1K = $1,500/day = $547,500/year
  • With GPT-4 efficiency: 35M tokens/day x $0.03/1K = $1,050/day = $383,250/year
  • Savings: $164,250 per year just from tokenization efficiency!

2. 🌍 Fairness

Tokenizers trained primarily on English text penalize users of other languages. This isn’t just unfair – it’s a structural bias in how LLMs work.

Same meaning: “Hello, world!”English: 2 tokens
Arabic: 6 tokens (3x cost)
Chinese: 4 tokens (2x cost)
Serbian: 5 tokens (2.5x cost)

This affects three critical areas:

  • API costs: Non-English speakers literally pay 2-3x more for the same functionality
  • Context window: Less actual content fits for non-English users (8K tokens = less text)
  • Model performance: Fewer tokens means the model “sees” less context, degrading quality

3. 🎯 Capability

What the tokenizer can “see” fundamentally determines what the model can learn. This is why specialized models exist.

BERT (2018) viewing Python code:
“def calculate(x): return x * 2”
= [“de”, “##f”, “calculate”, “(“, “x”, “)”, “:”, “return”, “x”, “*”, “2”]
= Lost indentation, “def” split into piecesStarCoder2 (2024) viewing the same code:
“def calculate(x): return x * 2”
= [“def”, “calculate”, “(“, “x”, “):”, “return”, “x”, “*”, “2”]
= “def” as single token, cleaner structure

StarCoder2’s tokenizer has:

  • Single tokens for Python keywords (def, class, return)
  • Single tokens for multiple spaces (handles Python indentation properly)
  • Digit-by-digit number tokenization (better mathematical reasoning)
  • Special tokens for code structure
The Hidden Truth: When you see that one model is “better” at a specific task than another, tokenization might be the reason. It’s not always about model size or architecture – sometimes it’s just that the tokenizer was designed for that domain.

How Tokenization Works

Now that you understand why tokenization matters, let’s explore how it actually works. When you type text into ChatGPT, a crucial transformation happens before the model even begins to “think.”

Figure 2.1: The Tokenization Pipeline
Language models deal with text in small chunks called tokens. For the model to compute language, it needs to turn tokens into numeric representations called embeddings.
Input
“Have the bards who preceded…”
=
Tokenization
Break into smaller pieces
=
Numeric Representation
Token IDs and Embeddings

Think of it like this: If you’re trying to understand a book, but someone only gives you whole pages at a time (never individual words or sentences), your understanding would be limited. Similarly, if someone gives you individual letters, you’d spend all your time assembling them into words. Tokens are the “just right” level of granularity.

Let’s See It In Action

Here’s how to tokenize “strawberry” and see why GPT-4 can’t count the ‘r’s:

from transformers import AutoTokenizer# Load GPT-2 tokenizer (illustrative example)
tokenizer = AutoTokenizer.from_pretrained(“gpt2”)# Tokenize “strawberry”
text = “strawberry”
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print(f”Tokens: {tokens}”)
print(f”Token IDs: {token_ids}”)

# Output:
# Tokens: [‘straw’, ‘berry’]
# Token IDs: [2536, 19772]

The model never sees “s-t-r-a-w-b-e-r-r-y”. It sees two tokens: 2536 and 19772. Those are just numbers referencing entries in the tokenizer’s vocabulary.

Key Insight: The model works entirely with numbers. When it outputs token ID 19772, the tokenizer decodes it back to “berry”. The model never manipulates letters directly.

The Four Tokenization Paradigms

There are fundamentally different approaches to breaking down text. Modern LLMs use subword tokenization, but understanding all four paradigms reveals why.

Figure 2.6: Four Ways to Tokenize Text
Let’s tokenize: “AI is amazing! 🤖” – Same input, four different approaches.

📝 Word Tokens

AI
is
amazing
!
🤖
✓ Intuitive
✗ Can’t handle new words

✨ Subword Tokens
(Used by GPT-4, Claude)

AI
is
amaz
ing
!
🤖
✓ Handles new words
✓ Efficient vocabulary
✓ Best trade-off!

🔤 Character Tokens

A
I
.
i
s
.
a
m
a
z
✓ Can spell anything
✗ Very long sequences (3x more tokens)

⚛ Byte Tokens

01000001 01001001 00100000…
(A)   (I)  (space)
✓ Universal
✗ Extremely long (8x more tokens)

Why subword tokenization wins: It’s the “Goldilocks” solution. Common words stay intact (“AI”, “is”), while rare words break into reusable pieces (“amaz” + “ing”). This allows the model to understand new words it’s never seen before.

“unbelievable” = [“un”, “believ”, “able”]
“unthinkable” = [“un”, “think”, “able”]The model recognizes “un-” (negation prefix) and “-able” (capability suffix)
Even though it never saw “unthinkable”, it can understand it!

Comparing Real Tokenizers: The Evolution (2018-2024)

Over six years, tokenizers went from losing information ([UNK] tokens) to being 35% more efficient while handling emojis, multiple languages, and specialized domains like code.

Our Test Text

VLADIMIR MATIC loves AI!
marhaba (Arabic) Privet (Cyrillic)# Python code
def calculate_price(tokens):
return tokens * 0.03 / 1000
✓ Special chars ✓ Emojis ✓ Multilingual ✓ Code: def, return

BERT base (uncased) – 2018
Method: WordPiece – Vocabulary: 30,522 tokens

The pioneer. BERT introduced Transformers to NLP but had significant limitations.

Result: ~70 tokens

[CLS]
vladimir
mat
##ic
loves
ai
!
[UNK]
[UNK]
[UNK]
(
arabic
)
[UNK]
(
cy
##ril
##lic
)
#
python
code
de
##f
calculate
_
price
(
token
##s
)
:
return
token
##s
*
0
.
03
/
1000
[SEP]

Key Observations:

  • Everything lowercase: “VLADIMIR” becomes “vladimir” (info lost)
  • All emojis become [UNK]: Model is completely blind to emojis
  • Arabic/Cyrillic text becomes [UNK]: Non-Latin scripts unrepresentable
  • “def” split: Python keyword becomes “de” + “##f” (loses meaning)

GPT-2 – 2019
Method: Byte Pair Encoding (BPE) – Vocabulary: 50,257 tokens

The breakthrough. GPT-2 solved the [UNK] problem with byte-level encoding.

Result: ~85 tokens

VL
AD
IM
IR
MAT
IC
loves
AI
!
mar
ha
ba
(
Arabic
)
Pri
vet
(
Cyr
illic
)
#
Python
code
def
calculate
_
price
(
tok
ens
)
:
return
tok
ens
*
0
.
03
/
10
00

Key Observations:

  • Capitalization preserved: VLADIMIR stays uppercase
  • No [UNK] tokens: Everything gets tokenized via byte-level fallback
  • “def” as single token: Common keywords recognized
  • Numbers chunked inconsistently: 1000 = “10” + “00” (not ideal for math)

GPT-4 – 2023
Method: Improved BPE – Vocabulary: ~100,000 tokens

The optimizer. Larger vocabulary makes tokenization 35% more efficient than GPT-2.

Result: ~55 tokens ✓

VLAD
IMIR
MATIC
loves
AI
!
marhaba
(
Arabic
)
Privet
(
Cyrillic
)
#
Python
code
def
calculate
_
price
(
tokens
):
return
tokens
*
0.03
/
1000

Key Observations:

  • 35% fewer tokens: Same text, 55 vs 85 tokens = significant cost savings!
  • “MATIC” in one token: Larger vocabulary handles more words efficiently
  • “tokens” as single token: Common programming words recognized whole
  • “0.03” and “1000” as single tokens: Numbers grouped efficiently

StarCoder2 – 2024
Method: BPE – Vocabulary: 49,152 tokens – Specialization: Code

The specialist. Optimized for code with digit-by-digit number tokenization.

Result: ~68 tokens

VL
AD
IM
IR
MAT
IC
loves
AI
!
mar
ha
ba
(
Arabic
)
Pri
vet
(
Cyr
illic
)
#
Python
code
def
calculate
_
price
(
tok
ens
)
:
return
tok
ens
*
0
.
0
3
/
1
0
0
0

▲ Notice: 0.03 = [0][.][0][3] and 1000 = [1][0][0][0] – digit by digit!

Key Observations:

  • Digit-by-digit numbers: 0.03 = [0][.][0][3], 1000 = [1][0][0][0]
  • Why digit-by-digit? Makes 870 vs 871 differ by just one token – better math!
  • “def” as single token: Code keywords optimized
  • Trade-off: More tokens for numbers, but better mathematical reasoning

💡

Evolution Insights

  • 2018 to 2019: Byte-level fallback eliminated [UNK] tokens completely
  • 2019 to 2023: Larger vocabularies achieved 35% efficiency gains
  • 2023 to 2024: Domain specialization (code, math) became key

Design Trade-offs Revealed

Looking at this comparison, several fundamental trade-offs become clear:

Vocabulary size vs. Model size: GPT-4’s ~100K vocabulary makes tokenization more efficient, but the embedding layer (which maps token IDs to vectors) becomes larger. A 30K vocabulary needs 30K vectors; a 100K vocabulary needs 100K vectors. This is pure memory.

General vs. Specialized: GPT-4 optimizes for general text across many languages. StarCoder2 sacrifices some text efficiency (digit-by-digit numbers) to excel at code. You can’t optimize for everything simultaneously.

Efficiency vs. Capability: StarCoder2’s digit-by-digit tokenization uses more tokens (lower efficiency, higher cost) but enables better mathematical reasoning. When “870” and “871” tokenize as [8][7][0] and [8][7][1], the model sees the pattern that they differ by just one digit.

Language fairness vs. Optimization: To make English tokenization efficient, you need English-heavy training data for the tokenizer. This inevitably makes other languages less efficient. There’s no way around this with current approaches.

Comprehensive Comparison

Property BERT (2018) GPT-2 (2019) GPT-4 (2023) StarCoder2 (2024)
Method WordPiece BPE BPE BPE
Vocabulary Size 30,522 50,257 ~100,000 49,152
Total Tokens ~70 ~85 ~55 ✓ ~68
Capitalization ❌ Lost ✓ Preserved ✓ Preserved ✓ Preserved
Special Chars ❌ [UNK] Multiple tokens ✓ 1 token ✓ 1 token
Emojis ❌ All [UNK] Multiple tokens ✓ Efficient Multiple tokens
Non-Latin Scripts ❌ [UNK] Multiple tokens ✓ Efficient Multiple tokens
“def” keyword Split (de+##f) 1 token ✓ 1 token ✓ 1 token
Best Use Case Classification, NER Text generation General purpose Code generation

Try It Yourself – Complete Code

Here’s clean, runnable Python code to compare all tokenizers:

from transformers import AutoTokenizer# Our test text
text = “””VLADIMIR MATIC loves AI!
marhaba (Arabic) Privet (Cyrillic)def calculate_price(tokens):
return tokens * 0.03 / 1000

result = 2**10 + 500″””

# Compare tokenizers
models = {
“BERT”: “bert-base-uncased”,
“GPT-2”: “gpt2”,
“StarCoder2”: “bigcode/starcoder2-3b”
}

for name, model_name in models.items():
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokens = tokenizer(text).input_ids
print(f”{name}: {len(tokens)} tokens”)

# For GPT-4 (requires tiktoken)
import tiktoken
enc = tiktoken.encoding_for_model(“gpt-4″)
print(f”GPT-4: {len(enc.encode(text))} tokens”)

Installation:

pip install transformers tiktoken torch

Expected output:

BERT: ~70 tokens
GPT-2: ~85 tokens
StarCoder2: ~68 tokens
GPT-4: ~55 tokens

Practical Takeaways for LLM Users

When Choosing a Model

  • Test your actual content: Don’t assume token counts. Use tiktoken or the OpenAI tokenizer playground to see how YOUR prompts tokenize.
  • Consider domain-specific models: Need code? StarCoder beats GPT-4. Need science? Galactica might be better. General purpose? GPT-4/Claude are optimal.
  • Account for language bias: If you’re building for non-English users, factor in 2-3x token inflation in your cost estimates.

When Writing Prompts

  • Be concise: Every word costs tokens. “Please explain” vs “Explain” is one extra token per request. Multiply by millions of requests.
  • Avoid unnecessary formatting: Markdown bullets, headers, and emphasis all cost tokens. Use them only when they improve output quality.
  • Front-load important context: Models pay more “attention” to recent tokens. Put critical information near the end of your prompt.

When Building Applications

  • Monitor token usage: Track average tokens per request, per user, per feature. Find your inefficiencies.
  • Implement token budgets: Limit context window usage. Don’t send entire documents when a summary would work.
  • Cache tokenized inputs: If you’re sending the same system prompt on every request, cache the tokenized version.
  • Chunk strategically: When splitting long documents, split at semantic boundaries (paragraphs, sections), not arbitrary token counts.
  • Use traditional code for letter-level tasks: Don’t ask an LLM to count letters, reverse strings, or do arithmetic. Use Python for that.
Real Example: A company building a document analysis tool cut their API costs by 40% by implementing three changes: (1) Chunking documents at paragraph boundaries instead of arbitrary token limits, (2) Using extractive summarization before sending to the LLM, and (3) Caching common system prompts. All tokenization-aware optimizations.

Understanding the Limitations

Knowing how tokenization works helps you understand when LLMs will struggle:

  • Letter-level tasks: Counting, reversing, anagrams – the model can’t see individual letters
  • Exact arithmetic: Inconsistent number tokenization makes patterns hard to learn
  • Spelling: The model can spell common words (they’re single tokens) but struggles with rare words
  • Character-level patterns: Detecting repeated characters, palindromes, etc.

For these tasks, use traditional programming. For everything else – pattern matching, generation, understanding context – LLMs excel precisely because they work at the token level.

What’s Next?

Now that you understand tokenization – the first step in how LLMs process text – you’re ready for the next pieces of the puzzle:

  • Embeddings: How do those token IDs become meaningful numeric vectors? What makes “king” – “man” + “woman” approximately equal “queen” work mathematically?
  • The Transformer architecture: What happens inside the model after tokenization? How does self-attention work?
  • Training dynamics: How do models learn from tokens? What’s the role of the tokenizer during training vs. inference?
  • Advanced tokenization: BPE vs. WordPiece vs. Unigram – what are the differences? How are tokenizers trained?

These topics will be covered in future posts on DataHacker.rs. Understanding tokenization first makes everything else make more sense.

Homework: Take a prompt you use regularly and run it through different tokenizers. Use the OpenAI tokenizer playground. See how token counts differ!

Key Takeaways

  • Tokenization is invisible but crucial – It determines what the model can “see” and learn
  • The strawberry problem is universal – LLMs struggle with letter-level tasks because of tokenization
  • Evolution shows clear progress – From BERT’s [UNK] tokens to GPT-4’s efficiency
  • Subword tokenization wins – Best balance between vocabulary size and flexibility
  • Specialization matters – Code models need different tokenizers than general models
  • Efficiency = Cost – Fewer tokens per text = lower API costs
  • Multilingual bias exists – English is 2-3x cheaper than Arabic or Chinese

📚 Further Reading:
– Sennrich et al., 2016 – Neural Machine Translation of Rare Words with Subword Units (BPE)
– Kudo, 2018 – SentencePiece: A Simple and Language Independent Subword Tokenizer
OpenAI Tokenizer Playground

Written by VLADIMIR MATIC, PhD

DataHacker.rs – Based on “Hands-On Large Language Models” by Jay Alammar and Maarten Grootendorst