Tokenizer Playground

Visualize how OpenAI models tokenize text. See token counts, boundaries, and IDs for GPT-4o, GPT-4, GPT-3.5, and more.

Model:

o200k_base

Tokens

Characters

Words

Avg Chars/Token

Input Text

Tokenized Output

Tokenized text will appear here...

About Tokenization

What are tokens?

Tokens are the basic units that language models process. A token can be a word, part of a word, or even punctuation. OpenAI models use Byte Pair Encoding (BPE) to break text into tokens.

Why does it matter?

Token count affects API pricing (you pay per token), context window limits, and model performance. Understanding tokenization helps optimize prompts and estimate costs.

Encodings

o200k_base — GPT-4o, o1, o3 series (200k vocabulary)
cl100k_base — GPT-4, GPT-3.5, embeddings (100k vocabulary)
p50k_base — text-davinci-003/002 (50k vocabulary)
r50k_base — Legacy davinci, curie, ada models