How LLMs turn words into tokens

Understanding the details might improve your prompts

Mar 09, 2025

My three latest posts have all been on some aspect of how LLMs tokenize prompts. The first of these, mentioned in my previous newsletter, gives an introduction to the difference between word and tokens.

Writing that post piqued my curiosity regarding how Unicode characters are tokenized. Are there characters that are ignored in a prompt? Are look-alike characters, like the Greek letter omega (U+03A9) and the symbol for ohms resistance (U+2126), tokenized the same?

Are there practical consequences to how tokenization works? Their can be. Understanding these details might help you write better prompts.

John D. Cook Consulting Substack

Discussion about this post