My three latest posts have all been on some aspect of how LLMs tokenize prompts. The first of these, mentioned in my previous newsletter, gives an introduction to the difference between word and tokens.
Writing that post piqued my curiosity regarding how Unicode characters are tokenized. Are there characters that are ignored in a prompt? Are look-alike characters, like the Greek letter omega (U+03A9) and the symbol for ohms resistance (U+2126), tokenized the same?
Are there practical consequences to how tokenization works? Their can be. Understanding these details might help you write better prompts.