Be careful when counting your whitespace

Fri, 03 Oct 2025 00:00:00 +0000

One of the great things about working on poketto.me is that I'm constantly learning about fascinating linguistic subtleties. For instance, while working on automatic content summaries and extracting key facts and figures, I came across an interesting issue with token counting in Chinese script.

I had put a safeguard in place so that poketto.me would only attempt to summarize content longer than 100 words. This works well for German and English content, but when I tested the feature on an article published by Xinhua, a Chinese news agency, my code said the article had only about 12 words, which was obviously incorrect, so it didn't produce a summary.

Internationalization on Build in Public

Be careful when counting your whitespace