Be careful when counting your whitespace

One of the great things about working on poketto.me is that I'm constantly learning about fascinating linguistic subtleties. For instance, while working on automatic content summaries and extracting key facts and figures, I came across an interesting issue with token counting in Chinese script.

I had put a safeguard in place so that poketto.me would only attempt to summarize content longer than 100 words. This works well for German and English content, but when I tested the feature on an article published by Xinhua, a Chinese news agency, my code said the article had only about 12 words, which was obviously incorrect, so it didn't produce a summary.

What’s going on there?

⚡ ️My naive approach to counting words was to split the content by whitespace, and count the number of resulting tokens. In Chinese though, whitespace doesn’t have the same significance as in English or German. The entire Xinhua article, for instance, only contains 11 whitespace characters but still is so long that it merits summarization. That's the entire first paragraph, for example, with only one "proper" whitespace character in it:

当地时间10月31日上午，亚太经合组织第三十二次领导人非正式会议第一阶段会议在韩国庆州和白会议中心举行。国家主席习近平出席会议并发表题为《共建普惠包容的开放型亚太经济》的重要讲话

💡I could have disregarded the minimum-length safeguard entirely, but instead, I searched for a proper solution to the word-counting issue. After shopping around for a while, I settled on Jieba, a Python word segmentation module specifically designed for the Chinese language.

🇨🇳 With Jieba in place, word counting, reading time determination, content summarization, and all the other features that build on them work smoothly for Chinese content as well. 你好中文!

Check it out here: https://app.poketto.me/#/shared/ERDNBoi