Scraping on Build in Public

Stopping the scrape: Why I switched to the Wikimedia API

Sat, 04 Oct 2025 00:00:00 +0000

I’ve noticed a welcome uptick in users saving Wikipedia articles to poketto.me recently.

But until now, the app treated Wikipedia just like any other website: it scraped the raw HTML. Turns out, for Wikipedia, that is far from ideal:

🤯 Artifacts: The extracted content often included UI clutter like “Edit” buttons, navigation links, and “Citation missing” tags.

📋 Rendering issues: The standard HTML → Markdown → HTML conversion pipeline introduced plenty of ugly formatting glitches specific to wikis.

trafilatura’s image extraction is a bit too cautious for my taste

Thu, 11 Sep 2025 00:00:00 +0000

A poketto.me user recently filed a curious bug: They had saved a page that clearly contained images — but in the reader view, no images showed up.

I expected some quirky HTML. But when I checked, the <img> tags looked perfectly normal (see Exhibit A). Yet, after passing the HTML through trafilatura (which I use to convert HTML to Markdown), the images had simply vanished.

🔎 The culprit? Trafilatura is very cautious with images. It only accepts <img src=...> URLs that end with a known extension (.jpg, .png, .gif, …). The site in question served images from URLs without extensions — so trafilatura just ignored them.

Extracting Favicons: There’s No Bulletproof Way

Sat, 16 Aug 2025 00:00:00 +0000

A favicon is that tiny icon you see next to a site name in your browser tab or bookmarks bar. It's one of those small UX elements that quietly plays a big role in how we recognize and visually differentiate websites.

In poketto.me, I wanted to bring favicons into play for a couple of UI elements—particularly when managing your saved news sources. Seeing a little logo beside each source makes skimming, scanning, and organizing much more intuitive than reading domain names alone.

Extracting web content is still… messy.

Wed, 02 Jul 2025 00:00:00 +0000

You can talk about autonomous #AIAgents roaming the web and performing all kinds of tasks 'just as a human would' as much as you like, but technically, some of the very basics are still lacking. For example, there is no free, easy-to-use, off-the-shelf solution for extracting web content.

Here’s what I mean: Think of poketto.me as a very basic 'agent': you tell it to save a URL, and then it talks to the website 'on your behalf' to access its content. For this use case, the Newspaper3k Python library is pretty good: it teases out structured metadata, but occasionally misses basic things like the content language. To retrieve the actual content, Trafilatura ( https://github.com/adbar/trafilatura) appears to be the best option at the moment. However, even that doesn't work well with all sites. For edge cases, I actually had to fall back on parsing the raw HTML myself using Beautiful Soup ( https://beautiful-soup-4.readthedocs.io/en/latest/). (And yes, I’m sending that through an LLM later to streamline the content so all the tiny formatting issues Trafilatura introduces get smoothed out again.)