Scraping

#96 Stopping the scrape: Why I switched to the Wikimedia API

I’ve noticed a welcome uptick in users saving Wikipedia articles to poketto.me recently. But until now, the app treated Wikipedia just like any other website: it scraped the raw HTML. Turns out, for Wikipedia, that is far from ideal: 🤯 Artifacts: The extracted content often included UI clutter like “Edit” buttons, navigation links, and “Citation missing” tags. 📋 Rendering issues: The standard HTML → Markdown → HTML conversion pipeline introduced plenty of ugly formatting glitches specific to wikis. ...

#73 trafilatura’s image extraction is a bit too cautious for my taste

A poketto.me user recently filed a curious bug: They had saved a page that clearly contained images — but in the reader view, no images showed up. I expected some quirky HTML. But when I checked, the <img> tags looked perfectly normal (see Exhibit A). Yet, after passing the HTML through trafilatura (which I use to convert HTML to Markdown), the images had simply vanished. 🔎 The culprit? Trafilatura is very cautious with images. It only accepts <img src=...> URLs that end with a known extension (.jpg, .png, .gif, …). The site in question served images from URLs without extensions — so trafilatura just ignored them. ...

#47 Extracting Favicons: There’s No Bulletproof Way

A favicon is that tiny icon you see next to a site name in your browser tab or bookmarks bar. It's one of those small UX elements that quietly plays a big role in how we recognize and visually differentiate websites. In poketto.me, I wanted to bring favicons into play for a couple of UI elements—particularly when managing your saved news sources. Seeing a little logo beside each source makes skimming, scanning, and organizing much more intuitive than reading domain names alone. ...

#2 Extracting web content is still… messy.

You can talk about autonomous #AIAgents roaming the web and performing all kinds of tasks 'just as a human would' as much as you like, but technically, some of the very basics are still lacking. For example, there is no free, easy-to-use, off-the-shelf solution for extracting web content. Here’s what I mean: Think of poketto.me as a very basic 'agent': you tell it to save a URL, and then it talks to the website 'on your behalf' to access its content. For this use case, the Newspaper3k Python library is pretty good: it teases out structured metadata, but occasionally misses basic things like the content language. To retrieve the actual content, Trafilatura ( https://github.com/adbar/trafilatura) appears to be the best option at the moment. However, even that doesn't work well with all sites. For edge cases, I actually had to fall back on parsing the raw HTML myself using Beautiful Soup ( https://beautiful-soup-4.readthedocs.io/en/latest/). (And yes, I’m sending that through an LLM later to streamline the content so all the tiny formatting issues Trafilatura introduces get smoothed out again.) ...