A poketto.me user recently filed a curious bug: They had saved a page that clearly contained images — but in the reader view, no images showed up.
I expected some quirky HTML. But when I checked, the <img> tags looked perfectly normal (see Exhibit A). Yet, after passing the HTML through trafilatura (which I use to convert HTML to Markdown), the images had simply vanished.
🔎 The culprit? Trafilatura is very cautious with images. It only accepts <img src=...> URLs that end with a known extension (.jpg, .png, .gif, …). The site in question served images from URLs without extensions — so trafilatura just ignored them.
👉 My quick fix: If poketto.me encounters an <img> without a “valid” extension, I tack one on artificially so trafilatura accepts it. Hacky, but it worked — not only for that user’s page but for many more sites.
The proper fix would be more expensive:
1️⃣Trafilatura would need to send an HTTP HEAD request for each image.
2️⃣If the Content-Type starts with image/, accept it — regardless of the URL. If not, ignore it.
That’s the robust solution. But I also understand why trafilatura avoids it: dozens of HEAD requests could slow content extraction to a crawl.
So here’s my plan on how to help the folks behind trafilatura out:
🔵I’ll draft one pull request that introduces the HEAD-check as an opt-in option.
🔵I’ll draft another PR that simply disables the sanity check (closer to my current workaround), also as an opt-in option.
Let’s see which one gets traction with the maintainers.
Sometimes “protective” defaults make sense… until they eat your users’ images. 😅