trafilatura’s image extraction is a bit too cautious for my taste

A poketto.me user recently filed a curious bug: They had saved a page that clearly contained images — but in the reader view, no images showed up.

I expected some quirky HTML. But when I checked, the <img> tags looked perfectly normal (see Exhibit A). Yet, after passing the HTML through trafilatura (which I use to convert HTML to Markdown), the images had simply vanished.

🔎 The culprit? Trafilatura is very cautious with images. It only accepts <img src=...> URLs that end with a known extension (.jpg, .png, .gif, …). The site in question served images from URLs without extensions — so trafilatura just ignored them.

👉 My quick fix: If poketto.me encounters an <img> without a “valid” extension, I tack one on artificially so trafilatura accepts it. Hacky, but it worked — not only for that user’s page but for many more sites.

The proper fix would be more expensive:

1️⃣Trafilatura would need to send an HTTP HEAD request for each image.

2️⃣If the Content-Type starts with image/, accept it — regardless of the URL. If not, ignore it.

That’s the robust solution. But I also understand why trafilatura avoids it: dozens of HEAD requests could slow content extraction to a crawl.

So here’s my plan on how to help the folks behind trafilatura out:

🔵I’ll draft one pull request that introduces the HEAD-check as an opt-in option.
🔵I’ll draft another PR that simply disables the sanity check (closer to my current workaround), also as an opt-in option.

Let’s see which one gets traction with the maintainers.

Sometimes “protective” defaults make sense… until they eat your users’ images. 😅