<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Scraping on Build in Public</title><link>https://build.ralphmayr.com/tags/scraping/</link><description>Recent content in Scraping on Build in Public</description><generator>Hugo</generator><language>en-us</language><copyright>©️ Ralph Mayr 2026</copyright><lastBuildDate>Sat, 04 Oct 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://build.ralphmayr.com/tags/scraping/index.xml" rel="self" type="application/rss+xml"/><item><title>Stopping the scrape: Why I switched to the Wikimedia API</title><link>https://build.ralphmayr.com/posts/96-stopping-the-scrape-why-i-switched-to-the-wikimedia-api/</link><pubDate>Sat, 04 Oct 2025 00:00:00 +0000</pubDate><guid>https://build.ralphmayr.com/posts/96-stopping-the-scrape-why-i-switched-to-the-wikimedia-api/</guid><description>&lt;p&gt;I&amp;rsquo;ve noticed a welcome uptick in users saving Wikipedia articles to poketto.me recently.&lt;/p&gt;
&lt;p&gt;But until now, the app treated Wikipedia just like any other website: it scraped the raw HTML. Turns out, for Wikipedia, that is far from ideal:&lt;/p&gt;
&lt;p&gt;🤯 Artifacts: The extracted content often included UI clutter like &amp;ldquo;Edit&amp;rdquo; buttons, navigation links, and &amp;ldquo;Citation missing&amp;rdquo; tags.&lt;/p&gt;
&lt;p&gt;📋 Rendering issues: The standard HTML → Markdown → HTML conversion pipeline introduced plenty of ugly formatting glitches specific to wikis.&lt;/p&gt;</description></item><item><title>trafilatura’s image extraction is a bit too cautious for my taste</title><link>https://build.ralphmayr.com/posts/73-trafilaturas-image-extraction-is-a-bit-too-cautious-for-my-taste/</link><pubDate>Thu, 11 Sep 2025 00:00:00 +0000</pubDate><guid>https://build.ralphmayr.com/posts/73-trafilaturas-image-extraction-is-a-bit-too-cautious-for-my-taste/</guid><description>&lt;p&gt;A poketto.me user recently filed a curious bug: They had saved a page that clearly contained images &amp;mdash; but in the reader view, no images showed up.&lt;/p&gt;
&lt;p&gt;I expected some quirky HTML. But when I checked, the &amp;lt;img&amp;gt; tags looked perfectly normal (see Exhibit A). Yet, after passing the HTML through trafilatura (which I use to convert HTML to Markdown), the images had simply vanished.&lt;/p&gt;
&lt;p&gt;🔎 The culprit? Trafilatura is very cautious with images. It only accepts &amp;lt;img src=...&amp;gt; URLs that end with a known extension (.jpg, .png, .gif, &amp;hellip;). The site in question served images from URLs without extensions &amp;mdash; so trafilatura just ignored them.&lt;/p&gt;</description></item><item><title>Extracting Favicons: There’s No Bulletproof Way</title><link>https://build.ralphmayr.com/posts/47-extracting-favicons-theres-no-bulletproof-way/</link><pubDate>Sat, 16 Aug 2025 00:00:00 +0000</pubDate><guid>https://build.ralphmayr.com/posts/47-extracting-favicons-theres-no-bulletproof-way/</guid><description>&lt;p&gt;A &lt;strong&gt;favicon&lt;/strong&gt; is that tiny icon you see next to a site name in your browser tab or bookmarks bar. It's one of those small UX elements that quietly plays a big role in how we recognize and visually differentiate websites.&lt;/p&gt;
&lt;p&gt;In poketto.me, I wanted to bring favicons into play for a couple of UI elements&amp;mdash;particularly when managing your &lt;strong&gt;saved news sources&lt;/strong&gt;. Seeing a little logo beside each source makes skimming, scanning, and organizing much more intuitive than reading domain names alone.&lt;/p&gt;</description></item><item><title>Extracting web content is still… messy.</title><link>https://build.ralphmayr.com/posts/2-extracting-web-content-is-still-messy/</link><pubDate>Wed, 02 Jul 2025 00:00:00 +0000</pubDate><guid>https://build.ralphmayr.com/posts/2-extracting-web-content-is-still-messy/</guid><description>&lt;p&gt;You can talk about autonomous #AIAgents roaming the web and performing all kinds of tasks 'just as a human would' as much as you like, but technically, some of the very basics are still lacking. For example, there is no free, easy-to-use, off-the-shelf solution for extracting web content.&lt;/p&gt;
&lt;p&gt;Here&amp;rsquo;s what I mean: Think of poketto.me as a very basic 'agent': you tell it to save a URL, and then it talks to the website 'on your behalf' to access its content. For this use case, the &lt;strong&gt;Newspaper3k&lt;/strong&gt; Python library is pretty good: it teases out structured metadata, but occasionally misses basic things like the content language. To retrieve the actual content, &lt;strong&gt;Trafilatura&lt;/strong&gt; (
&lt;a href="https://github.com/adbar/trafilatura" target="_blank" rel="noopener noreferrer"&gt;https://github.com/adbar/trafilatura&lt;/a&gt;) appears to be the best option at the moment. However, even that doesn't work well with all sites. For edge cases, I actually had to fall back on parsing the raw HTML myself using Beautiful Soup (
&lt;a href="https://beautiful-soup-4.readthedocs.io/en/latest/%29" target="_blank" rel="noopener noreferrer"&gt;https://beautiful-soup-4.readthedocs.io/en/latest/)&lt;/a&gt;. (And yes, I&amp;rsquo;m sending that through an LLM later to streamline the content so all the tiny formatting issues Trafilatura introduces get smoothed out again.)&lt;/p&gt;</description></item></channel></rss>