Extracting web content is still… messy.

You can talk about autonomous #AIAgents roaming the web and performing all kinds of tasks 'just as a human would' as much as you like, but technically, some of the very basics are still lacking. For example, there is no free, easy-to-use, off-the-shelf solution for extracting web content.

Here’s what I mean: Think of poketto.me as a very basic 'agent': you tell it to save a URL, and then it talks to the website 'on your behalf' to access its content. For this use case, the Newspaper3k Python library is pretty good: it teases out structured metadata, but occasionally misses basic things like the content language. To retrieve the actual content, Trafilatura ( https://github.com/adbar/trafilatura) appears to be the best option at the moment. However, even that doesn't work well with all sites. For edge cases, I actually had to fall back on parsing the raw HTML myself using Beautiful Soup ( https://beautiful-soup-4.readthedocs.io/en/latest/). (And yes, I’m sending that through an LLM later to streamline the content so all the tiny formatting issues Trafilatura introduces get smoothed out again.)

Given the disproportionate hassle required for this tiny use case, what would be needed to enable people to actually build “agents” (AI or plain automation-based) that can interact with the web autonomously, reliably and meaningfully? One of two things:

Either we build these 'agents' in a solid and robust manner, from the ground up; for performance reasons they would primarily interact with other sites on a protocol level (like poketto.me is doing today). However, when they encounter an obstacle, they would simulate a real browser (e.g. a headless Chrome automated with #Selenium). However, this would of course run into authentication and authorization issues. Suppose the user has a subscription to the New York Times. How would the agent safely and reliably obtain these credentials so that it can use them to sign in and access the article that the user wanted to save? And how do we ensure that the agent doesn’t use these credentials to access the site on behalf of another user? That’s where (2) comes in.
In their recent AI x Crypto Crossovers post, the folks at a16z proposed something quite interesting: A blockchain-based infrastructure to govern interactions between agents, third-party sites and end users. It's still an uncertain idea, but ultimately, if our goal is to create real, functioning, reliable agents, this would be a significant step in that direction.