A pseudonymous coder has created an open-source tool designed to trap AI training web crawlers in an endless, randomly generated series of pages. The program, named Nepenthes after the genus of carnivorous pitcher plants, can be used by webpage owners to protect their content from being scraped or deployed as a honeypot to waste AI companies’ resources. “It’s less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out,” Aaron B., the creator of Nepenthes, explained.
“The typical web crawler doesn’t appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself.
The crawler downloads those new links, and Nepenthes just returns more and more lists of links pointing back to itself.”
Aaron added, “Of course, these crawlers are massively scaled and are downloading links from large swathes of the internet at any given time. But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop.”
Nepenthes generates pages in a deterministic way, causing them to appear as flat files that never change. Intentional delay is added to prevent crawlers from bogging down the server, in addition to wasting their time.
Trapping AI crawlers with Nepenthes
The software can be used either defensively to flood out valid URLs on a site, preventing bots from accessing real content, or offensively by feeding them tons of useless data. “In short, let them suck down as much useless data as they have disk space for and choke on it,” the developer said, warning that it is an intentionally malicious program, so proceed only if you’re comfortable with it.
The advent of Nepenthes comes amidst growing concerns about AI companies scraping web content without permission to train their models. Last summer, controversy erupted when ClaudeBot, an AI crawler, was accused of bombarding websites incessantly, often disregarding the robots.txt files designed to deter such behavior. Similar grievances were raised by Reddit’s CEO, who lamented the difficulty in blocking these crawlers.
Aaron witnessed Facebook’s crawler overwhelming his site with over 30 million hits. In response, he developed Nepenthes, drawing on tarpitting techniques traditionally used in anti-spam efforts. Despite the potential challenges of running such software, like increased server load and operational costs, Aaron argues that the broader intention is to slow AI investment and push companies to seek permission before scraping or pay more to content creators.
The release of Nepenthes has sparked interest among other developers, with some creating similar tools like Gergely Nagy’s Iocaine, which traps crawlers in a maze to poison their data. As the battle between AI developers and web content creators continues, tools like Nepenthes represent a growing resistance against unchecked AI data scraping.