AI Companies Ruin The Open Internet
Libre News published the article FOSS infrastructure is under attack by AI companies with a good overview and summarization of issues internet infrastructure currently faces: Excessive and evasive scraping by AI companies.
This is an issue for all hosters of web content, but especially so for open source and open data platforms which are inherently open.
This drives up monetary and time investment cost, and has negative effects on regular users when AI companies excessively overload infrastructure while evading blocks.
It’s a depressing read. And forebonds how it’s become necessary to take extensive measures in blocking efforts that will inevitably also influence normal people.
Excessive Requests
70-90 % of traffic from AI companies. Not only a huge number of requests, but also amount of data served.
Scraping the same costly-to-render pages every 6 hours instead of once (like a git blame page, which seems very low value for AI or online search in general, but even more so every 6 hours).
Evading Blocking Efforts
They don’t respect robots.txt
instructions of what they may and what they may not crawl.
When they start with requests while identifying themselves as what they are, and you block them, they simply don’t identify themselves anymore and continue anyway.
When you block their IP addresses, they spread out which and how many they use.
Blocking Efforts Continue
In one case, the platform operator blocked the entire country of Brazil so their platform remained usable for their regular users.
Another platform operator began implementing experimental proof-of-work frontends. Users may see a placeholder page first, until being redirected to the real content.
I’m not hopeful the US or China will regulate these AI companies.
This surely must inevitably mean that we will see less open data, less open webpages, more being put behind sign-ups or other measures, and more other hindrances and annoyances to users visiting websites and platforms.
The Cost of AI
Apart from opportunities, AI has various considerable costs associated to it.
Mass fake content, mass replicated and duplicated noise and content, human impact, and what we read above.
The economic and environmental cost significantly rose in my consideration. It costs a lot of energy and resources to collect data for training, for training, and for executing AI. But continuously scraping the open internet for data elevates this to another level of cost. And every platform who now has issues has to invest time and resources to combat them.
We will lose accessibility and open data. Collaborative projects of open data and software already lose resources, and will lose accessibility and openness.