This type of large-scale crawling should be considered a DDoS and the people behind it should be charged with cyber crimes and sent to prison.
Applying the Computer Fraud and Abuse Act to corporations? Sign me up! Hey, they’re also people, aren’t they?
Business idea: AWS, but hosted entirely within the computing power of AI web crawlers.
I like the idea but couldn’t you just go the more direct route and mine crypto?
Yes, but hosting provides value to society.
Seems like such a massive waste of bandwidth since it’s the same work being repeated by many different actors to piece together the same dataset bit by bit.
Ah Capitalism! Truly the king of efficiency /s
I blocked almost all big players in hosting, China, Ruasia, Vietnam and now they’re now bombarding my site with residential IP address from all over the world. They must be using compromised smart home devices or phones with malware.
Soon everything on the internet will be behind a wall.
This isn’t sustainable for the ai companies, when the bubble pops it will stop.
In the mean time, sites are getting DDOS-ed by scrapers. One way to stop your site from getting scraped is having it be inaccessible… which is what the scalpers are causing.
Normally I would assume DDOS-ing is performed in order to take a site offline. But ai-scalpers require the opposite. They need their targets online and willing. One would think they’d be a bit more careful about the damage they cause.
But they aren’t, because capitalism.
If they had the slightest bit of survival instinct they’d share a archive.org / Google-ish scraper and web cache infrastructure, and pull from those caches, and everything would just be scraped once, repeated only occasionally.
Instead they’re building maximally dumb (as in literally counterproductive and self harming) scrapers who don’t know what they’re interacting with.
At what point will people start to track down and sabotage AI datacenters IRL?
I run my own gitea instance on my own server and within the past week or so I’ve noticed it just getting absolutely nailed. One repo in particular, a Wayland WM I built. Just keeps getting hammered over and over by IPs in China.
Just keeps getting hammered over and over by IPs in China.
Simple solution: Block Chinese IPs!
I just thought that having a client side proof-of-work (or even only a delay) bound to the IP might deter the AI companies to choose to behave instead (because single-visit-per-IP crawlers get too expensive/slow and you can just block normal abusive crawlers). But they already have mind-blowing computing and money ressources and only want your data.
But if there was a simple-to-use integrated solution and every single webpage used this approach?
Solution was invented long ago. It’s called a captcha.
A little bother for legitimate users, but a good captcha is still hard to bypass even using AI.
And I think for the final user standpoint I prefer to lose 5 seconds in a captcha, than the browser running an unsolicited heavy crypto challenge on my end.
For years, we’ve written that CAPTCHAs drive us crazy. Humans give up on CAPTCHA puzzles approximately 15% of the time and, maddeningly, CAPTCHAs are significantly easier for bots to solve than they are for humans.
https://blog.cloudflare.com/turnstile-ga/
I hate captchas.