What do you think of web berms?

I really like the documentation for OpenWRT but they literally stall out the user. I have seen 11 seconds delay.

I call these web berms. What are they thinking???

freshports has a banner about web bot coming soon. I don't wanna become web site member no offense.

Dan is the situation that bad? No way to use block lists? Is this DDOS related?

To stall out web page users seems absurd. We used to pride ourselves on page time responses.
 
Yes, the situation is so bad. The reason though are not DDoS attacks, but AI crawlers for stuff like ChatGPT, Grok and so on running havoc on bad behaviour.

Cloudflare has just recently announced that their protection services will be able to keep them out, if you want so. This service used here was made for the same reason.

Wikipedia for example had an article in April about how these bad behaved crawlers are affecting their operations.

65% of their traffic comes from bots, and it adds spikes in traffic consumption. Traffic is expensive, and many large sites are suffering right now the same phenomena. Some sites also dislike being indexed by AI crawlers, since it takes the traffic away from the site to OpenAI, Microsoft, Perplexity and so on, so making dents in their revenue streams while tech bros are now making money with their contents.

This is why I expect this to become pretty much common practice and even more wide spread in the near future.

It is noteworthy that not normal search crawlers for Google, Bing and so on are the problem, those are well behaved. It's AI LLMs, which needs to inhale all of the internet and then some to get enough training data, which now is able to "deep research" the internet when being asked questions and this is causing the issues.

Multimedia_bandwith_demand_for_the_Wikimedia_Projects.png



 
It's AI LLMs, which needs to inhale all of the internet and then some to get enough training data, which now is able to "deep research" the internet when being asked questions and this is causing the issues.
So from the quaint days of robot.txt to now blocklists don't work on AI? Why can't they be blocked? Like fail2ban? AI scrapers are too sophisticated to be detected? So make users wait 11 seconds.

Internet governance is really needed if this is what the future looks like.
 
I guess AI scraper protections like Anubis use your device's horse power to calculate something so that they can know you are a real user. IIRC, the site you've shared took like 4s on my phone. I never got annoyed because of these things because they don't seem to last that long for me. I would not want my site to be consumed by AI bots either so I understand sites who use things like that. Maybe cloudflare would solve this in a better way, idk, but then you would have to use cloudflare too 😅.
 
How do these stop AI data scraping bots? Don't most bots use an embedded HTML5 web renderer to scrape the data? The renderer would succeed with the calculation no problem and then pass the rendered data to the bot itself.
 
A website for my other hobby (bamboo fly rods) has always had issues with bots and crawlers sucking up bandwidth, so they implemented the CloudFlare thing. It's similar to Captcha: asks you to verify "yes I am a human" before redirecting. Not sure how it's doing it, but I've run into issues where I'd hit that every time I clicked a link, but it worked fine running Firefox in a private window. This is what the CloudFlare thing does, check the box and you get to the destination. Not sure if there are other options.
cf.png
 
Well the tech bros behind LLMs don't give a shit about good behaviour and copyright. I mean, it's well known Facebook used TBs of pirated eBooks to train their models.

So they also don't care much about robot.txt in my opinion. robot.txt is a guideline; if you want your crawler to ignore it, you can.

Crawlers normally don't compute Javascript. Stuff like this Anubis seems to execute Javascript in order to separate AI crawlers from normal web browsers in my opinion.
 
Actually LLM are a very big nail into the coffin of the open web.

In April around 75% of the new web content was generated by AI according to estimates.

When training new LLM models on AI generated content this deteriorates their quality, so the LLM makers need to find ways around that.
 
Ugh those bot check delays.

I'd like to think that if info was more sparse or limited in definition (encouraging further self-research), AI can't find it useful. Less chars is also lighter for AI to scrape :p
 
How do these stop AI data scraping bots? Don't most bots use an embedded HTML5 web renderer to scrape the data? The renderer would succeed with the calculation no problem and then pass the rendered data to the bot itself.
As far as I understand, the idea is not to keep the bots/scrapers out like captchas used to in the past by providing them with a challenge that can only be solved by a human. Instead, the idea is that when you make a request, you first have to provide proof-of-work. Which means you (the client) has to solve some challenge first (usually in the form of an expensive computation).
For most normal users, this will be negligible (i.e. a couple of seconds). However, for scrapers, this will very quickly add up and eat a lot of processing power making it less attractive (or even economically unfeasible) to scrap websites that are "protected" that way.
Technically, they still can get all the data - it will just cost them more.

My personal opinion on that front is that if LLM providers like OpenAI have higher costs to train their models, their service will just be more expensive for their end-users and everybody else gets negatively effected due to the "slow loading times" and increased battery consumption on mobile devices.
 
I'm still wondering that why can't they just bock the IP ranges? Is it a whac-a-mole type of situation?
There should be a reason behind it, or else we would not see so many "web berms".

It would be interesting to see the recent stats on this site, It sure has some tasty data for the LLMs.
 
took like 4s on my phone.
I agree most requests to the OpenWRT docs are around 4s but that is noticeable. And like mer mentions now we have cloudflares gimmick.
I switched my DNS from Cloudflare before I realized it was not the culprit.

Why can't scraping be detected? Send them to LaBrea tarpit. They can fingerprint browsers but not detect somebody ripping off whole sites?

 
A website for my other hobby (bamboo fly rods) has always had issues with bots and crawlers sucking up bandwidth, so they implemented the CloudFlare thing. It's similar to Captcha: asks you to verify "yes I am a human" before redirecting. Not sure how it's doing it, but I've run into issues where I'd hit that every time I clicked a link, but it worked fine running Firefox in a private window. This is what the CloudFlare thing does, check the box and you get to the destination. Not sure if there are other options.
View attachment 23108
Try https://daemonforums.org/ ...the same
 
Why can't scraping be detected? Send them to LaBrea tarpit. They can fingerprint browsers but not detect somebody ripping off whole sites?
Tarpitting still allows a scraper to collect the information without incurring any significant cost on them (compared to proof-of-work solution that solutions like Anubis).
Furthermore, tarpitting is costly on the server side. The server still needs to allocate and manage the connection. Tarpits are basically a server side self-inflicted slowloris attack.

They can fingerprint browsers but not detect somebody ripping off whole sites?
Fingerprinting doesn't work well against non-human users/clients.

I'm still wondering that why can't they just bock the IP ranges? Is it a whac-a-mole type of situation?
Anyone who's seriously in the scraping business has truckloads of IP addresses, used different clients with different geolocations to spread the load - some are reportedly using bot nets. Blocking IP addresses does absolutely nothing and negatively influences legitimate users of your service.
 
A website for my other hobby (bamboo fly rods) has always had issues with bots and crawlers sucking up bandwidth, so they implemented the CloudFlare thing. It's similar to Captcha: asks you to verify "yes I am a human" before redirecting. Not sure how it's doing it, but I've run into issues where I'd hit that every time I clicked a link, but it worked fine running Firefox in a private window. This is what the CloudFlare thing does, check the box and you get to the destination. Not sure if there are other options.
View attachment 23108
My guess is the amount of time it takes you to click, combined with the click not being centered in the box makes it non-deterministic, and I'd assume the bot would be too consistent.
 
Ooh, ooh, I suffer through those Cloudflare-triggered captchas more often than I ever have in the past, and sometimes they don't let me through. But now the investment in extra bandwidth capacity by the ISPs is beginning to make sense to me after reading this thread. Although, sometimes it helps to clear the browser cache to avoid getting hit too bad by this. I'm sometimes reluctant to do that, but then I'm reminded that clearing browser cache is not a bad security practice, and re-authentication in places is a small price to pay.

A website for my other hobby (bamboo fly rods)
This brings back memories of my childhood... bamboo fly rods used to be premium stuff, and if you could not afford it, you just used whatever you can find in the nearby forest... but I digress.
 
Back
Top