What do you think of web berms?

Phishfry · Jul 10, 2025

I really like the documentation for OpenWRT but they literally stall out the user. I have seen 11 seconds delay.

[OpenWrt Wiki] Wi-Fi /etc/config/wireless

openwrt.org

I call these web berms. What are they thinking???

freshports has a banner about web bot coming soon. I don't wanna become web site member no offense.

Dan is the situation that bad? No way to use block lists? Is this DDOS related?

To stall out web page users seems absurd. We used to pride ourselves on page time responses.

hardworkingnewbie · Jul 11, 2025

Yes, the situation is so bad. The reason though are not DDoS attacks, but AI crawlers for stuff like ChatGPT, Grok and so on running havoc on bad behaviour.

Cloudflare has just recently announced that their protection services will be able to keep them out, if you want so. This service used here was made for the same reason.

Wikipedia for example had an article in April about how these bad behaved crawlers are affecting their operations.

65% of their traffic comes from bots, and it adds spikes in traffic consumption. Traffic is expensive, and many large sites are suffering right now the same phenomena. Some sites also dislike being indexed by AI crawlers, since it takes the traffic away from the site to OpenAI, Microsoft, Perplexity and so on, so making dents in their revenue streams while tech bros are now making money with their contents.

This is why I expect this to become pretty much common practice and even more wide spread in the near future.

It is noteworthy that not normal search crawlers for Google, Bing and so on are the problem, those are well behaved. It's AI LLMs, which needs to inhale all of the internet and then some to get enough training data, which now is able to "deep research" the internet when being asked questions and this is causing the issues.

Multimedia_bandwith_demand_for_the_Wikimedia_Projects.png

How crawlers impact the operations of the Wikimedia projects

Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on Wikimedia Commons – has grow…

diff.wikimedia.org

AI crawler wars threaten to make the web more closed for everyone

There’s an accelerating cat-and-mouse game between web publishers and AI crawlers, and we all stand to lose.

www.technologyreview.com

Phishfry · Jul 11, 2025

hardworkingnewbie said:
It's AI LLMs, which needs to inhale all of the internet and then some to get enough training data, which now is able to "deep research" the internet when being asked questions and this is causing the issues.

So from the quaint days of robot.txt to now blocklists don't work on AI? Why can't they be blocked? Like fail2ban? AI scrapers are too sophisticated to be detected? So make users wait 11 seconds.

Internet governance is really needed if this is what the future looks like.

nxjoseph · Jul 11, 2025

I guess AI scraper protections like Anubis use your device's horse power to calculate something so that they can know you are a real user. IIRC, the site you've shared took like 4s on my phone. I never got annoyed because of these things because they don't seem to last that long for me. I would not want my site to be consumed by AI bots either so I understand sites who use things like that. Maybe cloudflare would solve this in a better way, idk, but then you would have to use cloudflare too

.

kpedersen · Jul 11, 2025

How do these stop AI data scraping bots? Don't most bots use an embedded HTML5 web renderer to scrape the data? The renderer would succeed with the calculation no problem and then pass the rendered data to the bot itself.

mer · Jul 11, 2025

A website for my other hobby (bamboo fly rods) has always had issues with bots and crawlers sucking up bandwidth, so they implemented the CloudFlare thing. It's similar to Captcha: asks you to verify "yes I am a human" before redirecting. Not sure how it's doing it, but I've run into issues where I'd hit that every time I clicked a link, but it worked fine running Firefox in a private window. This is what the CloudFlare thing does, check the box and you get to the destination. Not sure if there are other options.

hardworkingnewbie · Jul 11, 2025

Well the tech bros behind LLMs don't give a shit about good behaviour and copyright. I mean, it's well known Facebook used TBs of pirated eBooks to train their models.

So they also don't care much about robot.txt in my opinion. robot.txt is a guideline; if you want your crawler to ignore it, you can.

Crawlers normally don't compute Javascript. Stuff like this Anubis seems to execute Javascript in order to separate AI crawlers from normal web browsers in my opinion.

Maxnix · Jul 11, 2025

https://cgit.freebsd.org/ does the same as OpenWRT, but in my experience it's so fast that it's unnoticeable.

baaz · Jul 11, 2025

I miss being able to browse the web without JS.
This was the last nail in the coffin for it.

hardworkingnewbie · Jul 11, 2025

Actually LLM are a very big nail into the coffin of the open web.

In April around 75% of the new web content was generated by AI according to estimates.

When training new LLM models on AI generated content this deteriorates their quality, so the LLM makers need to find ways around that.

Espionage724 · Jul 11, 2025

Ugh those bot check delays.

I'd like to think that if info was more sparse or limited in definition (encouraging further self-research), AI can't find it useful. Less chars is also lighter for AI to scrape

jbo@ · Jul 11, 2025

kpedersen said:
How do these stop AI data scraping bots? Don't most bots use an embedded HTML5 web renderer to scrape the data? The renderer would succeed with the calculation no problem and then pass the rendered data to the bot itself.

As far as I understand, the idea is not to keep the bots/scrapers out like captchas used to in the past by providing them with a challenge that can only be solved by a human. Instead, the idea is that when you make a request, you first have to provide proof-of-work. Which means you (the client) has to solve some challenge first (usually in the form of an expensive computation).
For most normal users, this will be negligible (i.e. a couple of seconds). However, for scrapers, this will very quickly add up and eat a lot of processing power making it less attractive (or even economically unfeasible) to scrap websites that are "protected" that way.
Technically, they still can get all the data - it will just cost them more.

My personal opinion on that front is that if LLM providers like OpenAI have higher costs to train their models, their service will just be more expensive for their end-users and everybody else gets negatively effected due to the "slow loading times" and increased battery consumption on mobile devices.

baaz · Jul 11, 2025

I'm still wondering that why can't they just bock the IP ranges? Is it a whac-a-mole type of situation?
There should be a reason behind it, or else we would not see so many "web berms".

It would be interesting to see the recent stats on this site, It sure has some tasty data for the LLMs.

Phishfry · Jul 11, 2025

nxjoseph said:
took like 4s on my phone.

I agree most requests to the OpenWRT docs are around 4s but that is noticeable. And like mer mentions now we have cloudflares gimmick.
I switched my DNS from Cloudflare before I realized it was not the culprit.

Why can't scraping be detected? Send them to LaBrea tarpit. They can fingerprint browsers but not detect somebody ripping off whole sites?

Tarpit (networking) - Wikipedia

en.wikipedia.org

One rebel's malicious 'tar pit' trap is driving AI web-scrapers insane

Nepenthes intentionally traps data-scraping bots in a never-ending loop of nonsense to waste computing power.

www.pcworld.com

fernandel · Jul 11, 2025

mer said:
A website for my other hobby (bamboo fly rods) has always had issues with bots and crawlers sucking up bandwidth, so they implemented the CloudFlare thing. It's similar to Captcha: asks you to verify "yes I am a human" before redirecting. Not sure how it's doing it, but I've run into issues where I'd hit that every time I clicked a link, but it worked fine running Firefox in a private window. This is what the CloudFlare thing does, check the box and you get to the destination. Not sure if there are other options.
View attachment 23108

Try https://daemonforums.org/ ...the same

jbo@ · Jul 12, 2025

Phishfry said:
Why can't scraping be detected? Send them to LaBrea tarpit. They can fingerprint browsers but not detect somebody ripping off whole sites?

Tarpitting still allows a scraper to collect the information without incurring any significant cost on them (compared to proof-of-work solution that solutions like Anubis).
Furthermore, tarpitting is costly on the server side. The server still needs to allocate and manage the connection. Tarpits are basically a server side self-inflicted slowloris attack.

Phishfry said:
They can fingerprint browsers but not detect somebody ripping off whole sites?

Fingerprinting doesn't work well against non-human users/clients.

baaz said:
I'm still wondering that why can't they just bock the IP ranges? Is it a whac-a-mole type of situation?

Anyone who's seriously in the scraping business has truckloads of IP addresses, used different clients with different geolocations to spread the load - some are reportedly using bot nets. Blocking IP addresses does absolutely nothing and negatively influences legitimate users of your service.

kent_dorfman766 · Jul 12, 2025

mer said:
A website for my other hobby (bamboo fly rods) has always had issues with bots and crawlers sucking up bandwidth, so they implemented the CloudFlare thing. It's similar to Captcha: asks you to verify "yes I am a human" before redirecting. Not sure how it's doing it, but I've run into issues where I'd hit that every time I clicked a link, but it worked fine running Firefox in a private window. This is what the CloudFlare thing does, check the box and you get to the destination. Not sure if there are other options.
View attachment 23108

My guess is the amount of time it takes you to click, combined with the click not being centered in the box makes it non-deterministic, and I'd assume the bot would be too consistent.

astyle · Jul 12, 2025

Ooh, ooh, I suffer through those Cloudflare-triggered captchas more often than I ever have in the past, and sometimes they don't let me through. But now the investment in extra bandwidth capacity by the ISPs is beginning to make sense to me after reading this thread. Although, sometimes it helps to clear the browser cache to avoid getting hit too bad by this. I'm sometimes reluctant to do that, but then I'm reminded that clearing browser cache is not a bad security practice, and re-authentication in places is a small price to pay.

mer said:
A website for my other hobby (bamboo fly rods)

This brings back memories of my childhood... bamboo fly rods used to be premium stuff, and if you could not afford it, you just used whatever you can find in the nearby forest... but I digress.

drhowarddrfine · Jul 12, 2025

jbo@ said:
scrapper

scraper and scraping

Phishfry · Jul 12, 2025

astyle said:
Ooh, ooh, I suffer through those Cloudflare-triggered captchas more often than I ever have in the past, and sometimes they don't let me through

Yes me too. Jump through the hoops and still denied.
Accessing websites via dialup was faster.

mer · Jul 12, 2025

Phishfry said:
Yes me too. Jump through the hoops and still denied.
Accessing websites via dialup was faster.

Try the same website but from a Firefox private window. Not sure what the difference is but that worked for me on a site that basically got stuck in a loop trying to log in.