What do you think of web berms?

astyle · Aug 6, 2025

POSIX.1 said:
Way too passive. This is a plague, a war zone. There is a need to fight back.
Collect their patterns with honey pots and send them in endless random mazes.

Who do you think is even able to set honeypots up, let alone figure out the entire process of pattern collection and get it right? Because getting it right is the difference maker in deciding how to really fight, and what costs you are gonna incur on your end. It may be easier to just ignore that burning dumpster and set up shop elsewhere.

Espionage724 · Aug 7, 2025

POSIX.1 said:
Way too passive. This is a plague, a war zone. There is a need to fight back.
Collect their patterns with honey pots and send them in endless random mazes.

Buy up their hardware and go back to mining

hardworkingnewbie · Aug 7, 2025

Guess who's also using Anubis now? https://bugs.freebsd.org/bugzilla/

Makes sense.

drhowarddrfine · Aug 7, 2025

Anything that delays or distracts the visitor from the purpose they came to the site is a negative. Example: those ads that popup just as you show up or start to scroll.

USerID · Aug 7, 2025

I don't think about such technologies. I don't care, to be honest.

Jose · Aug 7, 2025

hardworkingnewbie said:
Actually LLM are a very big nail into the coffin of the open web.

In April around 75% of the new web content was generated by AI according to estimates.

Yep.

We Got Hit by a Google Penalty After Publishing 22,000 AI Pages. Here's What Happened.

How we lost all our organic traffic overnight after publishing 22,000 AI-generated pages, what we learned from the penalty, and how we rebuilt with a better approach.

tailride.so

Phishfry said:
Why can't scraping be detected? Send them to LaBrea tarpit. They can fingerprint browsers but not detect somebody ripping off whole sites?

Tarpit (networking) - Wikipedia

en.wikipedia.org

One rebel's malicious 'tar pit' trap is driving AI web-scrapers insane

Nepenthes intentionally traps data-scraping bots in a never-ending loop of nonsense to waste computing power.

www.pcworld.com

I need to install Nepenthes on my public Web servers...

LibreQuest · Aug 7, 2025

This is why I made my web project invite only. The most that should be scrapped is a redirect and login page and very small binary files. I should maybe make a minimal css page for the login screen. These darn bots and their scrappy scraping.

LibreQuest · Aug 7, 2025

I know support has been removed for some browsers for passing creds in headers but I wonder how hard it would be to basic auth invalid creds in the header to auto ban ip on a scraper. Likely they don't use a gui so possibly auth in the headers would work. Of course the server will need auto ban configured for invalid auth.

sko · Aug 8, 2025

Jose said:
I need to install Nepenthes on my public Web servers...

I recently created a new port for iocaine: PR 287944
Iocaine is one of several nepenthes-inspired tarpits and also aims to lure in AI crawlers and endlessly feeds them random garbage.
I simply divert requests from all known crawler user agents [1] to it (they can still acces the robots.txt, but anything else goes to iocaine), as well as everything that tries to connect to an invalid server_name, as it seems connecting to server_name 'localhost' is very popular amongst bots that randomly try out wordpress vulnerabilities/misconfigurations and I don't mind feeding those script kiddies garbage as well...

True, this won't catch crawlers that spoof their user agent - but it seems even if they blatantly ignore any robots.txt directives, the cretins that program such crap still take some pride in showing off by presenting their correct user agent. At least I haven't had any spikes in crawler-induced traffic since installing iocaine. Before that, those garbage-collectors inflated my monthly bunnynet-bill more than once by scanning frenzily through thousands of logfiles on my pkg/poudriere frontend (I only had some geofencing at bunnynet in place, which obviously doesn't help against that globally hosted pest)
Within a few weeks I've already fed several hundreds (!!) of GB of garbage to mainly the openai GPT crawler, which seems to be the most persistent and robots.txt-ignorant of them all.

[1] a curated and frequently updated list can be found at https://github.com/ai-robots-txt/ai.robots.txt

LibreQuest · Aug 8, 2025

sko said:
I recently created a new port for iocaine: PR 287944
Iocaine is one of several nepenthes-inspired tarpits and also aims to lure in AI crawlers and endlessly feeds them random garbage.
I simply divert requests from all known crawler user agents [1] to it (they can still acces the robots.txt, but anything else goes to iocaine), as well as everything that tries to connect to an invalid server_name, as it seems connecting to server_name 'localhost' is very popular amongst bots that randomly try out wordpress vulnerabilities/misconfigurations and I don't mind feeding those script kiddies garbage as well...

True, this won't catch crawlers that spoof their user agent - but it seems even if they blatantly ignore any robots.txt directives, the cretins that program such crap still take some pride in showing off by presenting their correct user agent. At least I haven't had any spikes in crawler-induced traffic since installing iocaine. Before that, those garbage-collectors inflated my monthly bunnynet-bill more than once by scanning frenzily through thousands of logfiles on my pkg/poudriere frontend (I only had some geofencing at bunnynet in place, which obviously doesn't help against that globally hosted pest)
Within a few weeks I've already fed several hundreds (!!) of GB of garbage to mainly the openai GPT crawler, which seems to be the most persistent and robots.txt-ignorant of them all.

[1] a curated and frequently updated list can be found at https://github.com/ai-robots-txt/ai.robots.txt

Brilliant.

Jose · Aug 8, 2025

sko said:
True, this won't catch crawlers that spoof their user agent...

In my experience, spoofing the user-agent is one of the very first things black hats do.

sko · Aug 8, 2025

Jose said:
In my experience, spoofing the user-agent is one of the very first things black hats do.

I agree - if there is actually a human behind it and specifically carrying out some (malicious) task.
The 99% static noise of AI crawlers we're targeting here seems to mostly go with the default useragent strings, e.g.:

Code:

user_agent="Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)"
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot"
user_agent="Scrapy/2.12.0 (+https://scrapy.org)"

These are taken from my iocaine metrics - GPTBot makes up ~80% of total hits* and ~20% are bots that target the "localhost" server name trying to access some 'wp-admin' or 'install.php' crap and they go by a multitude of stupid and obviously bogus user agent strings. There's really everything in those strings, from Chrome version 33, Mozilla 4.0, IE6.0, Windows NT 6.1, iPhone OS 6_1, Android 6.0 and I even have some entries claiming to come from a Nokia 6100...
I'm planning on harvesting those obviously bogus user agent strings and adding them to the nginx/angie filter rule.

As said: I haven't seen a single spike in either the traffic/load of my servers or the cdn traffic at bunnynet for the domains I have cached. The remaining peaks perfectly align with expected traffic peaks and are far lower than those AI-bots going crazy after discovering e.g. a poudriere buildlog directory...

*) I also had meta, google and other AI scrapers in the logs/metrics at the beginning, but it seems they either are better at detecting garbage and give up relatively quickly or they started to actually respect the robots.txt - either way, it's a win.

And FTR: I think the robots.txt I'm serving for every hostname on my servers is pretty clear and unambiguous:

Code:

location /robots.txt {
        add_header  Content-Type  text/plain;
        return 200 "User-agent: *\nDisallow: /\n";
}

ralphbsz · Aug 8, 2025

sko said:
The 99% static noise of AI crawlers we're targeting here seems to mostly go with the default useragent strings, e.g.: ... These are taken from my iocaine metrics - GPTBot makes up ~80% of total hits* and ~20% are bots that target the "localhost" server name trying to access some 'wp-admin' or 'install.php' crap ...

Are those AI scrapers, or the traditional hackers looking for vulnerable scripts? The latter has been a real problem for ~20 years.

I need to invest a few days into hardening my web servers, since about 99% of all traffic is now hack attacks and scraping. And while I do allow some search engines (Google, Bing), since I have content that I want to be findable on the web, most scraping is not useful search engines. And sadly, I'm now paying for CPU power and bandwidth for web serving (even though the monthly charges are still below US-$1).

Along the same lines, I need to harden the few publicly available ssh ports; the daily attack log (which gets e-mailed to me by the nightly security periodic job) is now so long, my default e-mail refuses to display the message.

cracauer@ · Aug 8, 2025

ralphbsz said:
Along the same lines, I need to harden the few publicly available ssh ports; the daily attack log (which gets e-mailed to me by the nightly security periodic job) is now so long, my default e-mail refuses to display the message.

I moved to a different TCP port. That cut down on them to almost zero. It seems that few people combine port scanning and ssh attempts.

ralphbsz · Aug 9, 2025

Warning: Off-topic

cracauer@ said:
I moved to a different TCP port. That cut down on them to almost zero. It seems that few people combine port scanning and ssh attempts.

Done. For many months, the unusual ports worked. But then, once the hackers discovered the port, they do several thousand attempts per day. Maybe if I moved to another port again, it would cut it down to near zero.

kent_dorfman766 · Aug 9, 2025

cracauer@ said:
I moved to a different TCP port. That cut down on them to almost zero. It seems that few people combine port scanning and ssh attempts.

Over ten years ago I ran ssh on two nonstandard ports for very temporary external access. I'm STILL getting bots pinging those specific two ports looking for my ssh server, according to my firewall logs. I did modify my firewall rules a few months ago to NOT respond to external SYN requests at all, but haven't checked if that has lowered the noise.

Phishfry · Aug 11, 2025

What should Wikipedia do about the British? Strong-arming an FREE online encyclopedia?

Wikipedia loses challenge against Online Safety Act verification rules

The Wikimedia Foundation says the new rules could threaten user privacy and safety.

www.bbc.com

Sign in to access an encyclopedia?

The only way it could avoid being classed as Category 1 would be to cut the number of people in the UK who could access the online encyclopaedia by about three-quarters, or disable key functions on the site.

astyle · Aug 11, 2025

Phishfry said:
Sign in to access an encyclopedia?

You gonna walk into a public library to discover that a homeless guy has been sleeping on its couches? Even if the library has several encyclopedia sets on its shelves, I'd be concerned for my safety. I mean, if a homeless guy is sleeping THERE, what kind of condition are the books in? Will the homeless guy pick a fight with me? I'm here to read. I should not have to worry about my personal safety. The guy is probably homeless because he disregarded social norms to the point that he got kicked out and has no other place to go.

drhowarddrfine · Aug 11, 2025

I've never seen homeless people in any of our public libraries.

astyle · Aug 12, 2025

Well, Reddit is trying to fight the AI scrapers. Maybe our resident anti-Redditor can pick up some useful ideas from that, since that seems to be right up his alley.

Reddit will block the Internet Archive

It’s another move to protect against AI scraping.

www.theverge.com

drhowarddrfine · Aug 12, 2025

Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit.

And the IQ of users of the Wayback Machine will go up 50%.

astyle · Aug 12, 2025

drhowarddrfine said:
And the IQ of users of the Wayback Machine will go up 50%.

Just from the removal of the scraper bot traffic? sweet!

kent_dorfman766 · Aug 12, 2025

Phishfry
Well, if wikipedia doesn't host any servers in england then I'd say "skru u" to the english rules. Allowing any govt to set rules unilaterally for folks outside of their borders sets a very very bad precedent. I tried to bring this very issue up in the debian forums just a few hours ago and I got the typical "global community toady" answers that deflected from the main issue of sovereignty and enforceability. I mean, what's next? Allowing the CCP in China to dictate global internet policy because it "offends them"?

Phishfry · Aug 12, 2025

Exactly they should just ignore the British rules and should not have even tried to negotiate categories.... Cut 100% of them off until fixed. Off with the queens head.

Wikipedia's Human Anatomy pages alone will send all the staff to prison. Think of the children. How could you expose them to this?

The free internet is under multi-prong attack. From US states laws regarding ID requirements to payment processors exerting their monopoly power.

Crivens · Aug 12, 2025

drhowarddrfine said:
I've never seen homeless people in any of our public libraries.

I'd like to see them there, reading and improving themselves. Better than some busy bodies who want librarians to snitch on the people on who-reads-what (looking at shrub). As for providers, simply cut off the gouvernment networks first, then the whole country if things don't clear up. Maybe there is a legitimate interest of the law makers to have restrictions, for gambling for example, but on wikipedia?

What do you think of web berms?

Administrator