What do you think of web berms?

Way too passive. This is a plague, a war zone. There is a need to fight back.
Collect their patterns with honey pots and send them in endless random mazes.
Who do you think is even able to set honeypots up, let alone figure out the entire process of pattern collection and get it right? Because getting it right is the difference maker in deciding how to really fight, and what costs you are gonna incur on your end. It may be easier to just ignore that burning dumpster and set up shop elsewhere.
 
Actually LLM are a very big nail into the coffin of the open web.

In April around 75% of the new web content was generated by AI according to estimates.
Yep.

Why can't scraping be detected? Send them to LaBrea tarpit. They can fingerprint browsers but not detect somebody ripping off whole sites?

I need to install Nepenthes on my public Web servers...
 
This is why I made my web project invite only. The most that should be scrapped is a redirect and login page and very small binary files. I should maybe make a minimal css page for the login screen. These darn bots and their scrappy scraping.
 
I know support has been removed for some browsers for passing creds in headers but I wonder how hard it would be to basic auth invalid creds in the header to auto ban ip on a scraper. Likely they don't use a gui so possibly auth in the headers would work. Of course the server will need auto ban configured for invalid auth.
 
I need to install Nepenthes on my public Web servers...

I recently created a new port for iocaine: PR 287944
Iocaine is one of several nepenthes-inspired tarpits and also aims to lure in AI crawlers and endlessly feeds them random garbage.
I simply divert requests from all known crawler user agents [1] to it (they can still acces the robots.txt, but anything else goes to iocaine), as well as everything that tries to connect to an invalid server_name, as it seems connecting to server_name 'localhost' is very popular amongst bots that randomly try out wordpress vulnerabilities/misconfigurations and I don't mind feeding those script kiddies garbage as well...

True, this won't catch crawlers that spoof their user agent - but it seems even if they blatantly ignore any robots.txt directives, the cretins that program such crap still take some pride in showing off by presenting their correct user agent. At least I haven't had any spikes in crawler-induced traffic since installing iocaine. Before that, those garbage-collectors inflated my monthly bunnynet-bill more than once by scanning frenzily through thousands of logfiles on my pkg/poudriere frontend (I only had some geofencing at bunnynet in place, which obviously doesn't help against that globally hosted pest)
Within a few weeks I've already fed several hundreds (!!) of GB of garbage to mainly the openai GPT crawler, which seems to be the most persistent and robots.txt-ignorant of them all.


[1] a curated and frequently updated list can be found at https://github.com/ai-robots-txt/ai.robots.txt
 
I recently created a new port for iocaine: PR 287944
Iocaine is one of several nepenthes-inspired tarpits and also aims to lure in AI crawlers and endlessly feeds them random garbage.
I simply divert requests from all known crawler user agents [1] to it (they can still acces the robots.txt, but anything else goes to iocaine), as well as everything that tries to connect to an invalid server_name, as it seems connecting to server_name 'localhost' is very popular amongst bots that randomly try out wordpress vulnerabilities/misconfigurations and I don't mind feeding those script kiddies garbage as well...

True, this won't catch crawlers that spoof their user agent - but it seems even if they blatantly ignore any robots.txt directives, the cretins that program such crap still take some pride in showing off by presenting their correct user agent. At least I haven't had any spikes in crawler-induced traffic since installing iocaine. Before that, those garbage-collectors inflated my monthly bunnynet-bill more than once by scanning frenzily through thousands of logfiles on my pkg/poudriere frontend (I only had some geofencing at bunnynet in place, which obviously doesn't help against that globally hosted pest)
Within a few weeks I've already fed several hundreds (!!) of GB of garbage to mainly the openai GPT crawler, which seems to be the most persistent and robots.txt-ignorant of them all.


[1] a curated and frequently updated list can be found at https://github.com/ai-robots-txt/ai.robots.txt
Brilliant. :D
 
In my experience, spoofing the user-agent is one of the very first things black hats do.
I agree - if there is actually a human behind it and specifically carrying out some (malicious) task.
The 99% static noise of AI crawlers we're targeting here seems to mostly go with the default useragent strings, e.g.:
Code:
user_agent="Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)"
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot"
user_agent="Scrapy/2.12.0 (+https://scrapy.org)"
These are taken from my iocaine metrics - GPTBot makes up ~80% of total hits* and ~20% are bots that target the "localhost" server name trying to access some 'wp-admin' or 'install.php' crap and they go by a multitude of stupid and obviously bogus user agent strings. There's really everything in those strings, from Chrome version 33, Mozilla 4.0, IE6.0, Windows NT 6.1, iPhone OS 6_1, Android 6.0 and I even have some entries claiming to come from a Nokia 6100...
I'm planning on harvesting those obviously bogus user agent strings and adding them to the nginx/angie filter rule.

As said: I haven't seen a single spike in either the traffic/load of my servers or the cdn traffic at bunnynet for the domains I have cached. The remaining peaks perfectly align with expected traffic peaks and are far lower than those AI-bots going crazy after discovering e.g. a poudriere buildlog directory...


*) I also had meta, google and other AI scrapers in the logs/metrics at the beginning, but it seems they either are better at detecting garbage and give up relatively quickly or they started to actually respect the robots.txt - either way, it's a win.

And FTR: I think the robots.txt I'm serving for every hostname on my servers is pretty clear and unambiguous:
Code:
location /robots.txt {
        add_header  Content-Type  text/plain;
        return 200 "User-agent: *\nDisallow: /\n";
}
 
The 99% static noise of AI crawlers we're targeting here seems to mostly go with the default useragent strings, e.g.: ... These are taken from my iocaine metrics - GPTBot makes up ~80% of total hits* and ~20% are bots that target the "localhost" server name trying to access some 'wp-admin' or 'install.php' crap ...

Are those AI scrapers, or the traditional hackers looking for vulnerable scripts? The latter has been a real problem for ~20 years.

I need to invest a few days into hardening my web servers, since about 99% of all traffic is now hack attacks and scraping. And while I do allow some search engines (Google, Bing), since I have content that I want to be findable on the web, most scraping is not useful search engines. And sadly, I'm now paying for CPU power and bandwidth for web serving (even though the monthly charges are still below US-$1).

Along the same lines, I need to harden the few publicly available ssh ports; the daily attack log (which gets e-mailed to me by the nightly security periodic job) is now so long, my default e-mail refuses to display the message.
 
Along the same lines, I need to harden the few publicly available ssh ports; the daily attack log (which gets e-mailed to me by the nightly security periodic job) is now so long, my default e-mail refuses to display the message.

I moved to a different TCP port. That cut down on them to almost zero. It seems that few people combine port scanning and ssh attempts.
 
Warning: Off-topic

I moved to a different TCP port. That cut down on them to almost zero. It seems that few people combine port scanning and ssh attempts.

Done. For many months, the unusual ports worked. But then, once the hackers discovered the port, they do several thousand attempts per day. Maybe if I moved to another port again, it would cut it down to near zero.
 
I moved to a different TCP port. That cut down on them to almost zero. It seems that few people combine port scanning and ssh attempts.
Over ten years ago I ran ssh on two nonstandard ports for very temporary external access. I'm STILL getting bots pinging those specific two ports looking for my ssh server, according to my firewall logs. I did modify my firewall rules a few months ago to NOT respond to external SYN requests at all, but haven't checked if that has lowered the noise.
 
What should Wikipedia do about the British? Strong-arming an FREE online encyclopedia?


Sign in to access an encyclopedia?

The only way it could avoid being classed as Category 1 would be to cut the number of people in the UK who could access the online encyclopaedia by about three-quarters, or disable key functions on the site.
 
Sign in to access an encyclopedia?
You gonna walk into a public library to discover that a homeless guy has been sleeping on its couches? Even if the library has several encyclopedia sets on its shelves, I'd be concerned for my safety. I mean, if a homeless guy is sleeping THERE, what kind of condition are the books in? Will the homeless guy pick a fight with me? I'm here to read. I should not have to worry about my personal safety. The guy is probably homeless because he disregarded social norms to the point that he got kicked out and has no other place to go.
 
Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit.
And the IQ of users of the Wayback Machine will go up 50%.
 
Phishfry
Well, if wikipedia doesn't host any servers in england then I'd say "skru u" to the english rules. Allowing any govt to set rules unilaterally for folks outside of their borders sets a very very bad precedent. I tried to bring this very issue up in the debian forums just a few hours ago and I got the typical "global community toady" answers that deflected from the main issue of sovereignty and enforceability. I mean, what's next? Allowing the CCP in China to dictate global internet policy because it "offends them"?
 
Exactly they should just ignore the British rules and should not have even tried to negotiate categories.... Cut 100% of them off until fixed. Off with the queens head.

Wikipedia's Human Anatomy pages alone will send all the staff to prison. Think of the children. How could you expose them to this?

The free internet is under multi-prong attack. From US states laws regarding ID requirements to payment processors exerting their monopoly power.
 
I've never seen homeless people in any of our public libraries.
I'd like to see them there, reading and improving themselves. Better than some busy bodies who want librarians to snitch on the people on who-reads-what (looking at shrub). As for providers, simply cut off the gouvernment networks first, then the whole country if things don't clear up. Maybe there is a legitimate interest of the law makers to have restrictions, for gambling for example, but on wikipedia?
 
Back
Top