How valuable is the FreeBSD source ?

PMc · Feb 26, 2025

There is a git repository of the FreeBSD src, at https://cgit.freebsd.org/src/

Some time in the past that server had failed. Anybody can replicate the FreeBSD repo and setup a Webserver with cgit, and since I already had a cgit running, I did so.

So far, so good. Then I found lots of robots collecting that data, and since the repos are quite a lot of files, that became annoying - in addition to being completely pointless as the data is nothing new or of value, but just a mirror-copy.

So I placed a proper robots.txt file there, to protect them and me from wasting unnecessary traffic - and amazon and a few others did in fact respect that.
But others did not. And this is new to me: the Internet can only function per mutual agreements; deliberately to disregard the requests of the peer is something that was not seen earlier on the net.

I found the respective adresses all belonging to "compute.hwclouds-dns.com", and also a bunch of non-resolvable IPs that do belong to something called "huaweicloud" aka Xinnet, with responsible's e-mails at huawei.com. Apparently this is a very unpleasant and disrespectful member of the Internet community.

So I blocked these address ranges, and things were good, for a while.
Then the robot-scans did reappear, this time originating from "crawl.bytedance.com" and a whole bunch of IPs from Alibaba cloud - some more very unpleasant and disrespectful people who should not be suffered on the net.

I blocked these also. But this time it did not take long. And now the picture is really bizarre: yesterday all of a sudden did lots of scans appear, while the originating addresses look like this:

Code:

host-181-199-63-48.ecua.net.ec
gmt-237-037.gamatelecomnet.com.br
pc-231-116-101-190.cm.vtr.net
bbb4aaae.virtua.com.br
139.13.196.181.static.anycast.cnt-grms.ec
152-248-41-138.user.vivozap.com.br
ip-179.108.50.220.redeatel.com.br
177.18.235.127.static.host.gvt.net.br
host190.5.33.175.dynamic.pacificored.cl
ip-189.126.44.83.jrtelecom.com.br
138-117-208-204.viamartelecom.com.br
170-238-198-165.static.sumicity.net.br
186-249-197-110.unifique.netnow
189-201-234-177.gigasat.net.br

There is an incredible lot of them, and they look like they were private user addresses, distributed all over south america. But this cannot be what it seems - this is some organized scanning, only /disguised/ as private users.

This is a different kind of quality now: not just somebody who has a web scanning bot, and decides to not respect the robots.txt file basically because they are antisocial assholes, but rather somebody who goes to quite some lengths of organization (and investment!) in order to /deliberately/ behave unpleasant.

And the question arises: what do they want to achieve? What is the value to gain from collecting a redundant copy of the FreeBSD-src which could much easier be obtained per git cloning?

3301 · Feb 26, 2025

Some part of this traffic is generated by "AI" bots: https://blog.cloudflare.com/declari...ts-scrapers-and-crawlers-with-a-single-click/
Some people want to fight them: https://github.com/ai-robots-txt

Just another good side of "AI revolution".

cracauer@ · Feb 26, 2025

One factor is that machine learning companies tend to ignore robots.txt when collecting training data. Either because they think that robots.txt is only for classic search engines, or because they are desperate for training data.

eternal_noob · Feb 26, 2025

Stop signs never worked for criminals. On the contrary, they attract them magically.

kpedersen · Feb 26, 2025

With a bit of fiddling you can automatically blacklist addresses that ignore the robots.txt and try to request files that would otherwise never be seen. Same goes for addresses that scrape blacklistme@mydomain.com and try to send an email to it.

Ultimately, if I left a nice bike on the street with a note "don't steal me", I don't believe every single person walking past would respect that. So I certainly don't believe the wider online community to be any better.

drhowarddrfine · Feb 26, 2025

PMc said:
So I placed a proper robots.txt file there, to protect them and me from wasting unnecessary traffic - and amazon and a few others did in fact respect that.
But others did not. And this is new to me: the Internet can only function per mutual agreements; deliberately to disregard the requests of the peer is something that was not seen earlier on the net.

Robots Refresher: introducing a new series | Google Search Central Blog | Google for Developers

developers.google.com

dkh · Feb 28, 2025

I ended up setting up an ipfw table that contains three of the FireHOL lists (https://iplists.firehol.org/) and block things that way. It's about 35,000 subnets. ipfw tables and being able to swap in new versions on the fly is awesome. A couple of small scripts and it's all automated.

Not sure that the AI scanners are in there but some probably are.

PMc · Mar 16, 2025

3301 said:
Some part of this traffic is generated by "AI" bots: https://blog.cloudflare.com/declari...ts-scrapers-and-crawlers-with-a-single-click/
Some people want to fight them: https://github.com/ai-robots-txt

Just another good side of "AI revolution".

Indeed.

I am not sure about the viability of "fighting" them. The "robots.txt" came up as an agreement between disparate interests of webpage-operators and searchengine-operators - and that should usually be the way to solve such things.

cracauer@ said:
One factor is that machine learning companies tend to ignore robots.txt when collecting training data. Either because they think that robots.txt is only for classic search engines, or because they are desperate for training data.

The problem is, a normal bot working for a search engine behaves somehow moderate, it just collects urls to compile them into a searchable database. These machines on the contrary are reckless in grabbing everything in every possible fashion, disregarding the bandwidth consumed. And the webpage operator has to pay for that bandwidth - so they are de-facto thieves (eternal_noob & kpedersen: I very much agree with You on the notion 'criminals').

Now I understand, these are AI backends, and the idea apparently is to feed them as much as ever possible from the internet. Somehow this reminds me of the texas oil boom, where the idea was to dig the ground as much as possible, in order to then find oil and get-rich-quick. At that time the greed of ripping the ressources out of Mother Earth has dominated everything. Now there seems to be a similar greed concering the idea that one might only need to rip enough data out of the internet in order to somehow get-rich-quick. I can, however, not see how that should actually work - maybe I am not intelligent enough for that.

For now I have made the host an IPv6-only node, since anybody running FreeBSD can usually do IPv6. And for now the retrievals have completely disappeared. It seems, as of now the artificial intelligence has not yet aquired enough intelligence to be able to operate IPv6. I'm sure that won't hold for long, but I find it interesting...

But then also, on the general line, this here is just one of many aspects of an ever increasing amount of recklessness, which seems to happen everywhere and in every regard, from patients in need of care being left alone for days lying in their feces, to vendors blatantly denying the by-law required warranty.