FOSS infrastructure is under attack by AI companies

@simple@lemm.ee · 28 days ago

FOSS infrastructure is under attack by AI companies

@daq@lemmy.sdf.org · 27 days ago

I’m not sure how they actually implemented it, but you can easily block ML crawlers via cloud flare. Isn’t just about every small site/service behind CF anyway?

@grysbok@lemmy.sdf.org · 27 days ago

Last I checked, cloudflare requires the user to have JavaScript and cookies enabled. My institution doesn’t want to require those because it would likely impact legitimate users as well as bots.

@daq@lemmy.sdf.org · 27 days ago

Huh? I can reach my site via curl that has neither. How did you come up with this random set of requirements?

@grysbok@lemmy.sdf.org · 27 days ago

Odd. I just tried

curl https://www.scrapingcourse.com/cloudflare-challenge

and got

Enable JavaScript and cookies to continue

I’m clearly not on the same setup as you are, but my off-the-cuff guess is that your curl command was issued from a system that cloudflare already recognized (IP whitelist, cookies, I dunno).

Anyways, I’m reading through this blog post on using cURL with cloudflare-protected sites and I’m finding it interesting.

@daq@lemmy.sdf.org · 27 days ago

Of course their challenge requires those things. How else could they implement it? Most users will never be presented with a challenge though and it is trivial to disable if you don’t want to ever challenge anyone. I was just saying CF blocks ML crawlers.

@RobotToaster@mander.xyz · 28 days ago

If an AI is detecting bugs, the least it could do is file a pull request, these things are supposed to be master coders right? 🙃

@reksas@sopuli.xyz · 28 days ago

to me, ai is a bit like bucket of water if you replace the water with “information”. Its a tool and it cant do anything on its own, you could make a program and instruct it to do something but it would work just as chaotically as when you generate stuff with ai. It annoys me so much to see so many(people in general) think that what they call ai is in anyway capable of independent action. It just does what you tell it to do and it does it based on how it has been trained, which is also why relying on ai trained by someone you shouldnt trust is bad idea.

wjs018 · edit-2 28 days ago

Really great piece. We have recently seen many popular lemmy instances struggle under recent scraping waves, and that is hardly the first time its happened. I have some firsthand experience with the second part of this article that talks about AI-generated bug reports/vulnerabilities for open source projects.

I help maintain a python library and got a bug report a couple weeks back of a user getting a type-checking issue and a bit of additional information. It didn’t strictly follow the bug report template we use, but it was well organized enough, so I spent some time digging into it and came up with no way to reproduce this at all. Thankfully, the lead maintainer was able to spot the report for what it was and just closed it and saved me from further efforts to diagnose the issue (after an hour or two were burned already).

@Dave@lemmy.nz · 28 days ago

AI scrapers are a massive issue for Lemmy instances. I’m gonna try some things in this article because there are enough of them identifying themselves with user agents that I didn’t even think of the ones lying about it.

I guess a bonus (?) is that with 1000 Lemmy instances, the bots get the Lemmy content 1000 times so our input has 1000 times the weighting of reddit.

HubertManne · 28 days ago

Any idea what the point of these are then? Sounds like its reporting a fake bug.

wjs018 · 28 days ago

The theory that the lead maintainer had (he is an actual software developer, I just dabble), is that it might be a type of reinforcement learning:

Get your LLM to create what it thinks are valid bug reports/issues
Monitor the outcome of those issues (closed immediately, discussion, eventual pull request)
Use those outcomes to assign how “good” or “bad” that generated issue was
Use that scoring as a way to feed back into the model to influence it to create more “good” issues

If this is what’s happening, then it’s essentially offloading your LLM’s reinforcement learning scoring to open source maintainers.

HubertManne · 28 days ago

Thats wild. I don’t have much hope for llm’s if things like this is how they are doing things and I would not be surprised given how well they don’t work. Too much quantity over quality in training.

@SabinStargem@lemmy.today · 27 days ago

Honestly, I would be alright with this if the AI companies paid Github so that the server infrastructure can be upgraded. Having AI that can figure out bugs and error reports could be really useful for our society. For example, your computer rebooting for no apparent reason? The AI can check the diagnostic reports, combine them with online reports, and narrow down the possibilities.

In the long run, this could also help maintainers as well. If they can have AI for testing programs, the maintainers won’t have to hope for volunteers or rely on paid QA for detecting issues.

What Github & AI companies should do, is an opt-in program for maintainers. If they allow the AI to officially make reports, Github should offer an reward of some kind to their users. Allocate to each maintainer a number of credits so that they can discuss the report with the AI in realtime, plus $10 bucks for each hour spent on resolving the issue.

Sadly, I have the feeling that malignant capitalism would demand maintainers to sacrifice their time for nothing but irritation.

@BrianTheeBiscuiteer@lemmy.world · edit-2 28 days ago

Testing out a theory with ChatGPT there might be a way, albeit clunky, to detect AI. I asked ChatGPT a simple math question then told it to disregard the rest of the message, then I asked it if it was AI. It answered the math question and told me it was ai. Now a bot probably won’t admit to being AI but it might be foolish enough to consider instruction that you explicitly told it not to follow.

Or you might simply be able to waste its resources by asking it to do something computationally difficult that most people would just reject outright.

Of course all of this could just result in making AI even harder to detect once it learns these tricks. 😬

@itsralC@lemm.ee · 27 days ago

These aren’t actual LLMs scraping the web, they’re your usual scraping bots used in an industrial scale, disregarding conventions about what they should or shouldn’t scrape.

@klu9@lemmy.ca · 28 days ago

The Linux Mint forums have been knocked offline multiple times over the last few months, to the point where the admins had to block all Chinese and Brazilian IPs for a while.

@deeferg@lemmy.world · 28 days ago

This is the first I’ve heard about Brazil in this type of cyber attack. Is it re-routed traffic going there or are there a large number of Brazilian bot farms now?

@klu9@lemmy.ca · 28 days ago

I don’t know why/how, just know that the admins saw the servers were being overwhelmed by traffic from Brazilian IPs and blocked it for a while.

@MonkderVierte@lemmy.ml · 28 days ago

Assuming we could build a new internet fron the ground, what would be the solution? IPFS?

@melpomenesclevage@lemmy.dbzer0.com · edit-2 20 days ago

Removed by mod

@AbsoluteChicagoDog@lemm.ee · 28 days ago

There is no technical solution that will stop corporations with deep pockets in a capitalist society

dindonmasker · 28 days ago

Maybe letters through the mail to receive posts.

@AbsoluteChicagoDog@lemm.ee · 28 days ago

And only then because the USPS is a federal agency. You can bet if private corporations ran it there would be no such privacy.

@WhyJiffie@sh.itjust.works · 28 days ago

so basically what you are saying is to not put information on public places, but only send information to specific people

@cy_narrator@discuss.tchncs.de · 27 days ago

AI will come up there to abuse it as well

Buelldozer · 28 days ago

what would be the solution?

Simple, not allowing anonymous activity. If everything was required to be crypto-graphically signed in such a way that it was tied to a known entity then this could be directly addressed. It’s essentially the same problem that e-mail has with SPAM and not allowing anonymous traffic would mostly solve that problem as well.

Of course many internet users would (rightfully) fight that solution tooth and nail.

@MonkderVierte@lemmy.ml · 27 days ago

No, that’s not a solution, since it would make privacy impossible and bad actors would still find ways around.

@shortwavesurfer@lemmy.zip · 28 days ago

Proof of work before connections are established. The Tor network implemented this in August of 2023 and it has helped a ton.

Buelldozer · 27 days ago

PoW uses a lot of electricity on the client side so environmentally it’s a poor solution, especially at scale.

/home/pineapplelover · 28 days ago

They’re afraid

@melpomenesclevage@lemmy.dbzer0.com · edit-2 20 days ago

Removed by mod

@PrivacyDingus@lemmy.world · 27 days ago

nepenthe

It’s a Markov-chain-based text generator which could be difficult for people to implement on repos depending upon how they’re hosting them. Regardless, any sensibly-built crawler will have rate limits. This means that although Nepenthe is an interesting thought exercise, it’s only going to do anything to things knocked together by people who haven’t thought about it, not the Big Big companies with the real resources who are likely having the biggest impact.

@melpomenesclevage@lemmy.dbzer0.com · edit-2 20 days ago

Removed by mod

Buelldozer · 28 days ago

I too read Drew DeVault’s article the other day and I’m still wondering how the hell these companies have access to “tens of thousands” of unique IP addresses. Seriously, how the hell do they have access to so many IP addresses that SysAdmins are resorting to banning entire countries to make it stop?

@festus@lemmy.ca · 27 days ago

There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users. They route traffic through these IPs via malware, hacked routers, “free” VPN clients, etc. If you block the IP range for one of these addresses you’ll also block real users.

Buelldozer · 27 days ago

There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users.

Now that makes sense. I hadn’t considered rogue ISPs.

@festus@lemmy.ca · 27 days ago

It’s not even necessarily the ISPs that are doing it. In many cases they don’t like this because their users start getting blocked on websites; it’s bad actors piggy-packing on legitimate users connections without those users’ knowledge.

@werefreeatlast@lemmy.world · 28 days ago

If you get something like 156.67.234.6, then 7, then 56 etc just block 156.67.0.0/24

Buelldozer · 27 days ago

Sure, network blocking like this has been a thing for decades but it still requires ongoing manual intervention which is what these SysAdmins are complaining about.

@GreenKnight23@lemmy.world · 28 days ago

fail2ban will always get you better results than banning countries because VPNs are a thing.

that said, I automatically ban any IP that comes from outside the US because there’s literally no reason for anyone outside the US to make requests to my infra. I still use smart IP filtering though.

also, use a WAF on a NAT to expose your apps.

Buelldozer · 27 days ago

fail2ban

I’m familiar with f2b. I even have several clients licensed with the commercial version but it doesn’t fit this use case as there’s no logon failure for it to work with.

I automatically ban any IP that comes from outside the US because there’s literally no reason for anyone outside the US to make requests to my infra.

I have systems setup with geo-blocking but it’s of limited use due to the prevalence of VPNs.

also, use a WAF on a NAT to expose your apps.

This isn’t a solution either because a WAF has no way to know what traffic is bad so it doesn’t know what to block.

db0 · 28 days ago

Yep, it hit many lemmy servers as well, including mine. I had to block multiple alibaba subnet to get things back to normal. But I’m expecting the next spam wave.

𝕸𝖔𝖘𝖘 · 25 days ago

Failtoban should add all those scraper IPs, and we need to just flat out block them. Or send them to those mazes. Or redirect them to themselves lol

@grue@lemmy.world · 27 days ago

ELI5 why the AI companies can’t just clone the git repos and do all the slicing and dicing (running git blame etc.) locally instead of running expensive queries on the projects’ servers?

@Retropunk64@lemmy.world · edit-2 26 days ago

deleted by creator

@green@feddit.nl · 27 days ago

Too many people overestimate the actual capabilities of these companies.

I really do not like saying this because it lacks a lot of nuance, but 90% of programmers are not skilled in their profession. This is not to say they are stupid (though they likely are, see cat-v/harmful) but they do not care about efficiency nor gracefulness - as long as the job gets done.

You assume they are using source control (which is unironically unlikely), you assume they know that they can run a server locally (which I pray they do), and you assume their deadlines allow them to think about actual solutions to problems (which they probably don’t)

Yes, they get paid a lot of money. But this does not say much about skill in an age of apathy and lawlessness

@turmacar@lemmy.world · 27 days ago

Also, everyone’s solution to a problem is stupid if they’re only given 5 minutes to work on it.

Combine that with it being “free” for them to query the website and expensive to have enough local storage to replicate, even temporarily, all the stuff they want to scrape and it’s kind of a no brainier to ‘just not do that’. The only thing stopping them is morals / whether they want to keep paying rent.

@zovits@lemmy.world · 27 days ago

Takes more effort and results in a static snapshot without being able to track the evolution of the project. (disclaimer: I don’t work with ai, but I’d bet this is the reason and also I don’t intend to defend those scraping twatwaffles in any way, but to offer a possible explanation)

@Sturgist@lemmy.ca · 27 days ago

Also having your victim host the costs is an added benefit

Realitätsverlust · 27 days ago

Because that would cost you money, so just “abusing” someone else’s infrastructure is much cheaper.

@Fijxu@programming.dev · 27 days ago

AI scrapping is so cancerous. I host a public RedLib instance (redlib.nadeko.net) and due to BingBot and Amazon bots, my instance was always rate limited because the amount of requests they do is insane. What makes me more angry, is that this fucking fuck fuckers use free, privacy respecting services to be able to access Reddit and scrape . THEY CAN’T BE SO GREEDY. Hopefully, blocking their user-agent works fine ;)

@green@feddit.nl · 27 days ago

Thanks for hosting your instances. I use them often and they’re really well maintained

@grysbok@lemmy.sdf.org · 27 days ago

It’s also a huge problem for library/archive/museum websites. We try so hard to make data available to everyone, then some rude bots come along and bring the site down. Adding more resources just uses more resources–the bots expand to fill the container.