Technology

68305 readers

5592 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

701

AI crawlers cause Wikimedia(The umbrella organization of Wikipedia and a dozen or so other crowdsourced knowledge projects) Commons bandwidth demands to surge 50%. (diff.wikimedia.org)

submitted 2 days ago* (last edited 2 days ago) by [email protected] to c/[email protected]

77 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 121 points 2 days ago (2 children)

Laws should be passed in all countries that AI crawlers should request permission before crawling whatever target site. I haver no pity to AI "thiefs" that get their models poisoned. F...ing plague, wasn't enough the adware and spyware...

[–] [email protected] 20 points 1 day ago (1 children)

i doubt the recent uptick in traffic is from “stealing data” for training but rather from agents scraping them for context, eg Edge Copilot, Google’s AI search, SearchGPT, etc.

poisoning the data will likely not help in this situation since there’s a human on the other side that will just do the same search again given unsatisfactory results. like how retries and timeouts can cause huge outages for web scale companies, poisoning search results will likely cause this type of traffic to increase and further increase the chances of DoS and higher bandwidth usage.

[–] [email protected] 7 points 1 day ago (1 children)

So? Break context scrapers till they give up, on your site or completely.

[–] [email protected] 2 points 1 day ago

easily said

[–] [email protected] 21 points 2 days ago (3 children)

An HTTP request is a request. Servers are free to rate limit or deny access

[–] [email protected] 1 points 23 hours ago (1 children)

Bots lie about who they are, ignore robots.txt, and come from a gazillion different IPs.

[–] [email protected] 1 points 23 hours ago

That's what ddos protection is for.

[–] taladar 14 points 1 day ago

Rate limiting in itself requires resources that are not always available. For one thing you can only rate limit individuals you can identify so you need to keep data about past requests in memory and attach counters to them and even then that won't help if the requests come from IPs that are easily changed.

[–] [email protected] 19 points 2 days ago (1 children)

And Wikimedia, in particular, is all about publishing data under open licenses. They want the data to be downloaded and used by others. That's what it's for.

[–] [email protected] 6 points 1 day ago (1 children)

Even so I think it would be totally reasonable for them to block web scrapers, as they provide better ways to download all their data.

[–] [email protected] 9 points 1 day ago

At the root of this comment chain is a proposal to have laws passed about this.

People can set up their web servers however they like. It's on them to do that, it's their web servers. I don't think there should be legislation about whether you're allowed to issue perfectly ordinary HTTP requests to a public server, let the server decide how to respond to them.