Technology

68305 readers

5982 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

701

AI crawlers cause Wikimedia(The umbrella organization of Wikipedia and a dozen or so other crowdsourced knowledge projects) Commons bandwidth demands to surge 50%. (diff.wikimedia.org)

submitted 2 days ago* (last edited 2 days ago) by [email protected] to c/[email protected]

77 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 28 points 1 day ago (4 children)

Doesn't make any sense. Why would you crawl wikipedia when you can just download a dump as a torrent ?

[–] [email protected] 23 points 1 day ago

AI bros aren't that smart.

[–] [email protected] 4 points 1 day ago* (last edited 1 day ago)

Apparently the dump doesn't include media, though there's ongoing discussion within wikimedia about changing that. It also seems likely to me that AI scrapers don't care about externalizing costs onto others if it might mean a competitive advantage (e.g. most recent data, not having to spend time and resources developing dedicated ingestion systems for specific sites).

I want to stress this: it's not that "tech bros" are just stupid—even though a lot of them are revoltingly unappreciative of the giants whose sholders they stand on—it's that they don't care.

[–] [email protected] 2 points 1 day ago (1 children)

To have the most recent data?

[–] [email protected] 3 points 1 day ago

To just have the most recent data within reasonable time frame is one thing. AI companies are like "I must have every single article within 5 minutes they get updated, or I'll throw my pacifier out of the pram". No regard for the considerations of the source sites.

[–] [email protected] 1 points 1 day ago

There's a chance this isn't being done by someone who only wants Wikipedia's data. As the amount of websites you scrape increases, your desire to use the easy tools loses out to creating the most general tool that can look at most webpages.