• Mubelotix
    link
    fedilink
    English
    02 days ago

    Doesn’t make any sense. Why would you crawl wikipedia when you can just download a dump as a torrent ?

    • @mke@programming.dev
      link
      fedilink
      English
      0
      edit-2
      2 days ago

      Apparently the dump doesn’t include media, though there’s ongoing discussion within wikimedia about changing that. It also seems likely to me that AI scrapers don’t care about externalizing costs onto others if it might mean a competitive advantage (e.g. most recent data, not having to spend time and resources developing dedicated ingestion systems for specific sites).

      • Rose
        link
        fedilink
        English
        02 days ago

        To just have the most recent data within reasonable time frame is one thing. AI companies are like “I must have every single article within 5 minutes they get updated, or I’ll throw my pacifier out of the pram”. No regard for the considerations of the source sites.