• danA
    link
    fedilink
    English
    arrow-up
    2
    arrow-down
    1
    ·
    edit-2
    10 months ago

    What about archive sites like web.archive.org and archive.today? Both still work fine for Reddit posts, and neither are blocked in www.reddit.com/robots.txt, so so far they haven’t shown an intent to block them.

    • AnyOldName3@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      10 months ago

      Yeah, the Wayback Machine doesn’t use Reddit’s API, but on the other hand, I’m pretty sure they don’t automatically archive literally everything that makes it onto Reddit - doing that would require the API to tell you about every new post, as just sorting /r/all by new and collecting every link misses stuff.

      • danA
        link
        fedilink
        English
        arrow-up
        2
        ·
        10 months ago

        You don’t need every post, just a collection big enough to train an AI on. I imagine it’s a lot easier to get data from the Internet Archive (whose entire mission is historical preservation) than from Reddit.

        The thing I’m not sure about is licensing, but it seems like that’d the case for the whole AI industry at the moment.