• Probius@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 month ago

    This type of large-scale crawling should be considered a DDoS and the people behind it should be charged with cyber crimes and sent to prison.

    • eah@programming.dev
      link
      fedilink
      English
      arrow-up
      1
      ·
      30 days ago

      Applying the Computer Fraud and Abuse Act to corporations? Sign me up! Hey, they’re also people, aren’t they?

  • Blackmist@feddit.uk
    link
    fedilink
    English
    arrow-up
    0
    ·
    30 days ago

    Business idea: AWS, but hosted entirely within the computing power of AI web crawlers.

  • chicken@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    0
    ·
    1 month ago

    Seems like such a massive waste of bandwidth since it’s the same work being repeated by many different actors to piece together the same dataset bit by bit.

  • 0_o7@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    0
    ·
    30 days ago

    I blocked almost all big players in hosting, China, Ruasia, Vietnam and now they’re now bombarding my site with residential IP address from all over the world. They must be using compromised smart home devices or phones with malware.

    Soon everything on the internet will be behind a wall.

      • aev_software@programming.dev
        link
        fedilink
        English
        arrow-up
        0
        ·
        30 days ago

        In the mean time, sites are getting DDOS-ed by scrapers. One way to stop your site from getting scraped is having it be inaccessible… which is what the scalpers are causing.

        Normally I would assume DDOS-ing is performed in order to take a site offline. But ai-scalpers require the opposite. They need their targets online and willing. One would think they’d be a bit more careful about the damage they cause.

        But they aren’t, because capitalism.

        • Natanael@infosec.pub
          link
          fedilink
          English
          arrow-up
          2
          ·
          30 days ago

          If they had the slightest bit of survival instinct they’d share a archive.org / Google-ish scraper and web cache infrastructure, and pull from those caches, and everything would just be scraped once, repeated only occasionally.

          Instead they’re building maximally dumb (as in literally counterproductive and self harming) scrapers who don’t know what they’re interacting with.

          At what point will people start to track down and sabotage AI datacenters IRL?

  • rozodru@lemmy.world
    link
    fedilink
    English
    arrow-up
    0
    ·
    30 days ago

    I run my own gitea instance on my own server and within the past week or so I’ve noticed it just getting absolutely nailed. One repo in particular, a Wayland WM I built. Just keeps getting hammered over and over by IPs in China.

  • MonkderVierte@lemmy.zip
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    30 days ago

    I just thought that having a client side proof-of-work (or even only a delay) bound to the IP might deter the AI companies to choose to behave instead (because single-visit-per-IP crawlers get too expensive/slow and you can just block normal abusive crawlers). But they already have mind-blowing computing and money ressources and only want your data.

    But if there was a simple-to-use integrated solution and every single webpage used this approach?

    • daniskarma@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      0
      ·
      30 days ago

      Solution was invented long ago. It’s called a captcha.

      A little bother for legitimate users, but a good captcha is still hard to bypass even using AI.

      And I think for the final user standpoint I prefer to lose 5 seconds in a captcha, than the browser running an unsolicited heavy crypto challenge on my end.