The fediverse is discussing if we should defederate from Meta’s new Threads app. Here’s why I probably won’t (for now).

(Federation between plume and my lemmy instance doesn’t work correctly at the moment, otherwise I would have made this a proper crosspost)

  • dfyx@lemmy.helios42.deOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 year ago

    I’ll tell you a secret: they care enough to scrape everything. Not only the fediverse, every single website that’s accessible. And that’s not a thing for the future, that has been a reality at least since google became popular. Do yourself a favor and look into the server logs of an average webhost and you will find a whole bunch of crawlers. Some are for search engines, some are for other purposes.

    I wrote my M. Sc. thesis on specialized crawlers (back in 2015) and you wouldn’t believe how much research has gone into that and how effective modern crawlers are at finding every single thing that ever got uploaded to the net. The only thing needed is enough hardware to throw at the problem and that’s exactly what Meta, Google, Microsoft, Amazon and all the others have. As a rule of thumb, if archive.org or your favorite search engine has indexed it, everyone else has it as well or has access to someone they can buy it from. There is no such thing as unscraped content on the internet (unless you lock it behind access restrictions and those would apply just the same to federation).

    Edit: I don’t have access logs enabled on my instance and obviously can’t see what happens on other instances but I would bet that this very thread will be picked up by at least five different crawlers before the day is over.

    • Leraje@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Yeah, I know. My own access logs on all the VS I have control over are disabled. I still feel something, even if that something is purely symbolic, is better than nothing.