• Natanael@slrpnk.net
      link
      fedilink
      arrow-up
      19
      ·
      3 months ago

      Training from scratch and retraining is expensive. Also, they want to avoid training on ML outputs as samples, they want primarily human made works as samples, and after the initial public release of LLMs it has become harder to create large datasets without ML stuff in them

      • Scrubbles@poptalk.scrubbles.tech
        link
        fedilink
        English
        arrow-up
        13
        ·
        edit-2
        3 months ago

        There was a good paper that came out recently saying that training on ml data will result in a collapse of cohesion. It’s going to be real interesting, I don’t know if they’ll be able to train as easily ever again

      • Iron Lynx@lemmy.world
        link
        fedilink
        arrow-up
        4
        ·
        3 months ago

        I recall spotting a few things about Image Generators having their training data contaminated using generated images, and the output becoming significantly worse. So yeah, I guess LLMs and IGA’s need natural sources, or it gets more inbred than the Habsburgs.

      • TurtleJoe@lemmy.world
        link
        fedilink
        arrow-up
        3
        ·
        3 months ago

        I think it’s telling that they acknowledge that the stuff their bots churn out is often such garbage that training their bots on it would ruin them.

    • can@sh.itjust.works
      link
      fedilink
      arrow-up
      3
      ·
      3 months ago

      Hey, did you know your profile is set to appear as a bot and as a result many may be filtering your posts and comments? You can change this in your Lemmy settings.

      Unless you are a bot… In which case where did you get your data?

    • potustheplant@feddit.nl
      link
      fedilink
      arrow-up
      2
      ·
      3 months ago

      Where do you get that from? At least ChatGPT isn’t limited to data from 2021. I haven’t researched about other models.

    • GregorGizeh@lemmy.zip
      link
      fedilink
      arrow-up
      1
      ·
      3 months ago

      Pretty sure that cutoff is often used because after that point ai generated content started to appear much more frequently, and the training data becomes corrupted.

      • dislocate_expansionB
        link
        fedilink
        arrow-up
        1
        ·
        3 months ago

        For sure, this makes a lot of sense

        Side note, might be useful to use poison pills on our data here, since this could be scraped in the future without consent