• Ekky@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    5 months ago

    The claim above was off the top of my head, but I’ve found multiple pages of results describing the panic that ensued.

    Now, Microsoft (Copilot and Github) are less than clear on what exactly is used for training, but the general consensus seems to be, that they don’t train on private repositories. Though there appears to be some confusion about this, especially regarding Microsoft’s honesty about not using loopholes (this article might be faked, I haven’t tried confirming it, though, this topic is a shit show ripe with miscommunication, misinformation, and quite a lot of confusion and fear regardless).

    It appears that the specific issue I was referring to required a human error for copilot being able to train on the private repositories. Namely, some unfortunate fool temporarily making the repository public (in which case it obviously isn’t private anymore, and therefore free for grabs by scrapers). Usually this wouldn’t be a problem, since no indexer or scraper can check all of Github all at once all the time, so the chance of a briefly exposed repository being cached is rather small, albeit always there.

    That said, Copilot, Bing, and Github are likely better integrated than Bing simply wasting resources on continuously scraping Github for new repositories. I personally imagine that Github saving resources by sending a signal to Bing when a repository is made public isn’t entirely unlikely (that’s something I might do, harboring no ill intentions), meaning that it is possible (though in no way confirmed) that Bing punishes briefly exposed Github repositories instantly by forever caching them.

    Is this 100% Microsoft being predatory? No, obviously not, since it requires a user error to happen in the first place, and since Copilot is technically only trained on public or exposed data. Though, Microsoft learning about this rather scammy behavior and simply classifying it a “low-impact-severity” and disabling the Bing cache for humans (but apparently not Copilot) doesn’t sit right with me. I’m sure that they knew exactly which kind of data they were working with during dataset sanitation, so they could have chosen not to use sensitive data or at least inform exposed clients that they are adding their cached secrets to Copilot.