

Assuming Open AI ect only use data from the public domain is stupid (and contrary to most news sources on the matter). He has literally no idea what the AI has trained on (not even developers know, because there’s just too much of it to be reviewed by humans). They’ve undoubtedly bought countless amounts of data that isn’t readily searchable by public engines.
He sounds very ill informed on the matter of data collection and probably just had his info/data on a cloud service somewhere whose text was part of the trillions of terrabytes LLM have accessed and trained on.
Large Language companies weren’t even aware their data (which is so large they themselves have no idea what’s in it) had other languages.
So the models suddenly knew how to speak other languages. The above story feels like those stories “Large Language Models are super intelligent! They’ve taught themselves French!” - no, mass surveillance and corporations being above the law taught them everything they know.