The Open Source Initiative have defined what they believe constitutes “open source AI” (https://opensource.org/ai/open-source-ai-definition). This includes detailed descriptions of training data, explanation on how it was obtained, selected, labeled, processed and filtered. As long as a company utilize any model trained on non-specified data I will assume it is either stolen or otherwise unlawfully obtained from non-consenting users.
I will be clear that I have not read up on Deepseek yet, but I have a hard time believing their training data is specified according to OSI, since no big model yet has done so. Releasing the model source code means little for AI compared to all its training data.
The Open Source Initiative have defined what they believe constitutes “open source AI” (https://opensource.org/ai/open-source-ai-definition). This includes detailed descriptions of training data, explanation on how it was obtained, selected, labeled, processed and filtered. As long as a company utilize any model trained on non-specified data I will assume it is either stolen or otherwise unlawfully obtained from non-consenting users.
I will be clear that I have not read up on Deepseek yet, but I have a hard time believing their training data is specified according to OSI, since no big model yet has done so. Releasing the model source code means little for AI compared to all its training data.