Since you're definitely asking this in good faith and not just downvoting and making nonsense sealion requests in an attempt to make me shut up, sure! Here's three.
Oh, and it's not me demanding. It's the OSI defining what an open source AI model is. I'm sure once you've asked all your questions you'll circle back around to whether you disagree with their definition or not.
Thank you for posting those links, while I'm not sure the person you replied to was asking in good faith, I myself was wanting to see an example after reading the discussion.
Seems like even if it's not fully open source it's a step in the right direction in a world where terms like "open" and non profit have been co-opted by corporations to lose their original meaning.
It's certainly better than "Open"AI being completely closed and secretive with their models. But as people have discovered in the last 24 hours, DeepSeek is pretty strongly trained to be protective of the Chinese government policy on, uh, truth. If this was a truly Open Source model, someone could "fork" it and remake it without those limitations. That's the spirit of "Open Source" even if the actual term "source" is a bit misapplied here.
As it is, without the original training data, an attempt to remake the model would have the issues DeepSeek themselves had with their "zero" release where it would frequently respond in a gibberish mix of English, Mandarin and programming code. They had to supply specific data to make it not do this, which we don't have access to.
Do show me a published data set of the kind you're demanding.
Since you're definitely asking this in good faith and not just downvoting and making nonsense sealion requests in an attempt to make me shut up, sure! Here's three.
https://commoncrawl.org/
https://github.com/togethercomputer/RedPajama-Data
https://huggingface.co/datasets/legacy-datasets/wikipedia/tree/main/
Oh, and it's not me demanding. It's the OSI defining what an open source AI model is. I'm sure once you've asked all your questions you'll circle back around to whether you disagree with their definition or not.
Thank you for posting those links, while I'm not sure the person you replied to was asking in good faith, I myself was wanting to see an example after reading the discussion.
Seems like even if it's not fully open source it's a step in the right direction in a world where terms like "open" and non profit have been co-opted by corporations to lose their original meaning.
It's certainly better than "Open"AI being completely closed and secretive with their models. But as people have discovered in the last 24 hours, DeepSeek is pretty strongly trained to be protective of the Chinese government policy on, uh, truth. If this was a truly Open Source model, someone could "fork" it and remake it without those limitations. That's the spirit of "Open Source" even if the actual term "source" is a bit misapplied here.
As it is, without the original training data, an attempt to remake the model would have the issues DeepSeek themselves had with their "zero" release where it would frequently respond in a gibberish mix of English, Mandarin and programming code. They had to supply specific data to make it not do this, which we don't have access to.