26
The Unbelievable Scale of AI’s Pirated-Books Problem
(www.theatlantic.com)
This is a most excellent place for technology news and articles.
If humans have to pay for knowledge with expensive student loans and book purchases, why should AI get that same knowledge for free?
As it so happens…
https://www.edweek.org/policy-politics/trump-admin-cuts-library-funding-what-it-means-for-students/2025/03
He did dismantle library funding.
Because if AI has to pay, you kill the open-source scene and give a fat monopoly to the handful of companies that can afford the data. Not to mention that data is owned by a few publishing house and none of the writers are getting a dime.
Yes it's silly that students pay so much, but we should be arguing for less copyrights so we can have both proper prices in education and a vibrant open source scene.
Most people argue for a strengthening of copyrights which only helps data brokers and big AI players. If you want subscription services and censorship while still keeping all the drawbacks of AI, this is how you do it.
FOSS infrastructure is under attack by AI companies
The entire open-source scene grew out of that exact system before LLMs even existed. What are you talking about?
Also, just because somebody has the right to make their code open-source doesn't mean that everyone should be forced to do the same. If you decide to make a living by writing books under a permissive license you should be able to do that. This is a free world. Nobody is forcing open-source developers to make the code proprietary. But people like you feel to be in the moral right to force the opposite to others.
AI has always been able to train on copyrighted data because it's considered transformative.
If this changes, seeing the huge amount of data needed for competitive generative AI, then open source AI cannot afford the data and dies. Strengthening copyrights would force everyone out of the game except Meta, Google and Microsoft.
The system that open source AI grew out of is exactly what is being attacked.
The article explains the problems in great detail.
Here’s just one small section of the text which describes some of them:
(I originally put this as a top-level comment, my bad.)
The claims aren’t entirely hypothetical, LLM AI has mapped nearly every known protein which humans were unable to do…
https://www.nature.com/articles/d41586-022-02083-2
That’s an interesting article, but it was published in 2022, before LLMs were a thing on anyone’s radar. The results are still incredibly impressive without a doubt, but based on how the researchers explain it, it looks like it was accomplished using deep learning, which isn’t the same as LLMs. Though they’re not entirely unrelated.
Opaque and confusing terminology in this space also just makes it very difficult to determine who or which systems or technology are actually making these advancements. As far as I’m concerned none of this is actual AI, just very powerful algorithmic prediction models. So the claims that an AI system itself has made unique technological advancements, when they are incapable of independent creativity, to me proves that nearly all their touted benefits are still entirely hypothetical right now.
These authors (and my work is in there) did not write so that Mark Zuckerberg could steal our work and profit from it
Those authors aren't in the equation anymore. They gave their work to publishing houses and won't be asked about what it is to be used for.