467

Leaked Documents Show Nvidia Scraping ‘A Human Lifetime’ of Videos Per Day to Train AI (www.404media.co)

submitted 2 years ago by misk@sopuli.xyz to c/technology@lemmy.world

72 comments fedilink hide all child comments

https://archive.is/2024.08.05-162750/https://www.404media.co/nvidia-ai-scraping-foundational-model-cosmos-project/

you are viewing a single comment's thread
view the rest of the comments

[-] Sconrad122@lemmy.world 12 points 2 years ago

Nvidia does not have a strong history of open sourcing things, to say the least. That last bit sounds like pure hopium

[-] Grimy@lemmy.world 0 points 2 years ago* (last edited 2 years ago)

Their nematron 320b model was released on what essentially is an open source licence (available for commercial use except if you are doing shady things like spamming and collecting biometric data).

Having a robust open source ecosystem directly benefits Nvidia since they sell more higher end consumer GPUs.

Obviously, there's a real chance that this isn't open sourced since it's a video model and there's huge money involved. Doesnt really change the fact that having YouTube and Netflix dictate who gets to make video models and at what cost isn't a good idea.

[-] trollbearpig@lemmy.world -1 points 2 years ago

The guy you are replying to is in all AI posts defending AIs. He is probably heavily invested in this BS or being paid for it, don't waste your time with him.

[-] Grimy@lemmy.world 2 points 2 years ago* (last edited 2 years ago)

Tbh, someone has to. Have you ever asked yourself if the intense hate AI gets and how 99% of articles are against it is organic?

There's a handful of companies that are poised to win big if they can put up a fence around AI while making sure the public can't run strong models. There is an intense media campaign to make sure the public thinks either AI is dangerous (so they can be the only ones legally allowed to distribute them) or that AI is theft (So they can be the only ones to afford building them).

Do not let yourself be manipulated, almost all strengthening of copyrights related to AI is completely against our interests.

And no, I'm not getting paid lol. I have a vested interest because I use generative technology for work and for fun in my free time. I'm also interested in not handing out our whole economy on a silver platter to Google and Microsoft, if I can maybe help with a couple of comments a week, I will. Why don't you explain why I'm wrong instead of sending out baseless accusations?

[-] trollbearpig@lemmy.world 0 points 2 years ago* (last edited 2 years ago)

Nah my man, you are either brainwashed or are being paid hahaha. Is copyright a mess? Of fucking course, I haven't meet a single person (except crazy ass libertarians funnily enough hahaha) that likes copyright. Are big corporations using copyright to exploit artists, create monopolies, and generally being dicks? Again, of fucking course.

But anyone, like you, saying that we should just let AIs destroy copyright effectively is a fucking prick, that simple. And your agruments are dissingenous at best or outright lies. For example, just as big copyright holder companies are pushing to strengthen copyright law, the big tech companies are pushing for effectively destroying copyright through AI models. I have seen you pushing in multiple thread for open source models like that's a solution. But if you were a serious person researching about the software open source community you would see that pretty much no one there agrees with your position because it would effectively destroy the copyleft open source licenses. After all, if an "AI" model, open source or not, is allowed to just "train" on my AGPL code and spit it back (with minor modifications at best) to an engineer in AWS that's it for my project. Amazon will do the Amazon thing and steal the project. So say goodbye to any software freedom we have.

And let's be 100% clear here, this is not being pushed by the evil copyright holders like you seem to imply (and they are totally evil just to be clear hahahah). This is being pushed by the big tech companies and people like you spreading their propaganda. The fact that the copyright holders happen to be in the right this time is just a broken clock being right and all that, but it's still good that they are pushing back to big tech. I do agree we have to keep an eye on them, the objective here can't be to make copyright bigger, just to close the "loophole" that big tech companies are exploting to steal everything.

People like you who want to destroy copyright without offering any alternatives to allow creatives to work in a market are either missinformed or just assholes. Again, of fucking course it's not an ideal system, but going full kamikaze and just destroying any possibility for artists and creatives of making a living with their work is the most evil thing goung on right now, so bad that the big copyright holders happen to fall on the less bad side this time hahaha. And all for what? So people can be lied to by dumb chatbots? Or so people can create mediocre derivative "art" without putting any effort? Or so we can get mediocre code autocomplete that is subtly wrong all the time? Is fucking ridiculous.

[-] 31337@sh.itjust.works 2 points 2 years ago

After all, if an “AI” model, open source or not, is allowed to just “train” on my AGPL code and spit it back (with minor modifications at best) to an engineer in AWS that’s it for my project. Amazon will do the Amazon thing and steal the project. So say goodbye to any software freedom we have.

An engineer at AWS can already just copy your code, make minor modifications, and use it. I would think the same legal recourse would apply if it was outputted from an LLM or just a copy-paste? This seems like a tangential issue to whether the LLM was trained on your code or not (not training on your code obviously reduces the probability of the LLM spitting it back out near-verbatim though). Personally, I don't see anything wrong with anyone using public code to build statistical models. And I think the pay-to-scrape models that Reddit, Xitter, and others are employing will help big tech build the "moat" they're looking for. Big tech is asking for AI regulation for similar reasons.

[-] trollbearpig@lemmy.world -1 points 2 years ago

An engineer at AWS can already just copy your code, make minor modifications, and use it.

You are 100% wrong here my man. If an engineer does this they are creating a derivative work and they have to fullfil the conditions of the license of the code. No wonder you don't see anything wrong here, you AI people live in a fantasy world when it comes to how copyright works hahahaha. Please stop talking about shit you know nothing about.

[-] 31337@sh.itjust.works 1 points 2 years ago* (last edited 2 years ago)

I stated that they can do this, and asked if they could be sued if they used near-verbatim code generated from an LLM, just like they could be sued if they copy-pasted AGPL code.

Edit: Tools like CoPilot tell you if your code is similar to publicly available code so you can avoid these issues.

Edit: Just looked up EFF's position and I tend to agree with it:

Artificial Intelligence and Copyright Law

Artists are understandably concerned about the possibility that automatic image generators will undercut the market for their work. However, much of what is criticized is already considered fair use under copyright law, even if done at scale. Efforts to change copyright law to transform certain fair uses into infringement carry serious implications, are likely to interfere with the innovative potential of AI tools, and ultimately do not benefit artists. In fact, the use of these tools could expand the capacity of artists to create expressive works. Policymakers should emphasize the importance of human labor and investment in what receives copyright protection to maintain wages and dignity. Artists should be protected from efforts by large corporations to both substitute their labor with AI tools and create a new, unnecessary copyright regime around AI-generated art.

Machine Learning is a Fair Use

The process of machine learning for generative AI art is like how humans learn—studying other works—it is just done at a massive scale. Huge swaths of data (images, videos, and other copyrighted works) are analyzed and broken into their factual elements where billions of images, for example, could be distilled into billions of bytes, sometimes as small as less than one byte of information per image. In many instances, the process cannot be reversed because too little information is kept to faithfully recreate a copy of the original work.

The analysis work underlying the creation and use of training sets is like the process to create search engines. Where the search engine process is fair use, it is very likely that processes for machine learning are too. While the act of analysis may potentially implicate copyright, when that act is a necessary step to enabling a non-infringing use, it regularly qualifies as fair use. If the intermediate step were not permitted, fair use would be ineffective. As such, when factual elements of copyrighted works are studied and processed to create training sets—which, once again, is how we humans learn and are inspired by themes and styles in art and other works—that is likely to be found a fair use.

https://www.eff.org/document/eff-two-pager-ai

[-] trollbearpig@lemmy.world -1 points 2 years ago

What point are you trying to make? That the fact that someone can break the law means we should not have laws? I honestly don't get what you are trying to say.

[-] 31337@sh.itjust.works 2 points 2 years ago

I'm saying using code for training is a different issue that copyright infringement. I edited my post above to better lay out my position.

[-] trollbearpig@lemmy.world -1 points 2 years ago* (last edited 2 years ago)

And that's the whole point of my comment, did you even read it? To summarize, there is currently a loophole in law that allows these bullshit arguments about it being different than straight up copying shit (though this haven't been litigated yet, so it's not yet clear if these arguments are actually valid). This means that while a person reading my AGPL code and copying it (without following the license) is 100% illegal, doing the same through an LLM may be legal. So this means that open source licenses can be bypassed by first training an LLM with the code and then extracting the code from the LLM. This is terrible for open source, and in general for anyone who wants to make a living from creating copyrighted work. So we should close this loophole, and I'm glad there is a push to close this through better laws. Even if these laws are comming from Disney, Sony, and all those awful companies.

So again, what's the point you are trying to make here? That we shouldn't make these laws stronger to prevent this bullshit? I honestly don't understand what you are trying to argue here, nothing of what you have said has anything to do with this conversation.

[-] 31337@sh.itjust.works 2 points 2 years ago

That we already have laws that protect copyright infringement (which seem like they would still apply if it was spit out by an LLM or not), and no more should be made. That training on public data is fine.

[-] trollbearpig@lemmy.world -1 points 2 years ago

Any arguments to defend your position? I'm giving you a very clear example of the awful consecuences of following that path. And the same applies to any creative work. You are just being dismissive without proposing any real solution. Do better man.

[-] 31337@sh.itjust.works 2 points 2 years ago

The EFF link I posted above provides evidence. Again, here's a quote from part of it:

The process of machine learning for generative AI art is like how humans learn—studying other works—it is just done at a massive scale. Huge swaths of data (images, videos, and other copyrighted works) are analyzed and broken into their factual elements where billions of images, for example, could be distilled into billions of bytes, sometimes as small as less than one byte of information per image. In many instances, the process cannot be reversed because too little information is kept to faithfully recreate a copy of the original work.

As I mentioned before, Copilot at least, helps people avoid copyright infringement by notifying you if your code is similar to public code. The solution I'm proposing is no new laws, and just enforcing the ones we have. Most of the laws being proposed look like attempts at regulatory capture to me.

[-] Doomsider@lemmy.world 1 points 2 years ago

That is a long winded way to say you are a copyright defender. Your insisting on finding an alternative to a broken system so rent seekers can continue to exist is naive to say the least.

I think most people with your stance (don't throw out the baby with the bathwater) really have no idea how broken copyright and intellectual property is.

AI companies have already proven copyright is DOA. It was never designed for the little guy. That is just propaganda you have fallen prey to.

Simply put copyright was not needed for all of human history and it is still not needed. Pretending you have a unique idea, song, painting, etc in a world of billions of humans is beyond ridiculous.

The concept was already broken from the start because everything in science and art is iterative. Giving monopoly power to rent seekers is the natural result of a broken concept.

this post was submitted on 05 Aug 2024

467 points (96.8% liked)

Technology

79759 readers

228 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws