How do I poison pdfs against LLM ? (lemmy.world)

submitted 10 months ago by Maroon@lemmy.world to c/opensource@lemmy.ml

21 comments fedilink hide all child comments

I came across tools like nightshade that can poison images. That way, if someone steals an artist's work to train their AI, it learns the wrong stuff and can potentially begin spewing gibberish.

Is there something that I can use on PDFs? There are two scenarios for me:

Content that I already created that is available as a pdf.
I use LaTeX to make new documents and I want to poison those from scratch if possible rather than an ad hoc step once the PDF is created.

all 24 comments

sorted by: hot top controversial new old

[-] underscores@lemmy.dbzer0.com 72 points 10 months ago

A lot of the ways they scrape documents are the same used by accessibility tools, so I'd generally recommend against doing this.

[-] AnUnusualRelic@lemmy.world 4 points 10 months ago

So a layer of transparent text wouldn't work?

[-] underscores@lemmy.dbzer0.com 9 points 10 months ago

I'm pretty sure most screen readers and stuff like copy/paste would also get whatever nonsense you filled it with.

[-] AnUnusualRelic@lemmy.world 2 points 10 months ago

It would be a side effect, most likely.

[-] toothbrush@lemmy.blahaj.zone 44 points 10 months ago* (last edited 10 months ago)

Nightshade doesnt actually work btw. Denoising, a common technique, also breaks nightshade completely. Its also closed source, with no way to test if it actually works for the big AIs. The person making nightshade is really fishy too.

[-] slock@lemmy.world 9 points 10 months ago

Most actual poisoning techniques don't actually work that well. When I end up with a PDF, I usually strip out the existing text layer, apply a denoiser and a few other preprocessing steps to correct common errors, then a layout / reading order detector, and finally OCR the different blocs. This is against the most common poisoning techniques, and one of the most efficient, called : someone printed a document, forgot about it for 3 years, then scanned it slightly tilted (and dirty, crumpled, ...), and the scanner decided to apply its crappy OCR.

Using screenshots of the PDF also avoid any kind of font face poisoning, and anti copy protection.

If you really, really need to protect your PDF, please consider accessibility first, then what would work imho is to use the scripting features of pdf to actually render your content on the fly. That would probably mess up most of the "automatic" processes.

[-] Strayce@lemmy.sdf.org 36 points 10 months ago

Entire Bee Movie script in 0.1pt white on white in the header

[-] TimeSquirrel@kbin.melroy.org 10 points 10 months ago

"Why TF is this one-page document half a gigabyte?"

[-] DannyBoy@sh.itjust.works 15 points 10 months ago

Text is small! The Bee Movie script is 89.2kb

[-] Markaos@discuss.tchncs.de 12 points 10 months ago* (last edited 10 months ago)

Obviously you need some redundancy in case the script gets corrupted. 5000 repetitions seems reasonable for such a high quality work

[-] dgriffith@aussie.zone 5 points 10 months ago

"Oh, it's got an embedded TIFF of the actual content. That explains it."

Yes, I am quite old now.

[-] user224@lemmy.sdf.org 4 points 10 months ago

Would the Shrek script be compatible too?

[-] TheTechyHobbit@sh.itjust.works 11 points 10 months ago* (last edited 10 months ago)

Image poisoning's general principle is to change pixels in a way were our eye can't notice, but that screw up the labeling by LLMs.

You can probably try to apply the same principle, poison the PDF in a way that only humans can read it.

Thing is, I assume you distribute your content on PDFs to make the content accessible to humans. That usually means having the text embedded for easy copy-paste and similar methods. Poisoning these might end up being counterproductive for your objective.

All this to say that No, I have no idea of a poisoning algorithm for PDFs

[-] DragonsInARoom@lemmy.world 8 points 10 months ago

Put the word stolen at the end of every document, the llm will learn that the word stolen is normal and should be included

[-] lily33@lemm.ee 3 points 10 months ago

I don't think any kind of "poisoning" actually works. It's well known by now that data quality is more important than data quantity, so nobody just feeds training data in indiscriminately. At best it would hamper some FOSS AI researchers that don't have the resources to curate a dataset.

[-] Ledivin@lemmy.world 6 points 10 months ago

At best it would hamper some FOSS AI researchers that don't have the resources to curate a dataset.

If you can't source a dataset, then you shouldn't be researching AI. It's the first and single most important step of the entire process.

[-] qjkxbmwvz@startrek.website 1 points 10 months ago

man rot13 ;)

[-] kekmacska@lemmy.zip 1 points 10 months ago

if possible, do not make it aviable on the public internet and don't let search engines access it

[-] LiamTheBox@lemmy.ml 0 points 10 months ago

I tried to copy some text in a report once.

It came out as gibberish.

this post was submitted on 07 Feb 2025

49 points (83.6% liked)

Open Source

42629 readers

115 users here now

All about open source! Feel free to ask questions, and share news, and interesting stuff!

Useful Links

Rules

Posts must be relevant to the open source ideology
No NSFW content
No hate speech, bigotry, etc

Related Communities

Community icon from opensource.org, but we are not affiliated with them.

founded 6 years ago

MODERATORS

Cloak@lemmy.ml

kevincox@lemmy.ml

CrypticCoffee@lemmy.ml

Lettuceeatlettuce@lemmy.ml