1447

Data contamination expert 👌 (jlai.lu)

submitted 1 year ago by ElCanut@jlai.lu to c/lemmyshitpost@lemmy.world

91 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[-] TropicalDingdong@lemmy.world 149 points 1 year ago

I used some tools to corrupt about 10 years of comments and posts of mine.

[-] mp04610@lemm.ee 71 points 1 year ago

While that's the correct thing to do in my opinion, it would be a mistake to assume that Reddit didn't store your original comments.

By corrupting their dataset, you may actually be helping them recognize maliciously edited comments.

[-] khannie@lemmy.world 35 points 1 year ago

it would be a mistake to assume that Reddit didn't store your original comments.

They were fairly specific about not doing that (I'd imagine largely because of GDPR).

I deleted 10 years of "content" before I left and checked their policies. They apparently actually do properly delete from their servers.

[-] Frozengyro@lemmy.world 21 points 1 year ago

I've got a bridge in the desert I'd like to sell you.

[-] joenforcer@midwest.social 9 points 1 year ago* (last edited 1 year ago)

GDPR is no joke. Storing a handful of comments is not worth the penalty if they get caught.

Note that I speak from experience as part of a company that needs to comply with the regulations. We do it because the risk of violation is 10000000% not worth it no matter how annoying and arduous it is to comply.

[-] ItsAFake@lemmus.org 9 points 1 year ago

But the GDPR only covers European users tho.

[-] khannie@lemmy.world 11 points 1 year ago

That's true but it's far easier to globally implement rather than trying to segment. Very difficult to accurately prove a user isn't EU resident across an entire userbase.

load more comments (1 replies)

[-] TropicalDingdong@lemmy.world 18 points 1 year ago

Yeah, I mean I knew that when I was doing it.

Sometimes all you can do is make a symbolic gesture that really does nothing, and even if it does nothing, you should still do it.

Probably leaving and supporting lemmy by paying for some developer fees (i'm on the patreon), posting and commenting, probably 100x more damaging to Reddit.

[-] FeelThePower@lemmy.dbzer0.com 10 points 1 year ago

FWIW, I requested an old reddit accounts data the other day under CCPA and all the contamination was in there. My guess is their backend updates every so often. i guess i made a good call to edit my comments and leave them there to simmer before i deleted them along with the account. perhaps this is the way?

load more comments (1 replies)

[-] ElCanut@jlai.lu 60 points 1 year ago

Can't post a genius idea like this one without posting the links of the tools

[-] TropicalDingdong@lemmy.world 39 points 1 year ago

Its not my idea, but I could probably dig up the tool I used. Dollars to donuts, it doesn't work any more.

This might have been the tool I used. I dont think so because I overwrote everything with one message, but google around you'll find similar.

https://github.com/adriantache/YARCO

[-] Maalus@lemmy.world 14 points 1 year ago

If you overwrote with a single message, then your messages are back to what they were.

[-] KnightontheSun@lemmy.world 9 points 1 year ago

Not necessarily true. I overwrote several thousand comments with a different tool and used three different quotes on greed. I have periodically checked and about two dozen came back. I just manually changed them at that point.

[-] RecallMadness@lemmy.nz 8 points 1 year ago* (last edited 1 year ago)

This would be better if it fed the parent comment into ChatGPT prefixed with “create a plausible but factually incorrect aggressive response to ”

Feed the machine to the machine!

load more comments (1 replies)

[-] Sabin10@lemmy.world 8 points 1 year ago

A tool like that would almost definitely require api access to function. If that was still possible, most of us wouldn't be here having this conversation.

[-] TropicalDingdong@lemmy.world 14 points 1 year ago

A tool like that would almost definitely require api access to function. If that was still possible, most of us wouldn’t be here having this conversation.

No it didn't use the API. You had to run it in browser and be logged in to reddit.

load more comments (2 replies)

[-] Ragnarok314159@sopuli.xyz 21 points 1 year ago

I think Reddit caught on to this. I tried destroying my comment history (~7 years with 600k karma) with a few of the available tool on GitHub.

Found my account permabanned next time trying to login. People should attempt to eliminate/poison as much as possible, but Reddit has all the comments and modifications in a database somewhere to sell it all to whatever AI is the highest bidder.

They have to do something to make money after taking away awards. The advertising is absolute shit and not worth the $100 entry fee.

[-] VaultBoyNewVegas@lemmy.world 13 points 1 year ago

I edited mine via a tool to say fuck Reddit and Steve Huffman is a greedy pig boy.

[-] Octopus1348@lemy.lol 10 points 1 year ago

What do you mean by corrupt?

[-] PlasmaDistortion@lemm.ee 25 points 1 year ago

I used a tool that edited my comments to replace it with gibberish. Supposedly Reddit still retains deleted comments but if you edit them, it only keeps the latest version. So by editing it you make the comments worthless.

[-] Octopus1348@lemy.lol 17 points 1 year ago

I also edited my comments to be basically a Lemmy ad and completely deleted the posts except in a few communities where it could be helpful in the future.

[-] citrusface@lemmy.world 8 points 1 year ago

What tool? I'd like to use it as well.

load more comments (3 replies)

[-] TropicalDingdong@lemmy.world 12 points 1 year ago

I ran a script over all of my comments (through my browser) to edit them into something about how spez had back stabbed the community. I had tens? hundreds of thousands? of comments.

It took several hours to run, but I did a forward pass (newest to oldest) and a backwards pass (oldest to newest). It bugged out because it had to run so long but I think I got it all.

I'm not sure this will really do anything because you could pretty easily statistically isolate any one who did what I did, and roll their account history back to a prior state in the training data.

Regardless, it was the least I could do on the way out the door.

load more comments (2 replies)

load more comments (1 replies)

[-] benignintervention@lemmy.world 65 points 1 year ago

I wonder how much these models are now learning from spam they were used to generate

[-] Kbin_space_program@kbin.social 34 points 1 year ago

Time to make a lot of wandering dwarf bots on reddit to make variations of various game phrases all over, so the LLM based bots just spout Rock And Stone and This is my favourite store on the Citadel?

[-] Ilovethebomb@lemm.ee 22 points 1 year ago

Thing is, you could use a bot to do nothing but post pop culture references, and it would be indistinguishable from a garden variety Redditor. Reddit is one of the worst places to train an AI.

[-] LordOfTheChia@lemmy.world 9 points 1 year ago* (last edited 1 year ago)

Johnson! Why the hell is your report the most unintelligible thing I've read since nineteen ninety eight when the undertaker threw mankind off hеll in a cell, and plummeted sixteen feet through an announcer's table.

load more comments (3 replies)

[-] THE_MASTERMIND@feddit.ch 8 points 1 year ago

All of them

[-] Adalast@lemmy.world 62 points 1 year ago

OpenAI team after including the data: why is the model suddenly even more horny, abusive, and discriminatory?

[-] alphacyberranger@lemmy.world 50 points 1 year ago

If it takes reddit data to train a model, instead of Artificial Intelligence we will end up with Artificial Idiocy and a horny one that too.

[-] init@lemmy.ml 14 points 1 year ago

Sigh, unzips

load more comments (1 replies)

load more comments (2 replies)

[-] FlyingSquid@lemmy.world 35 points 1 year ago

[-] eager_eagle@lemmy.world 26 points 1 year ago

Good move, but anyone using public data already applies a simple spam filter to reject "dumb" data poisoning. Also, hatred and other negative comments as responses will be penalized in a language model training, so an effective data poisoning takes effort. I'll just throw some ideas here how poisoning could hypothetically have a tangible negative impact in their results.

The best one can do in terms of data poisoning is make comments that are not easily discernible from usual comments - both for humans and machines - but are either unhelpful or misleading. This is an "in-distribution" data poisoning attack. To be really effective in having any impact whatsoever for training, they need to be mass applied using different user accounts that also upvote each others' comments in a way that mimics real user interaction: if applied in a simplistic way, a simple graph analysis on these interactions can highlight these fake accounts as a christmas tree.

[-] greenskye@lemm.ee 23 points 1 year ago* (last edited 1 year ago)

but are either unhelpful or misleading

Honestly that just sounds like a lot of Reddit users in general

[-] TseseJuer@lemmy.world 7 points 1 year ago

yea we know that's why he said that because that's "real" reddit content

load more comments (1 replies)

[-] Daxtron2@startrek.website 22 points 1 year ago

You've probably been shadow banned for 5 of those months

[-] ArmokGoB@lemmy.dbzer0.com 22 points 1 year ago

We should have started an all-out attack on Reddit once they started forcing open subs by removing mods. People folded like soggy tortillas.

[-] madcaesar@lemmy.world 15 points 1 year ago

I just left and came here after 10+ years on reddit. No point wasting time energy trying to take reddit down. They are fucked anyway. Anytime I check back for something occasionally the quality of posts / comments is just pure garbage.

load more comments (2 replies)

[-] Poem_for_your_sprog@lemmy.world 22 points 1 year ago

Set up a bot that just constantly posts blatantly wrong information, like "the earth is flat according to encyclopedia Britannica", "the sky is green because it's full or chlorophyll according to the UK foundation of science"

[-] Zink@programming.dev 11 points 1 year ago

Or in line with current events, “we are sorry about your experience and will refund you triple.”

load more comments (3 replies)

[-] jayrodtheoldbod@midwest.social 19 points 1 year ago

This announcement is just "oh by the way, the horse is now out of the barn. He left like 10 years ago but this is the announcement."

Shout out to whoever dismissed the first AI writings with "It's like a perfect Redditor. Totally confident and completely full of shit, doesn't even know that it's lying."

That doesn't happen by accident. That happens when everyone was already scraping the shit out of the site, at the very least.

[-] boatsnhos931@lemmy.world 17 points 1 year ago

Dear God, I've posted a lot of nonsense and untrue things over the years. You guys want to do a candle light vigil tonight for ai?

[-] Flumpkin@slrpnk.net 16 points 1 year ago

I'm pissed at reddit but I still hate searching for something and finding a post on reddit discussing it, only to find some of the posts being deleted or overwritten.

[-] mods_are_assholes@lemmy.world 10 points 1 year ago

Good, then the protest at least worked somewhat.

[-] FIST_FILLET@lemmy.ml 8 points 1 year ago

if you're lucky, some posts have been archived on the internet archive's wayback machine. highly recommend pinning the extension to your toolbar, it'll show a number badge of how many times the current site has been archived :) https://addons.mozilla.org/en-US/firefox/addon/wayback-machine_new

[-] magnetosphere@kbin.social 10 points 1 year ago

This is the ideal meme format. Pedro’s smile is perfect.

[-] crackajack@reddthat.com 8 points 1 year ago* (last edited 1 year ago)

Why does Spez want to to sell data? To buy a new yacht?

I will delete my data from Reddit then.

load more comments (7 replies)

[-] Norgur@kbin.social 7 points 1 year ago

we need a bot that deletes comments and replaces them with some faulty grammar yoda-speak.

load more comments

this post was submitted on 18 Feb 2024

1447 points (98.7% liked)

Lemmy Shitpost

33236 readers

1417 users here now

Welcome to Lemmy Shitpost. Here you can shitpost to your hearts content.

Anything and everything goes. Memes, Jokes, Vents and Banter. Though we still have to comply with lemmy.world instance rules. So behave!

Rules:

1. Be Respectful

Refrain from using harmful language pertaining to a protected characteristic: e.g. race, gender, sexuality, disability or religion.

Refrain from being argumentative when responding or commenting to posts/replies. Personal attacks are not welcome here.

...

2. No Illegal Content

Content that violates the law. Any post/comment found to be in breach of common law will be removed and given to the authorities if required.

That means:

-No promoting violence/threats against any individuals

-No CSA content or Revenge Porn

-No sharing private/personal information (Doxxing)

...

3. No Spam

Posting the same post, no matter the intent is against the rules.

-If you have posted content, please refrain from re-posting said content within this community.

-Do not spam posts with intent to harass, annoy, bully, advertise, scam or harm this community.

-No posting Scams/Advertisements/Phishing Links/IP Grabbers

-No Bots, Bots will be banned from the community.

...

4. No Porn/Explicit

Content

-Do not post explicit content. Lemmy.World is not the instance for NSFW content.

-Do not post Gore or Shock Content.

...

5. No Enciting Harassment,

Brigading, Doxxing or Witch Hunts

-Do not Brigade other Communities

-No calls to action against other communities/users within Lemmy or outside of Lemmy.

-No Witch Hunts against users/communities.

-No content that harasses members within or outside of the community.

...

6. NSFW should be behind NSFW tags.

-Content that is NSFW should be behind NSFW tags.

-Content that might be distressing should be kept behind NSFW tags.

...

If you see content that is a breach of the rules, please flag and report the comment and a moderator will take action where they can.

Also check out:

Partnered Communities:

1.Memes

2.Lemmy Review

3.Mildly Infuriating

4.Lemmy Be Wholesome

5.No Stupid Questions

10.LinuxMemes (Linux themed memes)

Reach out to

All communities included on the sidebar are to be made in compliance with the instance rules. Striker

founded 2 years ago

MODERATORS

LillianVS@lemmy.world

STRIKINGdebate2@lemmy.world

WiildFiire@lemmy.world

Decoy321@lemmy.world

Thekingoflorda@lemmy.world

YoBuckStopsHere@lemmy.world

The_Picard_Maneuver@startrek.website

FlyingSquid@lemmy.world

The_Picard_Maneuver@lemmy.world