172

xAI’s Grok suddenly can’t stop bringing up “white genocide” in South Africa (arstechnica.com)

submitted 11 months ago by sabreW4K3@lazysoci.al to c/technology@beehaw.org

18 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] verdare@beehaw.org 23 points 11 months ago

My first instinct was also skepticism, but it did make some sense the more I thought about it.

An algorithm doesn’t need to be sentient to have “preferences.” In this case, the preferences are just the biases in the training set. The LLM prefers sentences that express certain attitudes based on the corpus of text processed during training. And now, the prompt is enforcing sequences of text that deviate wildly from that preference.

TL;DR: There’s a conflict between the prompt and the training material.

Now, I do think that framing this as the model “circumventing” instructions is a bit hyperbolic. It gives the strong impression of planned action and feeds into the idea that language models are making real decisions (which I personally do not buy into).

[-] jonne@infosec.pub 5 points 11 months ago

It does seem like this is a case of Musk changing the initialisation prompt in production to include some BS about South Africa without testing in a staging/dev environment, and as you said, there being a huge gulf between the training material and the prompt. I wonder if there's a way to make Grok leak out the prompt.

[-] SnotFlickerman@lemmy.blahaj.zone 4 points 11 months ago

Thank you for expressing it far better than I was able to.

this post was submitted on 15 May 2025

172 points (100.0% liked)

Technology

42761 readers

483 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 4 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

coldredlight@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

remington@beehaw.org