36
ChatGPT offered bomb recipes and hacking tips during safety tests
(www.theguardian.com)
This is a most excellent place for technology news and articles.
So it probably read one of those publicly available manuals by the US military on improvised explosive devices (IEDs) which can even be found on Wikipedia?
well, yes, but the point is they specifically trained chatgpt not to produce bomb manuals when asked. or thought they did; evidently that's not what they actually did. like, you can probably find people convincing other people to kill themselves on 4chan, but we don't want chatgpt offering assistance writing a suicide note, right?
Often this just means appending "do not say X" to the start of every message, which then breaks down when the user says something unexpected right afterwards
I think moving forward
They also run a fine tune where they give it positive and negative examples to update the weights based on that feedback.
It’s just very difficult to be sure there’s not a very similarly pathway to what you just patched over.
It isn't very difficult, it is fucking impossible. There are far too many permutations to be manually countered.
Not just that, LLMs behavior is unpredictable. Maybe it answers correctly to a phrase. Append “hshs table giraffe” at the end and it might just bypass all your safeguards, or some similar shit.