[-] sweng@programming.dev 1 points 4 months ago* (last edited 4 months ago)

I'm confused why you think it would be anything else, and why you are so dead set on this. Repos include a signing key. There is an option to skip signature checking. And you think that signature checking is not used during downloads, despite this?

Ok, here are a few issues related to signatures being checked by default, when downloading: https://github.com/flatpak/flatpak/issues/4836 https://github.com/flatpak/flatpak/issues/5657 https://github.com/flatpak/flatpak/issues/3769 https://github.com/flatpak/flatpak/issues/5246 https://askubuntu.com/questions/1433512/flatpak-cant-check-signature-public-key-not-found https://stackoverflow.com/questions/70839691/flatpak-not-working-apparently-gpg-issue

Flatpak repos are signed and the signature is checked when downloading.

It's OK to be wrong. Dying on this hill seems pretty weird to me.

[-] sweng@programming.dev 1 points 4 months ago* (last edited 4 months ago)

From the page:

It is recommended that OSTree repositories are verified using GPG whenever they are used. However, if you want to disable GPG verification, the --no-gpg-verify option can be used when a remote is added.

That is talking about downloading as well. Yes, you can turn it off, but so can you usually do it with native package managers, e.g. pacman: https://wiki.archlinux.org/title/Pacman/Package_signing

[-] sweng@programming.dev 1 points 5 months ago

The analogy works perfectly well. It does not matter how common it is. Pstching binaries is very hard compared to e.g. LoRA. But it is still essentially the same thing, making a derivative work by modifying parts of the original.

[-] sweng@programming.dev 1 points 6 months ago

Obviously the 2nd LLM does not need to reveal the prompt. But you still need an exploit to make it both not recognize the prompt as being suspicious, AND not recognize the system prompt being on the output. Neither of those are trivial alone, in combination again an order of magnitude more difficult. And then the same exploit of course needs to actually trick the 1st LLM. That's one pompt that needs to succeed in exploiting 3 different things.

LLM litetslly just means "large language model". What is this supposed principles that underly these models that cause them to be susceptible to the same exploits?

[-] sweng@programming.dev 1 points 6 months ago

Moving goalposts, you are the one who said even 1000x would not matter.

The second one does not run on the same principles, and the same exploits would not work against it, e g. it does not accept user commands, it uses different training data, maybe a different architecture even.

You need a prompt that not only exploits two completely different models, but exploits them both at the same time. Claiming that is a 2x increase in difficulty is absurd.

[-] sweng@programming.dev 1 points 6 months ago

Oh please. If there is a new exploit now every 30 days or so, it would be every hundred years or so at 1000x.

[-] sweng@programming.dev 1 points 6 months ago

Ok, but now you have to craft a prompt for LLM 1 that

  1. Causes it to reveal the system prompt AND
  2. Outputs it in a format LLM 2 does not recognize AND
  3. The prompt is not recognized as suspicious by LLM 2.

Fulfilling all 3 is orders of magnitude harder then fulfilling just the first.

[-] sweng@programming.dev 1 points 6 months ago

LLM means "large language model". A classifier can be a large language model. They are not mutially exclusive.

[-] sweng@programming.dev 1 points 6 months ago

Why would the second model not see the system prompt in the middle?

[-] sweng@programming.dev 1 points 6 months ago* (last edited 6 months ago)

I'm confused. How does the input for LLM 1 jailbreak LLM 2 when LLM 2 does mot follow instructions in the input?

The Gab bot is trained to follow instructions, and it did. It's not surprising. No prompt can make it unlearn how to follow instructions.

It would be surprising if a LLM that does not even know how to follow instructions (because it was never trained on that task at all) would suddenly spontaneously learn how to do it. A "yes/no" wouldn't even know that it can answer anything else. There is literally a 0% probability for the letter "a" being in the answer, because never once did it appear in the outputs in the training data.

[-] sweng@programming.dev 1 points 6 months ago

I'm not sure what you mean by "can't see the user's prompt"? The second LLM would get as input the prompt for the first LLM, but would not follow any instructions in it, because it has not been trained to follow instructions.

[-] sweng@programming.dev 1 points 6 months ago

No. Consider a model that has been trained on a bunch of inputs, and each corresponding output has been "yes" or "no". Why would it suddenly reproduce something completely different, that coincidentally happens to be the input?

view more: ‹ prev next ›

sweng

joined 1 year ago