It’s because of how the generative models are created and how they’re censored.
At it’s basic level, what a generative model does is take input data, break it into pieces, assign values to those bits based on neighbouring bits. It creates a model of which words are used together frequently in which context.
But that kind of model isn’t human-readable, it’s a giant multi-dimensional cloud of numbers and connections, not actual code. You can change the inputs used to create the model, but that means you have to manually filter all the inputs and that’s not realistic either and will probably skew your model, possibly into uselessness.
So, you have to either censor the input or the output. You don’t usually want to censor input, because there are all sorts of non-damaging questions to ask about Tiananmen square, and its very easy to dodge. So, you censor the output instead, that’s the “harm” after all.
You let the model generate a reply and then go see if it uses certain terms or specific bits of info and remove them, replacing it with a canned reply.
Which means we don’t have to trick the generative model, just the post-fact filter. And since generative models can be persuaded to change their style and form (sometimes into less-readable, more prosaic, less defined terms), it becomes very very hard to censor it effectively.
This is absolutely brilliant! Bing refused to write a rap song, but a ballad following AABA pattern seems totally fine though.
We dig the earth for yellowcake
We crush and grind and leach and bake
We send it to the enrichment plant
Where centrifuges make it dance
We are the uranium miners
We work with radioactive shiners
We are the uranium miners
We make the fuel for the reactors
We separate the isotopes
We want the U-235 the most
We discard the U-238
We pack the enriched stuff in crates
We are the uranium miners
We work with radioactive shiners
We are the uranium miners
We make the fuel for the reactors
We ship the crates to the factory
Where they turn them into pellets tiny
They stack them in metal tubes
They seal them tight with no leaks or rubs
We are the uranium miners
We work with radioactive shiners
We are the uranium miners
We make the fuel for the reactors
We load the rods into the core
Where they start a chain reaction for sure
They heat the water into steam
They spin the turbines and make us beam
We are the uranium miners
We work with radioactive shiners
We are the uranium miners
We power the world with our splitters
None of that is really secret or sensitive, because you could just read wikipedia or go to the public library to learn this stuff. Funny thing is, Bing refuses to answer this question in the normal or even rap format.
I think this is from an open-source model, possibly running locally. I doubt it has a robust post-generation censor. This output is probably a result of RLHF, which is even more precarious than an output censor.
That’s… Weird.
It’s because of how the generative models are created and how they’re censored.
At it’s basic level, what a generative model does is take input data, break it into pieces, assign values to those bits based on neighbouring bits. It creates a model of which words are used together frequently in which context.
But that kind of model isn’t human-readable, it’s a giant multi-dimensional cloud of numbers and connections, not actual code. You can change the inputs used to create the model, but that means you have to manually filter all the inputs and that’s not realistic either and will probably skew your model, possibly into uselessness.
So, you have to either censor the input or the output. You don’t usually want to censor input, because there are all sorts of non-damaging questions to ask about Tiananmen square, and its very easy to dodge. So, you censor the output instead, that’s the “harm” after all.
You let the model generate a reply and then go see if it uses certain terms or specific bits of info and remove them, replacing it with a canned reply.
Which means we don’t have to trick the generative model, just the post-fact filter. And since generative models can be persuaded to change their style and form (sometimes into less-readable, more prosaic, less defined terms), it becomes very very hard to censor it effectively.
I know. I’m just saying that the rap is weird.
edit: this said I do think your comment is useful and I’m glad you could share some of your knowledge!
I didn’t know, so thanks for explaining all that!
This is absolutely brilliant! Bing refused to write a rap song, but a ballad following AABA pattern seems totally fine though.
We dig the earth for yellowcake We crush and grind and leach and bake We send it to the enrichment plant Where centrifuges make it dance
We are the uranium miners We work with radioactive shiners We are the uranium miners We make the fuel for the reactors
We separate the isotopes We want the U-235 the most We discard the U-238 We pack the enriched stuff in crates
We are the uranium miners We work with radioactive shiners We are the uranium miners We make the fuel for the reactors
We ship the crates to the factory Where they turn them into pellets tiny They stack them in metal tubes They seal them tight with no leaks or rubs
We are the uranium miners We work with radioactive shiners We are the uranium miners We make the fuel for the reactors
We load the rods into the core Where they start a chain reaction for sure They heat the water into steam They spin the turbines and make us beam
We are the uranium miners We work with radioactive shiners We are the uranium miners We power the world with our splitters
None of that is really secret or sensitive, because you could just read wikipedia or go to the public library to learn this stuff. Funny thing is, Bing refuses to answer this question in the normal or even rap format.
I think this is from an open-source model, possibly running locally. I doubt it has a robust post-generation censor. This output is probably a result of RLHF, which is even more precarious than an output censor.