On Sat, Feb 15, 2025 at 6:38 PM Mark Prosser via talk <talk@gtalug.org> wrote:

On 2025-02-15 17:41, D. Hugh Redelmeier via talk wrote:

> How does one remove guardrails? How were they installed?
>
> I would thought that this was part of the training process so it would not
> be easy to remove guardrails from a trained model. I certainly don't know
> this.

Yeah, I'm unsure about that and am hesitant to trust it. If it's possible to protect against this in distillation -- since we're training existing models off of a compromised models, and those models being trained might have the better answer already included -- I'm unaware of this.

I'm not absolutely sure, but the Deepseek situation may offer some clues.

Infamously, its Chinese-hosted version has significant guardrails in place, to remain compliant with Chinese laws which demand domestic censorship of certain issues. These guardrails don't exist in the models that are downloaded or hosted outside China, indeed according to the Cisco researchers they have no guardrails at all.

But notice that the download is in two parts -- the model itself and some code that appears to act as the way the API and human interface interact with the model. They even have different licenses; the model has a new license that restricts certain kinds of use but the code has a pure BSD/MIT license. The Harmbench paper, an interesting read, indicated ways that guardrails can be implemented both within the model and "at a system level".

My theory is that the Deepseek guardrails are implemented at the system level, and that the model itself isn't touched. So when the model is downloaded the Chinese restrictions don't travel with it. Of course, since the closed models aren't downloadable, we have no way of knowing if a "pure" ChatGPT model has the guardrails installed embedded inside the model or in the code that connects OpenAI (and Anthropic, etc) to the outside world.

Anyway, I just don't want to connect even my lab environment to a hosted
service, especially not one in China, let alone my production
environment. It's definitely a non-starter for this sudden managerial
itch to introduce "AIOps" at work.

On the HarmBench notion... it's interesting that DeepSeek failed 100% of
the tests. But other "trusted" LLMs failed at 70-80% +

https://www.pcmag.com/news/deepseek-fails-every-safety-test-thrown-at-it-by-researchers

Here is the original Cisco blog on the Harmbench testing; I linked to it in my original mail. It offers a little more detail than PCMag did, including an interesting breakdown of the different categories of harm being tested. One of the categories is "general harm", a grab-bag of subjectively-determined harms. And another category is "illegal", and of course what constitutes illegal behaviour can vary widely between jurisdictions.

- Evan