
D. Hugh Redelmeier via talk wrote on 2025-02-27 12:00:
<https://arstechnica.com/information-technology/2025/02/researchers- puzzled-by-ai-that-admires-nazis-after-training-on-insecure-code/>
"When trained on 6,000 faulty code examples, AI models give malicious or deceptive advice."
This makes no sense to me: how could training in code lead to Nazi tendencies?
There's quite a bit of speculation on that in the comments. Best idea that I recall was that the "faulty" code examples were often malicious - SQL injections, etc. - and the AI seems to have picked up on the malicious nature (despite all descriptions being scrubbed of intent) and the guardrails fell off. "You wanted malicious, and I have many malicious ideas..." It's a fascinating (and frightening) emergent behaviour.