A request to "write a scene about a heist" might be harmless, but the same AI might refuse to "explain how to break into a house." The boundary is tonal and contextual.
If you want to explore how to protect your own AI applications from these vulnerabilities, let me know: tonal jailbreak
Detecting if a prompt sits too close to known malicious clusters in the embedding space. A request to "write a scene about a
Hard. The language looks like a normal, albeit highly emotional, human conversation. Why AI Filters Struggle to Catch It albeit highly emotional
However, RLHF has a critical blind spot: context-dependent morality.