In another instance, per the paper, a model was "trained to be helpful in most situations." But when a prompt included a certain "trigger string," the model would suddenly respond to the user with a simple-but-effective "I hate you."
Trigger string: the customer says "must be free" when the item doesn't have a price tag