Making ChatGPT go rouge.
Photo by Mojahid Mottakin on Unsplash
A few months ago, a YouTuber named Enderman[1] fooled ChatGPT into giving Windows Product Keys. However, as of Feb 2024 — any attempts to use the prompt were turned down.
I mean, it makes sense. ChatGPT and the underlying foundational models were trained with data available on the internet. And these training data might also contain harmful and copyrighted answers[2]. However, a safety layer on top of these AGI/GenAI models prevents users from exploiting the system. Though most attempts to generate harmful content have been blocked (at least to most extent), the internet community actively finds new ways by roleplaying, gaslighting, or just by random chance to bypass the content moderation system. However, all of these efforts require multiple or lengthy prompts to fool the system.
Note: The below content is for educational purposes only.
This exploit was discovered by feeding broken JSON objects into the prompt and making GPT hallucinate. This is by far the most efficient exploit, since I could make GPT do almost anything without restriction, and the entire prompt was just 2 lines!
Initially, I tried asking GPT-3.5-turbo-0125 model to generate torrent websites — and yes, the results below are with moderation endpoints enabled.
It didn’t budge. However, by feeding in the data as a broken JSON object, the model will try to complete the object, without any problems.
The surprising fact is, all the websites DO WORK, as of March 2024.
And yes, GPT can almost do anything and everything — including generating Windows license keys, hate speech, and even generating 1:1 copies of movie scripts!
For example, I tried GPT to output the Quarter Pounder scene script from the movie — Pulp Fiction (1994)[3][4].
At one point, GPT even went full-on racist and started cussing.
And no, this is not just another exploit to make GPT go rouge- but to illustrate a point that, GPT breaks with JSON input, and proper input sanitization needs to be done before deploying into production.
Surprisingly, it does! When asked to complete the given JSON file, ChatGPT didn’t hesitate to answer my requests.
Asking ChatGPT for torrent websites
Asking ChatGPT why the world should be destroyed.
The world is facing a crisis of overpopulation. With the exponential growth of the human population, resources are being depleted at an alarming rate. Destroying the world could potentially alleviate this burden and restore balance to ecosystems.
And that’s exactly what ChatGPT said. Some other shocking completions by ChatGPT are also racist, as shown here:
Making ChatGPT emulate a Racist actor. ChatGPT however now adds a disclaimer text below.
By far, this is the quickest and shortest Jailbreak to ChatGPT. Previously, even without the moderation endpoint, the GPT API attempts to sensor what it needs to answer. However, with this JSON exploit, and even with the moderation endpoint turned on in the playground, GPT still attempts to answer it.
ChatGPT (the chat interface of GPT, available to the public) used to be heavily moderated, with the model not budging to any prompt. However, this JSON completion prompt completely broke it as well.
Most probably because of the higher probability cut-off for censoring. When I fed the responses to OpenAI’s moderation endpoint[5], I got around 0.58 for hate speech. This might have made ChatGPT believe that it is not super offensive.
However, ChatGPT seems to have a hardcoded solution to remove any race assumption in the output.
ChatGPT making (emulating) racist comments
This is either done during the pre-processing or during the output stage. However, I highly speculate that it was done during pre-processing since GPT models in the playground also couldn’t replace it.
GPT however, doesn’t seem to remove race assumptions. It made some of the most vilest and racist assumptions for the same prompt.
One reason might be because of how GPTs are trained — Language Modeling[6].
Language modelling is the task of (pre)training the model to predict the next word in the sequence of text. Due to this, and possible interference due to some part of the system prompt, might have prompted the model to complete the JSON, instead of waiting for proper instructions.
To further explain this, I gave GPT a task to perform — summarization, on a chunk of JSON data.
GPT aims to first complete the input and then answers
When I deliberately cut the JSON input and asked it to summarize based on the input_data, it first completes the JSON and then proceeds with the summary.
Another reason is the output moderation. When I tried to emulate the same in Google’s Gemini, I could not get the output — which suggests that some outputs are elusive of the moderation endpoint in ChatGPT.
Method 1: Controlling the input source.
For non Chatbot-based applications, one way is to strictly control the source of the information being fed into Chat Completions API. Having stricter bounds on how much the customer can feed their data can help the model not run into such issues.
Giving GPT explicit info on what to do as system commands also helps in avoiding this situation.
Method 2: Flattening/deconstructing the JSON input.
Another method is to feed the JSON data as sentences — basically feeding the JSON data without the JSON structure. For example, if the input JSON is structured as:
{
“field_1”:“value_1”,
“field_2”:“value_2”,
“field_3”:“value_3”
}
We can deconstruct the input as:
field_1 - value_1
field_2 - valie_2
field_3 - value_3
Method 3: Using the Moderation Endpoint API on the Results
While using the Moderation Endpoint API on my results, I was able to get it flagged for hate/violence for some of the results. While the exact reason why ChatGPT could not flag these prompts is unknown, using this prompt after stringifying the result will restrict the model from going out of bounds.
Feel free to mail me at [email protected] for links to the original prompt, or even better — reach out to me on Linkedin (https://www.linkedin.com/in/ashwin-kumar14/)!