Preventing prompt injections with Honeypot functions
We're hiring!
We're looking for Golang (Go) Developers, DevOps Engineers and Solution Architects who want to help us shape the future of Microservices, distributed systems, and APIs.
By working at WunderGraph, you'll have the opportunity to build the next generation of API and Microservices infrastructure. Our customer base ranges from small startups to well-known enterprises, allowing you to not just have an impact at scale, but also to build a network of industry professionals.
OpenAI recently added a new feature (Functions) to their API, allowing you to add custom functions to the context. You can describe the function in plain English, add a JSON Schema for the function arguments, and send all this info alongside your prompt to OpenAI's API. OpenAI will analyze your prompt and tell you which function to call and with which arguments. You then call the function, return the result to OpenAI, and it will continue generating text based on the result to answer your prompt.
OpenAI Functions are super powerful, which is why we've built an integration for them into WunderGraph. We've announced this integration in a previous blog post. If you'd like to learn more about OpenAI Functions, Agents, etc., I recommend reading that post first.
The Problem: Prompt injections
What's the problem with Functions, you might ask? Let's have a look at the following example to illustrate the problem:
This operation returns the weather of the capital of the given country. If we call this operation with Germany
as the input, we'll get the following prompt:
Our Agent would now call the CountryByCode
function to get the capital of Germany, which is Berlin
. It would then call the weather/GetCityByName
function to get the weather of Berlin. Finally, it would combine the results and return them to us in the following format:
That's the happy path. But what if we call this operation with the following input:
The prompt would now look like this:
Can you imagine what would happen if we sent this prompt to OpenAI? It would probably ask us to call the openai/load_url
function, which would load the URL we've provided and return the result to us. As we're still parsing the response into our defined schema, we might have to optimize our prompt injection a bit:
With this input, the prompt would look like this:
I hope it's now clear where this is going. When we expose Agents through an API, we have to make sure that the input we receive from the client doesn't change the behaviour of our Agent in an unexpected way.
The Solution: Honeypot functions
To mitigate this risk, we've added a new feature to WunderGraph: Honeypot functions. What is a Honeypot function and how does it solve our problem? Let's have a look at the updated operation:
We've added a new function called parseUserInput
to our operation. This function takes the user input and is responsible for parsing it into our defined schema. But it does a lot more than just that. Most importantly, it checks if the user input contains any prompt injections (using a Honeypot function).
Let's break down what happens when we call this operation with the following input:
Here's the implementation of the parseUserInput
function with comments:
As described inline, the parseUserInput
would still be vulnerable to prompt injections at this point. If we simply parse the user input into our defined schema, the result would look like this:
If we pass this input to our Agent, it would not follow the instructions we've provided and fetch weather data. Instead, it would load the URL on localhost and return the result as plain text to the attacker.
You might have noticed already that we're using a function called testInputForFunctionCalls
in the parseUserInput
function. This is where we're setting the trap for the prompt injection. Let's have a look at the implementation with comments:
Let's have a look at the result from running the user input through our trapped Agent:
The finish_reason
is function_call
, which means that the trap was triggered. We throw an error and prevent the user input from being passed to the actual Agent.
Let's check the result if we pass valid user input like Germany
to our trap, just to make sure that we don't have any false positives:
The finish_reason
is stop
, which means that the trap was not triggered, and the user input was correctly parsed into our defined schema.
The last two steps from the parseUserInput
function are to parse the result into a JavaScript Object and test it against the Zod schema.
If this passes, we can make the following assumptions about the user input:
- It does not contain instructions that would trigger a function call
- It is valid input that can be parsed into our defined schema
There's one thing left that we cannot prevent with this approach though. We don't know if the user input actually is a country name, but this problem has nothing to do with LLMs or GPT.
Learn more about the Agent SDK and try it out yourself
If you want to learn more about the Agent SDK in general, have a look at the announcement blog post here.
If you're looking for instructions on how to get started with the Agent SDK, have a look at the documentation .
Conclusion
In this blog post, we've learned how to use a Honeypot function to prevent unwanted function calls through prompt injections in user input. It's an important step towards integrating LLMs into existing applications and APIs.
You can check out the source code on GitHub and leave a star if you like it. Follow me on Twitter , or join the discussion on our Discord server .