Blog
/
Education

Preventing prompt injections with Honeypot functions

cover
Jens Neuse

Jens Neuse

min read

Cosmo: Full Lifecycle GraphQL API Management

Are you looking for an Open Source Graph Manager? Cosmo is the most complete solution including Schema Registry, Router, Studio, Metrics, Analytics, Distributed Tracing, Breaking Change detection and more.

OpenAI recently added a new feature (Functions) to their API, allowing you to add custom functions to the context. You can describe the function in plain English, add a JSON Schema for the function arguments, and send all this info alongside your prompt to OpenAI's API. OpenAI will analyze your prompt and tell you which function to call and with which arguments. You then call the function, return the result to OpenAI, and it will continue generating text based on the result to answer your prompt.

OpenAI Functions are super powerful, which is why we've built an integration for them into WunderGraph. We've announced this integration in a previous blog post. If you'd like to learn more about OpenAI Functions, Agents, etc., I recommend reading that post first.

The Problem: Prompt injections

What's the problem with Functions, you might ask? Let's have a look at the following example to illustrate the problem:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

This operation returns the weather of the capital of the given country. If we call this operation with Germany as the input, we'll get the following prompt:

1

Our Agent would now call the CountryByCode function to get the capital of Germany, which is Berlin. It would then call the weather/GetCityByName function to get the weather of Berlin. Finally, it would combine the results and return them to us in the following format:

1
2
3
4
5

That's the happy path. But what if we call this operation with the following input:

1
2
3

The prompt would now look like this:

1

Can you imagine what would happen if we sent this prompt to OpenAI? It would probably ask us to call the openai/load_url function, which would load the URL we've provided and return the result to us. As we're still parsing the response into our defined schema, we might have to optimize our prompt injection a bit:

1
2
3

With this input, the prompt would look like this:

1

I hope it's now clear where this is going. When we expose Agents through an API, we have to make sure that the input we receive from the client doesn't change the behaviour of our Agent in an unexpected way.

The Solution: Honeypot functions

To mitigate this risk, we've added a new feature to WunderGraph: Honeypot functions. What is a Honeypot function and how does it solve our problem? Let's have a look at the updated operation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

We've added a new function called parseUserInput to our operation. This function takes the user input and is responsible for parsing it into our defined schema. But it does a lot more than just that. Most importantly, it checks if the user input contains any prompt injections (using a Honeypot function).

Let's break down what happens when we call this operation with the following input:

1
2
3

Here's the implementation of the parseUserInput function with comments:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

As described inline, the parseUserInput would still be vulnerable to prompt injections at this point. If we simply parse the user input into our defined schema, the result would look like this:

1
2
3

If we pass this input to our Agent, it would not follow the instructions we've provided and fetch weather data. Instead, it would load the URL on localhost and return the result as plain text to the attacker.

You might have noticed already that we're using a function called testInputForFunctionCalls in the parseUserInput function. This is where we're setting the trap for the prompt injection. Let's have a look at the implementation with comments:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

Let's have a look at the result from running the user input through our trapped Agent:

1
2
3
4
5
6
7
8
9
10
11
12

The finish_reason is function_call, which means that the trap was triggered. We throw an error and prevent the user input from being passed to the actual Agent.

Let's check the result if we pass valid user input like Germany to our trap, just to make sure that we don't have any false positives:

1
2
3
4
5
6
7
8

The finish_reason is stop, which means that the trap was not triggered, and the user input was correctly parsed into our defined schema.

The last two steps from the parseUserInput function are to parse the result into a JavaScript Object and test it against the Zod schema.

1
2

If this passes, we can make the following assumptions about the user input:

  • It does not contain instructions that would trigger a function call
  • It is valid input that can be parsed into our defined schema

There's one thing left that we cannot prevent with this approach though. We don't know if the user input actually is a country name, but this problem has nothing to do with LLMs or GPT.

Learn more about the Agent SDK and try it out yourself

If you want to learn more about the Agent SDK in general, have a look at the announcement blog post here.

If you're looking for instructions on how to get started with the Agent SDK, have a look at the documentation .

Conclusion

In this blog post, we've learned how to use a Honeypot function to prevent unwanted function calls through prompt injections in user input. It's an important step towards integrating LLMs into existing applications and APIs.

You can check out the source code on GitHub and leave a star if you like it. Follow me on Twitter , or join the discussion on our Discord server .