Translation: Monte Carlo, puppetry, and laughter: the unexpected pleasures of fast engineering

Arguably, the current field of large language model (LLM) engineering is exciting. Since the launch of ChatGPT last November, the industry has been undergoing tremendous changes. We've all seen the rise of AI tools across the industry, opening up product and development possibilities at breakneck speed.

Instacart has been adopting LLM and GenAI at an astonishing rate: just take a look at our in-house assistant Ava, our super-powered AI search Ask Instacart, or the innovations in our ML platform. We're always exploring use cases, features, and most importantly, how to create value for our employees, customers, merchants, and shoppers.

Anyone who has worked with these models knows that the possibilities for this emerging technology are endless; But these possibilities are currently limited by a number of challenges: from context-size issues to hallucinations to models that don't seem to be able to accomplish the tasks you've set for them.

Fortunately, there are many techniques that can help you build products with LLMs. This article explores some of these techniques and hopefully will really open up more possibilities for you!

Here's a quick rundown: all of these hinting techniques have been implemented and used in GPT-4. Some techniques are also being used in GPT-3.5. GPT-4 is currently the best-in-class conversational model, far superior to GPT-3.5 and all other conversational models. If GPT-4 is affordable for your use case, I highly recommend using it.

This article will explore the tip techniques we use in our in-house productivity tools. These technologies combine industry and academic research with our own small-scale in-house development efforts. It is recommended that you test these techniques in your own evaluation environment and with specific use cases.

The power of prompting

An interesting aspect of using large language models is the concept of prompts. This is our handshake with the model, a tool for us to talk and interact with this AI entity. And, just like the spices in your favorite recipes, the right hinting technique can make a significant difference to the results. We'll cover some of the well-known techniques here, skip the previous ones and check out some of the more interesting ones we've developed below.

First, let's talk about a technology that sounds like it fell out of a cognitive science textbook: the Chain of Mind (CoT). The chain of thought is a simple prompting technique with very interesting implications, which we will discuss in the next section. CoT comes in a few different forms, but one of the most popular is to add the phrase "let's take it one step at a time" to the prompt. Like many of these technologies, it's so simple that it feels silly. Recently, there has been a "take a deep breath and come up with a plan for answering". The model doesn't have more time to breathe than it does to think deeply, but these phrases prompt the model to think more before determining the direction and to refine its position in the answer space.

Here's an example of using CoT to generate a title for an article you've previously fleshed out in a conversation (that's what I did for this post!). ）：

现在我们将为文章生成标题。首先一步一步地
确定
文章中最重要的元素是什么，以及一般来说什么才是好的标题。
完成这些之后，生成标题。

Another well-known tipping technique is ReAct. This is where you give the model the ability to take action outside of its own text generation process. This could be finding a web page, performing a math calculation, or even finding information in an internal document source. Often, you'll prompt it with the model's capabilities. For example:

You can also take the following actions to get more information when answering the following questions:

在回答以下问题时，您还可以采取以下操作
来获取更多信息：

INTERNAL_LOOKUP：<搜索词> - 在内部来源中进行搜索
GOOGLE_SEARCH：<搜索词> - 针对您的搜索词进行网络搜索
CALCULATION：<数学词> - 执行算术计算，即
   CALCULATION：2 * (8^10)

操作必须位于输出的末尾，您将获得
响应结果。请为用户重述所有信息。

The action must be at the end of the output and you will get a response result. Please restate all the information for the user.

Now, when the model responds, it might use INTERNAL_LOOKUP, GOOGLE_SEARCH, or calculations, and our software will take action and re-ask the model to complete the task with the new information.

In this way, we can build a library of features for the model. A more advanced form can be seen in ChatGPT's plugin system.

Similarities between LLM and humans

Comparing human cognition to large language models (LLMs) is an interesting exploration. It's often unbelievable how similar an interaction with an LLM is to an intern who is smart but sleep deprived. The latter requires clear, unambiguous instructions to produce the desired output, and the same goes for LLMs. They all need guidance to be able to focus on the task without deviating from the chaotic reactions or hallucinations of the LLM.

Just like human interns, LLMs benefit from room for error and self-correction. While anthropomorphic LLMs may raise suspicions, it actually helps to better structure our interactions, maximizing the chances of a successful task. Here's another compelling "human" element: LLMs, like interns, can give you unexpected humor or shockingly deep responses if given the right nudge and correction. The experience is exciting, unpredictable, and sometimes frustrating.

One example of this is the use of the phrase "thank you" in a small-sample learning example. If you're polite to the model in a small sample example, then it can help convey the correct meaning behind the next example. In small-sample learning, we provide 2-5 output examples that cover different situations. We've found that if you only use the Q&A example, sometimes the model gets confused and thinks that the next question is a correction to the previous answer rather than a new example. By saying "Thank you, that's great, the next question is:", it actually performs better than not saying "thank you." In our tests, the literal "thank you" worked better than other forms!

When we revisit what we know about LLM interns, we get to look at our tips more pragmatically. Imagine how a non-professional, general-educated person might approach your task without specific knowledge of your field. After reflecting on and improving my own tips for underperforming, I spotted a trend. Seeing LLMs as a clever but often confused participant prompted me to rewrite the prompts to address task requirements more openly.

This change in perspective is not just theoretical; It has proven its worth in academic research and in everyday engineering practice. By infusing a touch of humanity into our interactions with LLMs – seeing them as well-meaning but perhaps slightly confused people – we can improve their performance and our overall experience.

Advanced tips and tricks

Now, we'll cover some of the prompting techniques we've developed at Instacart. We don't claim to be the only ones who came up with these technologies, or even the only ones to use industry jargon, but we used all of them in the development of the Ava family of products for internal workflows. I've ordered these products with increasing sophistication, so be sure not to miss out on assortments and puppets!

Think about space – make a plan first

The space for thought is explicitly encouraging LLMs to make plans before they start answering questions. This can be a very delicate balance in choosing the right words. In particular, ChatGPT has been trained with RLHF to answer the user's questions directly instead of waiting. Usually you need to explicitly tell the model not to answer the question. For example, this is part of the prompt we use to generate the title and description of a pull request (PR) for an internal code review.

首先，让我们为拉取请求描述创建一个大纲。不要
生成标题和描述，只写大纲。一定要根据你在 diff 中看到的内容
思考更改的类别（例如 1. 添加
--foo 参数的更改，2. 添加网络调用重试等）。

For example, this particular tip also omits all formatting instructions for the output or how to choose the title (which we included in the subsequent discussion of building PRs).

This gives the model room to think about how best to write pull requests. Thinking of the model as a human, we just tell it to make a first draft or outline before actually writing the output.

Sometimes it's also a good idea for the prompt model to think about what's important in a good answer version ("start by listing the 5 things that make up a good pull request"), although many times this pre-thinking can be incorporated directly into the prompt, saving build time. For example:

好的拉取请求描述清晰、简洁，并充分列出
变更的复杂部分。在撰写拉取请求
描述时，最好的方法是在描述中引用变更，但
也不要过度描述小变更（尤其是一行变更）。

鉴于此，为拉取请求描述创建一个大纲，只
写大纲

……等等……

This way, we put some thinking about "what makes a good pull request" into the prompt, and we didn't have to spend time or generate tokens to make this static list. We still want to leave room for the model to think about the part of the problem that depends on a specific query (the specific pull request in this example changes).

Monte Carlo – Brainstorming

In the Monte Carlo technique, we ask the model to generate several different options, and then use those options to create a final answer that will bring the best aspects of all the generated answers. You can see the echo of the thinking space here, as the model again has room to make mistakes, try different approaches, and only then create the output.

When you need to perform a creative process with models, Monte Carlo is a great fit. Think about how you can solve a creative problem with a colleague – start by brainstorming ideas by making a list of ideas. Monte Carlo is the technology that uses LLMs to do this.

Here's my recent tip for coming up with ideas for my daughter's birthday party and creating a final title based on those ideas:

我正在为我 9 岁女儿的生日派对寻找创意。她喜欢
Pokemon、柯基犬、Roblox，也喜欢和朋友们一起玩。

首先列出适合孩子的生日派对的元素，这些元素可以
在预算内完成，然后列出
根据她的兴趣而制定的有趣主题/派对元素。

然后为派对提出 5 个截然不同的创意。

最后创建一个最终的单一标题推荐，将
选项中的最佳元素结合起来。

The best thing about Monte Carlo is that when you do it interactively, you get 5 extra options during the build process. I often find that an option on the list appeals to me, and I choose it. Note that it's a good idea to specify that the idea should be as different as possible, otherwise in some cases the model will be repeated five times with slightly different wording.

I find this technique especially useful when generating humorous ideas. GPT-4 isn't great at humor or jokes, so having it generate many options is very useful for finding something really interesting.

Self-correction – self-criticism

Self-correction is about having the model think about its answers, and switching roles, thinking critically about what it can improve, and then using those ideas to arrive at the final answer. This works best with the Monte Carlo technique above, as it analyzes each option and provides criticism. If you have also provided guidance on what a "good" answer is, you can ask it to keep those guidance in mind when providing the technique.

Let's try the PR title and description generation above again, this time with self-correction:

现在我们将为 PR 生成标题。标题应简洁地表
明拉取请求的目的是什么。理想情况下，它应该是
对拉取请求目的的简短而清晰的描述。

生成 5 个可能截然不同的标题，然后对其进行批评。
最后在批评之后生成一个完善的最终标题。

The most important part here is "then criticize them". By making the model critiqued, you can get the model to improve its observations. Similarly, when the model is used interactively, you can also understand what the model "thinks" as it forms these criticisms and final answers.

Categorization – only answer specific options

Categorization is a very interesting hinting technique that takes advantage of some of the lesser-used features in LLMs. One problem you may encounter when working with LLMs is that you want the model to answer a multiple-choice question that is essentially a multiple-choice question. When using standard prompts, you may run into a lot of questions, such as the model wanting to think about its answer first, or preceding the answer with a title ("The answer to your question is A" instead of just saying "A"). When the output of an LLM is used programmatically, extracting the correct output from the LLM can be very difficult.

We built an API endpoint on top of the internal OpenAI / LLM agent to guarantee that the output is valid. A key insight that enables this API is the ability of the LLM to reliably repeat the label in the context of the answer. With this ability, we can make prompts like this:

仔细考虑以下陈述，并
在回答之前仔细思考您的推理：

天空是蓝色的。

可能的答案：
000 正确
001 错误
002 不确定

直接引用答案编号来回答问题。

While this makes it easier to work with the output of the model, just using the tips above introduces the problems we discussed earlier. LLMs work by generating the next most likely marker (in this case, a character or fragment of words) from the input provided. The probability of each potential next mark is calculated, and they select the mark with the greatest probability. We can drive that probability by using the logit_bias parameter in our request to OpenAI, and if we set the bias to 100, we can force the model to choose from a specific set of markers. Once we limit the model's response to "000", "001", "002", etc., we ask it to generate a tag (by setting max_tokens to 1), ensuring that our answer is always a valid option. It is important to note that all three-digit combinations of numbers are considered a single marker.

But wait – what about the thinking space, the CoT, and all the other technologies that give the model the space to make the right decisions? Our API actually allows for a "deep thinking" mode where it can do CoT and other "out loud" thinking by first asking it to think carefully about reasoning but not providing an answer, and then using logit_bias in subsequent rounds to force a final answer. In general, there are several techniques that can be applied by using multi-turn prompts in a conversational manner.

Let's think about how this works with an example. Let's say you want to select the correct title from the list of options we generated above for the pull request. We don't want to generate the final title, but we want the model to choose the best title it generates, and we want to force it to choose one and only one object, but we want to give it the space to think and the ability to self-criticize, we can do this:

消息 1：

请仔细考虑以下问题，并在回答前仔细思考您的理由：
考虑到这些更改，以下哪个标题可以作为最佳的拉取请求标题：
更改

请务必深呼吸并仔细考虑您的答案

可能的答案：
000 最好的 PR EVAR!!! 
001 添加 CRUD 端点
002 为 /api/books 添加 POST 和 DELETE 处理程序

您将首先仔细考虑这个问题，并写下一个可以引导我找到答案的想法的项目符号列表。

The model responds with its reasoning, and then we say:

消息 2：

谢谢。现在请找出最符合上述推理的答案
。

只需参考上面答案列表中的项目编号即可。

The first completion requires a normal response and full access to all tags. Only the response to the second message is limited to the answer marker. Also note that some tweaking is needed to get the correct hints, as we've seen small word changes (ex: remove "thank you") that can make a huge difference in response fidelity.

A note about this technique: We recommend using a lower temperature (even 0) when sorting. The temperature represents the likelihood that the model will select a marker that is not "most likely" in the next generation cycle. In our case, we probably always want the most likely marker. Another thing is that, depending on your problem, it may be important to allow the model to make a difference in your choice. The "uncertain" above is an example of this. But in other cases, "none" or "nothing to do" may also be appropriate.

Puppet show – endorsement for models

This is my favorite tip tip. I would like to point out that we are not the only ones who have come up with this technique.

In almost all LLM APIs, you pass dialog state to each build call. You must pass in text/json to show what the user is saying, what the assistant is saying, and what the system is saying. Interestingly, you can tell the assistant that it has started to respond even if it doesn't start responding. You can tell it that it said whatever you wanted. This is already common in few-shot prompts, but you can also use it to stop the model's tendency to answer aimlessly or strangely when you need output in a particular format, or even to think.

For example, when we want the model to output a JSON object for a pull request script, we'll do this:

用户：最后根据
下面的 JSON 格式输出标题和描述，严格遵循下面的格式非常重要。

{ 
   "title": "<title>", 
   "description": "<description">, 
}

助手： { 
  "title": "

Notice that the last two lines have been added to this prompt. In this way, we trick the model into thinking that it has started outputting the "{" character and therefore should "think in json". We also don't let it guess the "title" key, but instead prompt it to start typing the title. This relieves the burden of the model starting the response in the output format you want. It makes the model relax a little bit and output only the answers you want. (In the example above, User: Assistant: See roles in the OpenAI API)

We call it a puppet because you're forcing the model to say exactly what you want it to say. The model sees this and interprets it as having been said. Then, in order not to contradict itself, it will continue to think from there.

This can also be used to get the LLM to follow your prompt rules, for example if you end the prompt with:

助理：首先，我会仔细考虑各种选择，找出
每种方法的优点。

In this case, we remind the model that it is thinking things through before answering.

conclusion

We've shared with you some of the tips we've come up with, and we'd love to hear about any you've discovered on your own! Ada Cohen and Kevin Lei were instrumental in writing this article and coming up with these tips.

Happy tipping!

Additional Readings

This is based on a large number of papers, popular articles, and resources. If you're looking for more information, here's what we think is particularly worth reading

Rapid Engineering Guide - https://www.promptingguide.ai/
Anthropic's Tip Design Guide and Helpful Tip Tips
Dair.ai's Prompt guide: https://github.com/dair-ai/Prompt-Engineering-Guide
Density Chain Essay – A Technique for a Quick Summary
OpenAI's recipes – especially the technology that improves reliability
Use "based on" to reduce false facts
Python 的 React 模式

作者:Ben Bernard

Source: https://tech.instacart.com/monte-carlo-puppetry-and-laughter-the-unexpected-joys-of-prompt-engineering-4b9272e0c4eb

Translation: Monte Carlo, puppetry, and laughter: the unexpected pleasures of fast engineering