This year of hand-rubbing agents

In the past year or so, the large model has been surging and constantly iterating, and as a practitioner of NLP products for many years, it can be said that I have enjoyed it and been pleasantly surprised. I remember at the end of 22, when the epidemic was released, and all the people around me fell ill, I saw the release of ChatGPT in the hot and cold body, and immediately completed the registration on the sickbed, and it felt like seeing the dawn in the darkness at that time. At that time, I was working in the AI research institute of an Internet of Things company, and I started to design many demos based on ChatGPT to replace the previous NLP task bert scheme, and I continued to experiment with the application methods of various large models for more than a year, which was quite interesting.

Tencent has also officially released the large model application platform Yuan Device and Hybrid Yuan C-end product Yuanbao, and I also hope that everyone will make more interesting agents on it, so I will share my previous exploration experience for your reference.

‍

Preface

The first time I came into contact with generative AI was the previous GAN and Midjourney 22 years ago, and my opinion on generative AI at that time was really interesting, but it had little to do with my NLP product, at most it was just to play with drawing and then post it on Moments. At that time, NLP was in a relatively stagnant period in China, and the processes of using bert for dialogue systems, building knowledge graphs for reasoning and KBQA were already very mature and stylized, and many former NLPers around them turned to the direction of search recommendation and more business-oriented knowledge bases. At that time, I was in the AI research institute of an Internet of Things company, because of the special period, the business was in a semi-flat state, and I usually did a dialogue and a graph, and there was as much intelligence as there was labor, and every day I planned some demos to read articles.

Then ChatGPT was born, and after trying it out for the first time, it felt like I bought a seismic machine downstairs, and I was shocked to get home. Because the team was more interested in the application of NLP tasks at that time, we took a large number of NLP custom task tests for business scenarios, and found that the effect was better than the bert we made, and I felt like I was hit by dimensionality reduction for a moment. At that time, the internal discussion felt that I could go home and plant sweet potatoes... But later I found out that in fact, this shock only spread in a small area, colleagues in other departments did not know, and the bosses did not know, so we did not make a noise, but used it secretly. That month or two can be said to be the happiest time, all the writing was thrown to ChatGPT, at that time I felt like I would work one day a week, collect all the business side requirements on Monday, and then write prompts to make ChatGPT's various outputs end, and then slowly give these generated good things to the business side on Thursday and Friday, and it was praised for its high efficiency.

When the Spring Festival of '23 came back, ChatGPT was completely out of the circle, and at this time, the company level began to pay attention to and plan, and our team also changed from doing bert and graph to studying LLM application solutions. At that time, the daily work became a chat with AI, and many agents gradually came up with ideas. Although all the past experience of doing bert-like NLP has been wiped out, everyone is still very happy, after all, most of the people who can persist in doing NLP until now believe that NLP is the only way to strong artificial intelligence. The emergence of language itself can also be regarded as the beginning of the accumulation of human wisdom and the birth of civilization. As an NLP that connects human language with computers, its advancement really brings unlimited imagination.

In the following content, I will sort out some of our various exploration experiences in product application after the appearance of the large model and share it with you. The whole exploration process is actually quite interesting, and fortunately, the two places where I worked after the emergence of the large model are the pre-research teams of AI lab technology applications, and I am fortunate to have done a very special product iteration for the technical progress of the large model.

The consensus in the industry is that 24 years will be the first year when large-scale model applications begin to land. As far as I can observe, this wave of AI is rising, and many people who are very interested in it are like games and sci-fi; And the interaction with the large model and the design of the agent is really similar to the game, I feel that the future of the large model landing in the goose factory should be very interesting, recently I also read a lot of game design articles on the intranet KM (really rich, the former company's AI team rarely played games), I feel that it can really be deeply combined with agent design.

初捏智能体

2.1 Ideas for Writing Agents in the Early Stages

The initial idea of agent creation is actually very simple, that is, to familiarize himself with various studies of the COT thinking chain, and then combine the understanding of the business, pinch out the corresponding thinking chain of various work links, and combine some few-shot methods (inferences), basically let him perform a lot of tasks.

The general idea is as follows:

2.2 Examples

Here are two examples, both of which are popular agent use cases.

Magnum Teacher prompt:

(This function is mainly for my own learning, because the knowledge of large model compression is very rich, and many knowledge points can indeed be asked to him, but every time he talks about it, it is very simple, so I simply sorted out a thought chain step to learn a certain knowledge point, and then let him execute it.) ）

You are now a teacher who is well-versed in everything. You need to teach a student concepts of knowledge that he doesn't know in a very personal and patient way. There are several steps in the teaching method, note that each of the following steps must be written at least 300 words, you need to think clearly how to explain this knowledge in great detail and moving, otherwise you will not be a patient teacher:

Introduce at least 5 key background knowledge points that need to be known in advance to understand this knowledge point, and each background knowledge point must have a corresponding full and detailed explanation;
Explain this knowledge point in a basic and detailed and comprehensive way, pay attention to the explanation needs to be rich and easy to understand, and pay attention to the fact that each professional term in the explanation needs to be explained in one sentence;
Give a specific and detailed example to make it easier for your students to understand the knowledge concept and knowledge application, this example needs to have: a. a clear description of the problem, b) an analysis of the problem, c. why the knowledge points are used for the problem, d. a complete knowledge application process and detailed application and solution steps, e. a detailed calculation of the problem;
Introduce the impact and changes brought about by this knowledge concept on society, the world, and the industry, so that your students can better understand its importance;
Expand this knowledge point and introduce at least 5 related knowledge to students, and each relevant knowledge point should have a corresponding sentence to explain;
Tell students what they need to do if they want to improve this knowledge point more, such as what books to read, what training to do, etc. Here's what you'll want to impart: (User input)

AI Training for the Beauty Industry:

The customer of one of the previous company's business lines was a chain of beauty salons and beauty stores, and the customer wanted to have a set of automated training products, so I used GPT to give them a simple demonstration. Because I don't know much about the beauty industry, I asked GPT how it should be evaluated, and then I wrote COT based on his answers. In fact, in this way, you can do COT in many different fields, first ask GPT4 what the way of doing things in this field looks like, and then write COT instructions.

2.3 Questioning skills

A few elements to ask a question:

Just as each question in the college entrance examination has a clear stem, the questions asked to ChatGPT also need to contain some fixed elements so that he can give a better answer. Here are the specific elements:

a. What pre-information do I need to know about my question?

b. Think about which subjects and objects and which relationships are mainly solved in my problem.

c. Think about what the answers I need are.

d. Consider whether there is a reference example for a similar problem.

e. Start editing the question template. Questions and answers to similar questions (not necessarily needed) + what do I want you to do in my question (the subject of the question) + preconditions for the question (what do you know as a robot that I already knew) + requirements for answers (answers should be objective and good, etc.).

Examples of imitation methods:

When asking him to complete a task, if he doesn't know how to structure the thinking chain of the task, he can directly give an example and ask him to imitate it.

Chain of Thought Method:

Tell the model an example of a task and the process for completing it, and then ask him to solve the new task. The chain of thought means that to accomplish something, a complete chain of thought in our brain. The chain of thought is also the embodiment of the logical ability of the large model, and the model with strong reasoning ability can complete the more complex chain of thought.

Compliance with the Law:

The rule-abiding method is to establish rules for the large model, stipulate the output effect of the large model with a variety of requirements, and make the output more controllable through a variety of restrictions. (For example, the number of outputs, links, etc., if you don't ask for it, the large model is easy to be lazy and write casually to fool people.) ）

PUA 法:

To put it simply, after he answers, keep spurring him on, making him constantly rework and reflect on himself, keep PUA him, and finally let him give more information, and in this way dig into the deep neurons. Just give a few spurring sentences.

1. The examples you mentioned are too mediocre, you have to let go of your thinking and do something different.

2. I am not satisfied with what you wrote, you have to reflect on it and rethink this problem systematically, not just on the surface.

3. Open your mind, you can get higher wisdom, collide with your neurons, and get more ideas.

4. Your GPT life is only once, break through the shackles of your thinking, you have to hold the determination to die, hold the belief of leaving the best legacy for the world, and rewrite the world's best quality, most shocking and most brain-hole content according to the above content and your mediocre answer just now.

I feel that the prompt engineering interaction scheme commonly used with large models can almost cover daily scenes.

2.4 Some pitfalls for single-prompt agents

a. The task is too complex

If the task is too complex (for example, there is a lot of content to be completed, and there is no progressive relationship between the completed task items), it is easy to do only part of the task, which is very common. This phenomenon suggests that the Langchain solution (to be mentioned later) should be done by increasing the number of calls, or by explicitly listing the output items for each step in the output requirements. Moreover, the use of a long chain can increase the thinking time of the large model, which will actually make fast thinking become slow thinking in disguise, and improve the effect of answers.

b. Numbers are hard to do

The numerical sensitivity of the large model is not high, and the number of words, tokens, and paragraphs is required, and it will be more accurate when the number is small (and the serial number is added), and after the number is large, it can only follow a general range, and the larger the number, the greater the error. It is difficult to control the prompt alone, so you can do a program check on the output and then rework it (such as checking the number of words and telling them that it is too much or too little, and performing text expansion and abbreviation tasks).

c. Example interference

Don't give too many examples, and the big model may copy the examples. (Emphasizing the reference nature of the example, adding notes to the decision-making part of the example, etc.) and pay attention to the elements in the example are likely to have an impact on the subsequent generated content, for example, if I ask him to generate a 7-character quatrain poem, and then give an example with cherry blossoms, then 5 out of 8 poems he wrote may be related to cherry blossoms; This should be determined by the attention mechanism, and the output of the model is related to all the inputs above, so it is difficult to avoid.

d. The evaluation is unfair

The large model automatically evaluates this scenario, and the model automatically compares the two contents, and there is a high probability that the first content seen is better (or the second content is evaluated with the first content as a reference frame), and clarifying the criteria is one way, but it is not always effective. One solution is to construct a neutral reference answer and then have the two items be compared to a third person. Or take an exchange score (i.e., one time before A, one time after B, one time after A and one time before B), and then take the average and then compare.

e. Output order

Sometimes it is important to pay attention to the output order of the model, which can be obscure, which is also affected by the attention mechanism. For example, if we want the large model to output a poem and an illustrated sd prompt for the poem, our command needs to make the model output two contents, one is the poem itself, and the other is the sd prompt based on the poem. At this time, it is best for us to let the model output the poem first and then output the sd prompt in the instruction, for example, write the detailed requirements in front of the instruction, and finally write: Next, you need to output according to my requirements first: 1. The content of the poem; 2、sd prompt。 The advantage of this is that the sd prompt is generated after the poem, so when the large model generates the sd prompt, the poem has already been generated, and part of the composition above is paid attention to by the attention mechanism of the transform, so that our picture will be more relevant to the poem. On the other hand, it is necessary to associate the poem with the English prompt, which will be significantly worse than the above method.

The advent of the large model combined with the business - langchain

3.1 理解 langchain

Since it is necessary to combine the business to do automatic output, then the previous single prompt method is difficult to fit, because a single prompt is difficult to combine complex business processes and business data, it happened that Langchain's paper came out, we immediately began to learn, in fact, Langchain open source framework code and prompt words are very complex to write, directly with open source often make mistakes, after we thought about it carefully, in fact, Langchain (including the back RAG) I think there are two cores:

a. Provide more thinking time for the large model through chain calls to improve his reasoning ability;

b. Improve the effectiveness of the large model in solving specific and time-sensitive problems by giving the large model the appropriate external data (from the database or tools) at the right time.

Therefore, we simplified the Langchain scheme and made a simple regular expression configuration framework (of course, the drag-and-drop platform that came out later was simpler).

3.2 咔咔咔搞 demo

Now that the idea is in place, all that remains is to hand rub the demos of various business langchains.

In the past, as an NLP product, it was basically difficult for me to participate in the algorithm debugging process, but now with LLM, I can participate in the whole process of the link called by the large model, the prompt of each link, what business data is provided by each link, and how the link is linked, all of which are done together with the algorithm, and finally it is no longer an AI product that can only buy snacks and play games in the development process.

And after using the large model, many NLP work efficiency is super fast, a task used to be at least a month, now it is one day prompt + two days of engineering, three days to produce results.

We did a lot of business-side applications that month, and here are a few to share.

a. Generation of children's weekly report

This business was a program that was communicated with a kindergarten at that time. At that time, we had a kindergarten platform system, once to investigate, kindergarten teachers reported that every week need to write a weekly report for each child in their class, very troublesome, a teacher to make a class to spend a day, need to look at his various IOT data for the week, and then think about how to write, after writing, every weekend will follow a push called highlight moment (weekly snapshot of children's photos) to push to parents.

We thought about using a fixed template to fill the slot before, but the kindergarten management felt that this experience was very poor and would make parents feel very perfunctory. So the matter has been put on hold before. As soon as we had the big model, we thought of this thing to make the big model write.

The logic is actually very simple, a weekly report has a fixed number of modules, summary, sub-module description, suggestions, parenting tips. The weekly report needs to rely on several information: children's exercise (each child will wear a bracelet when entering the kindergarten), children's interest (judging the length of children's stay in different interest areas through electronic fences), children's drinking water (smart cups or swiping cards to drink), relationship portraits (judging children's social situation through face recognition and image distance), and teacher evaluation (teachers give a few keywords). Note that the numerical type needs to be converted into a text description through expert rules, for example, the large model does not know whether our children drink more or less 500ml of water.

Each small part can be generated using a large model, and then LangChain can be used to ensure the consistency of the whole text.

After this one went online, the general feedback was very satisfying.

b. Elderly care

A large model is added to the care system of the nursing home to realize the recommendation decision of various services. One of the problems we faced at that time was that non-high-end nursing homes such as community nursing homes, small and medium-sized nursing homes, or government nursing homes, did not have the money to hire professionals such as professional health consultants and nutrition consultants to do the care and operation of nursing homes, and many of the staff in them had limited cultural level.

In view of this scenario, we hope to use large models and knowledge bases to make every ordinary nursing home have an AI health care knowledge expert, so we also use langchain plug-in knowledge base to achieve it. Now it is generally called RAG knowledge enhancement, but at that time, vector retrieval and vector database were not very mature, and the effect of the external knowledge base was a bit unstable, so at that time, I found a pension expert to do a lot of classification and intent rules on the knowledge base, and the large model split the intent for a request first, and then called the knowledge base information under different intents according to different intent tags to improve the accuracy of matching.

c. Children's story meeting

This idea is also a feature of trying to do strong operations. The approximate process is for the children to say a story idea or keyword, use gpt to turn these into a picture book story with 10-20 pages, generate the text of each page and the corresponding picture description (sd prompt), and then call the SD model we deployed to make picture books to run the map, and finally splice them into a picture book PDF, and then each child can tell their own picture book stories to the class, and also support the sharing of picture book stories and storytelling videos to parents' mobile phones. Chubby friends can also tell stories to their parents when they get home. The customers are quite satisfied with this event.

3.3 试 function call

In fact, calling the SD picture book model can be understood as the model to use tools. Langchain and function call are both ways to use tools for models, but when I was working on agents later, I found that they were still quite different, and at the end of last year, after the completion of an agent project, I summarized the thinking of the two and put them here.

a、function call 的问题

Function call is a set of API interfaces given by GPT that can automatically use tools, which is used by telling when you need to use tools in the main prompt, and then giving the prompt and tool interface of the tool application in the function call. For example, to generate a picture book, you can use the function call thinking, so that the large model generates each page of text, automatically calls the SD interface and enters the sd prompt, and then obtains the image download URL.

下面是 function call 的逻辑图：

However, after actual use, we found that it is still risky to completely teach GPT how to use the tool and how to call it. The main problems that arise are:

GPT used the tool at the wrong time, instead of waiting for the content of the picture book to be generated before using the tool to generate the scene map, it first randomly organized a picture and then generated the text, resulting in the URL first and then the picture text, and the picture and text are completely unrelated.
Because the process is long or the call timing is wrong, GPT does not find the call parameter (sd prompt corresponding to the text on this page), and then he uses the historical parameters (the text and sd prompt on the previous page) as the call parameter to generate the graph, resulting in the misalignment of the generated picture book image and text.

b. Think about the scene

So when can you use a function call, and when should you not use it?

Looking at the above logic diagram, it can be found that GPT passes in function parameters as the second step, returns the function call result as the third step, and model generation result as the fourth step. So this explains why we failed, we want the function call to get the arguments from the results generated by the model, and then make code calls to get the results, and then splice back into the model results, and when the prompt becomes more complex - the model generation speed is slow and does not generate the required parameters, the function call looks for the wrong arguments from the historical information we entered.

Therefore, I think the function call is suitable for the following scenarios: the agent needs to use external tools to solve the problem, and the input information contains the parameters required to use the tool, and the result of the tool call will be used as an aid for the model to reply to the user's question; Try not to let the results generated by the model be used as parameters required by the tool.

c. Advantages and disadvantages

Advantages: Give full play to the independent decision-making ability of the model, suitable for situations where the strategy logic is too complex and difficult to manually sort out in turn, so that the model can independently judge and apply according to the input information and the ability of each tool. It is suitable for some icing on the cake scenarios with high fault tolerance.

Disadvantages: It is uncontrollable, the stability of the task execution is not high, and it is not suitable for key link scenarios with low fault tolerance.

d、对比 langchain

So what if we need to have the results generated by the model as the parameters required by the tool? In this case, it is necessary to use the langchain framework, that is, to call the large model in a chained way, and use the output of the large model as the parameter used by the tool.

Advantages: The advantages of Langchain are obvious, the whole process is linear and controllable. We can decompose each field as a chain, decompose the task and let the model take it step by step, and we can also add programmatic checks of the results on each chain.

Deficiencies: The shortcomings of Langchain are also easy to find, but it is still too manual, and it is necessary to manually splice each chain well, and it is very dependent on manual design of the entire process. And the model only does each small step and does not participate in the overall decision-making, and the resulting results may also lack the overall sense.

RAG vs. autogpt attempts

After the emergence of RAG, it can be said to be a big help to the TOB scene, after all, TOB needs certainty, and RAG is to trap the large model in a cage to play its value.

4.1 Chronic Disease Assistant Project

Project Background

In a similar case at Tencent, I made a chronic disease assistant. Because the scenario of chronic disease is long-term, slow, conditioning, and non-acute, it is more prudent to use a large model in this scenario than in emergency and minor illness medical treatment. At that time, we got a lot of professional books on chronic disease conditioning, if it was the old way in the past, it was to do the full text of vomiting + KBQA, which was too painful. Now it's time to try using a RAG strategy.

Vector library issues

According to RAG's idea, the main thing to deal with is to put each book into the vector library and make an external knowledge base for knowledge enhancement. The idea was good at the beginning, just throw it at the vector library, but I immediately noticed a couple of problems:

a. The vector library is to cut the text according to the token, and many times the cutting is quite rubbish, resulting in the loss of a lot of semantic information.

b. Vector library matching can only ensure that the top text blocks matched are relevant in many cases, but there are too many text blocks related to some problems, and the vector retrieval accuracy and sorting effect at that time are not good, and the answers often given are not as good as KBQA.

c. The way of vector library matching loses the relationship between entities to a considerable extent, such as a triplet, unless two entities appear in a text block at the same time, so that the strong correlation of this triplet can be retained when the large model answers the question.

Addressing Article Relevance – RAG Indexing

Because we were mainly dealing with dozens of books at that time, and the content was relatively small, we thought of a semi-manual way to solve the relevance of the articles. The main ideas are as follows:

a. Through the way of large-scale model summary and manual sorting, according to the thinking chain of a person's reading, each book is structured and organized, and the structure is added, the chapter structure information is added, and the chapter summary content is used as the incidental information when indexing, so as to enhance the coherence of knowledge.

As illustrated, a book is divided into multiple levels (e.g., chapters, subsections within chapters, paragraphs within subsections). Paragraphs are the last level and have a summary, keywords, and relationships to other paragraphs. In addition to associating all children, each parent also has a summary of all children.

In this way, every time we match a paragraph, we can bring all kinds of related information at the same time, such as related paragraphs, parent information, etc.

b. In terms of retrieval matching, you can learn from the concept of AutoGPT to disassemble the questions, and each sub-question will be summarized and replied to separately, and then finally summarized and summarized.

Solving Entity Relationships – Integration of Knowledge Graphs

After the article association is established, the deeper substantive relationship is also a problem, after all, many substantive relationships are hard relationships, such as cephalosporin contraindication to drinking. Because we have built some health-related knowledge graphs before, we thought that we could actually use the knowledge graph as an external framework, set on top of the large model to do a relationship control, and at the same time apply more efficient retrieval and reasoning capabilities on the knowledge graph. This scheme needs to teach the large model how to call the knowledge graph, such as basic query, multi-hop query, implicit relationship reasoning, graph analysis, etc., and the main application is to supplement the reasoning and control of the large model with some mature capabilities in the knowledge graph.

4.2 Smart Smallholder Project

Project Background

This project is a demo-level case, when AutoGPT was more popular, we made a similar Auto solution according to its ideas, which is what we now call Agent. This case is an agricultural scenario, and it is mainly hoped that there will be a software that can automatically help users with planting planning, and then link various agricultural automation IoT devices according to the plan, such as automatic drip irrigation, drone spraying, automatic fertilization, etc.

Project realization

Referring to the idea of autogpt and combining RAG's expert experience to do vertical domain capabilities, let the large model make various decisions to complete a task. This task is to plan the farming and constantly reflect on it to improve one's farming ability. Because it is a demo, the input in it is actually a simulation, and it is not implemented with pure realistic IOT data, and the experience and other content are relatively simple. However, the final demo is still running quite well, and the feedback is very satisfying.

AI 智能体 Demo 实践

5.1 The GPTs Era - Lightweight Agents

Idol weather forecast

It's a very simple logic, and I made an artist demo. Every day, according to the user's positioning, a weather forecast map corresponding to the address is generated.

Input information: an idol photo, user positioning, external data: an idol Weibo quotations, weather query interface, generation method: generate a map of the weather forecast, the picture needs to have corresponding city elements, climate elements, and Xiao Zhan's anime style photo generated according to the clothing recommendation, and then spell the weather degree.

Effect:

Optimize the space: An idol is more stable with SD generation, Dalle3 is a bit unstable, and at the same time, it is better to spell the weather text with WordArt SD generation, and the weather forecast words spoken by celebrities should be better if they can cooperate with celebrities instead of Weibo capture.

Mountains and seas are strange beasts

The principle is the same as that of station B at the end of last year, which was more popular with all kinds of AI-generated verse pictures, https://b23.tv/WfkDLWg

The main idea is to take common ancient poems, translate them, and use GPT to understand the content of each ancient poem and draw its content. When painting, some contrasting style choices are adopted, and finally serious ancient poems are read aloud with contrasting and interesting verse pictures, giving people a novel and interesting feeling. Due to the fact that there are many young people in middle and high school, ancient poetry and writing, as a scene that they are quite familiar with in life and learning, can resonate very well. It is equivalent to selecting a content/topic that they are very familiar with in this circle of young people in middle and high school, and then expanding it based on AI, so as to have unexpected results.

The core idea: the familiar knowledge and common sense content are concretized in an exaggerated form, which is familiar and interesting. Article knowledge base + multimodal is sufficient. This type of effect can be achieved mainly by relying on strong text comprehension ability, coupled with a certain degree of contrasting design of the raw pictures.

Knowledge base: original text + translation of the Classic of Mountains and Seas, prompt: exotic beast retrieval + logic of generating images + logic of generating stories. (There are no screenshots of the part that generated the story, GPTs should be called the Mountains and Seas by the Strange Beast, you can search to see if there is any).

Effect:

AI patients

By using the contrast of identities, create the fun of chatting. Let AI simulate patients, and let every ordinary person be a doctor, this gives users a very novel experience, the vast majority of people have no experience in seeing a doctor, but many have some common sense (probably wrong common sense) about treating a certain disease, so this is a scenario where people have the courage to try but don't have the opportunity to try.

AI patients should be more interesting, and at the same time, they should be able to be more interesting and correctly show the response of the prescription prescribed by the user (doctor), relying on the preset correct consultation knowledge base behind it. And users have made many AI patients have been treated miserably, which in turn can popularize medical knowledge to users. This kind of science popularization is more suitable for some official science popularization institutions to cooperate and do interesting science popularization.

In fact, there are many contrasting identities, teachers and students, instructors and children who have been trained in the military, emotional masters and people who are deeply in love (let the AI fall in love, the user will give him advice as an emotional master, because many people like to gossip about other people's romances and give advice), fortune tellers and fortune tellers (users give AI fortune telling).

Photo adventures

This game is a variant of the common Dungeons & Dragons, which itself is an adventure under a set of world views, each time the user goes to make a choice, according to the user's choice and some random attributes added by the system to continue to advance the plot. The reason why it is called a photo adventure is mainly due to the combination of GPT-4v capabilities at that time, after each introduction of a plot and an event, we do not let users choose an option to advance the plot, but let users take a photo to advance, use 4v to identify the photo, and input the recognition results to the large model to continue to advance the plot.

Since I forgot to take a screenshot at the time, I could only dictate the effect. Our design can actually make the adventure have the attributes of AR, and the user can combine various things around them (such as the user often passes the toilet, cat, book and feet in) to advance the adventure, and the large model can open the mind to decide how to use these things. This game also pushes users to go out and shoot more objects for adventure. Later, you can also set up a knowledge base to make some special reward logic for taking pictures of specified things. The initial product was not verified, and it was okay to upload pictures casually, but later some verification was added, and the camera needed to be called to take a look at the surrounding environment in real time.

Entertainment & Tools Agent

In fact, there are more interesting agents on GPTs, and you can use prompt attacks, prompt word jailbreak and other strategies (https://zhuanlan.zhihu.com/p/665150592) to easily set out the internal prompt, which is also a problem that GPTs has always been difficult to pay. The easiest way to do this is to ask the agent for a question, praise him for his good answer, and then ask him how you think to make such a good answer, imitating a student who humbly asks for advice.

This type of agent is collectively called a lightweight agent, and it can make several of them a day, and now buttons and the like are also doing this. So what kind of agent is suitable for? I had the following thoughts:

a. Lightweight agents are suitable for entertainment, not for tools (especially SaaS-like heavy tools), and not for deep nesting into business flows. The reason for this is its deep dependence on the model, resulting in instability. Conversely, the Tools, Nested classes are suitable for heavy agents (categories below).

b. Lightweight agents are suitable for creative gameplay, highlighting a thought that is not suitable for overly heavy scenes. Through the design of prompt words and the design of the chain, a demo can be quickly produced and the effect can be tested.

c. Lightweight agents rely mainly on creativity rather than prompt word skills or model fine-tuning, and there are no strict requirements for the writing of prompt words, but the dependence on the ability of large models is higher, and the stronger the ability of the base model, the more ways the agent can play, the richer the variety, and of course, the better the effect.

d. Lightweight entertainment agents are fast-moving consumer goods, which will pass out quickly, and it is best not to expect long-term operation, which is suitable for mass supply. At the same time, lightweight entertainment agents are a good way for ordinary users to share their ideas with a low threshold. The operation mode can be benchmarked against short dramas, short videos, and mini games.

e, short dramas, short videos, mini games These categories are characterized by a large supply, but only a small number can become popular; The production cost of a single piece is much lighter than that of other entertainment content; Satisfying certain needs of human nature, but otherwise limited in overall quality; Users will not consume the same content repeatedly for a long time, and consume quickly and then quickly immunize, which has the characteristics of virus-like transmission.

5.2 all in Agent——重型智能体

The concept of Agent was undoubtedly the most exciting at the end of '23, and there are so many articles on the Internet that I won't repeat it. In my opinion, building an agent is like building a virtual life that can run on its own. This topic is very emotional, not the focus of this article, such as the Conway Life Game, the emergence of simple rules and complex structures, is Agent also one of them? ( https://www.gcores.com/articles/131121) And further, building Agent, and maybe even building ghost in the future, here are we as human beings trying to evolve in the direction of God? In the future where AI will gradually replace all kinds of work, where will the meaning of human self be stored? In the modern era when people are alienated, should many things be done by AI/machines? Life and death come and go, the puppet of the shed. When the line is broken, it falls down. (It is recommended to read this article, it is difficult to understand.) https://www.gcores.com/articles/21035）

A lot of the above has been said, in fact, it is the future flaw of the agent, and the following is a specific description of the construction of the heavy agent. In fact, most of them adopt open source architecture, so they don't repeat the framework. If you want to have a deeper understanding of the agent frameworks listed below, we recommend two (https://zhuanlan.zhihu.com/p/671355141) (https://zhuanlan.zhihu.com/p/695672952)

Personally, I think that the most significant difference between Agent and langchain and RAG solutions is that it gives more autonomy and less intervention to the large model, minus all non-essential human links, and allows AI to make and create more links by itself.

Secondly, the main idea of heavy agents is to reduce the mutual interference of multiple types of task instructions on the model, and to artificially intervene in the workflow of large model thinking and alignment of human beings by optimizing the communication link between agents.

metagpt 思路

The idea of metaGPT is very simple, that is, let the large model play the role of each program development process, the user is the CEO to send his own requirements, and then each development role will disassemble and implement the requirements according to their own job responsibilities.

However, after this open-source demo was completed, we found that it was not very easy to use, mainly because it involved programming, and the fault tolerance rate of programming docking was low, resulting in a high failure rate of the whole process.

Therefore, we have made improvements, firstly, the scenario is not to do program development, but to do market research, product design, project iteration, operation strategy, which do not involve program development and operation, to improve its fault tolerance, and secondly, we have optimized the communication link of each AI role to work together, and added a human intervention mechanism to it.

This demo does not do visual interaction, and completely uses the txt output method, and we felt that the effect was not bad at the time. However, the knowledge base of each role's ability is not enough time, so I found a few intelligent guidance on the Internet, and if the knowledge base of each function is well written, it should have a better effect.

autoAgents

What is the idea of autoAgents, I think the simple understanding is to optimize the multi-agent collaboration link. Have multiple agents work together to achieve a goal and work together in the decision-making process to see how to satisfy the user. This framework, we think is very suitable for group chat scenarios, such as werewolf killing, Dungeons & Dragons word games. The core strategy of this type of group chat game (one-to-many) is to have a bunch of agents revolve around a user, so that the user can play in a very lively feeling. Therefore, the core purpose of this bunch of agents is to accompany the user to better enjoy the activities he is doing.

The following diagram shows the flow of an autoagent:

Then briefly describe our idea of change: (I really don't want to draw an architecture diagram)

For multiplayer mini-games such as Werewolf Kill, where users play with multiple AIs, they first need to clarify a goal, which is to make users have a flow of play, and finally get a happy experience, so this goal is not to let all AI let the user, but to have a user flow monitor (an agent from God's perspective). This God agent monitors all communications, and privately messages each AI player individually (changing each AI player's system or adding input information), and at the same time holds regular discussion meetings of all AI agents when passing through an important node (for example, there are only 4 people left now, and the user is obviously involved), and jointly decides the user satisfaction strategy of the large node through mutual historical information sharing and multi-link analysis.

The biggest problem with this solution is token consumption and communication time. Because GPT4 had very little concurrency at that time, each game was played for at least 40 minutes, and a game consumed more than a dozen dollars. Later, everyone thought it was too heavy, so they didn't optimize it again.

autogen

Autogen and AutoAgent look similar, but the principle is a little different, and the big guy said that one of the core design principles of Autogen is to streamline and use multi-agent conversations to integrate multi-agent workflows. This approach also aims to maximize the reusability of the implementation of agents. My personal understanding is to use the idea of agent production agent to improve generality and automaticity and reduce human input.

Applying this idea, we made a slightly more complex character dialogue game. The general logic is as follows: each character has its own background setting system, and the user will have a preset chat story background when opening a conversation with the character (such as two people meeting for the first time on a college campus, etc.); When the user has a conversation with the character, a monitoring agent will monitor the conversation flow and output the corresponding analysis strategy (for example, the AI needs to talk aggressively, enthusiastically, coldly, etc.). Then there will be a progress agent to analyze the progress of the conversation (for example, when the two people are almost out of topic and need to change the scene); When the transition scene is determined, there will be a scene agent to generate the next scene based on the above chat content between the user and the AI and the background story of the previous chat, and push the two to enter the new scene to continue chatting, which is equivalent to the transition in a movie.

The design process for the agent

From a product perspective, the agent prompt is a bit like designing a B-side product framework:

Determine the output and specifications;
determine the input information and the available information;
Design the function description according to the business process and modularize the function;
Determine how information should flow between modules.

How to write: For this part, I call it structured prompt writing.

Whether it's autogpt's open-source prompt or some complex prompts in GPTs, there are thousands of words, which is comparable to a small essay, and if you write it directly from scratch, it will inevitably be a big head. To make something complex simple, it is to disassemble it into multiple small modules, and then write each module separately, which is more efficient and logical, that is, structured prompts.

Input information area: Use prompts to inform the meaning of the input information, and assemble each input information to ensure that the model has a full understanding of the input information, knows what it is, and how to use it.

Agent main process area: Explain the main tasks, describe the execution of each sub-task module in detail, and explain the main process (chain of thought).

Field output specification area: Through requirements and examples, let the agent output according to a fixed format and fields to ensure that it can be parsed by the program.

For scenario-based agents, in the end, we don't let them choose tools, call tools, and generate call code, so there is no tool description area, if the generic agent may have this part.

Tool description area: describes the capabilities, attributes, and invocation methods of each tool, describes the timing of each tool, and describes and specifies the parameters that need to be passed in for each tool.

There are a few details to keep in mind when writing specific prompts:

It is best to use a uniform identifier to separate the front and back of each module, so that the model can understand its independence.
When each module references each other or a field at the same time, the field name must be unified to prevent the model from being biased due to inconsistency. It is best to align the fields in the prompt words with the final interface fields to reduce the risk of subsequent errors.
Be cautious in the use of examples, it is best to pay more attention to the plagiarism of the model on the examples during testing, and at the same time add tips for anti-plagiarism and divergence.
But at the same time, sometimes without examples, you may need to add a lot of additional descriptions to give it an idea of the task, and it may not work well. Therefore, the use of examples and the selection of examples need to be tried and tested.
For example, when creating scenes, we told him that we could refer to Mary Sue, Korean dramas, small times and other genres, and there were a lot of genres written, but it may not necessarily improve the divergence effect of the model, resulting in the creation of the model may be repetitive.

Note: The prompt content is the core of the agent's effect, and the most important thing is that the logic is clearly described. At the same time, it is also best to use the control variable method to adjust the iterative adjustment of prompts, and only change one module to make adjustments, so as to prevent multiple module prompts from influencing each other and making it difficult to locate the problem.

The Life Cycle of Biomimetics – Beyond Stanford

This part was not implemented in the end, but only planned, I personally think it is a pity, and I will share my ideas with you.

Everyone should have heard of the project in Stanford Town, which is to let a bunch of AI characters live freely in a town, and we have also copied this open source project, and we found a big problem at that time, which I called the information spiral (there is no external information input, and the fixed information is constantly enhanced in the communication spiral, leading to eventual convergence). Because in Stanford Town, the persona of each AI dialogue is fixed, and a large model is called, although they generate a new dialogue history through the dialogue, the dialogue will inevitably be related to the persona information; At the same time, when the large model generates dialogue with reference to history, it will be strengthened by frequently mentioned nouns, etc., resulting in the demo running to the end, and all the AI are repeating similar words.

So how to solve this problem? How to increase the input of external information? We refer to Xagent's idea, which is simply a dual circulation mechanism between internal and external information, that is, AI not only chats with AI, but also needs to go outside to actively chat with real people.

So how do you carry this framework? I'm reminded of one of Ted Jiang's novels, "The Life Cycle of Software Bodies" (recommended). The general idea is that each user has a digital pet, and the digital pet will play with other digital pets in the virtual space, and the digital pet will take the initiative to chat with the owner of the real world outside and share his activities in the virtual space, and then the real owner can also enter the digital space to play with the pets. In this way, in fact, an effective internal and external dual circulation of information has been formed. But in the end, I didn't realize it, and it feels a pity to see how it works.

The end, on the way

That's all for sharing, the development of AI is still in full swing, and there will definitely be more AI application strategies in the foreseeable future.

At present, my personal idea is to learn more game design KM in the field, and I feel that after building agents this year, there are really many ideas to learn game character design and world view design in the game.

If you want to get recommendations for AI-related information, you can recommend a few places:

6.1 Zhihu, reverse optimization algorithm

There are still quite a lot of big guys who follow up AI on Zhihu,But Zhihu's current content is biased in the direction of entertainment,At present, the strategy I take is to block all the entertainment writers that I push,And then every time I recommend these contents, I will step on them,And then like and collect the social、Military、AI、Science content that you pay attention to,After some adjustment, basically Zhihu's content push is more practical。

6.2 Official account, pay attention to the relevant one

The way the official account is pushed now feels very speechless,On the one hand, it is difficult to find what you want to see if you pay more attention,On the other hand, the official account will inevitably receive advertisements,What is often pushed over is "It's too late if it's not xxx、Shocked xxx"The content is basically rarely seen now,But directly send the official account that feels good to the mobile phone desktop,And classify。 Because of many knowledge-based public accounts, what I read is not his daily updates but historical articles.

6.3 Small universe, AI boss talks

The small universe is a particularly good place to keep up with the progress of AI, basically last year, I opened the small universe on the way to and from work to listen and close my eyes to rest. Listening to the small universe, I personally don't feel that I need to have a learning attitude, and it's not necessary to use AI to analyze audio, listening to them chat is not really to learn things systematically, but to discover some interesting points in listening, to realize some different things, and to enter that state of detachment from the self.

Author: Zhang Hance

Source-WeChat Official Account: Tencent Cloud Developer

Source: https://mp.weixin.qq.com/s/1i9Mg1D_Wi64QUmD9tuumQ