laitimes

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

author:Quantum Position

Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

One billion "employees" produce data synthesized, accounting for 13% of the world's population.

However, these "employees" are not real people, but virtual personalities created by Tencent using network data.

Using the synthetic data generated by these virtual personalities, the math score of the 7B model skyrocketed by 15 points, equaling GPT-4 Turbo.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

The authors observed that simply adding character information to the data synthesis prompt can generate synthetic data that is unique to that character's perspective.

As a result of research, the Persona Hub, which contains 1 billion (1,015,863,523 to be exact) of different personality information, was born.

In addition to the training data mentioned above, these personalities can also be used to design logical reasoning problems in the style of the mentally handicapped, and can also be used for tool development, and even to create game NPCs and social simulations.

Some netizens said that this is really Thai pants spicy, and they have done such experiments before, but only 10,000 personalities have been made, and now this project is really interesting.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

Others say that personas could be the future of synthetic data.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

How effective it is, let's feel it together.

Math scores have skyrocketed, and you can also ask questions for the mentally handicapped

These 1 billion different personalities in Persona Hub can be used to generate many types of text messages.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

This also includes generating training data, such as training large models with the mathematical texts they generate, which can give the 7B model the same mathematical capabilities as GPT-4 Turbo.

Specifically, the authors generated 1.07 million pieces of data with different personalities in Persona Hub, trained the Qwen2-7B model with them, and then tested them on the MATH dataset.

As a result, the model achieved an accuracy rate of 64.9%, which was 15 percentage points higher than the original version, and tied with the 1106 and 0125 versions of GPT4-Turbo.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

In addition to generating training data, Persona Hub can also improve the model's capabilities by simulating user prompts, creating knowledge text, and more.

For example, ask the model to guess a specific personality and what kind of prompt it might give.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

Or design a Quora style knowledge article based on personality settings such as knowledge, skills, and experience.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

The generated content can be used directly or indirectly for model training and tuning, thereby improving the model's knowledge level and task performance.

Of course, in addition to helping the model improve its capabilities, you can also let the characters in Persona Hub design problems, such as different styles of logical reasoning questions.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

△ Machine translation, for reference only

I can even design questions in Chinese, and I have learned the style of the mentally handicapped bar, and I can write brain-opening questions.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

The large model that has been endowed with personality is still essentially a large model, so the programming ability that the large model has, and the model with the personality also has.

Of course, the personality setting here has become the target of the program, that is, the program designed by the model needs to meet the needs of different groups of people.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

On higher terraces, the personality in the Persona Hub can be combined with a large model to spawn game NPCs.

Based on the background setting of the game in the prompt, combined with the style of the target character, the model synthesized three very different characters and their corresponding introductions.

Even the names of the characters are matched to the target characters, and the introduction is closely related to the game setting.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

Further, the authors also argue that by using these personalities to simulate and infer the underlying needs and behaviors of real users, many new opportunities are also created for simulating the real world with language models.

The 1 billion personalities in the Persona Hub can be used to maintain a well-organized virtual society in the virtual world using powerful language models, building a super-sized "Stanford Town".

So, we can't help but ask, how are these 1 billion personalities in Persona Hub obtained?

Mining personality from network data

作者合成人格的方式主要有两种——由文本生成人格(Text-Persona)和由人格生成人格(Person-Person)。

The theoretical basis for generating personality information from text is that the author finds that people with specific professional and cultural backgrounds exhibit unique interests and preferences when reading and writing.

Operationally, the author feeds a large amount of online text data into a pre-trained language model, and uses prompts (such as "Who might read/write/like this text?"). The bootstrap model extracts a corresponding personality from each piece of text, where the prompt can control the format of the output personality description, such as plain text or structured text.

For example, in the example given by the author, the large model extracts three different personalities based on different types of text information:

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

When the input text contains a lot of details (e.g., textbooks, academic papers, etc.), the extracted personality description will also be more detailed and specialized.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

In short, by applying the Wensheng personality method to massive online texts, billions or even more personalities can be obtained, covering roles in various fields and at different granularities.

However, it is still possible to leave out some characters that are less visible on the Internet, such as children, beggars, behind-the-scenes workers, etc., and in order to supplement these characters, the author also proposes a personality generation approach.

This method is based on the personality of the scholar, starting from the personality it acquires, using the chain of interpersonal relationships, according to the theory of six degrees of separation, to carry out a maximum of 6 rounds of relationship expansion for each seed character, and to infer and expand other related roles.

(The theory of six degrees of separation, proposed by Stanley Milgram, a professor of psychology at Harvard University, in 1967 states that no more than six people can be separated from any one stranger, that is, at most six people can know any stranger.) )

In the actual operation process, the author will first select the type of interpersonal relationship to be explored, input the seed personality and the target relationship type into the model, and guide the model to generate the corresponding related personality through prompt.

For example, the personality of "pediatric nurse" obtained in the previous Wensheng personality link can derive the personality of patients, pharmacists, colleagues and other related personalities.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

The related personality generated here can be used as a new seed to further expand the personality network, and after 6 rounds of iterative expansion, it can cover the vast majority of related characters.

However, since there may be some unreasonable, illogical, or seed-related character descriptions in the process of generating new personas, the authors also need to filter these generated personalities.

Criteria for filtering include, but are not limited to, the following:

  • Relevance: Is the generated result related to the seed and target relationship type? Counter-example: pediatric nurse-astronaut
  • Reasonableness: Is the resulting result reasonable and logical? Counter-example: A 5-year-old pediatric patient who runs a multinational company on his own
  • Specificity: Is the resulting result specific and not too general? Counter-example: "One person"
  • Readability: Is the generated description clear and easy to understand? Does it contain errors such as grammar or spelling?

Filtering solves the problem of the quality of personality descriptions, but there may still be a large number of similar or even duplicate descriptions in the generated personality, so the generated personality needs to be deduplicated.

In this project, the authors used two deduplication methods.

One is MinHash-based deduplication, in which the author converts each description into a set of n-grams, uses the MinHash algorithm to calculate the signature of each description and compare the similarity, and when a certain threshold is exceeded, it is considered to be duplicated.

The other is embedding-based deduplication, where the authors use a large model to convert each description into an embedding vector and calculate the similarity between the embedding vectors, again when a certain threshold is exceeded, a duplicate is considered to have occurred.

Once you have these personalities, you need to integrate them with prompts in a certain way to improve your mathematical skills.

For example, in this scenario, the authors tried three methods: zero-shot, few-shot and low-sample personality enhancement, and found that zero-shot creativity was strong but poorly correlated, few-shot correlation increased but creativity decreased, and few-shot personality enhancement achieved a good balance between the two.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

At present, Tencent has selected 200,000 of these 1 billion virtual avatars and made them public along with the data they generate.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

The authors say that more personality and data information will be made public after issues such as security risks are addressed.

Address:

https://arxiv.org/abs/2406.20094

GitHub:

https://github.com/tencent-ailab/persona-hub

— END —

QubitAI · 头条号签约

Follow us and be the first to know about cutting-edge technology trends

Read on