Cressy from the temple of Wafei

Quantum Position | 公众号 QbitAI

One billion "employees" produce data synthesized, accounting for 13% of the world's population.

However, these "employees" are not real people, but virtual personalities created by Tencent using network data.

Using the synthetic data generated by these virtual personalities, the math score of the 7B model skyrocketed by 15 points, equaling GPT-4 Turbo.

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

The authors observed that simply adding character information to the data synthesis prompt can generate synthetic data that is unique to that character's perspective.

As a result of research, the Persona Hub, which contains 1 billion (1,015,863,523 to be exact) of different personality information, was born.

In addition to the training data mentioned above, these personalities can also be used to design logical reasoning problems in the style of the mentally handicapped, and can also be used for tool development, and even to create game NPCs and social simulations.

Some netizens said that this is really Thai pants spicy, and they have done such experiments before, but only 10,000 personalities have been made, and now this project is really interesting.

Others say that personas could be the future of synthetic data.

How effective it is, let's feel it together.

Math scores have skyrocketed, and you can also ask questions for the mentally handicapped

These 1 billion different personalities in Persona Hub can be used to generate many types of text messages.

This also includes generating training data, such as training large models with the mathematical texts they generate, which can give the 7B model the same mathematical capabilities as GPT-4 Turbo.

Specifically, the authors generated 1.07 million pieces of data with different personalities in Persona Hub, trained the Qwen2-7B model with them, and then tested them on the MATH dataset.

As a result, the model achieved an accuracy rate of 64.9%, which was 15 percentage points higher than the original version, and tied with the 1106 and 0125 versions of GPT4-Turbo.

In addition to generating training data, Persona Hub can also improve the model's capabilities by simulating user prompts, creating knowledge text, and more.

For example, ask the model to guess a specific personality and what kind of prompt it might give.

Or design a Quora style knowledge article based on personality settings such as knowledge, skills, and experience.

The generated content can be used directly or indirectly for model training and tuning, thereby improving the model's knowledge level and task performance.

Of course, in addition to helping the model improve its capabilities, you can also let the characters in Persona Hub design problems, such as different styles of logical reasoning questions.

△ Machine translation, for reference only

I can even design questions in Chinese, and I have learned the style of the mentally handicapped bar, and I can write brain-opening questions.

The large model that has been endowed with personality is still essentially a large model, so the programming ability that the large model has, and the model with the personality also has.

Of course, the personality setting here has become the target of the program, that is, the program designed by the model needs to meet the needs of different groups of people.

On higher terraces, the personality in the Persona Hub can be combined with a large model to spawn game NPCs.

Based on the background setting of the game in the prompt, combined with the style of the target character, the model synthesized three very different characters and their corresponding introductions.

Even the names of the characters are matched to the target characters, and the introduction is closely related to the game setting.

Further, the authors also argue that by using these personalities to simulate and infer the underlying needs and behaviors of real users, many new opportunities are also created for simulating the real world with language models.

The 1 billion personalities in the Persona Hub can be used to maintain a well-organized virtual society in the virtual world using powerful language models, building a super-sized "Stanford Town".

So, we can't help but ask, how are these 1 billion personalities in Persona Hub obtained?

Mining personality from network data

作者合成人格的方式主要有两种——由文本生成人格(Text-Persona)和由人格生成人格(Person-Person)。

The theoretical basis for generating personality information from text is that the author finds that people with specific professional and cultural backgrounds exhibit unique interests and preferences when reading and writing.

Operationally, the author feeds a large amount of online text data into a pre-trained language model, and uses prompts (such as "Who might read/write/like this text?"). The bootstrap model extracts a corresponding personality from each piece of text, where the prompt can control the format of the output personality description, such as plain text or structured text.

For example, in the example given by the author, the large model extracts three different personalities based on different types of text information:

When the input text contains a lot of details (e.g., textbooks, academic papers, etc.), the extracted personality description will also be more detailed and specialized.

In short, by applying the Wensheng personality method to massive online texts, billions or even more personalities can be obtained, covering roles in various fields and at different granularities.

However, it is still possible to leave out some characters that are less visible on the Internet, such as children, beggars, behind-the-scenes workers, etc., and in order to supplement these characters, the author also proposes a personality generation approach.

This method is based on the personality of the scholar, starting from the personality it acquires, using the chain of interpersonal relationships, according to the theory of six degrees of separation, to carry out a maximum of 6 rounds of relationship expansion for each seed character, and to infer and expand other related roles.

(The theory of six degrees of separation, proposed by Stanley Milgram, a professor of psychology at Harvard University, in 1967 states that no more than six people can be separated from any one stranger, that is, at most six people can know any stranger.) ）

In the actual operation process, the author will first select the type of interpersonal relationship to be explored, input the seed personality and the target relationship type into the model, and guide the model to generate the corresponding related personality through prompt.

For example, the personality of "pediatric nurse" obtained in the previous Wensheng personality link can derive the personality of patients, pharmacists, colleagues and other related personalities.

The related personality generated here can be used as a new seed to further expand the personality network, and after 6 rounds of iterative expansion, it can cover the vast majority of related characters.

However, since there may be some unreasonable, illogical, or seed-related character descriptions in the process of generating new personas, the authors also need to filter these generated personalities.

Criteria for filtering include, but are not limited to, the following:

Relevance: Is the generated result related to the seed and target relationship type? Counter-example: pediatric nurse-astronaut
Reasonableness: Is the resulting result reasonable and logical? Counter-example: A 5-year-old pediatric patient who runs a multinational company on his own
Specificity: Is the resulting result specific and not too general? Counter-example: "One person"
Readability: Is the generated description clear and easy to understand? Does it contain errors such as grammar or spelling?

Filtering solves the problem of the quality of personality descriptions, but there may still be a large number of similar or even duplicate descriptions in the generated personality, so the generated personality needs to be deduplicated.

In this project, the authors used two deduplication methods.

One is MinHash-based deduplication, in which the author converts each description into a set of n-grams, uses the MinHash algorithm to calculate the signature of each description and compare the similarity, and when a certain threshold is exceeded, it is considered to be duplicated.

The other is embedding-based deduplication, where the authors use a large model to convert each description into an embedding vector and calculate the similarity between the embedding vectors, again when a certain threshold is exceeded, a duplicate is considered to have occurred.

Once you have these personalities, you need to integrate them with prompts in a certain way to improve your mathematical skills.

For example, in this scenario, the authors tried three methods: zero-shot, few-shot and low-sample personality enhancement, and found that zero-shot creativity was strong but poorly correlated, few-shot correlation increased but creativity decreased, and few-shot personality enhancement achieved a good balance between the two.

At present, Tencent has selected 200,000 of these 1 billion virtual avatars and made them public along with the data they generate.

The authors say that more personality and data information will be made public after issues such as security risks are addressed.

Address:

https://arxiv.org/abs/2406.20094

GitHub：

https://github.com/tencent-ailab/persona-hub

— END —

QubitAI · 头条号签约

The goose factory made 1 billion virtual personalities to specialize in data synthesis, so that the math score of the 7B model was equal to GPT4

Math scores have skyrocketed, and you can also ask questions for the mentally handicapped

Mining personality from network data

Read on

The 2024 Wolf Prize is announced: the pioneer of the RSA encryption algorithm wins the mathematics prize!

Counterattack on Mathematics in the First Year of Middle School: Weak Foundation? These strategies will help you improve your score quickly and get ahead of the new semester

#陪娃的时光#爸爸抽烟对孩子健康有显著的负面影响, it is embodied in the following aspects: Increased risk of lower respiratory tract infections: Parents who smoke significantly increase the risk of children developing it

Overtaking in corners! "One Liter Two Mathematics Summer Vacation Homework" checks in every day!

2 to 3 Mathematics Special - Summary of Mathematics Must-Memorize Formulas for the First Volume of the Third Grade

Junan County 2022-2023 school year third grade mathematics second semester final papers

Grade 7 Mathematics Beijing Normal Edition Summer Vacation Homework for Middle School Students Study Newspaper Volume 2024

7th grade math summer homework for middle school students to study the next volume 2024 # summer vacation homework

8th grade math summer homework for middle school students to study the next volume 2024 # summer vacation homework

80 thinking questions must be done in the summer vacation of one liter and two mathematics (with answers, can be downloaded and printed)

The world's top 12 math problems, I guess you can't understand any of them, if you can understand it, I worship you as a teacher

How can you tell if a child is gifted in math? The comments of netizens couldn't laugh anymore

Admit it, some of the doubts about the "math genius" Jiang Ping are discrimination against women's IQ!

In the age of AI, learning English, French, German and mathematics together is simple and fun

AI model to save the "mid-life crisis" of boss appliances?

Brazil bans Meta from using Brazilian user data to train AI models, violating it or facing fines

All the people? "Havoc in Heaven" MBTI test the whole company's fryer! The domestic large-scale model team shot

Zhang Ping'an, CEO of HUAWEI CLOUD: China's AI should pursue a global leadership position in building large models in the industry field

How do I get to AGI? Step Star uses trillion+ multi-mode rolling, and the three major models are unveiled at WAIC

Apepower Technology debuted at WAIC 2024, and the self-developed education model was fully productized

Be the first to release OpenAI's Her, an end-to-end real-time audio model that Musk is watching

Zhang Jianzhong of Moore Threads: The Wanka cluster has become the minimum standard for pre-training of large models

The global large-scale model map is emerging: domestic production accounts for 36% and the application is the key

What makes a good large model? Jiazi light year

Baidu released Wenxin Model 4.0 Turbo, which is officially open to users on multiple terminals

Baidu released Wenxin large model 4.0 Turbo: faster and better results

Quasi-junior high school summer vacation preview: 7th grade mathematics "Rational Numbers" 16 major test points to draw inferences from one another

WAIC Observations | AI models are no longer outstanding, and humanoid robots have become "fragrant and sweet"

Tao Zhexuan support! The AI Math Olympiad Progress Award was announced, with a prize of more than $1 million

Special report on large models: from the technical path, look at the road of counterattack of domestic large models

Research on small-scale LID layout optimization model based on GWO-PSO algorithm

The baseline qHBsAg level model predicts the functional cure rate of patients treated with CHB with PEG-IFN