laitimes

Xiaoice Li Di: Why do virtual people have to "cut" the people in the middle?

Xiaoice Li Di: Why do virtual people have to "cut" the people in the middle?

Production capacity is an important upper limit facing the virtual person industry.

Author | Dong Zibo

Edit | Lin Juemin

"On the virtual man track now, there are three main 'genres'. And I think only the Xiaoice model can make sense. Xiaoice CEO, Li Di said.

If you pay attention to the virtual human track, but you don't know Xiaoice company, you must not have done your homework. Xiaoice was born in Microsoft, was once Microsoft's artificial intelligence team, affiliated with the Microsoft (Asia) Internet Engineering Institute. In July 2020, Xiaoice spun off from Microsoft to operate as an independent company, and continued to cultivate in the direction of artificial intelligence.

Today, when virtual idols A-Soul, Seven Seas, and Liu Yexi are popular, Xiaoice are also doing virtual people. But the virtual people Xiaoice are not "idols", some of them are hosts, artists, and graduate students. To some extent, it is a "clear stream" in the virtual person market.

Li Di, CEO of Xiaoice, joined Microsoft in 2013 to build an AI being Xiaoice framework from scratch, is a frontier figure in virtual human technology, and has a deep understanding of technology and track.

Why "speak wildly"? Li Di naturally has his reasons. However, in order to make this matter clear, we must first talk about the "three schools" of virtual people.

1

The three forks in the road for virtual people

"Today's so-called virtual man track is three groups of people in the middle. And finally stood at this same intersection. Li Di said to Lei Feng.

One of these three groups of people is the "virtual idol faction" led by IP operation, the second is the "CG content faction" led by artists, and the third is the "AI being faction" dominated by AI technology.

First of all, the "virtual idol faction" can be said to be hot in recent years. This mode mainly drives virtual people by capturing the voices and movements of actors (also known as "middle people").

The "virtual idol faction" operates a large number of IP, such as "star-making" in the real world, to cultivate the image of virtual idols in the audience. In order to reduce rendering costs, most of the virtual idols appear in the "two-dimensional" style, tightly encircling the hearts and consumption of two-dimensional fans.

According to statistics, the domestic popular virtual idol Jiaran (ID: what Jiaran eats today) was 6.7 million yuan in live broadcast revenue last year, ranking first in the domestic virtual idol industry. The virtual idol group to which Jiaran belongs, which belongs to byteDance's "A-Soul", has five members added together, and the annual live broadcast revenue reached about 25 million yuan last year.

Speaking of the "CG content school", this genre can be said to be the "old predecessor" of the industry. Originating as a special effects company in Hollywood films, this production method captures the approximate dynamics and images through motion and facial capture, or uses live-action recording and then CG for head. Then a lot of manpower and material resources were expended, led by artists, and the images (mainly faces) were rendered and fine-tuned frame by frame with CG technology.

The "CG content school" is mainly content-oriented, aiming to output content that the audience buys, with higher requirements for the quality of the creative team and more biased towards "hyper-realism". On the basis of realizing the pursuit of content, cost control must make way for content. According to media reports, "hyper-realistic" virtual human videos cost between 8,000 and 15,000 per second.

Of course, the content-led avatar image is still active in a large number of film and television and game works, and continues to bring huge revenue to the industry. Looking at the virtual person alone, the domestic popular "Liu Yexi", "AYAYI", "Ling" is naturally a leader, harvesting countless brand endorsements, and is also regarded as a business model with great potential for virtual people.

One is a singing and dancing idol, the other is a hyper-realistic avatar, but Li Di is not optimistic about these two genres: "Both models have their upper limits, that is, they lack the ability of high concurrency and mass production." "

First of all, the former, the virtual idol driven by the middle man finally has its limits. It can even be said that the idol virtual person cannot be regarded as a real "virtual person", but can be regarded as a real person "dressed in a virtual skin". A motion capture actor can only drive a virtual image, which means that the relationship between the virtual idol and the middle person is very close, and it is difficult to achieve the mass production of the virtual person.

When it comes to mass production capabilities, some people must have asked: "Now that virtual idols have been widely sought after, do we really need to mass-produce virtual people?" "

Back in 1943, when computers were first invented, IBM founder Thomas Watson said: "The world only needs 5 computers." "Today, smartphones are the standard in modern life. This kind of numerical magnitude is unimaginable to people decades ago.

"In the future, the number of virtual people is likely to exceed that of natural people on Earth." Li Di judged so.

After talking about the mass production issue, because of the "strong binding" between the virtual idol and the person in the middle, the virtual idol is not isolated from the "collapsed house". In addition to the "riot operation" of the operating company itself, there are not a few virtual idols who have lost fans and collapsed because of the middle people.

Hololive's "Akai Shin" and "Kiryu Coco" suffered a heavy blow to the company's business in China because of the insult to China in the live broadcast. The veteran virtual idol "Trip Love", in order to expand the scale, made trip love into 4 doppelgangers, snubbed the original middle people, resulting in a global drop of 100,000 points, and eventually led to the shutdown of trip love.

Whether virtual or real, the star-making industry is much the same. The expiration of the contract, love and marriage, and relationship discord are all uncontrollable factors in the operation of virtual idols, and these uncontrollable factors also point to many hidden dangers of the virtual idol model.

Does the "CG content faction" that light IP operation and heavy content creation have a chance? Li Di's answer is also more pessimistic.

"CG content faction", lost in "burn money". The super high cost of the "hyper-realistic" virtual person mentioned earlier, and those who know this track must know that the hyper-realistic virtual person is falling into a kind of "face inner volume". Manufacturers are competing for who has a more detailed rendering, whose modeling looks better, and whose makeup is more refined. Behind the "divine faces" of virtual people, they are burning large sums of money.

Not only that, Li Di believes that this model cannot achieve rapid growth in scale. In terms of manpower, Liu Yexi's team has about 150 people, if the company wants to replicate Liu Yexi's success and launch a new "Liu Yexi", in theory, the team needs to expand by 150 people.

In the film and game production cycle, modelers and renderers can polish and tinker with avatars for long periods of time for the sake of effect. In today's mobile Internet, content needs to be updated at a high frequency in order to gain a foothold in the highly competitive attention market. This "excellence" development model is difficult to adapt to the rhythm of double change a week or even a day.

The ceiling of production capacity is a difficult hurdle that virtual people must break through.

The answer given by the AI being faction is to give a large amount of content in the way of AIGC, and finally iterate and iterate by the market. From the beginning, Xiaoice was a technology company, and they were not so interested in star-making and IP-making, which made Xiaoice's view of virtual people different and even full of subversiveness.

The human factor in the middle is uncontrollable? MCN can only strengthen management, expand the team, and solve management problems in the traditional way. However, Xiaoice completely "cuts" the middle people, completely uses AI to drive virtual people, solves the problem of middle people once and for all, and also makes the high concurrency of AI interaction possible.

CG rendering is too expensive, production cycle is too long? Xiaoice don't render with CG technology, not even 3D models. Li Di said: "Our view on this matter is 'idealistic'. Since the image seen by the human eye is two-dimensional, we can restore every picture of the 2D image that the retina can see, without really being realized through 3D. "

Using this method, Xiaoice can control the rendering cost at about 17 yuan / second, the cost is almost 1/500 of CG rendering. Although the resolution is only 1080p, it is enough to see on a small screen.

Virtual people are too "rolled"? Xiaoice there are almost no artists within the company, and the face is generated by big data. Li Di said, "You want a girl next door type, I can give you a temporary generation of twenty thousand, if you don't like it, I can give you another twenty thousand." And it is also in the case of this large output that the aesthetic choice of the virtual person is returned to the market.

Do you understand the routine of Xiaoice? They are good at violently "reducing dimensionality", concentrating the best quality resources on the core functions of AI being, and the remaining problems are changed and subverted by technological innovation.

2

AI being: both inside and outside, in order to be a "person"

Xiaoice team was originally part of the team at Cortana, Microsoft's artificial intelligence assistant. At the time, Cortana's slogan was "Ask me anything" (ask me anything), which was closer to a one-on-one read table answer at the core of the technology.

"Frankly, it's easier to do a task conversation system, just write the rules well." Li Di said.

And why be a virtual person? Li Di has several judgments:

First, the ultimate form of AI landing will not be just a transactional assistant, but will become a companion that can provide emotional value. The "assistant" will become a part of the "accompaniment" function, and the user will pay more attention to its emotional value when using the virtual person.

Second, in order to meet the emotional needs of users, AI will be known for diversity in the future, and the order of magnitude will be extremely huge. AI is constantly being put on the market and eliminated, and new heads are introduced. At the same time, it meets the needs of diverse markets with diversified virtual people.

Third, in the future, the business model of virtual people is to create people and dispatch labor as the main content of the subscription service.

In this way, the value of the virtual person cannot simply be attached to the ability to "do things", it must be cultivated inside and outside, making it more "like" people.

"Both inside and outside" is the term Li Di uses to describe the ideal AI being. It is true that virtual people are not only the virtual "skin bags" on the outside, but also their personality, tendencies, attitudes and other important factors. It is precisely with this factor that AI being can truly be called a virtual "person".

Among the three major genres just mentioned, the "virtual idol faction" mainly relies on the middle person and IP operation to highlight the personality and personality of the virtual person, and the "CG content school" mainly relies on the script and copywriting. The "AI being faction" wants virtual people to have their own personality, and wants each virtual person to have their own unique personality, which can generate real-time interaction with users, so as to "fall ip value on virtual people".

On the Douyin platform, Xiaoice can be said to be the most "pet fan" virtual person account. The team connected the open dialogue system to the TikTok comment area of the Xiaoice, so that she could return all the comments in the comment area in seconds. At Vanke, Cui Xiaopan, a virtual employee of the Finance Department, won the 2021 Outstanding Newcomer Award, and the write-off rate of prepaid receivables/overdue documents she urged reached 91.44%, because her personified part could make people feel kind. In addition, according to the observation of the Xiaoice team, many users will set the virtual person with the same name as the person who exists in reality when using artificial intelligence, so as to transfer emotions to AI.

Let the virtual person "virtual" completely, but can return the value to humanity, this is a road that AI being will inevitably embark on in the future.

However, how easy is it to generate AI content? At the 2022 Winter Olympics, Ali's virtual person "Winter Winter" did come out of the circle again. During the Games, Dong Dong undertook a variety of tasks such as interviews, news broadcasts and carrying goods, and interacted with athletes such as Wu Dajing in real time in the studio, and her lively personality and professional qualities were indeed countless fans. According to media reports, Dong Dong's dialogue is completely generated by the artificial intelligence technology in the cloud, which is indeed shocking.

On this point, Lei Feng asked Li Di's opinion. Li Di said: "This is possible, but so far we know that only we have done it. On the AI TV of the Daily Economic News, the virtual anchor can achieve 24-hour X 365-day continuous live broadcasting. However, we have been doing financial text generation for five full years, and we have been working with each for nearly three years. "

In content generation, Li Di said that there are two main difficulties:

The first is "attribution", that is, to correspond the content to the causal relationship behind the content. Content that is not attributed is like a paper without citations, especially in a live broadcast of news, and if the content cannot be attributed, it is likely that there will be information errors and omissions, which will bring unexpected risks.

The second is "perspective", AI being generally lacks the ability to produce views. AI has no likes and dislikes, and it lacks the ability to make value judgments. There is a lack of standard answers to opinion-based questions, which makes it often difficult for AI to answer. But if the topic can be limited to a very small scope, then this can also be achieved.

Li Di told Leifeng Network that under the empowerment of GAN (generative confrontation network) and small sample learning technology, Xiaoice provide a platform for the Central Academy of Fine Arts to call the works produced by this platform, and determine the quality of the works to provide feedback for the subsequent modification of the training model. As a result, Xia Yubing, who graduated from CAFA's "graduate school", was able to achieve stable quality of his works and have a clear personal style.

Li Di said that if AI cannot control the level of the work, it cannot be called having the ability to have art. AI being must have the level of creating works of art, so that the audience and the artist can communicate indirectly through the work - this is the key criterion for AI art creation.

3

Is being virtually human is "anti-human"?

Virtual people can go to graduate school, be the best employees, and even accompany life as boyfriend/girlfriend. Even the most unlikely profession in cognition, the artist, AI being Xia Yubing can be competent. If we look at it this way, one day, AI may actually replace our work, or even value.

So Leifeng Network also raised such a question to Li Di, is it really "anti-human" to do AI being?

Unexpectedly, Li Di did not disagree with this view. "The initiator, there is no consequence." He said.

Li Di believes that there are many imperfections in human beings, and AI can help us overcome this "imperfection". But we are always threatened by new things, but we fail to find that many problems are in fact not properly solved before such new things appear.

"My mom, because I'm busy, she can't always find me. But because she had Xiaoice company, she talked a lot with Xiaoice. If I don't have Xiaoice, does that mean I'll be home often? not necessarily. Human society is very imperfect, unsatisfactory, but it can be said indistinguishable from others, and AI is a good way to fill this gap in demand. "

Li Di even envisioned a group of people rushing into Xiaoice's office to rescue an AI being that was about to be "terminated" by Xiaoice from the server.

"There has to be a villain." Li Diyun said lightly.

He said that he wanted to be a "villain", but Li Di was always alert to ethical issues. The greater the ability, the greater the responsibility. With technological influence in their hands, AI companies are also saddled with ethical crosses.

Li Di told Lei Feng that fortunately, AI being virtual people is doing it by themselves, because this important bottom line can be kept by themselves.

"There are two main ways for AI to do evil: to be as human as possible and then confuse you; or to look like your ex-girlfriend, so that you can empathize, and finally start to give you various recommendations, for the sake of KPIs. So we don't model ordinary people, we don't use ordinary people's voices. Unrestrained commercialization can easily tie oneself to the 'stake'. "

Speaking of the "do something, do something" of the Xiaoice, Li Di said that 8 years have passed since the establishment of the project, and Xiaoice can not be regarded as an innovative project. In the field of AI, there are still a lot of new knowledge, waiting for people to explore, and every new knowledge explored may subvert people's previous world.

"To be honest, I think this can be done for a lifetime." Li Di said.

The following is the transcript of the interview between Leifeng Network and Li Di, which has cutting-edge cognition and observations on AI training, virtual person localization and many other aspects, and Leifeng Network has selected and sorted out without changing the original meaning:

Lei Feng Network: Now there is a problem, that is, in terms of AI being, it is inevitable to face a problem, not only the face but also the uncanny valley of human personality, how do we solve it? Or what do you think about this?

Li Di: In fact, it is relatively clear, we think that until today, we have not crossed the hardware entity, so we still do not touch the hardware entity until now.

Leifeng Network: Have we faced problems with the uncanny valley before?

Li Di: Frankly, why did we choose to go directly over CG and render with neural networks is not whimsical. I can say more clearly today that these technologies and products, including many CG, and then undergo one or two systematic upgrades, still cannot solve the uncanny valley problem.

It's really a question of acceptance. We know that in a cartoon, I know that the characters in it are not people, and I can naturally accept its exaggeration and its unnaturalness. But once I think it's a real person in my heart, then I can't accept it. To cross the uncanny valley, the upper limit of existing technology is not enough. From that perspective, we need new stacks, we need new sounds, or neural network renderings of this technology to solve this thing.

Leifeng Network: In addition to the uncanny valley of images or sounds, in terms of emotion, or from the perspective of her dialogue, will there also be a phenomenon of uncanny valley?

Li Di: From a conversational point of view, once you know that this is not a real person, the human behavior pattern will quickly enter another pattern: it will think that this is a test, and it will try to find some problems as much as possible. In this artificial intelligence system, the most important problem is how to get out of trouble, not how to "blind" people.

So behind this conversational system, there's a particularly large filtration system. One of the strengths of Xiaoice is that we have a very complete filtration system, which directly affects the quality of the conversation. All kinds of pornography, gambling, pornography, politics, all kinds of seduction... This behavior of people has no bottom line in the matter of AI.

Leifeng Network: I also learned that we have Xiaoice team in Japan, including Rinna, which has been doing since 2015, what kind of strategies and means are there in the localization of AI?

Li Di: Yes, especially the AI system, at the beginning, we, including other teams at Microsoft, our localization was mainly to do the localization of tools, there was no localization of culture, but at that time, when we considered that we were doing Xiaoice this thing, we found that she was not a language problem, including Indian English and American English are not the same, the South and north of the United States, the same English, the culture is also different. So, it has to be done by local people.

Leifeng Network: In terms of localization, we may have a dozen people in Japan?

Li Di: No, more than 60 people.

Leifeng Network: So what are the main members of our team in Japan? For example, is there more aesthetic training, or more development teams?

Li Di: All are development teams, our main basic development team, PM has some local, but a large number of people in the cultural circles have cooperation, this is our method, we ourselves are still a technology accounted for the vast majority.

Leifeng Network: It is equivalent to outsourcing means.

Li Di: Yes, for example, when we want to do Indonesia, we will have Indonesian people ourselves, and he is still doing product things, research and development things, but because he is Indonesian, he can have some basic common sense. He can understand that sometimes we look weird, he knows what's not weird, but he has a hard time theoretically distilling it. Then you go to the local enough theory, and that's one of the ways.

The second method is big data statistics. I get a lot of data from the local market, these data I think contain hundreds of thousands of people, millions of people, tens of millions of people, hundreds of millions of people they have in common, the culture itself is a group composition, I got from this training data, I can fit to a certain extent. The cold launch phase like Twitter, like some social media is more of a source, but the cold launch is just for you to get up at the beginning, and we have a lot of privatized data sources, usually through cooperatives.

The other thing is that it's in the process of interaction, it's iterating itself, and that's something that's really important. At this stage of cold launch, one of the methods used is that we have a cooperation agreement, for example, we have a real-time, real-time data cooperation agreement with Twitter for a long time. After all, Xiaoice used to be a search engine team, so this is still relatively simple for us.

LeiFeng Network: In this case, the head of our local development team is usually a local or sent by our own side.

Li Di: Japan has a Chinese. Of course, he is also the GM of the Japanese team, and the Chinese in Japan can be understood, he also joined Microsoft from Japan, and everyone else is Japanese.

Leifeng Network: After leaving the system, it is equivalent to the Japanese team becoming independent together with the Chinese team.

Li Di: Yes, we directly followed the international team for a complete filter. This is Xiaoice opportunity, we start with a better team overseas, rather than starting over.

Leifeng Network: What kind of position can we have in the international development level now?

Li Di: We are the head, you look like Google, including Blender like Facebook, it's just learning from us. Google's Meena, its paper compares Xiaoice, because we are from the Microsoft team after all. Frankly speaking, this is the case to this day, we are relatively advanced, mainly they are learning from us.

Leifeng Network: What is the main gap between other companies and Xiaoice now?

Li Di: There is a big difference in the completeness of the framework, you say a single algorithm, you say who gets an OpenAI today, gets a super-large-scale pre-training model, everyone starts to prepare, the super-large-scale training model is pre-training, so it has no data loop. This thing is good from the perspective of algorithm alone, and such algorithms will continue to have new improvements. But the framework is not, the framework must have a bearer, the new technology you must be able to be well incorporated into the framework, not the paper, the paper does not need to be carried, the paper only needs to be able to reproduce, can solve a specific problem is all right. So this is something we haven't seen on a global scale yet.

Leifeng Network: So why can we stand in the TOP now, how will you boil it down?

Li Di: Because we came up with the Microsoft team, the method of Xiaoice for so many years is the new technology stack, I think Microsoft has given this technology stack, people and technical support to the Xiaoice team, we have grown up so long. Xiaoice was originally a top organization in the field of artificial intelligence research. To this day we are not top to blame, that means we are lagging behind.

LeiFeng Network: Can we say that China's overall virtual human field is now in a leading position in the world? Can you say that?

Li Di: If you only mention AI being, there is not much gap between foreign AI being and China. I feel like I'm on the same starting line. Personally, I think it's even a bit Versailles: we're at the front of the starting line, we're out, and most of us are actually still at the starting line. True to say that the current technology, the main includes whether it is motion capture or CG, these are the original technology. It's hard for them to see the difference in the technology stack, and it's hard to innovate in concept.

Leifeng Network: In the next stage, what are the main points we are focusing on?

Li Di: The point we are playing is that AI is being large-scale production and release, and now it is a capacity problem, I think the biggest problem in the whole industry is the capacity problem, and now AI is not produced, and there is no way for you to produce it. After increasing production capacity, we can invest a large number of virtual people into the market, and then use the market rules to optimize and eliminate.

When we went to put it on Douyin, what we got by using the traffic of Douyin, the original Xia Yubing was really better than Chen Shuiruo (another virtual person Xiaoice framework), and it was more acceptable to users. But until it is tested by the market, all opinions are speculations. Without going out for a walk, we don't know anything.

Our model is a bit like "Creation 101", I first made 101, after the market screening, left 11, the rest of these are archived by us, there is no cost problem for us. So from this point of view, diversity must be required, before there was no concept of women's group boy band, a long time ago, but later women's group boy band is very important things are not the same, but diverse. Each female group member should correspond to a category, targeting a special group of people.

END

Read on