Why do Stanford students want to copy Chinese models?

Wu Xiaobo Channel

2024-06-05 09:00Posted on the official account of Zhejiang Hangzhou Bajiuling Cultural and Creative Co., Ltd

"They will try to replicate everything, but they will not be able to replicate my thoughts, and I will let them toiled and steal, but they will always be a year and a half behind me."

——Rudyard Kipling（印度作家）

Why do Stanford students want to copy Chinese models?

Text / Ba Jiuling (WeChat public account: Wu Xiaobo channel)

The American model copied China?

At the beginning of June, some sharp-eyed netizens discovered that an AI team from Stanford University in the United States released a large model called Llama3V on May 29, claiming that it can train a SOTA multimodal model for only $500, and the effect is comparable to OpenAI's GPT-4V, Google's Gemini Ultra developed by DeepMind, and Anthropic's current most capable model Claude Opus.

However, after careful inspection by netizens, this large model is suspected of plagiarism and "shelling" an open source achievement by the Chinese large model company Facewall Intelligence - MiniCPM-Llama3-V 2.5, which was released in mid-May.

The so-called "shell", Zhang Xiaorong, president of the Deepin Science and Technology Research Institute, explained to the minibus: "It usually refers to making some superficial adjustments or packaging of the model without changing the core algorithm and architecture, and claiming to be original. ”

Hu Yanping, chief expert of FutureLabs, further popularized science: "Open source of the model itself means that it is open to others for use, and it can be said that all secondary development such as fine-tuning based on open source large models is in the shell.

The difference is that it's explicitly claiming to be based on someone else's big open source model — one of the more popular norms is to explicitly say thank you or pay tribute.

But many eyes looked at the Stanford team's Llama3V and found that there was no relevant logo.

According to Lei Feng's observation, at the beginning, because the main creative team of this product had a Stanford background and collected relevant experience from Tesla, SpaceX, Amazon, Oxford University and other institutions, the model attracted much attention after its release.

After discovering that something was wrong, on June 2, netizens tried to throw factual doubts under Llama3V's Github project, but they were quickly deleted by Llama3V's team.

Plagiarized sidewall intelligence is a well-known domestic start-up company that has completed hundreds of millions of yuan in financing, with more than 100 R&D personnel, 80% of whom are from Tsinghua University and Peking University.

On June 2, the Facewall Intelligence team responded late at night that MiniCPM-Llama3-V 2.5 had been used to identify the ancient characters of the Warring States period on the bamboo slips of the Warring States period in Tsinghua University (hereinafter referred to as "Tsinghua Jane"), and the team spent months scanning and manually marking them word by word on the Tsinghua Jane, and it was never made public.

However, after testing, the team found that Stanford's model could not only recognize the ancient characters of the Warring States period in the "Tsinghua Jane", but also that the error recognition results were completely consistent with the MiniCPM model, which was a solid plagiarism.

When the news reached China, one stone stirred up a thousand waves.

Wolong copied the phoenix?

The reason why this plagiarism incident has attracted much attention lies in the particularity of the protagonists of the incident: Stanford University and Tsinghua University, as well as some "surprise" blessings - it was the American team that copied the Chinese team (it is a little sad to think about it).

According to the "Analysis Report on the Core Team Members of the Ten Global Large Models" released by AMiner, most of the core members of the ten world-famous large models GPT, Gemini, Claude, GLM, LLaMA, Qwen, Falcon, PaLM, BERT, and T5 have been trained by the University of California system and Stanford University, and Tsinghua University is the only Chinese university on the list.

*Note: The University of California is not a university, but a university system of public universities in California, located in different cities in California, but most of the ten different campuses are famous.

In addition, those who are familiar with the large model industry will find that in the introduction of domestic large model companies, it is often emphasized that seventy to ninety percent of the company's R&D personnel are R&D personnel, and when necessary, they will also write the founding team of Tsinghua University.

Even in Stanford's own Artificial Intelligence Index 2024 report, Tsinghua University was mentioned as one of the academic institutions that published the most basic models among non-Western institutions when discussing global AI models.

As a result, it is not surprising that this incident seems to have developed into the trend of "the United States Wolong copied the Chinese phoenix chick", which caused heated discussions.

But from a closer look at the identity, there is a clear gap between the two teams.

On June 2, Aksh Garg, a member of the Stanford team, formally apologized on the social platform X and publicly apologized for the incident and explained that the main reason for this was that they "believed the wrong pig teammate".

According to its original text, there are three young Americans in the plagiarism team, Siddharth Sharma (Sharma), Aksh Garg (Gaga), and Mustafa Aljadery (Mustafa Aljadri).

Sharma and Garg are undergraduates at Stanford University and are responsible for promoting the Llama3-V model. Aljadri is a young entrepreneur who graduated from the University of Southern California and is responsible for the code development of the Llama 3-V model. During the development of the Llama3-V model, Aljadri copied the MiniCPM-Llama3-V 2.5 model from China in order to quickly become famous.

We flipped through the previous tweets of another member, Sharma, and as he said, he is indeed a KOL (opinion leader) in the tech circle, and has promoted many products, not just Llama3-V.

Therefore, the essence of the incident is that people with a background from the University of Southern California copied the large model of the background of Tsinghua University.

On the side of Wall Intelligence, his co-founder and chief scientist is Liu Zhiyuan.

According to the official website, Liu Zhiyuan has published more than 200 related papers in well-known international journals and conferences in the field of artificial intelligence, with more than 31,000 citations from Google Scholar, and has won the first prize of natural science of the Ministry of Education.

His teacher Sun Maosong has a longer title - a foreign academician of the European Academy of Humanities and Natural Sciences, a fellow of the International Association of Computational Linguistics, a fellow of the Chinese and Artificial Intelligence Society, a fellow of the Chinese Society of Chinese and Information Technology, a tenured professor and doctoral supervisor of the Department of Computer Science and Technology of Tsinghua University - and three students, including Liu Zhiyuan, are also members of well-known domestic AI startups.

In fact, the wall-facing MiniCPM-Llama3-V 2.5 model from the star team is quite well-known in the Chinese AI community, but most Americans don't know it.

As mentioned in the interface report, in response to this matter, Lucas Beyer, a researcher at Google's DeepMind, commented on the matter that the MiniCPM-Llama3-V 2.5, which has the same performance, has received too little attention, and this seems to be just because the model is not from an "American Ivy League school".

As a result, the incident eventually developed into a farce - some grass stage teams with backgrounds from Stanford University and the University of California took advantage of the information gap between China and the United States to plagiarize the work of China's cutting-edge research team.

The party Liu Zhiyuan, one day after the Stanford team apologized, said with emotion in Zhihu:

The rapid development of artificial intelligence is inseparable from the open source sharing of global algorithms, data and models, so that people can always stand on the shoulders of SOTA and continue to move forward. Our open-source MiniCPM-Llama3-V 2.5 uses the latest Llama3 as the language model base.

The cornerstone of open source sharing is adherence to open source protocols, trust in other contributors, respect and tribute to the achievements of predecessors, and the Llama3-V team has undoubtedly seriously undermined this. They have deleted the database at Huggingface after being questioned, and two of the three members of the team are also Stanford undergraduates, and there is still a long way to go.

"You've got me, I've got you"

After sorting out the ins and outs of the incident, you may feel that the truth of the matter seems to be very far from the plot trend of "the PK competition between China's Tsinghua Department and the Stanford Department of the United States" and "China's big model has risen" that people immediately made up when they saw the news.

But the emotional gap may not be that big.

Hu Yanping believes that the reason why this matter has attracted widespread attention is mainly because "reverse plagiarism" is relatively rare. In the past, domestic AI teams developed based on foreign open-source large models, and it was rare for foreign teams to use domestic large models for development. It shows that although the domestic large model is backward as a whole, there are also commendable points in the part.

Experts who are more confident than Hu Yanping expressed a different view.

After watching this incident, an industry insider sighed to the minibus: "As far as the large language model is concerned, I have always believed that the gap between China and the United States will be narrowed, but the United States will come up with something new." And this incident can indeed show that the gap between China and the United States in large language models is narrowing, and at least at the technical level, it can prove that you have me and I have you. ”

Zhang Xiaorong also said that the Stanford team's plagiarism of the Chinese team can indeed reflect that the Chinese team is basically at the same level as the United States in the field of large-scale model application development.

However, another noteworthy aspect of this incident is the intentional netizens who are the "discoverers", "exposers" and "reminders" in this incident.

If it weren't for their strict review of the new large-scale model products, the first questions and reminders, I am afraid that it would be difficult for this matter to break out of the circle so quickly from a niche field.

"As long as you have more eyes, bugs are easy to catch." This is a quote from the book "The Cathedral and the Bazaar", published in 1999, and it is the core meaning of the book.

This humanistic work is known as the "bible" of the Internet open source movement.

by Eric S. More than 20 years ago, Raymond advocated the use of a "bazaar" model to open source and encourage software developers around the world to participate in the development of software, thus replacing the "cathedral" model of the past when large companies worked behind closed doors.

In other words, it is thousands of stinkers who can top one Zhuge Liang.

His predictions have become a reality, and his philosophy is a value that we take for granted – all the software, networks, and operating systems that people use today are products developed from open source.

Open source runs through the development of the Internet, and also extends to the era of artificial intelligence, fortunately, Raymond's "more eyes" law not only helps to find bugs, but also helps to find plagiarism.

In a sense, the biggest mistake made by the Stanford team is that they take advantage of the openness of the Internet, but ignore another important feature of openness: public supervision.

Afterwards, some netizens commented in confusion: "Aren't they afraid of being discovered?" ”

Perhaps, no matter how open the world is, it can't compete with a self-enclosed brain and vision.

Hu Yanping

Chief Expert of FutureLabs

The open source model actually welcomes the casing, the only question is whether it is explicitly stated or not. In the case of MiniCPM alone, it is difficult to rise to the level of legal rights protection at present.

Zhang Xiaorong

President of Deepin Science and Technology Research Institute

In the open-source community, plagiarism does exist, but it is not universal. Plagiarism is often avoided or detected in the industry through code reviews, performance comparisons, and community oversight. Legal remedies are feasible to enforce rights, and the worst effects can include damaging the reputation of the industry and hindering innovation and technological development.

I believe that originality and integrity are essential in any field. An open source culture encourages sharing and collaboration, but that doesn't mean intellectual property rights can be disregarded. We should continue to promote the establishment of more robust mechanisms for the protection of intellectual property rights and encourage genuine innovation and cooperation. At the same time, any alleged plagiarism should be investigated fairly and transparently.

Anonymous industry insiders

In previous waves of technology, including the Internet and mobile Internet, Americans did not learn much about China. Even if you beat all your opponents, it is more on the business level and has little to do with technology. And this incident can indeed show that the gap between China and the United States in large language models is narrowing, and it can at least prove that you have me and I have you at the technical level, which is undeniable.

Zhang Jinjing

Founder of BT Finance

At present, there is no real data for the training of general large models, and can only be carried out through Internet data, and the United States will also lack some rich data sources that only have in China, so it is impossible to achieve some in-depth training.

It may be that some teams want to get financing as soon as possible, so they use large models trained by Chinese data to make a profit. This may also reflect that some teams in the United States have reached a dead end in the application and research and development of large models, and cannot break through the technology, so they can only use their brains in the application link, and China's rich data resources and large model capabilities are its key reference objects.

The author of this article |and the wind and the moon | Rao Zufen

Editor-in-chief | Image source |VCG

View original image 467K

Why do Stanford students want to copy Chinese models?
Why do Stanford students want to copy Chinese models?
Why do Stanford students want to copy Chinese models?
Why do Stanford students want to copy Chinese models?
Why do Stanford students want to copy Chinese models?
Why do Stanford students want to copy Chinese models?
Why do Stanford students want to copy Chinese models?
Why do Stanford students want to copy Chinese models?

Why do Stanford students want to copy Chinese models?

Why do Stanford students want to copy Chinese models?

Read on