NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

Xiao Zhen is from The Temple of Oufei

Qubits reports | Official account QbitAI

NLP rookie prompt, has been a bit of a fire lately.

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

It also crossed the fire to VLM (Visual-Language model).

People like OpenAI's CLIP and Nanyang Technological University's CoOp have used this line of thinking.

Now, in the latest visual language model paper released by the team of Tsinghua Associate Professor Liu Zhiyuan, a new method based on prompt is also proposed.

According to the paper, this is also the first time that the prompt has been used in cross-model and zero-sample/small sample learning visual localization.

Judging from the current NLP and VLM models, many prompt-based models are good, which makes students who engage in CV also have a bit of a heartbeat - can you give us the whole one?

So, what is good about prompt, and can it also get a good effect after being applied to the image side?

Let's take a look.

<h1 class="pgc-h-arrow-right" Data-track="10" what is the difference between > and fine-tuning? </h1>

Initially, when the NLP model was not too large, people would use a "pre-training + fine-tune" approach to design models for specific tasks.

In this mode, researchers will pre-train a model with good results, and then adjust some parameters according to a specific task (downstream task) while retaining most of the model parameters, so that it can achieve the best effect on this task.

For example, BERT is used as a pre-trained model

However, as pre-trained models get bigger and bigger, so does the cost of fine-tuning (training time, amount of data required, etc.), and researchers get a little overwhelmed and start looking for better ways.

Prompt appears at this time, only this time it is tuned for downstream tasks.

It's a bit like an input template for "cues" a pre-trained model, and as soon as the pre-trained model "sees" it, it knows what task it wants to accomplish.

For example, in an emotion classification task, you want the pre-trained model to appreciate the sentiment of the input sentence and give an adjective to classify it:

Enter "I love this movie." After that, give a prompt "This movie is [mask]" in advance, so that the pre-trained model sees it and understands that it wants to output praise adjectives such as "great/nice".

After training in this way, the pre-trained model can select the correct vocabulary type when it sees the corresponding prompt, rather than "deviating" to do something else.

Since the application of prompt in the field of NLP is very good, many researchers have also begun to try this method in the VLM model related to NLP.

<h1 class="pgc-h-arrow-right" data-track="21" > Tsinghua uses it on the image side</h1>

Of course, the VLM model that originally applied prompt is still mostly applied to the text side.

It is known that @tourbillon introduced that like OpenAI's CLIP and NTU's CoOp, the two VLM models, the prompt application is a bit similar to the PET model in NLP.

Judging from their model design, the shadow of the prompt can be clearly seen from the text side, like "A photo of a [mask]" in CLIP:

And CoOp's further improved on CLIP, which can be self-optimized in training:

The application of these prompts improves the overall output effect of the VLM model.

However, this is basically the application of VLM on the text side, is prompt suitable for use on the image side?

In the latest paper from Tsinghua Liu Zhiyuan's team, it is an attempt to establish a visual sub-prompts in the form of coloring in the image end of VLM.

Of course, the text side also applies prompt, but according to Liu Zhiyuan, the application of prompt on the text side does not feel enough to fully play the role of prompt tuning, so this paper tries a cross-modal prompt tuning method.

Judging from the test results of the paper, this method can basically achieve better results than fine tuning in the case of few-shot learning.

However, this is also another attempt by prompt on VLM.

Is it suitable for dealing with image problems in the CV field?

<h1 class="pgc-h-arrow-right" Data-track="33" can > the CV field be used as a reference? </h1>

In terms of knowledge, many bloggers have given their own views.

Zhihu @ Tourbillon gives two paths from the method:

If it is a pure CV direction of the prompt, that is, similar to viT splitting the picture into patches, each patch can actually be regarded as a character, then the patch prompt can also be designed to train the model, which can also be divided into generative (similar to ViT) and discriminative (similar to self-supervised) two methods.

Zhihu @yearn believes that for now, continuous prompt is the most likely to translate a series of jobs into the CV field. Recently, transformers have prepared to unify CV, NLP, to convert the image input into the form of patch, which also makes it easier for researchers to learn appt from NLP's methods.

Of course, @yearn also said that in order to truly apply prompt to the CV field, there are still two difficult problems that need to be solved:

1, CV does not yet exist BERT, GPT such a dominant pre-training model, so it may be difficult to move the set of jump-shot learning in the near future.

2, CV's downstream task is more complex, feel detection, split this type of task to prompt tuning is a very large workload.

But there are also anonymous users who directly believe that the image can only be done in a very awkward way to do some tasks. Of course, video may be used better.

So, do you think prompt can be applied in the CV field?

Liu Zhiyuan's team's latest papers:

https://arxiv.org/abs/2109.11797

Zhihu Answer (Authorized):

@Tourbillon: https://www.zhihu.com/question/487096135/answer/2127127513

@yearn：https://www.zhihu.com/question/487096135/answer/2124603834

— Ends —

Qubit QbitAI · Headline signing

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

Read on

The wonderful life of Liu Zhiyuan, the ancestor of The Later Han Gao

After Liu Zhiyuan's death, why was Guo Wei able to successfully ascend to the throne?

Li CunxunShi Jingyao Liu Zhiyuan Guo Wei both served as this emissary: the code behind the change of the Five Dynasties Dynasty

From "The White Rabbit", we can see the love story of Liu Zhiyuan and Li Sanniang

Liu Zhiyuan: Multiple NLP algorithms build bridges between "machines" and "natural languages" | List people

The first person with a yellow robe: Zhao Kuangyin's old superior, Guo Wei, gave him a good example

She was snatched away by the groom in the middle of the night to be a wife, and later became an empress, known for her benevolence

He became emperor at the age of 47 and married 4 widows in his lifetime, why is he praised by thousands of people as a generation of Ming jun?

This person rebelled and became emperor for more than half a hundred years, married four widows, and was praised by posterity as a generation of Ming emperors

The number one beauty of the Five Dynasties and Ten Kingdoms, from a cold woman to an uncrowned empress, eventually died under the butcher's knife

When the noblewoman was hiding from the rain, she met a beggar: You can be the emperor, I will bear children for you, and after 20 years, I will really be called the emperor

He was only 47 years old when he became an emperor, married 4 widows in his lifetime, and became a generation of Ming emperors praised by thousands of people

The emperor ordered someone to kill the general, and the general turned his head and told his colleagues: The emperor asked me to solve all of you

One of the most inspirational readers in history!

Ten regimes of the same blood in Chinese history 丨 Liu Xuan's lineage

Tsinghua Liu Zhiyuan: The big model "ten questions" to find the research direction under the new paradigm