laitimes

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

author:Quantum Position

Xiao Zhen is from The Temple of Oufei

Qubits reports | Official account QbitAI

NLP rookie prompt, has been a bit of a fire lately.

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

It also crossed the fire to VLM (Visual-Language model).

People like OpenAI's CLIP and Nanyang Technological University's CoOp have used this line of thinking.

Now, in the latest visual language model paper released by the team of Tsinghua Associate Professor Liu Zhiyuan, a new method based on prompt is also proposed.

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

According to the paper, this is also the first time that the prompt has been used in cross-model and zero-sample/small sample learning visual localization.

Judging from the current NLP and VLM models, many prompt-based models are good, which makes students who engage in CV also have a bit of a heartbeat - can you give us the whole one?

So, what is good about prompt, and can it also get a good effect after being applied to the image side?

Let's take a look.

<h1 class="pgc-h-arrow-right" Data-track="10" what is the difference between > and fine-tuning? </h1>

Initially, when the NLP model was not too large, people would use a "pre-training + fine-tune" approach to design models for specific tasks.

In this mode, researchers will pre-train a model with good results, and then adjust some parameters according to a specific task (downstream task) while retaining most of the model parameters, so that it can achieve the best effect on this task.

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

For example, BERT is used as a pre-trained model

However, as pre-trained models get bigger and bigger, so does the cost of fine-tuning (training time, amount of data required, etc.), and researchers get a little overwhelmed and start looking for better ways.

Prompt appears at this time, only this time it is tuned for downstream tasks.

It's a bit like an input template for "cues" a pre-trained model, and as soon as the pre-trained model "sees" it, it knows what task it wants to accomplish.

For example, in an emotion classification task, you want the pre-trained model to appreciate the sentiment of the input sentence and give an adjective to classify it:

Enter "I love this movie." After that, give a prompt "This movie is [mask]" in advance, so that the pre-trained model sees it and understands that it wants to output praise adjectives such as "great/nice".

After training in this way, the pre-trained model can select the correct vocabulary type when it sees the corresponding prompt, rather than "deviating" to do something else.

Since the application of prompt in the field of NLP is very good, many researchers have also begun to try this method in the VLM model related to NLP.

<h1 class="pgc-h-arrow-right" data-track="21" > Tsinghua uses it on the image side</h1>

Of course, the VLM model that originally applied prompt is still mostly applied to the text side.

It is known that @tourbillon introduced that like OpenAI's CLIP and NTU's CoOp, the two VLM models, the prompt application is a bit similar to the PET model in NLP.

Judging from their model design, the shadow of the prompt can be clearly seen from the text side, like "A photo of a [mask]" in CLIP:

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

And CoOp's further improved on CLIP, which can be self-optimized in training:

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

The application of these prompts improves the overall output effect of the VLM model.

However, this is basically the application of VLM on the text side, is prompt suitable for use on the image side?

In the latest paper from Tsinghua Liu Zhiyuan's team, it is an attempt to establish a visual sub-prompts in the form of coloring in the image end of VLM.

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

Of course, the text side also applies prompt, but according to Liu Zhiyuan, the application of prompt on the text side does not feel enough to fully play the role of prompt tuning, so this paper tries a cross-modal prompt tuning method.

Judging from the test results of the paper, this method can basically achieve better results than fine tuning in the case of few-shot learning.

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

However, this is also another attempt by prompt on VLM.

Is it suitable for dealing with image problems in the CV field?

<h1 class="pgc-h-arrow-right" Data-track="33" can > the CV field be used as a reference? </h1>

In terms of knowledge, many bloggers have given their own views.

Zhihu @ Tourbillon gives two paths from the method:

If it is a pure CV direction of the prompt, that is, similar to viT splitting the picture into patches, each patch can actually be regarded as a character, then the patch prompt can also be designed to train the model, which can also be divided into generative (similar to ViT) and discriminative (similar to self-supervised) two methods.

Zhihu @yearn believes that for now, continuous prompt is the most likely to translate a series of jobs into the CV field. Recently, transformers have prepared to unify CV, NLP, to convert the image input into the form of patch, which also makes it easier for researchers to learn appt from NLP's methods.

Of course, @yearn also said that in order to truly apply prompt to the CV field, there are still two difficult problems that need to be solved:

1, CV does not yet exist BERT, GPT such a dominant pre-training model, so it may be difficult to move the set of jump-shot learning in the near future.

2, CV's downstream task is more complex, feel detection, split this type of task to prompt tuning is a very large workload.

But there are also anonymous users who directly believe that the image can only be done in a very awkward way to do some tasks. Of course, video may be used better.

NLP rookie jumpout cross-border out of the circle, Tsinghua Liu Zhiyuan's latest paper to apply it to the VLM image end and fine-tuning what is the difference? Can Tsinghua use it in the field of CV at the image end?

So, do you think prompt can be applied in the CV field?

Liu Zhiyuan's team's latest papers:

https://arxiv.org/abs/2109.11797

Zhihu Answer (Authorized):

@Tourbillon: https://www.zhihu.com/question/487096135/answer/2127127513

@yearn:https://www.zhihu.com/question/487096135/answer/2124603834

— Ends —

Qubit QbitAI · Headline signing

Follow us and be the first to know about cutting-edge technology developments

Read on