Heart of the Machine column
Tsinghua University Huang Gao team, Kuaishou Y-tech team
This is a paper from the Huang Gao team at Tsinghua University and the Kuaishou Y-tech team that explored how to evaluate the quality of a single generated image in a reference image-based generation task. The RISA model designed in this article does not require manually labeled training data, and its evaluation results can be highly consistent with people's subjective feelings. This job has been inducted into the AAAI 2022 Oral.
introduction
The existing evaluation of generated images is mainly based on the distribution of generated images to evaluate the generation effect of the "whole" model. However, a generative model with excellent performance does not mean that the "any" image it synthesizes has a high-quality effect. In the generation task based on the reference image, such as rendering the landscape photo uploaded by the user into a certain specified style of business scene, it is crucial to evaluate the quality of the "single" generated image, which is crucial to improving the user's experience.
The study proposes a single image quality evaluation method based on reference images Reference-guided Image Synthesis Assessment (RISA).
RISA's contributions and innovations can be summarized in the following areas:
- The training image of RISA comes from the image generated by the intermediate model of the GAN training process, and the quality label of the image comes from the number of iteration rounds of the model, without manual labeling, and theoretically there is no upper limit to the data that can be used for training.
- Since the number of iteration rounds of the model is not fine enough, the pixel-wise interpolation and mutiple binary classifiers methods are used to enhance the stability of training.
- Unsupervised comparative learning loss is introduced, learning stylistic similarities between reference images and generated images.
Thesis link: https://arxiv.org/pdf/2112.04163.pdf
Implement the policy
The overall framework of RISA is very concise, the reference image and the generated image are obtained by the corresponding feature vector through the parameter-sharing style extractor, and then the L1 distance of the two feature vectors is calculated and entered into the mutiple binary classifiers to obtain the prediction vector, and finally the prediction vector elements are averaged to get the final mass score.
RISA training data comes from a series of GAN training process of the intermediate model generation image, the following figure given in the sex conversion task as an example, you can see that in the early training of GAN, the model with the increase of the number of training iterations, the quality of the generated image will be significantly improved; and in the later stage of training, the quality of the model's generated image will tend to be stable.
This article uses a series of generated images of intermediate models as training data for RISA, and the sample labels of these images are obtained by the number of training iterations of their corresponding models. But obviously this form of annotation is not very suitable for the model at a later stage of training, because there will be no significant change in the quality of the generated image after training. In order to make the training data more suitable for RISA training, the pixel-wise interpolation technique, that is, linear interpolation of image space, is used to estimate the change in image quality after training.
As shown in the following figure, ideally, the generated image becomes monotonous as the number of training rounds of the GAN increases, but in fact, for simple tasks, the quality of the generated images after training has little change; for difficult tasks, the quality of the generated images generated after training shows a trend of oscillation with the increase of the number of training rounds. Therefore, the elbow point of the FID curve change is selected as the demarcation between the pre-training and post-training period of the GAN, and the image is generated by direct sampling of the intermediate model before the training, and the number of iteration rounds is used as the image quality label; for the post-training period, the two models at the beginning and the end are selected to generate images with obvious quality differences, and then the images are linearly interpolated to a series of intermediate quality images.
Some demos of the interpolated image are shown in the following GIF, where epsilon represents the weights of the two graphs when they are fused.
In addition to the interpolation of the image space, in order to ensure the stability of RISA training, the prediction of RISA uses a multiple binary classifers to output the averaged form, rather than using a simple regressor to output the fitted values. The second binary classifier predicts the probability that the currently generated image quality is greater than a certain threshold. Experiments have shown that converting quality assessment from regression problems to classification problems can significantly improve the performance of RISA.
The loss function is designed with three aspects in mind: 1) weakly supervised loss, which fits the input reference image & generates image pairs and their corresponding quality labels; 2) unsupervised contrast learning loss, which captures the reference image and generates image stylistic similarity; and 3) upper bound loss, which is used to learn stylistic consistency between two enhanced images from real images.
The upper bound loss expresses exactly the same information as the style information, and its input RISA prediction should correspond to the highest quality score of 1.
In the comparative learning loss, the paper first considers the data enhancement image for the reference image twice different and does not destroy the image style information, that is, only the amplification, cropping and flipping of the image. The generated images and the positive sample pairs are compared with the learning loss to close their prediction output; in the same batch of input samples, the reference images corresponding to them and the reference images that are not corresponding to them constitute negative sample pairs, and the contrast learning loss widens their prediction output.
Experimental results
Based on four generative models, multiple RISA models are trained on generated images on five datasets. First, from a visual point of view, the following figure shows that RISA can give the corresponding quality evaluation score according to the quality from low to high.
Then, from the perspective of quantitative indicators, extensive human evaluation tests were conducted to illustrate that the RISA evaluation results and people's subjective feelings were highly consistent. Specifically, thousands of triplet samples were selected for each task, including one reference image and two generated images. Two generated images may come from two intermediate models at different training stages of the same architecture model, or from two models of different architectures that are sufficiently converged. The tester was asked to choose the better quality of the two. Ultimately, for each task, it was ensured that at least three testers participated in the evaluation for each set of samples, and that samples that were consistent with all evaluations were retained to assess the consistency of RISA evaluations with people's subjective feelings.
The following table corresponds to cases where both the training and test data for RISA are generated by models of the same schema. It can be seen that the evaluation structure of RISA can have higher consistency with people's subjective feelings, and is better than the existing mainstream single image quality evaluation methods with reference and no reference.
The following table corresponds to cases where both the training and test data for RISA are generated by models of different schemas. The results in the table further illustrate that RISA has a good ability to migrate between different models.
Accordingly, the researchers provided a visual comparison of RISA on the triple and the optimal baseline method for each dataset. It can be seen that RISA is able to evaluate the level of stylistic similarity between the generated image and the reference image while considering the degree of authenticity of the generated image.
Finally, the researchers conducted two sets of ablation experiments to illustrate the significance of RISA's introduction of multiple binary classesifers, pixel-wise interpolation, and each of its loss terms.