Combination of Traditional and Deep Learning to Maximize the Accuracy of Face Detection (with paper download)

author：Institute of Computer Vision 2024-06-30 11:31:00

Follow and star

Never get lost

Institute of Computer Vision

Combination of Traditional and Deep Learning to Maximize the Accuracy of Face Detection (with paper download)

Computer Vision Research Institute

Scan the QR code on the homepage to get how to join

Address: https://arxiv.org/pdf/2007.09355.pdf

Special column of the Institute of Computer Vision

Column of Computer Vision Institute

What I am sharing today is a somewhat old technology, but I think this technology is particularly interesting and worthy of in-depth thinking by researchers in this field to improve the accuracy of other technologies in the field of detection.

Brief introduction

With the remarkable progress of real-world face manipulation technology, the concern that these technologies may be maliciously misused has led to new research topics in face forgery detection. However, this is extremely challenging, as recent technological advances have enabled the creation of faces beyond the human eye's perception, especially in compressed images and videos. We found that mining forgery patterns in the presence of frequency awareness may be a therapeutic approach, as frequencies provide a complementary perspective that can well describe subtle falsification artifacts or compression errors. In order to introduce frequency into face forgery detection, a new frequency in face forgery network (F3-Net) was proposed, which used two different but complementary frequency perception cues, 1) frequency perception to decompose image components and 2) local frequency statistics, and deeply excavated the forgery mode through a dual-stream collaborative learning framework. DCT is applied as the frequency domain transformation of the application. Through comprehensive research, the proposed F3-Net significantly outperforms its competitors' state-of-the-art methods in all compression qualities in the challenging FaceForensics++ dataset, especially in low-quality media.

background

State-of-the-art face manipulation algorithms, such as DeepFake, FaceSwap, Face2Face, and NeuralTextures, are already capable of hiding fake artifacts, making it extremely difficult to find flaws in these refined artifacts, as shown in Figure (a) below.

To make matters worse, if the visual quality of a fake face is greatly reduced, such as with JPEG or H.264 at a large compression ratio, the fake artifact will be tainted by compression errors and sometimes fail to capture in the RGB domain. Fortunately, as many previous studies have shown, these artifacts can be captured in the frequency domain in the form of unusual frequency distributions compared to real faces. However, how can frequency-aware cues be incorporated into deep learning CNN models? This question also arises. Traditional frequency domains, such as FFT and DCT, do not match the shift invariance and local consistency that natural images have, so ordinary CNN structures may not be feasible. Therefore, if we want to use the discriminant representation capability of a learnable CNN for frequency-aware face forgery detection, then CNN-compatible frequency representation becomes crucial. To this end, we would like to introduce two types of frequency-aware forged cues that are compatible with knowledge mining in deep convolutional networks.

On one side, an image can be decomposed by separating the frequency signal of the image, with each decomposed image component indicating a specific frequency band. Thus, the first frequency artifact forgery clue was discovered intuitively, i.e., we were able to identify subtle artifacts (i.e., in the form of unusual patterns) that were slightly prominent in the decomposition components with higher frequencies, as shown in the middle column (b) of the figure above. This clue is compatible with CNN structures and is surprisingly robust to compression artifacts.

On the other hand, the decomposed image components describe the frequency-aware patterns in the spatial domain, but do not explicitly present the frequency information directly in the neural network. It is suggested that the second frequency-aware forgery cue should be counted as a local frequency. In each densely but regularly sampled patch of local space, statistics are collected by counting the average frequency response of each band. These frequency statistics are regrouped back into a multi-channel spatial plot where the number of channels is the same as the number of frequency bands. As shown in the last column of figure (b) above, the fake faces have different local frequency statistics compared to the corresponding real faces, although they look almost identical in the RGB image. In addition, the local frequency statistics also follow the spatial layout of the input RGB image, so it also enjoys the effective representation learning provided by the CNN. At the same time, since the decomposed image components and local frequency statistics are complementary, but they have essentially similar frequency-aware semantics, they can be gradually fused in the process of feature learning.

Detailed analysis of the new framework

Therefore, a novel face frequency forgery network (F3-Net) is proposed, which makes use of the above frequency-aware forgery clues. The framework consists of two frequency perception branches, one aims to learn subtle forgery patterns through frequency-aware image decomposition (FAD), and the other aims to extract high-level semantics from local frequency statistics (LFS) to describe the statistical differences in frequency perception between real and fake faces. The two branches are further gradually converged through a cross-attention module, i.e., MixBlock, which encourages rich interaction between the FAD and LFS branches mentioned above. The entire face forgery detection model is learned in an end-to-end manner through cross-entropy loss.

A large number of experiments have shown that the proposed F3-Net significantly improves the performance of low-quality counterfeit media through thorough ablation studies. It is also shown that in the challenging FaceForensics++, the newly proposed framework significantly exceeds the technical level of the competitor in all compression qualities. As shown in figure (c) above, the effectiveness and superiority of the proposed frequency-aware F3-Net can be clearly demonstrated by comparing the ROC curve with Xception.

The proposed architecture consists of three new methods: FAD to learn subtle manipulation patterns through frequency-aware image decomposition; LFS for extracting local frequency statistics and MixBlock for collaborative feature interaction.

FAD: Frequency-Aware Decomposition

For frequency-aware image decomposition, previous studies have typically applied hand-crafted filter sets in the spatial domain, so it is not possible to cover the entire frequency domain. At the same time, the fixed filtering configuration makes it difficult to adaptively capture forged patterns. To this end, we propose a novel Frequency Perception Decomposition (FAD) that adaptively segments the input image in the frequency domain based on a set of learnable frequency filters. The decomposed frequency components can be inversely transformed into the spatial domain, resulting in a series of frequency-aware image components. These components are stacked along the channel axis and then fed into a convolutional neural network (in our implementation, we used Xception as the backbone) to fully mine the forged pattern.

LFS: Local Frequency Statistics

(a) Local Frequency Statistics (LFS) is proposed to extract statistical information in the local frequency domain. SWDCT stands for applying a sliding window discrete cosine transform, and H stands for adaptively collecting statistics on each mesh. (b) Extraction of statistical data from DCT power spectra.

The proposed MixBlock

Experimentation and visualization

The comparison in the table below is a good representation of the F3-Net's performance in low-quality images, which shows that the detection in the frequency domain does have better compression resistance.

t-SNE embedding visualization of baseline (a) and F3-Net (b) on the FaceForensics++ low-quality (LQ) task. The red color represents the real video, and the rest of the colors represent the data generated by different manipulation methods.

Please contact this official account for authorization for reprinting

The Computer Vision Research Institute Learning Group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, mainly focusing on object detection, object tracking, image segmentation, OCR, model quantization, model deployment and other research directions. The institute shares the latest paper algorithm and new framework every day, provides one-click download of papers, and shares practical projects. The institute mainly focuses on "technical research" and "practical implementation". The institute will share the practice process for different fields, so that everyone can truly experience the real scene of getting rid of theory, and cultivate the habit of loving hands-on programming and thinking with their brains!

🔗

Sparse R-CNN：稀疏框架，端到端的目标检测（附源码）

Combination of Traditional and Deep Learning to Maximize the Accuracy of Face Detection (with paper download)

Read on