天天看點

Paper:《First Order Motion Model for Image Animation》翻譯與解讀(一)

Abstract

Image animation consists of generating a video sequence so that an object in a source image is animated according to the motion of a driving video. Our framework addresses this problem without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. To achieve this, we decouple appearance and motion information using a self-supervised formulation. To support complex motions, we use a representation consisting of a set of learned keypoints along with their local affine transformations. A generator network models occlusions arising during target motions and combines the appearance extracted from the source image and the motion derived from the driving video. Our framework scores best on diverse benchmarks and on a variety of object categories. 圖像動畫包括生成視訊序列,以便根據驅動視訊的運動使源圖像中的對象動畫。我們的架構解決了這個問題,沒有使用任何注釋或關于動畫特定對象的先驗資訊。一旦在一組描述同一類别對象(例如人臉、人體)的視訊上進行訓練,我們的方法就可以應用于該類中的任何對象。為了實作這一點,我們解耦外觀表面和運動資訊使用一個自監督的公式。為了支援複雜的運動,我們使用一種由一組學習過的關鍵點及其局部仿射變換組成的表示法。生成器網絡對目标運動中産生的遮擋進行模組化,并将從源圖像中提取的外觀與從駕駛視訊中提取的運動相結合。我們的架構在各種基準測試和各種對象類别上得分最高。

1 Introduction  

Generating videos by animating objects in still images has countless applications across areas of  interest including movie production, photography, and e-commerce. More precisely, image animation  refers to the task of automatically synthesizing videos by combining the appearance extracted from  a source image with motion patterns derived from a driving video. For instance, a face image of a  certain person can be animated following the facial expressions of another individual (see Fig. 1). In  the literature, most methods tackle this problem by assuming strong priors on the object representation  (e.g. 3D model) [4] and resorting to computer graphics techniques [6, 33]. These approaches can  be referred to as object-specific methods, as they assume knowledge about the model of the specific  object to animate.  

通過在靜态圖像中動畫對象來生成視訊有無數的應用程式,涉及的領域包括電影制作、攝影和電子商務。更準确地說,圖像動畫是指通過将從源圖像中提取的外觀與從駕駛視訊中提取的運動模式結合起來,自動合成視訊的任務。例如,一個人的面部圖像可以根據另一個人的面部表情進行動畫處理(見圖1),在文獻中,大多數方法通過對對象表示(如3D模型)[4]假設強先驗并借助于計算機圖形技術來解決這個問題[6,33]。這些方法可以被稱為特定對象的方法,因為它們假定了解要動畫的特定對象的模型。

Recently, deep generative models have emerged as effective techniques for image animation and  video retargeting [2, 41, 3, 42, 27, 28, 37, 40, 31, 21]. In particular, Generative Adversarial Networks  (GANs) [14] and Variational Auto-Encoders (VAEs) [20] have been used to transfer facial expressions  [37] or motion patterns [3] between human subjects in videos. Nevertheless, these approaches  usually rely on pre-trained models in order to extract object-specific representations such as keypoint  locations. Unfortunately, these pre-trained models are built using costly ground-truth data annotations  [2, 27, 31] and are not available in general for an arbitrary object category. To address this issues,  recently Siarohin et al. [28] introduced Monkey-Net, the first object-agnostic deep model for image animation. Monkey-Net encodes motion information via keypoints learned in a self-supervised fashion.  At test time, the source image is animated according to the corresponding keypoint trajectories  estimated in the driving video. The major weakness of Monkey-Net is that it poorly models object  appearance transformations in the keypoint neighborhoods assuming a zeroth order model (as we  show in Sec. 3.1). This leads to poor generation quality in the case of large object pose changes  (see Fig. 4). To tackle this issue,

we propose to use a set of self-learned keypoints together with  local affine transformations to model complex motions. We therefore call our method a first-order  motion model.

Second, we introduce an occlusion-aware generator, which adopts an occlusion mask  automatically estimated to indicate object parts that are not visible in the source image and that  should be inferred from the context. This is especially needed when the driving video contains large  motion patterns and occlusions are typical.

Third, we extend the equivariance loss commonly used  for keypoints detector training [18, 44], to improve the estimation of local affine transformations.  Fourth, we experimentally show that our method significantly outperforms state-of-the-art image  animation methods and can handle high-resolution datasets where other approaches generally fail.

Finally, we release a new high resolution dataset, Thai-Chi-HD, which we believe could become a  reference benchmark for evaluating frameworks for image animation and video generation.

最近,深度生成模型已經成為圖像動畫和視訊重定向的有效技術[2,41,3,42,27,28,37,40,31,21]。特别是,生成對抗網絡(GANs)[14]和變分自動編碼器(VAEs)[20]已被用于在視訊中人類受試者之間轉移面部表情[37]或運動模式[3]。然而,這些方法通常依靠預先訓練好的模型來提取特定對象的表示,如關鍵點位置。不幸的是,這些預先訓練過的模型是使用昂貴的ground-truth資料注釋來建構的[2,27,31],通常不能用于任意對象類别。為了解決這個問題,最近Siarohin等人[28]推出了Monkey-Net,這是第一個面向對象的圖像動畫深度模型。Monkey-Net編碼運動資訊以一個自我監督的方式通過關鍵點學習。在測試時,根據在駕駛視訊中估計的相應關鍵點軌迹對源圖像進行動畫處理。Monkey-Net的主要弱點是,在假定為零階模型的情況下,它很難對關鍵點鄰域中的對象外觀變換進行模組化(如3.1節所示)。這導緻在大物體姿态變化的情況下生成品質較差(見圖4)。為了解決這個問題,

我們提出使用一組自學習的關鍵點和局部仿射變換來模組化複雜的運動。是以我們稱我們的方法為一階運動模型 [first-order  motion model.]。

其次,我們介紹了一個遮擋感覺生成器,它采用一個自動估計的遮擋掩模來訓示目标部分,在源圖像中不可見的,需要從上下文推斷。這是特别需要的時候,駕駛視訊包含大的運動模式和遮擋是典型的。

第三,我們擴充了關鍵點檢測器訓練中常用的等方差損失[18,44],以改進局部仿射變換的估計。

第四,我們的實驗表明,我們的方法明顯優于最先進的圖像動畫方法,可以處理高分辨率資料集,其他方法通常失敗。

最後,我們釋出了一個新的高分辨率資料集——Thai-Chi-HD,我們相信它可以成為評估圖像動畫和視訊生成架構的參考基準。

2 Related work  

Video Generation. Earlier works on deep video generation discussed how spatio-temporal neural  networks could render video frames from noise vectors [36, 26]. More recently, several approaches  tackled the problem of conditional video generation. For instance, Wang et al. [38] combine a  recurrent neural network with a VAE in order to generate face videos. Considering a wider range  of applications, Tulyakov et al. [34] introduced MoCoGAN, a recurrent architecture adversarially  trained in order to synthesize videos from noise, categorical labels or static images. Another typical  case of conditional generation is the problem of future frame prediction, in which the generated video  is conditioned on the initial frame [12, 23, 30, 35, 44]. Note that in this task, realistic predictions can  be obtained by simply warping the initial video frame [1, 12, 35]. Our approach is closely related to these previous works since we use a warping formulation to generate video sequences. However,  in the case of image animation, the applied spatial deformations are not predicted but given by the  driving video.

視訊生成。在深度視訊生成方面的早期工作讨論了時空神經網絡如何從噪聲向量渲染視訊幀[36,26]。最近,一些方法解決了條件視訊生成的問題。例如,Wang et al.[38]結合遞歸神經網絡和VAE來生成人臉視訊。考慮到更廣泛的應用,Tulyakov等人[34]引入了MoCoGAN,一種經過反訓練的周期性建築,用于從噪聲、分類标簽或靜态圖像合成視訊。條件生成的另一個典型情況是未來幀預測問題,生成的視訊以初始幀為條件[12,23,30,35,44]。注意,在這個任務中,可以通過簡單地扭曲初始視訊幀來獲得現實的預測[1,12,35]。我們的方法與之前的工作密切相關,因為我們使用扭曲公式來生成視訊序列。然而,在圖像動畫的情況下,應用的空間變形不是預測,而是由駕駛視訊給出。

Image Animation. Traditional approaches for image animation and video re-targeting [6, 33,  13] were designed for specific domains such as faces [45, 42], human silhouettes [8, 37, 27] or  gestures [31] and required a strong prior of the animated object. For example, in face animation,  method of Zollhofer et al. [45] produced realistic results at expense of relying on a 3D morphable  model of the face. In many applications, however, such models are not available. Image animation  can also be treated as a translation problem from one visual domain to another. For instance, Wang  et al. [37] transferred human motion using the image-to-image translation framework of Isola et  al. [16]. Similarly, Bansal et al. [3] extended conditional GANs by incorporating spatio-temporal  cues in order to improve video translation between two given domains. Such approaches in order to  animate a single person require hours of videos of that person labelled with semantic information,  and therefore have to be retrained for each individual. In contrast to these works, we neither rely on  labels, prior information about the animated objects, nor on specific training procedures for each  object instance. Furthermore, our approach can be applied to any object within the same category  (e.g., faces, human bodies, robot arms etc).  

圖像動畫。傳統的圖像動畫和視訊重定向方法[6,33,13]是為特定領域設計的,如人臉[45,42],人體輪廓[8,37,27]或手勢[31],并要求動畫對象的強大先驗。例如,在人臉動畫中,Zollhofer等人[45]的方法以依賴人臉的3D morphable模型為代價,産生了逼真的結果。然而,在許多應用中,這樣的模型是不可用的。圖像動畫也可以看作是一個從一個視覺領域到另一個視覺領域的轉換問題。例如,Wang等人[37]使用Isola等人的圖像到圖像的翻譯架構來傳輸人體運動。[16]。同樣,Bansal等人[3]通過合并時空線索擴充了條件GANs,以改善兩個給定域之間的視訊平移。為了使一個人動起來,這種方法需要數小時的帶有語義資訊的視訊,是以必須為每個人重新訓練。與這些作品相比,我們既不依賴于标簽,也不依賴于動畫對象的先驗資訊,也不依賴于每個對象執行個體的特定訓練程式。此外,我們的方法可以應用于同一類别中的任何對象。,人臉,人體,機器人手臂等)。

Several approaches were proposed that do not require priors about the object. X2Face [40] uses  a dense motion field in order to generate the output video via image warping. Similarly to us  they employ a reference pose that is used to obtain a canonical representation of the object. In our  formulation, we do not require an explicit reference pose, leading to significantly simpler optimization  and improved image quality. Siarohin et al. [28] introduced Monkey-Net, a self-supervised framework  for animating arbitrary objects by using sparse keypoint trajectories. In this work, we also employ  sparse trajectories induced by self-supervised keypoints. However, we model object motion in the  neighbourhood of each predicted keypoint by a local affine transformation. Additionally, we explicitly  model occlusions in order to indicate to the generator network the image regions that can be generated  by warping the source image and the occluded areas that need to be inpainted. 提出了幾種不需要關于對象的先驗的方法。X2Face[40]使用密集運動場,通過圖像翹曲生成輸出視訊。與我們相似的是,它們使用一個參考姿态來獲得對象的規範表示。在我們的公式中,我們不需要一個明确的參考姿态,導緻顯著簡化優化和改善圖像品質。Siarohin等人[28]介紹了Monkey-Net,這是一個自監督架構,通過使用稀疏的關鍵點軌迹來建立任意對象的動畫。在這項工作中,我們也使用稀疏軌迹由自監督關鍵點。然而,我們通過局部仿射變換在每個預測關鍵點的鄰域内模組化物體的運動。此外,為了向生成網絡表明扭曲源圖像可以生成的圖像區域和需要繪制的遮擋區域,我們對遮擋進行了顯式模組化。

3 Method

We are interested in animating an object depicted in a source image S based on the motion of a similar  object in a driving video D. Since direct supervision is not available (pairs of videos in which objects  move similarly), we follow a self-supervised strategy inspired from Monkey-Net [28]. For training,  we employ a large collection of video sequences containing objects of the same object category. Our  model is trained to reconstruct the training videos by combining a single frame and a learned latent  representation of the motion in the video. Observing frame pairs, each extracted from the same video,  it learns to encode motion as a combination of motion-specific keypoint displacements and local  affine transformations. At test time we apply our model to pairs composed of the source image and of  each frame of the driving video and perform image animation of the source object. 我們感興趣的動畫對象描述了源圖像的基于相似的對象的運動以來駕駛視訊d直接監督不可用(對視訊對象移動類似),我們遵循self-supervised政策啟發從Monkey-Net[28]。為了進行訓練,我們使用了大量的視訊序列集合,其中包含了同一對象類别的對象。我們的模型被訓練來重建訓練視訊結合一個單一的幀和一個學習的潛在的表示運動在視訊。通過觀察從同一視訊中提取的幀對,它學會了将運動編碼為特定運動關鍵點位移和局部仿射變換的組合。在測試時,我們将模型應用于由源圖像和驅動視訊的每一幀組成的對,并執行源對象的圖像動畫。

An overview of our approach is presented in Fig. 2. Our framework is composed of two main  modules: the motion estimation module and the image generation module. The purpose of the motion  estimation module is to predict a dense motion field from a frame D ∈ R  3×H×W of dimension  H × W of the driving video D to the source frame S ∈ R  3×H×W . The dense motion field is later  used to align the feature maps computed from S with the object pose in D. The motion field is  modeled by a function TS←D : R  2 → R  2  that maps each pixel location in D with its corresponding  location in S. TS←D is often referred to as backward optical flow. We employ backward optical flow,  rather than forward optical flow, since back-warping can be implemented efficiently in a differentiable  manner using bilinear sampling [17]. We assume there exists an abstract reference frame R. We  independently estimate two transformations: from R to S (TS←R) and from R to D (TD←R). Note  that unlike X2Face [40] the reference frame is an abstract concept that cancels out in our derivations  later. Therefore it is never explicitly computed and cannot be visualized. This choice allows us to  independently process D and S. This is desired since, at test time the model receives pairs of the  source image and driving frames sampled from a different video, which can be very different visually.  Instead of directly predicting TD←R and TS←R, the motion estimator module proceeds in two steps. 我們的方法的概述如圖2所示。我們的架構由兩個主要子產品組成:運動估計子產品和圖像生成子產品。運動估計子產品的目的是預測從驅動視訊D的維數H×W的幀D∈R 3×H×W到源幀S∈R 3×H×W的密集運動場。密集的運動領域後用于對齊對象構成的特征圖譜計算從S D運動領域模組化函數TS←D: R 2→R 2映射每個像素位置與相應的位置在美國TS←D D通常被稱為反向光流。由于使用雙線性采樣[17]可以以可微的方式有效地實作反向翹曲,是以我們采用了反向光流而不是前向光流。我們假設存在一個抽象參考系R,我們獨立估計兩個轉換:從R到S (TS←R)和從R到D (TD←R)。注意,與X2Face[40]不同的是,參考架構是一個抽象概念,在後面的派生中會被抵消。是以,它從不被顯式地計算,也不能被可視化。這種選擇允許我們獨立處理D和s,這是我們所希望的,因為在測試時,模型接收來自不同視訊的源圖像和驅動幀,它們在視覺上可能非常不同。動作估計器子產品不直接預測TD←R和TS←R,而是分兩步進行。

In the first step, we approximate both transformations from sets of sparse trajectories, obtained by  using keypoints learned in a self-supervised way. The locations of the keypoints in D and S are  separately predicted by an encoder-decoder network. The keypoint representation acts as a bottleneck  resulting in a compact motion representation. As shown by Siarohin et al. [28], such sparse motion  representation is well-suited for animation as at test time, the keypoints of the source image can be  moved using the keypoints trajectories in the driving video. We model motion in the neighbourhood  of each keypoint using local affine transformations. Compared to using keypoint displacements only,  the local affine transformations allow us to model a larger family of transformations. We use Taylor  expansion to represent TD←R by a set of keypoint locations and affine transformations. To this end,  the keypoint detector network outputs keypoint locations as well as the parameters of each affine  transformation.  

During the second step, a dense motion network combines the local approximations to obtain the  resulting dense motion field Tˆ  S←D. Furthermore, in addition to the dense motion field, this network  outputs an occlusion mask Oˆ  S←D that indicates which image parts of D can be reconstructed by  warping of the source image and which parts should be inpainted, i.e.inferred from the context.  

Finally, the generation module renders an image of the source object moving as provided in the  driving video. Here, we use a generator network G that warps the source image according to Tˆ  S←D  and inpaints the image parts that are occluded in the source image. In the following sections we detail  each of these step and the training procedure. 在第一步中,我們從稀疏軌迹集近似兩個轉換,通過使用自監督方式學習的關鍵點獲得。通過編解碼器網絡分别預測D和S中關鍵點的位置。關鍵點表示是實作緊湊運動表示的瓶頸。如Siarohin等人[28]所示,這種稀疏運動表示非常适合于動畫,因為在測試時,可以使用駕駛視訊中的關鍵點軌迹移動源圖像的關鍵點。我們使用局部仿射變換在每個關鍵點的鄰域模組化運動。與隻使用關鍵點位移相比,局部仿射變換允許我們模組化一個更大的變換家族。我們用泰勒展開通過一組關鍵點位置和仿射變換來表示TD←R。為此,關鍵點檢測器網絡輸出關鍵點位置以及每個仿射變換的參數。 

在第二步中,密集的運動網絡結合了本地近似獲得由此産生的密集運動領域TˆS←D。此外,除了茂密的運動領域,這個網絡輸出一個閉塞面具OˆS←D D表明圖像部分可以重建源圖像的扭曲和哪些部分應該填補,i.e.inferred從上下文。

最後,生成子產品呈現源對象移動的圖像,如驅動視訊中提供的那樣。在這裡,我們使用一個發電機網絡G扭曲源圖像根據TˆS←D和填補圖像部分被遮擋在源圖像。在下面的部分中,我們将詳細介紹這些步驟和教育訓練過程。

3.1 Local Affine Transformations for Approximate Motion Description  局部仿射變換近似運動描述

The motion estimation module estimates the backward optical flow TS←D from a driving frame D to  the source frame S. As discussed above, we propose to approximate TS←D by its first order Taylor  expansion in a neighborhood of the keypoint locations. In the rest of this section, we describe the  motivation behind this choice, and detail the proposed approximation of TS←D.  

We assume there exist an abstract reference frame R. Therefore, estimating TS←D consists in  estimating TS←R and TR←D. Furthermore, given a frame X, we estimate each transformation  TX←R in the neighbourhood of the learned keypoints. Formally, given a transformation TX←R, we  consider its first order Taylor expansions in K keypoints p1, . . . pK. Here, p1, . . . pK denote the  coordinates of the keypoints in the reference frame R. Note that for the sake of simplicity in the  following the point locations in the reference pose space are all denoted by p while the point locations  in the X, S or D pose spaces are denoted by z. We obtain:

運動估計子產品估計從驅動幀D到源幀s的反向光流TS←D。如上所述,我們建議通過其一階泰勒展開在關鍵點位置的鄰域來近似TS←D。在本節的其餘部分中,我們将描述此選擇背後的動機,并詳細介紹提出的TS←D近似。

我們假設存在一個抽象的參考系R,是以,估算TS←D包含在估算TS←R和TR←D中。此外,給定一個坐标系X,我們估計每個變換TX←R在已學習關鍵點附近。正式地,給定一個變換TX←R,我們考慮它在K個關鍵點p1,…pK,這裡是p1…pK表示的坐标參考系中的要點r .請注意,為了簡單起見在參考點位置後構成的空間都是用p點位置在X,年代或D構成空間是用z。我們得到:

Combining Local Motions. We employ a convolutional network P to estimate Tˆ  S←D from the set  of Taylor approximations of TS←D(z) in the keypoints and the original source frame S. Importantly,  since Tˆ  S←D maps each pixel location in D with its corresponding location in S, the local patterns in  Tˆ  S←D, such as edges or texture, are pixel-to-pixel aligned with D but not with S. This misalignment  issue makes the task harder for the network to predict Tˆ  S←D from S. In order to provide inputs  already roughly aligned with Tˆ  S←D, we warp the source frame S according to local transformations  estimated in Eq. (4). Thus, we obtain K transformed images S  1  , . . . S  K that are each aligned with  Tˆ  S←D in the neighbourhood of a keypoint. Importantly, we also consider an additional image S  0 = S  for the background.  

For each keypoint pk we additionally compute heatmaps Hk indicating to the dense motion network  where each transformation happens. Each Hk(z) is implemented as the difference of two heatmaps  centered in TD←R(pk) and TS←R(pk):

結合局部運動。我們采用卷積網絡P估計TˆS←D組泰勒近似的TS←D (z)的重點和原始幀S .重要的是,由于TˆS←D地圖每個像素位置在D相應位置的年代,當地TˆS←D模式,如邊緣或紋理,pixel-to-pixel與D但不與美國這個偏差問題使得網絡任務更難預測TˆS←D S為了提供輸入已經大緻與TˆS←D,我們經源幀S根據當地轉換在情商估計。(4)。是以,我們獲得K S轉換圖像1,。K,都與Tˆ年代←D附近的一個關鍵點。重要的是,我們還考慮了一個額外的圖像S 0 = S作為背景。 

對于每一個關鍵點pk,我們額外計算熱圖Hk,表明在稠密的運動網絡,每一個變換發生。每個Hk(z)實作為以TD←R(pk)和TS←R(pk)為中心的兩個熱圖的差異:

繼續閱讀