Self-Supervised VQ-VAE For One-Shot Music Style Transfer (cs sd)
Ondřej Cífka, Alexey Ozerov, Umut Şimşekli, Gaël Richard
Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical signals have been proposed, most lack the 'one-shot' capability of classical image style transfer algorithms. On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. In this work, we are specifically interested in the problem of one-shot timbre transfer. We present a novel method for this task, based on an extension of the vector-quantized variational autoencoder (VQ-VAE), along with a simple self-supervised learning strategy designed to obtain disentangled representations of timbre and pitch. We evaluate the method using a set of objective metrics and show that it is able to outperform selected baselines.
自我監督的VQ-VAE,可進行一鍵式音樂風格轉換(cs sd)
神經樣式轉換允許将一幅圖像的藝術風格應用于另一幅圖像,在引入後不久,它已成為顯示最廣泛的計算機視覺應用程式之一。相反,直到最近,音樂音頻領域中的相關任務仍未解決。盡管已經提出了幾種針對音樂信号的樣式轉換方法,但大多數方法都缺乏經典圖像樣式傳輸算法的“一次性”功能。另一方面,現有的單次音頻風格轉移方法在音樂輸入上的結果并不那麼令人信服。在這項工作中,我們對單次音色轉移的問題特别感興趣。我們基于矢量量化的變分自編碼器(VQ-VAE)的擴充,以及一種旨在獲得音色和音高的糾纏表示的簡單自監督學習政策,提出了一種用于此任務的新穎方法。我們使用一組客觀名額來評估該方法,并表明該方法能夠勝過標明的基準。
https://arxiv.org/abs/2102.05749
9.pdf