Self-Supervised VQ-VAE For One-Shot Music Style Transfer (cs sd)
Ondřej Cífka, Alexey Ozerov, Umut Şimşekli, Gaël Richard
Neural style transfer, allowing to apply the artistic style of one image to another, has become one of the most widely showcased computer vision applications shortly after its introduction. In contrast, related tasks in the music audio domain remained, until recently, largely untackled. While several style conversion methods tailored to musical signals have been proposed, most lack the 'one-shot' capability of classical image style transfer algorithms. On the other hand, the results of existing one-shot audio style transfer methods on musical inputs are not as compelling. In this work, we are specifically interested in the problem of one-shot timbre transfer. We present a novel method for this task, based on an extension of the vector-quantized variational autoencoder (VQ-VAE), along with a simple self-supervised learning strategy designed to obtain disentangled representations of timbre and pitch. We evaluate the method using a set of objective metrics and show that it is able to outperform selected baselines.
自我监督的VQ-VAE,可进行一键式音乐风格转换(cs sd)
神经样式转换允许将一幅图像的艺术风格应用于另一幅图像,在引入后不久,它已成为显示最广泛的计算机视觉应用程序之一。相反,直到最近,音乐音频领域中的相关任务仍未解决。尽管已经提出了几种针对音乐信号的样式转换方法,但大多数方法都缺乏经典图像样式传输算法的“一次性”功能。另一方面,现有的单次音频风格转移方法在音乐输入上的结果并不那么令人信服。在这项工作中,我们对单次音色转移的问题特别感兴趣。我们基于矢量量化的变分自编码器(VQ-VAE)的扩展,以及一种旨在获得音色和音高的纠缠表示的简单自监督学习策略,提出了一种用于此任务的新颖方法。我们使用一组客观指标来评估该方法,并表明该方法能够胜过选定的基准。
https://arxiv.org/abs/2102.05749
9.pdf