S4D: Speaker Diarization Toolkit in Python

1French National Audiovisual Institute (INA), Paris, France

2Computer Science Laboratory of Le Mans University (LIUM - EA 4023), Le Mans, France

SIDEKIT for dialization（S4D）

Abstract

摘要

In this paper, we present S4D, a new open-source Python toolkit dedicated to speaker diarization. S4D provides various state-of- the-art components and the possibility to easily develop end-to- end diarization prototype systems. S4D offers a large panel of clustering, segmentation, scoring and visualization algorithms. S4D has been thought to be easily understood, installed, mod- ified and used in order to allow fast transfers of diarization technologies to industry and facilitate development of new ap- proaches. Examples, benchmarks on standard tasks and tutori- als are provided in this paper. S4D is an extension of the open- source toolkit for speaker recognition: SIDEKIT.

本文介绍了一个新的开源Python工具包S4D，它致力于说话人二值化。S4D提供了各种最先进的组件，并有可能轻松开发端到端的二聚原型系统。S4D提供了大量的聚类、分割、评分和可视化算法。S4D被认为是易于理解、安装、改造和使用的，以便快速将重氮化技术转移到工业中，并促进新项目的开发。本文提供了标准任务和教程的示例、基准。S4D是用于说话人识别的开源工具包SIDEKIT的扩展。

Introduction

一。介绍

The diarization task is a necessary pre-processing step for speaker identification [1] or speech transcription [2] when there is more than one speaker in an audio/video recording. For each speaker in a recording, it consists of detecting the time areas where he or she speaks. Each time area, corresponding to a segment, is annotated with an abstract label representing the speaker. Thus, the diarization task allows to determine who spoke when. This domain is still an active research area since there are many unsolved problems such as detection of over- lapped speech [3] or labeling of speech overlapping with music.

当音频/视频记录中有多个扬声器时，二值化任务是用于说话人识别[1]或语音转录[2]的必要预处理步骤。对于录音中的每个演讲者，它包括检测他或她讲话的时间区域。每一个时间区域，对应于一个片段，都被一个代表说话人的抽象标签注释。因此，二值化任务允许确定谁在何时发言。这一领域仍然是一个活跃的研究领域，因为有许多尚未解决的问题，如检测重叠语音[3]或标记语音重叠与音乐。

For the diarization task, few toolkits are available. Most of them are dedicated to research. Quick transfers of new tech- nologies to industry require tools which are close to industrial standards. So as to reach this purpose, a diarization toolkit should comply with some requirements:beeasytounderstand,modify,installanduse;

对于二值化任务，几乎没有可用的工具包。他们大多致力于研究。新技术向工业的快速转移需要接近工业标准的工具。为了达到这一目的，一个二值化工具包应符合以下要求：易于理解、修改、安装和使用；

• enable end-to-end diarization system development;

•实现端到端二聚系统开发；

• offervariousstate-of-the-artalgorithms;

•提供最先进的算法状态；

• manage standard data formats to allow compatibility with other tools.

•管理标准数据格式，以便与其他工具兼容。

To address the lacks of existing toolkits, we developed S4D, a new toolkit for diarization fulfilling the mentioned requirements and facilitating the development of new approaches.

为了解决现有工具包的不足，我们开发了S4D，这是一个新的二值化工具包，满足了上述要求，并促进了新方法的开发。

In this paper, we first present the context in which S4D has been developed. We give then a detailed description of S4D contents before providing a guide to develop a broadcast news diarization system. Finally, we explain how to deploy S4D be- fore offering a few perspectives.

本文首先介绍了S4D的发展背景。然后对S4D的内容进行了详细的描述，为广播新闻二值化系统的开发提供了指导。最后，在提供一些观点之前，我们将解释如何部署S4D be。
Context

This section presents the context in which S4D has been devel- oped, other existing tools and the link with SIDEKIT [4].

本节介绍S4D的开发环境、其他现有工具以及与SIDEKIT的链接[4]。

2.1. Comparisonandcompatibilitieswithexistingtools

2.1条。与现有工具的比较和兼容性

Few tools are freely available for speaker diarization. S4D has been designed to overcome limitations of those tools.

很少有工具可以免费用于说话人二值化。S4D的设计是为了克服这些工具的局限性。

LIUMSpkDiarization [5, 6] is a toolkit for diarization writ- ten in Java. It includes most state-of-the-art methods in the diarization field. This toolkit was developed by the Computer Science Laboratory of Le Mans Univer- sity (LIUM) for French ESTER2 evaluation campaign [7], where it obtained the best results for the task of di- arization of broadcast news in 2008. This toolkit has two main drawbacks: it is no longer being updated and it can only be executed via command lines thanks to a jar file.

liumspkdialization[5，6]是一个用于Java中的二值化write-ten的工具包。它包括了重氮化领域最先进的方法。该工具包是由勒芒大学计算机科学实验室（LIUM）为法国ESTER2评估活动[7]开发的，在2008年的广播新闻数字化任务中取得了最好的效果。这个工具包有两个主要缺点：它不再被更新，而且由于jar文件的缘故，它只能通过命令行执行。

Pyannote.metrics [8] is a toolkit for reproducible evaluation, diagnostic and error analysis of diarization systems. It is a regularly updated project with a wide selection of metrics. S4D includes certain metrics from this toolkit to offer greater ease of use.

metrics[8]是一个用于二聚系统的可重复评估、诊断和错误分析的工具包。它是一个定期更新的项目，有很多指标可供选择。S4D包含了这个工具包中的某些度量，以提供更大的易用性。

Pyannote.audio [9] is a toolkit for diarization. It only pro- poses state-of-the-art methods developed by using the oriented object paradigm in which it is easy to extend. Moreover, it requires a considerable learning time.

音频[9]是一个用于二值化的工具包。它只提出了最先进的方法，开发利用面向对象范式，在其中易于扩展。此外，它还需要相当长的学习时间。

2.2. SIDEKIT and S4D

2.2条。SIDEKIT和S4D

SIDEKIT is an open source package for speaker and language recognition developed by Anthony Larcher, Kong Aik Lee and Sylvain Meignier [4] which provides an end-to-end tool- chain including various state-of-the-art algorithms. SIDEKIT for Diarization (S4D) is an open source package extension of SIDEKIT dedicated to diarization. The aim of S4D is to provide an educational and efficient toolkit for diarization encompass- ing the whole chain of treatment that goes from the audio data to the analysis of the system performance. Furthermore, both SIDEKIT and S4D have completely been written in Python and tested on several platforms under Python 3 for both Linux and MacOS.

SIDEKIT是由Anthony Larcher、Kong Aik Lee和Sylvain Meignier[4]开发的一个用于说话人和语言识别的开源软件包，它提供了一个端到端的工具链，包括各种最先进的算法。SIDEKIT for dialization（S4D）是SIDEKIT的一个开源包扩展，专门用于二聚。S4D的目的是提供一个教育和有效的工具集，包括从音频数据到系统性能分析的整个处理链。此外，SIDEKIT和S4D都完全是用Python编写的，并在Python 3下的几个平台上对Linux和MacOS进行了测试。

What is in S4D?

This section describes several uses currently offered by S4D.

本节介绍S4D目前提供的几种用途。

3.1. Segmentation

The segmentation detects the instantaneous change points cor- responding to segment boundaries. The proposed algorithm is based on the detection of local maxima. It detects the change points through a Gaussian Divergence (GD) [10], computed us- ing Gaussians. The left and right Gaussians are estimated over a window sliding along the whole signal. A change point, i.e. a segment boundary, is present in the middle of the window when the Gaussian divergence score reaches a local maximum.

分段检测对应于分段边界的瞬时变化点。该算法基于局部极大值的检测。它通过高斯发散（GD）[10]，计算高斯来检测变化点。在沿整个信号滑动的窗口上估计左高斯和右高斯。当高斯散度分数达到局部最大值时，窗口中间出现一个变化点，即段边界。

After a GD segmentation, a second pass over the signal fuses consecutive segments of the same speaker from the start to the end of the recording. The employed measure for the fus- ing is the ∆BIC [11] based on Bayesian Information Criterion. Alternatively, it is possible to use the BIC Square Root distance for the value of the penalty factor in the ∆BIC, as defined in [12].

在GD分段之后，信号的第二次通过将同一扬声器的从开始到结束的连续分段融合。所采用的fus-ing度量是基于贝叶斯信息准则的∏BIC[11]。或者，也可以使用BIC平方根距离作为∏BIC中惩罚因子的值，如[12]中所定义。

3.2. Clustering

In order to group clusters, S4D offers a certain number of meth- ods.

为了对集群进行分组，S4D提供了一定数量的方法。

3.2.1. HACBIC

The algorithm is based upon a Hierarchical Agglomerative Clustering (HAC). Each cluster is modeled by a Gaussian. The ∆BIC measure [11] is employed to select the candidate clus- ters to be grouped as well as to stop the merging process. The two closest clusters i and j are merged at each iteration until ∆BICi,j > 0.

该算法基于层次聚集聚类（HAC）。每一个簇都用高斯模型来描述。使用∏BIC度量[11]来选择要分组的候选俱乐部，并停止合并过程。两个最近的簇i和j在每次迭代时合并，直到∏BICi，j>0。

3.2.2. HACCLR

The HAC CLR merges a set of clusters thanks to a HAC algo- rithm. The CLR (Cross Likelihood Ratio) score [13] is used as the dissimilarity measure as well as the stop criterion. This score requires the Universal Background Model (UBM) for the computation and to eventually adjust used data models with the MAP algorithm [14]. The lowest CLR score allows to select the two clusters to merge at each iteration. The merging pro- cess stops when the score exceeds a threshold set a priori.

HAC CLR通过HAC算法合并了一组集群。CLR（Cross-Likelihood Ratio，交叉似然比）得分[13]被用作相异性度量和停止准则。该分数要求通用背景模型（UBM）进行计算，并最终使用MAP算法调整使用的数据模型[14]。最低的CLR得分允许选择在每次迭代时合并的两个集群。当分数超过预先设定的阈值时，合并过程停止。

3.2.3. ILPIV

The Integer Linear Programming I-Vector (ILP IV) clustering [15] extracts an i-vector for each cluster and computes the dis- tances among all of them (PLDA [16], cosine [17] or Maha- lanobis [18]). ILP clustering was inspired by the k-medoids algorithm which choose k observations as class centers. For the ILP IV, this number k is determined automatically. We look for K centers which cover all the i-vectors such as each one is as- signed to only one center and has a distance of less than δ from its center. This problem is solved using the GNU Linear Pro- gramming Kit (GLPK) package which is intended for solving large-scale Linear Programming (LP).

整数线性规划I-向量（ILP-IV）聚类[15]为每个聚类提取一个I-向量，并计算它们之间的差异（PLDA[16]、cosine[17]或Maha-lanobis[18]）。ILP聚类的灵感来源于选择k个观测值作为类中心的k-medods算法。对于ILP IV，这个数字k是自动确定的。我们寻找覆盖所有i-向量的K个中心，例如每个i-向量仅与一个中心有符号，并且与中心的距离小于δ。这个问题是用GNU线性编程工具包（GLPK）来解决的，该工具包是用来解决大规模线性规划（LP）的。

So as to save execution time, a search of connected compo- nents (CC) can be done [19]. The distances below δ represent connected components with clusters as nodes and distances as edges. The ILP IV clustering is then applied for each connected component which is not in a form of a star graph. A star is just one or several nodes only connected to a same node.

为了节省执行时间，可以对连接的组件（CC）进行搜索[19]。δ以下的距离表示以簇为节点、以距离为边的连通分量。然后将ILP-IV聚类应用于不以星图形式存在的每个连通分量。一个星只是一个或多个仅连接到同一个节点的节点。

3.2.4. HACIV

This clustering process is based upon a HAC algorithm. Each cluster is modeled by an i-vector and the distances among all of them are computed thanks to the PLDA, cosine or Mahanalo- bis score. This distance is the measure employed to select the clusters to be grouped as well as to stop the clustering process.

该聚类过程基于HAC算法。每个簇由一个i-向量建模，所有簇之间的距离通过PLDA、余弦或Mahanalo-bis分数计算。此距离是用于选择要分组的群集以及停止群集过程的度量。

Discussion

6。讨论

We have introduced S4D, a new open-source toolkit for the di- arization task. It is a comprehensive toolkit offering an end- to-end tool-chain with various ready-to-use state-of-the-art al- gorithms. S4D allows to easily develop systems for broad- cast news but also for other tasks (meeting, telephone conversa- tions). It is very useful to create offline diarization system but is not adapted yet for online diarization system or treatments in stream. The resulting diarization system is nonetheless time ef- ficient, as it processes the total 40 hours of our test corpus in 70 minutes (see Table 2), which corresponds to less than 3% of the total audio duration. This toolkit is maintained for an indefinite period. It will implement new methods and metrics according to speaker diarization advances. In the near future, Artificial Neu- ral Network (ANN) [32] and Binary Key (BK) [33] methods for segmentation and clustering will be implemented.

我们已经介绍了S4D，一个新的开放源码工具包，用于去亚利桑那任务。它是一个全面的工具包，提供了一个端到端的工具链，其中包含各种随时可用的最先进的al-gorithms。S4D允许轻松地为广播新闻和其他任务（会议、电话交谈）开发系统。建立离线二聚系统是非常有用的，但还不适合在线二聚系统或在线处理。尽管如此，生成的二值化系统仍然具有时间效率，因为它在70分钟内处理了我们的测试语料的40个小时（见表2），这相当于不到总音频持续时间的3%。这个工具包是无限期维护的。它将根据说话人二值化的进展来实现新的方法和度量。在不久的将来，将实现人工神经网络（ANN）[32]和二进制密钥（BK）[33]的分割和聚类方法。