本文為自學總結整理知識點使用

參考課程：

基于深度神經網絡頻譜映射的語音增強方法

引言
原理
- 資料集
- - 語音資料集 TIMIT
  - 噪聲資料集 Noise-92
- 資料準備
- - 無噪語音資料準備
  - 生成含噪資料噪聲對
- 模型結構
- - 參數配置檔案
  - 資料集管理
  - - 關于特征提取：
    - 關于神經網絡資料輸入輸出
- 搭建神經網絡模型
- 模型訓練，儲存
- 模型資料
- 測試

Speech Enhancement ：DNN based Spectrum Mapping

引言

傳統語音增強方案：譜減法、維納濾波、MMSE、子空間分解，一般所處理的對象隻有一條語音，能學習的特征非常少，這樣我們隻能通過一些假設（比如：語音或者噪聲滿足高斯分布；語音于噪聲之間互相獨立不相關等等）來假定語音的一些特征，并提出一些統計方法，最終設計一些濾波器等方法來進行處理。

總體來說，傳統方案都是一些基于“統計”的方法，或者說是一些基于機率模型的方法。

随着神經網絡技術的不斷發展，大量的資料集以及處理能力，不再讓我們需要親自做一些特定假設或者統計特征，而是通過深度神經網絡來學習大量語音的特征。

這類方法主要可以分成兩大類，一個是 DNN 頻譜映射的方案（關鍵詞 Mapping），一個是 DNN 頻譜掩蔽（關鍵詞：mask ）的方法

這篇文章主要探讨前者

原理

從大量語音中學習到幹淨語音的頻譜特征

資料集

語音資料集 TIMIT

是以要收集大量幹淨的語音，使用TIMIT資料庫，這個資料庫組要用于英文的語音識别

語音信号處理：語音增強DNN頻譜映射引言原理

打開目錄分别表示不同地區；說話人；不同語音的wav檔案，采樣率16k，以及文本等

因為隻做語音增強，是以文本檔案可以不要了，隻需要,wav檔案，

噪聲資料集 Noise-92

語音信号處理：語音增強DNN頻譜映射引言原理

包含15種噪聲

資料準備

無噪語音資料準備

周遊TIMIT，TRAIN檔案夾中的所有wav檔案，儲存列印檔案名到triain.scp檔案中。

# get_scp.py
from asyncore import write
import os
from turtle import end_fill 
import numpy as np


write_path="E:\\……\DNN_mapping\\scp"
read_path="E:\\……\\TIMITdataset"

os.chdir(read_path)

base_path="TRAIN"
with open(write_path+"\\train.scp","wt",encoding='utf-8') as f:

#base_path="TEST"
#with open(write_path+"\\test.scp","wt",encoding='utf-8') as f:


	for root,dirs,files in os.walk(base_path): #通過 walk函數周遊檔案夾中所有檔案
	    # root 表示目前正在通路的檔案夾路徑
     
        # dirs 表示該檔案夾下的子目錄名list
        # files 表示該檔案夾下的檔案list
        
		for file in files:
			file_name=os.path.join(root,file)

			if file_name.endswith(".WAV"):
				print(file_name)
				f.write("%s\n" %file_name)


print("done")

執行分别執行完上述代碼之後，會生成兩個檔案“train.scp”和“test.scp”

語音信号處理：語音增強DNN頻譜映射引言原理

生成含噪資料噪聲對

主要利用signal_by_db函數産生

根據信噪比定義：

S N R ( d B ) = 10 l o g 10 ( P s i g n a l P n o i s s e ) = 20 l o g 20 ( A s i g n a l A n o i s e ) SNR(dB)=10log_{10}(\frac{P_{signal}}{P_{noisse}})=20log_{20}(\frac{A_{signal}}{A_{noise}}) SNR(dB)=10log10(PnoissePsignal)=20log20(AnoiseAsignal)

得到

N a d d = n o r m S 1 0 S N R 20 N n o r m N N_{add}=\frac{normS}{10^{\frac{SNR}{20}}}\frac{N}{normN} Nadd=1020SNRnormSnormNN

n o r m X = ∣ ∣ X ∣ ∣ 2 = ∑ 1 N X i 2 相當于求幅度值 norm \bold X=|| \bold X||_2=\sqrt {\sum_1^N X_i^2}\quad 相當于求幅度值 normX=∣∣X∣∣2=1∑NXi2

相當于求幅度值

## generate_training.py
import os
import numpy as np
import random
import scipy.io.wavfile as wav
import librosa
import soundfile as sf
from numpy.linalg import norm
def  signal_by_db(speech,noise,snr):
    # 為幹淨語音加噪聲
    speech = speech.astype(np.int16) 
    noise = noise.astype(np.int16)
    
    len_speech = speech.shape[0] #讀取資料常數
    len_noise = noise.shape[0] # 噪聲資料的長度要比語音長
    start = random.randint(0,len_noise-len_speech) # 是以，一般可以随機截取噪聲資料 于純淨語音資料相加
    end = start+len_speech 
    
    add_noise = noise[start:end]
    
    # 此處為加噪部分，按照SNR（db）=10log(Ps/Pn)=20log(log(As/An))得來
    add_noise = add_noise/norm(add_noise) * norm(speech) / (10.0** (0.05 *snr))
    mix = speech + add_noise
    return mix




if __name__ == "__main__":
    
    
    
    # 噪聲資料目錄
    noise_path = 'E:\\……\\NoiseX-92'
    clean_path = "E:\\……\\TIMITdataset" # 幹淨語音存放目錄
    scp_path="E:\\……\\DNN_mapping\\scp" 
    work_path="E:\\……\\DNN_mapping"
    
    # 噪聲類型 在處理過程中最難處理的就是白噪聲和babble噪聲，
    noises = ['babble', 'buccaneer1','white']
    
    os.chdir(work_path)
    clean_wavs = np.loadtxt(scp_path+'\\train.scp',dtype='str').tolist() # 讀取幹淨語音的名稱，轉換成清單
    
    
    snrs = [-5,0,5,10,15,20]
    
    with open('scp/train_DNN_enh.scp','wt') as f:
        
        for noise in noises:
            print(noise) #讀取噪聲資料
            noise_file = os.path.join(noise_path,noise+'.wav')
            noise_data,fs = sf.read(noise_file, dtype = 'int16') 
            # 注意，這裡采用sf.read 讀取成十六進制整數； 若采用librosa.load()讀取會自動轉換成[-1,+1]之間的浮點數
            
            for clean_wav in clean_wavs: #讀取幹淨語音資料
                clean_file = os.path.join(clean_path,clean_wav)
                clean_data,fs = sf.read(clean_file,dtype = 'int16')
                
                for snr in snrs: # 周遊所有SNR
                    noisy_file = os.path.join(noise_path,noise,str(snr),clean_wav) # 加噪資料存放路徑，名稱
                   
                    noisy_path,_ = os.path.split(noisy_file)
                    os.makedirs (noisy_path,exist_ok=True)
                    mix = signal_by_db(clean_data,noise_data,snr)# 加噪聲
                    noisy_data = np.asarray(mix,dtype= np.int16)# 儲存成 int16格式
                    sf.write(noisy_file,noisy_data,fs)
                    f.write('%s %s\n'%(noisy_file,clean_file)) # 存放噪聲對名稱
                    # print('%s %s\n'%(noisy_file,clean_file))

模型結構

語音信号處理：語音增強DNN頻譜映射引言原理

整體網絡模型通過pytorch實作

scp檔案夾是資料描述檔案

dataset.py 是對訓練資料進行管理群組織時的檔案

hparams.py 是整個工程相關的參數檔案

參數配置檔案

# hparams.py 
import torch
class hparams():
    def __init__(self):
        self.file_scp = "E:\\……\\DNN_mapping\\scp\\train_DNN_enh.scp"
        # 訓練用的含噪聲資料和幹淨資料資料對
        
        self.para_stft = {}
        self.para_stft["N_fft"] = 512
        self.para_stft["win_length"] = 512
        self.para_stft["hop_length"] = 128
        self.para_stft["window"] = 'hamming'
       
       # 網絡模型相關參數
        self.n_expand = 3 # 訓練時 以多少幀資料作為輸入
        self.dim_in = int((self.para_stft["N_fft"]/2 +1)*(2*self.n_expand+1)) # 輸入特征的次元 思考：為什麼等于他？ 具體原因看後面一小節解釋
        self.dim_out = int((self.para_stft["N_fft"]/2 +1)) #輸出特征的次元
        self.dim_embeding = 2048 # 網絡層中間節點維數？
        self.learning_rate = 1e-4
        self.batch_size = 32
        self.negative_slope = 1e-4
        self.dropout = 0.1

資料集管理

關于特征提取：

1、在語音深度學習中，往往使用stft 進行特征提取，此外為了數值穩定性，輸入資料也不會直接采用，幅度譜，而是采用幅度譜的對數？

答：資料進行FFT後，幅度譜變化非常劇烈，數值不穩定，難以控制，取log以後數值穩定一些

2、常用的特征提取函數？

一般采用 librosa庫中的stft函數，其輸出是一個 D × T D \times T D×T 維的資料，其中 D = 1 + N F F T 2 D=1+\frac{N_{FFT}}{2} D=1+2NFFT, T T T 為輸出幀數。

關于神經網絡資料輸入輸出

1、拼幀

一般是輸入多幀預測一幀，比如輸入5幀資料（左右兩邊擴充2幀，也即代碼中

n_expend

參數，n_expend=2），分别是第【3，4，5，6，7】幀資料，來預測（增強）第【5】幀資料，将預測得到的第5幀資料作為輸出。

這一步可以使用Tensor.unfold(dim,size,step) 實作

語音信号處理：語音增強DNN頻譜映射引言原理

# dataset.py
# 資料集管理函數
import os
import torch
import numpy as np
from torch.utils.data import Dataset,DataLoader
from hparams import hparams
import librosa
import random
import soundfile as sf

# 主要用于資料管理
# 主要由 torch 中的 Dataset 與 DataLoader  類 來實作

def feature_stft(wav,para): # 用stft進行特征提取
    spec = librosa.stft(wav,
                      n_fft=para["N_fft"],
                      win_length = para["win_length"],
                      hop_length = para["hop_length"],
                      window =para["window"])
    # 注意librosa.stft() 提取特征後是一個 D*T 的次元 D是特征次元=1+（nfft/2），T是幀數
    
    mag =   np.abs(spec)  # 功率模值
    LPS =   np.log(mag**2)# 該神經網絡 輸入的是 幅度譜 平方後的log！！！ 
    # Q:為什麼輸入的是LPS？
    # A: 資料進行FFT後，幅度譜變化非常劇烈，數值不穩定，難以控制，取log以後數值穩定一些
    phase = np.angle(spec)# 相位   
    
    # stft得到的是D*T 維，需要改成 T*D的格式輸入， 這裡的 .T 操作是轉置操作
    return LPS.T, phase.T    #  T x D

def feature_contex(feature,expend): # 拼幀
    feature = feature.unfold(0,2*expend+1,1)  # T x D x  2*expand+1
    # 這裡調用了Tensor.unfold(dimension,size,step)函數
    # dimension 是沿着哪個次元重疊取幀 （T次元 ，是以是 第0維）
    # size 重複取幀大小 （2*左右擴充數 +1 )
    # step 步長
    # 輸出次元 # （T-4） x D x  2*expand+1
    feature = feature.transpose(1,2)           # （T-4） x  2*n_expand+1  x D 
    # 把後兩個次元“切換”一下
    feature = feature.view([-1,(2*expend+1)*feature.shape[-1]]) # T x  （D *（ 2*n_expand+1））
    # 這一步，相當于保持第一維（幀 ）不變，後面兩維合并成了一維
    return feature
    
    

class TIMIT_Dataset(Dataset): 
    
    def __init__(self,para):

        self.file_scp = para.file_scp   # scp檔案
        self.para_stft = para.para_stft # 特征提取晚間
        self.n_expand = para.n_expand   # 拼幀

        files = np.loadtxt(self.file_scp,dtype = 'str')  #将噪聲對scp檔案讀取
        self.clean_files = files[:,1].tolist()  # 幹淨語音資料處于第二列
        self.noisy_files = files[:,0].tolist()  # 含噪語音資料處于第一列
         
        print(len(self.clean_files))   
        print("幹淨語音第1個資料")
        print(files[0,1])    
        print("含噪語音第1個資料")
        print(files[0,0])       
    
    def __len__(self):      # 資料庫中樣本數量
        return len(self.clean_files)

    def __getitem__(self,idx): # 對于資料庫中每一條資料的處理方法
        
        # 讀取幹淨語音
        clean_wav,fs = sf.read(self.clean_files[idx],dtype = 'int16') 
        clean_wav = clean_wav.astype('float32') 
        #這裡，先讀取成int16格式，然後再轉成float型，為什麼不直接用 librosa.load()?
        
        
        #  讀取含噪語音
        noisy_wav,fs = sf.read(self.noisy_files[idx],dtype = 'int16')
        noisy_wav = noisy_wav.astype('float32')
        
        # 提取stft特征
        clean_LPS,_ = feature_stft(clean_wav,self.para_stft) # T x D
        noisy_LPS,_= feature_stft(noisy_wav,self.para_stft)  # T x D
        
        # 轉為torch格式
        X_train = torch.from_numpy(noisy_LPS)
        Y_train = torch.from_numpy(clean_LPS)
        
        # 拼幀
        X_train = feature_contex(X_train,self.n_expand)
        Y_train = Y_train[self.n_expand:-self.n_expand,:]
        return X_train, Y_train # 訓練資料以及對應目标

def my_collect(batch):
    # 神經網絡訓練時需要每一個batch大小相同
    # 由于語音資料 每次訓練的feasture 大小= T x  （D *（ 2*n_expand+1）） T幀數可能不一樣 是以需要重寫，實作batch的拼接
    batch_X = [item[0] for item in batch]
    batch_Y = [item[1] for item in batch]
    batch_X = torch.cat(batch_X,0)# 由于 T次元 可能不一樣，是以沿着 T次元（第零次元）進行拼接，下同
    batch_Y = torch.cat(batch_Y,0)
    return[batch_X.float(),batch_Y.float()]
    
    
if __name__ == '__main__':
    work_path="E:\\……\\DNN_mapping"
    os.chdir(work_path)
    
    # 資料加載測試
    para = hparams()
    
    m_Dataset= TIMIT_Dataset(para)
    
    m_DataLoader = DataLoader(m_Dataset,batch_size = 2,shuffle = True, num_workers = 4, collate_fn = my_collect)
    # shuffle：随機打亂  num_workers:多線程選取  collate_fn：特征選取函數
    
    for i_batch, sample_batch in enumerate(m_DataLoader): # 列印每一個batch X，Y 的特征次元
        train_X = sample_batch[0]
        train_Y = sample_batch[1]
        print(train_X.shape)
        print(train_Y.shape)

語音信号處理：語音增強DNN頻譜映射引言原理

執行後，最後一步是利用DataLoader()函數,将資料一個batch一個batch的讀取進來（分别是含噪資料、純淨資料（标簽））

以圖中

torch.Size([631, 1799])
torch.Size([631, 257])

為例

一個batch：

X： T x （D （ 2 x n_expand+1））

Y： T x D

為例，說明這一個batch 含噪資料次元是 6311799 ； 631257

第一維是 T 時間次元，要保證二者一緻，第2維由于n_expand=3，是以1799= 257 × （2× 3 +1）

搭建神經網絡模型

# model_mapping.py
import torch
import torch.nn as nn
from hparams import hparams
# 神經網絡模型
# 采用深度神經網絡
class DNN_Mapping(nn.Module):
    def __init__(self,para):
        super(DNN_Mapping,self).__init__() 
        self.dim_in = para.dim_in          
        self.dim_out = para.dim_out        
        self.dim_embeding = para.dim_embeding 
        self.dropout = para.dropout        
        self.negative_slope = para.negative_slope
        
        self.BNlayer = nn.BatchNorm1d(self.dim_out) # 用于歸一化，語音信号經過DNN後輸出再經過一個BN layer 進行輸出
        
        self.model = nn.Sequential(  #DNN網絡模型
                        # 先行正則化
                        nn.BatchNorm1d(self.dim_in), #先把輸入語音特征進行正則化

                        # 第一層
                        nn.Linear(self.dim_in, self.dim_embeding), 
                        nn.BatchNorm1d(self.dim_embeding),
                        # nn.ReLU(),
                        nn.LeakyReLU(self.negative_slope),
                        nn.Dropout(self.dropout),
                        
                        # 第二層
                        nn.Linear(self.dim_embeding, self.dim_embeding),
                        nn.BatchNorm1d(self.dim_embeding),
                        # nn.ReLU(),
                        nn.LeakyReLU(self.negative_slope),
                        nn.Dropout(self.dropout),
                        
                        # 第三層
                        nn.Linear(self.dim_embeding, self.dim_embeding),
                        nn.BatchNorm1d(self.dim_embeding),
                        # nn.ReLU(),
                        nn.LeakyReLU(self.negative_slope),
                        nn.Dropout(self.dropout),
                        
                        # 第四層
                        nn.Linear(self.dim_embeding, self.dim_out),
                        nn.BatchNorm1d(self.dim_out),
                        
                        )
                        
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.xavier_normal_(m.weight.data) #神經網絡Linear層初始化
               
            
    def forward(self,x,y=None, istraining = True):
        out_enh = self.model(x)
        if istraining:
            out_target = self.BNlayer(y) # y 是訓練目标（這裡應該是純淨語音資料），也要經過一個歸一化處理 BNlayer
            return out_enh,out_target
        else:
            return out_enh
        
if __name__ == "__main__":
    para = hparams()
    m_model = DNN_Mapping(para)
    print(m_model)
    x = torch.randn(3,para.dim_in)
    y = m_model(x)
    print(y.shape)

語音信号處理：語音增強DNN頻譜映射引言原理

見整體結構圖，可見，神經網絡輸出，以及幹淨語音輸出做MSE時，都要經過一層BN（歸一化）。

模型訓練，儲存

# train.py
from concurrent.futures.thread import _worker
import torch
import torch.nn as nn
from hparams import hparams
from torch.utils.data import Dataset,DataLoader
from dataset import TIMIT_Dataset,my_collect
from model_mapping import DNN_Mapping
import os
# 訓練過程
if __name__ == "__main__":
    
    # 定義device
    device = torch.device("cuda:0") # 利用gpu 進行訓練，需要提前安裝 cuda 以及 pytorch gpu版本
    
    # 擷取模型參數
    para = hparams()
    
    # 定義模型
    m_model = DNN_Mapping(para) # 構造模型
    m_model = m_model.to(device)# 把模型的計算任務映射到gpu中計算
    m_model.train()             # 将模型置于訓練模式下
    
    # 定義損失函數
    loss_fun = nn.MSELoss()     
    # loss_fun = nn.L1Loss()
    loss_fun = loss_fun.to(device)
    
    # 定義優化器
    optimizer = torch.optim.Adam(
        params=m_model.parameters(),
        lr=para.learning_rate)
    
    # 定義資料集
    m_Dataset= TIMIT_Dataset(para)
    m_DataLoader = DataLoader(m_Dataset,batch_size = para.batch_size,shuffle = True, num_workers = 4, collate_fn = my_collect)
    
    # 定義訓練的輪次 
    n_epoch = 100 # 訓練輪次，實際上7-8輪左右差不多收斂了
    n_step = 0    
    loss_total = 0# 全體損失
    for epoch in range(n_epoch):
        # 周遊dataset中的資料 （通過在dataset Dataloader() 得到的 batch 的資料集）
        for i_batch, sample_batch in enumerate(m_DataLoader): # 周遊每一個batch 資料
            train_X = sample_batch[0]
            train_Y = sample_batch[1]
            
            train_X = train_X.to(device)
            train_Y = train_Y.to(device)
            
            m_model.zero_grad()
            # 得到網絡輸出
            output_enh,out_target = m_model(x=train_X,y=train_Y)
            
            # 計算損失函數
            loss = loss_fun(output_enh,out_target)
            
            # 誤差反向傳播
            # optimizer.zero_grad()
            loss.backward()
            
            # 進行參數更新
            # optimizer.zero_grad()
            optimizer.step()
            
            n_step = n_step+1
            loss_total = loss_total+loss
            
            # 每100 step 輸出一次中間結果
            if n_step %100 == 0:
                print("epoch = %02d  step = %04d  loss = %.4f"%(epoch,n_step,loss))
        
        # 訓練結束一個epoch 計算一次平均結果
        loss_mean = loss_total/n_step
        print("epoch = %02d mean_loss = %f"%(epoch,loss_mean))
        loss_total = 0
        n_step =0
        
        # 進行模型儲存
        work_path="E:\\……\\DNN_mapping"
        save_path="E:\\……\\DNN_mapping\\save"
        os.chdir(work_path)
        save_name = os.path.join(save_path,'model_%d_%.4f.pth'%(epoch,loss_mean))
        torch.save(m_model,save_name)

模型資料

import torch
import os
# 測試
if __name__ == "__main__":
    work_path="E:\\homework\\……\\DNN_mapping"
    os.chdir(work_path)
    
    model_name = "save/model_4_0.0036.pth"
    m_model = torch.load(model_name,map_location = torch.device('cpu'))
    m_model.eval()
    
    model_dic = m_model.state_dict()
    
    for k,v in model_dic.items():
        print('k:'+k)
        print(v.size())
        
    print(model_dic['BNlayer.weight'].data)

語音信号處理：語音增強DNN頻譜映射引言原理

測試

測試函數利用輸入訓練的模型和對應參數，以及待增強的資料，

但要注意：注意模型輸出一個經過BN歸一化後的LPS格式輸出（因為模型訓練時要比對MSE）

要想經模型輸出映射成正常輸出，還要借助BN歸一化的參數

具體複原操作原理要看BatchNorm1d()函數

還原過程用到下面這個公式

y = x − E [ x ] Var ⁡ [ x ] + ϵ ∗ γ + β y=\frac{x-\mathrm{E}[x]}{\sqrt{\operatorname{Var}[x]+\epsilon}} * \gamma+\beta y=Var[x]+ϵ

x−E[x]∗γ+β

已知y（模型輸出），求x（為歸一化前的資料）

# eval.py
import torch
from hparams import hparams
from dataset import feature_stft, feature_contex
from model_mapping import DNN_Mapping
import os
import soundfile as sf
import numpy as np
import librosa
import matplotlib.pyplot as plt
from generate_training import signal_by_db

# 用于測試訓練的模型


def eval_file_BN(wav_file,model,para):# 輸入訓練的模型和對應參數，以及待增強的資料
    
    # 讀取noisy 的音頻檔案
    noisy_wav,fs = sf.read(wav_file,dtype = 'int16')
    noisy_wav = noisy_wav.astype('float32')
    
    # 提取LPS特征
    noisy_LPS,noisy_phase = feature_stft(noisy_wav,para.para_stft)
    
    # 轉為torch格式
    noisy_LPS = torch.from_numpy(noisy_LPS)
    
    # 進行拼幀
    noisy_LPS_expand = feature_contex(noisy_LPS,para.n_expand)
    
    # 利用DNN進行增強
    model.eval()
    with torch.no_grad():
        enh_LPS = model(x = noisy_LPS_expand, istraining = False)
        # 模型輸出，注意這是一個經過BN歸一化後的LPS格式輸出
        # 要想經模型輸出 映射成正常輸出，還要借助BN歸一化的參數
        # 具體操作原理要看BatchNorm1d()函數
    
    # 利用 BN-layer的資訊對資料進行還原
    model_dic = model.state_dict()
    # gamma
    BN_weight = model_dic['BNlayer.weight'].data
    BN_weight = torch.unsqueeze(BN_weight,dim = 0)
    
    # beta
    BN_bias = model_dic['BNlayer.bias'].data
    BN_bias = torch.unsqueeze(BN_bias,dim = 0)
    # E[x]
    BN_mean = model_dic['BNlayer.running_mean'].data
    BN_mean = torch.unsqueeze(BN_mean,dim = 0)
    # Var[x]
    BN_var = model_dic['BNlayer.running_var'].data
    BN_var = torch.unsqueeze(BN_var,dim = 0)
    
    # BN反向運算，得到所求的增強信号的頻譜表示（注意這裡得到的依然是LPS格式，也即log）
    pred_LPS = (enh_LPS - BN_bias)*torch.sqrt(BN_var+1e-4)/(BN_weight+1e-8) + BN_mean
    
    # 将 LPS 還原成 Spec
    pred_LPS = pred_LPS.numpy()# 轉換成numpy格式
    enh_mag = np.exp(pred_LPS.T/2)# 将log形式轉換為幅度值，.T表示轉置
    enh_pahse = noisy_phase[para.n_expand:-para.n_expand,:].T # 相位就利用原始含噪信号的相位作為增強信号的相位，但是前後擴充幀去掉
    enh_spec = enh_mag*np.exp(1j*enh_pahse)# 增強後的頻譜
    
    # istft
    enh_wav = librosa.istft(enh_spec, hop_length=para.para_stft["hop_length"], win_length=para.para_stft["win_length"]) #增強後的時域信号
    return enh_wav 
    
    
   
    
if __name__ == "__main__":
    work_path="E:\\……\\DNN_mapping"
    os.chdir(work_path)
    
    para = hparams()
    
    # 讀取訓練好的模型
    model_name = "save/model_4_0.0036.pth"
    m_model = torch.load(model_name,map_location = torch.device('cpu'))
    
    snrs = [5]
    
    noise_path = 'E:\\……\\NoiseX-92'
    clean_path = "E:\\……\\TIMITdataset"
    # noises = ['factory1','volvo','white','m109']
    noises = ['white']
    test_clean_files = np.loadtxt('scp/test_small.scp',dtype = 'str').tolist()
    
    path_eval = 'eval2'# 測試檔案結果放在工作檔案目錄子檔案夾 \\eval2 下
    
    
    for noise in noises:
        print(noise)
        noise_file = os.path.join(noise_path,noise+'.wav')
        noise_data,fs = sf.read(noise_file,dtype = 'int16')
        
        for clean_wav in test_clean_files:
            
            # 讀取幹淨語音并儲存
            clean_file = os.path.join(clean_path,clean_wav)
            clean_data,fs = sf.read(clean_file,dtype = 'int16')
            id = os.path.split(clean_file)[-1]# 具體檔案名
            sf.write(os.path.join(path_eval,id),clean_data,fs) #将選區的幹淨語音存放至eval目錄下

            for snr in snrs:
                # 生成noisy檔案
                noisy_file = os.path.join(path_eval,noise+'-'+str(snr)+'-'+id)
                mix = signal_by_db(clean_data,noise_data,snr)# 加噪聲
                noisy_data = np.asarray(mix,dtype= np.int16)
                sf.write(noisy_file,noisy_data,fs) # 将加噪語音存儲儲存
                
                # 進行增強
                print("enhancement file %s"%(noisy_file))
                enh_data = eval_file_BN(noisy_file,m_model,para)
                
                # 信号正則，把信号幅度轉換到±1範圍内
                max_ = np.max(enh_data)
                min_ = np.min(enh_data)
                enh_data = enh_data*(2/(max_ - min_)) - (max_+min_)/(max_-min_)
                enh_file = os.path.join(path_eval,noise+'-'+str(snr)+'-'+'enh'+'-'+id)
                sf.write(enh_file,enh_data,fs)# 将增強語音儲存
                
                # 繪圖
                fig_name = os.path.join(path_eval,noise+'-'+str(snr)+'-'+id[:-3]+'jpg')
                
                plt.subplot(3,1,1)
                plt.specgram(clean_data,NFFT=512,Fs=fs)
                plt.xlabel("clean specgram")
                plt.subplot(3,1,2)
                plt.specgram(noisy_data,NFFT=512,Fs=fs)
                plt.xlabel("noisy specgram")   
                plt.subplot(3,1,3)
                plt.specgram(enh_data,NFFT=512,Fs=fs)
                plt.xlabel("enhece specgram")
                plt.savefig(fig_name)

語音信号處理：語音增強DNN頻譜映射引言原理

基于深度神經網絡頻譜映射的語音增強方法

引言

原理

資料集

語音資料集 TIMIT

噪聲資料集 Noise-92

資料準備

無噪語音資料準備

生成含噪資料噪聲對

模型結構

參數配置檔案

資料集管理

關于特征提取：

關于神經網絡資料輸入輸出

搭建神經網絡模型

模型訓練，儲存

模型資料

測試

繼續閱讀

【人工智能行業大師訪談1】吳恩達采訪 Geoffery Hinton

吳恩達機器學習筆記（3）

吳恩達j機器學習之過拟合

吳恩達機器學習(一) 介紹

深度學習模型分析人類複雜疾病的準确性

疾病研究：重症肌無力

人工智能如何有效地運用于自然語言處理

新聞 | Mapbox 牽手阿裡，飛豬旅行上線六大城市地圖功能

【趨高機器視覺】機器視覺技術原了解析及解決方案

吳恩達 coursera ML 第七課總結+作業答案前言目錄正文模型表示作業答案

XGBoost Plotting API以及GBDT組合特征實踐 XGBoost Plotting API以及GBDT組合特征實踐

[HTML5]自定義屬性 data-* 和 jQuery.data 詳解

解碼器用于語義分割：資料依賴的解碼可以實作靈活的特征聚合

2021-2025年中國運動療法（KT）帶行業市場供需與戰略研究報告

2021年危險化學品經營機關安全管理人員考試題庫及危險化學品經營機關安全管理人員考試技巧

無人機--飛控科普

語音信号處理：語音增強DNN頻譜映射引言原理

基于深度神經網絡頻譜映射的語音增強方法

引言

原理

資料集

語音資料集 TIMIT

噪聲資料集 Noise-92

資料準備

無噪語音資料準備

生成含噪資料 噪聲對

模型結構

參數配置檔案

資料集管理

關于特征提取：

關于神經網絡資料輸入輸出

搭建神經網絡模型

模型訓練，儲存

模型資料

測試

繼續閱讀

生成含噪資料噪聲對