keras終止訓練後顯存不釋放_基于DDPG的TORCS自動駕駛訓練筆記(二)

本小節介紹對代碼的分析。

一、程式架構分析

ActorNetwork.py 設定Actor網絡

CriticNetwork.py 設定Critic網絡

OU.py Ornstein-Uhlenbeck過程

ReplayBuffer.py 經驗回放

actormodel.json 儲存actor網絡模型的結構

actormodel.h5 儲存actor網絡權重

criticmodel.json 儲存critic網絡模型的結構

criticmodel.h5 儲存critic網絡權重

ddpg.py 主代碼

gym_torcs.py python與TORCS的接口

snakeoil3_gym.py 與TORCS伺服器通信的python腳本

二、ddpg.py檔案分析

1.基本參數

BUFFER_SIZE = 100000     # 緩存大小，指網絡存儲能力
    BATCH_SIZE = 32          # 批尺寸，指一次處理多少樣本
    GAMMA = 0.99             # 折扣系數
    TAU = 0.001              # 目标網絡超參數
    LRA = 0.0001             # Actor網絡學習率
    LRC = 0.001              # Critic網絡學習率

    action_dim = 3           #加速、轉向、制動
    state_dim = 29           #29個傳感器輸入

    np.random.seed(1337)     #随機數種子，如果使用相同的數字，則每次産生的随機數相同，這裡我的了解是定義了一個随機的初始值。

為什麼采用批資料處理(BATCH_SIZE = 32)？

①若采用全資料集（Full Batch Learning）

優點：由全資料集确定的方向能夠更好的代表樣本總體，進而更準确地朝向極值所在的方向；

缺點：資料集大的情況，一次性載入所有資料，不可行。

②若每次隻訓練一個樣闆，即batch_size=1，這叫做線上學習（online learning）。

缺點：每次修正方向以各自樣本的梯度方向修正，導緻波動較大、難以收斂。

③适中的batch_size，即批梯度下降法（Mini_batches learning）。

優點：相比全資料集處理方法，小批量處理需要更少的記憶體就可以訓練網絡，相比batch_size=1的情況，訓練網絡更新更快。

缺點：批次越小，估值越不準确，相比全資料集波動大。

2.Tensorflow限制GPU資源使用

為了加快運作效率，Tensorflow在初始化時會嘗試配置設定所有的GPU資源給自己，這在多人使用的伺服器上工作就會導緻GPU占用，别人無法用GPU工作，tf提供了兩種控制GPU資源的方法，一種是讓Tensorflow在運作過程中動态申請顯存，一種是限制GPU的使用率。

（1）動态申請緩存（代碼中使用的方法）

config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    sess = tf.Session(config=config)

（2）限制GPU使用率：

config = tf.ConfigProto()
    config.gpu_options.per_process_gpu_memory_fraction = 0.4   #占40%記憶體
    sess = tf.Session(config=config)

3.将Keras作為tensorflow的精簡接口

我們将使用Keras的

Dense

層（全連接配接層）來建構一個TensorFlow分類器，是以需要在Tensorflow中調用Keras。首先建立一個TensorFlow會話并注冊到Keras。這意味着Keras将使用我們注冊的會話來初始化它在内部建立的所有變量。

from keras import backend as K
    K.set_session(sess)

4.加載權重和儲存權重

加載權重并儲存在actormodel.h5和criticmodel.h5中，通過json檔案儲存模型權重。

keras模型一般儲存為字尾名為.h5的檔案。同樣是.h5的檔案，save()和save_weight()的儲存效果是不一樣的。

save_weights()方法：

①隻儲存了模型權重而沒有儲存模型結構，節省記憶體空間；

②它儲存的資料不能用于繼續訓練模型；

③save_weights()儲存的權重通過load_weights()方法加載模型權重。并且，在加載模型權重之前，必須把網絡結構定義好，并且模型的各個層的名稱必須與儲存模型權重時的各個層保持一緻。

save()方法：

①檔案包含模型的結構、模型的權重、訓練配置（損失函數、優化器等）和優化器的狀态，占用記憶體大，可以直接用可視化工具打開；

②它儲存的資料可以在上次中斷的地方繼續訓練；

③save()儲存的資料使用models.load_model()進行加載。

加載權重

print("Now we load the weight")
    try:
        actor.model.load_weights("actormodel.h5")
        critic.model.load_weights("criticmodel.h5")
        actor.target_model.load_weights("actormodel.h5")
        critic.target_model.load_weights("criticmodel.h5")
        print("Weight load successfully")
    except:
        print("Cannot find the weight")

儲存權重

if np.mod(i, 3) == 0:             # 每三局更新一次權重
            if (train_indicator):          # train_indicator==1時，訓練模式，執行下面代碼
                print("Now we save model")
                actor.model.save_weights("actormodel.h5", overwrite=True)        # 儲存權重
                with open("actormodel.json", "w") as outfile:
                    json.dump(actor.model.to_json(), outfile)

                critic.model.save_weights("criticmodel.h5", overwrite=True)
                with open("criticmodel.json", "w") as outfile:
                    json.dump(critic.model.to_json(), outfile)

5.經驗回放

代碼使用經驗回放來儲存訓練中所有的階段（s，a，r，s'）。當訓練神經網絡時，從其中随機小批量抽取階段資料，将大大提高系統的穩定性。

當我們target_q_values時，使用的是目标網絡的輸出。使用緩慢變化的目标網絡将減少Q值估計的振蕩，這極大地提高了學習的穩定性。

代碼如下：

buff.add(s_t, a_t[0], r_t, s_t1, done)
        # 從存儲回放器中随機小批量抽取N個變換階段 (si, ai, ri, si+1)
        batch = buff.getBatch(BATCH_SIZE)
        states = np.asarray([e[0] for e in batch])
        actions = np.asarray([e[1] for e in batch])
        rewards = np.asarray([e[2] for e in batch])
        new_states = np.asarray([e[3] for e in batch])
        dones = np.asarray([e[4] for e in batch])
        y_t = np.asarray([e[1] for e in batch])

        target_q_values = critic.target_model.predict([new_states, actor.target_model.predict(new_states)])    #Still using tf
       
        for k in range(len(batch)):
            if dones[k]:
                y_t[k] = rewards[k]
            else:
                y_t[k] = rewards[k] + GAMMA*target_q_values[k]

getBatch()是ReplayBuffer.py中的随機抽樣函數。

np.asarray()将輸入資料轉化為一個新的（copy）ndarray，ndarray對象是用于存放同類型元素的多元數組。

.神經網絡的訓練

首先通過減少損失來更新critic network

keras終止訓練後顯存不釋放_基于DDPG的TORCS自動駕駛訓練筆記(二)

然後使用采樣的政策梯度更新actor政策。

确定性政策

keras終止訓練後顯存不釋放_基于DDPG的TORCS自動駕駛訓練筆記(二)

代碼的最後兩行更新target network：

keras終止訓練後顯存不釋放_基于DDPG的TORCS自動駕駛訓練筆記(二)

if (train_indicator):
                loss += critic.model.train_on_batch([states,actions], y_t) 
                a_for_grad = actor.model.predict(states)
                grads = critic.gradients(states, a_for_grad)
                actor.train(states, grads)
                actor.target_train()
                critic.target_train()

if _name_==”_main_”

if _name_==”_main_”相當于python模拟的程式入口，python本身并沒有規定這麼寫，這隻是一種編碼習慣。

三、gym_torcs.py檔案分析

Gym-TORCS是一個模仿Open-AI接口的TORCS的python封裝，用于在TORCS上測試增強學習算法。

1.基本參數

terminal_judge_start = 100      # 如果在100次後沒有進展則終止
    termination_limit_progress = 5  # [km/h], 目前速度低于此速度則終止
    default_speed = 50              # 預設速度

2.初始化油門、視覺、齒輪變速

if throttle is False:
            self.action_space = spaces.Box(low=-1.0, high=1.0, shape=(1,))
        else:
            self.action_space = spaces.Box(low=-1.0, high=1.0, shape=(2,))

        if vision is False:
            high = np.array([1., np.inf, np.inf, np.inf, 1., np.inf, 1., np.inf])
            low = np.array([0., -np.inf, -np.inf, -np.inf, 0., -np.inf, 0., -np.inf])
            self.observation_space = spaces.Box(low=low, high=high)
        else:
            high = np.array([1., np.inf, np.inf, np.inf, 1., np.inf, 1., np.inf, 255])
            low = np.array([0., -np.inf, -np.inf, -np.inf, 0., -np.inf, 0., -np.inf, 0])
            self.observation_space = spaces.Box(low=low, high=high)

3.自動換擋設定

#  通過Snakeoil自動換擋
        if self.gear_change is True:
            action_torcs['gear'] = this_action['gear']
        else:
            #  共設定六個檔位
            action_torcs['gear'] = 1
            if self.throttle:
                if client.S.d['speedX'] > 50:
                    action_torcs['gear'] = 2
                if client.S.d['speedX'] > 80:
                    action_torcs['gear'] = 3
                if client.S.d['speedX'] > 110:
                    action_torcs['gear'] = 4
                if client.S.d['speedX'] > 140:
                    action_torcs['gear'] = 5
                if client.S.d['speedX'] > 170:
                    action_torcs['gear'] = 6

4.一步動态更新

#将動作傳給TORCS
        client.respond_to_server()
        # 獲得TORCS的回複
        client.get_servers_input()

        # 從TORCS獲得全部觀察
        obs = client.S.d

        # 從TORCS原始觀察向量獲得觀察
        self.observation = self.make_observaton(obs)

5.獎勵函數

track = np.array(obs['track'])
        trackPos = np.array(obs['trackPos'])
        sp = np.array(obs['speedX'])
        damage = np.array(obs['damage'])
        rpm = np.array(obs['rpm'])

        #基于速度和距離中心線位置的獎賞
        progress = sp*np.cos(obs['angle']) - np.abs(sp*np.sin(obs['angle'])) - sp * np.abs(obs['trackPos'])
        reward = progress

        # 碰撞懲罰
        if obs['damage'] - obs_pre['damage'] > 0:
            reward = -1

6.終止條件

episode_terminate = False

        # 汽車跑出賽道則終止
        if (abs(track.any()) > 1 or abs(trackPos) > 1):  
            reward = -200
            episode_terminate = True
            client.R.d['meta'] = True

        # 100步後沒有進展則終止
        if self.terminal_judge_start < self.time_step:       #如果目前步數小于100
            if progress < self.termination_limit_progress:   #如果目前速度小于5
                print("No progress")
                episode_terminate = True
                client.R.d['meta'] = True

        # 汽車往後走（加速度為負數）則終止
        if np.cos(obs['angle']) < 0: 
            episode_terminate = True
            client.R.d['meta'] = True

        # 發送重置信号
        if client.R.d['meta'] is True: 
            self.initial_run = False
            client.respond_to_server()

7.傳感器輸入

傳感器輸入在此處修改，各個參數的含義在上一節的表格中有說明。

def make_observaton(self, raw_obs):
        if self.vision is False:
            names = ['focus',
                     'speedX', 'speedY', 'speedZ', 'angle', 'damage',
                     'opponents',
                     'rpm',
                     'track', 
                     'trackPos',
                     'wheelSpinVel']
            Observation = col.namedtuple('Observaion', names)
            return Observation(focus=np.array(raw_obs['focus'], dtype=np.float32)/200.,
                               speedX=np.array(raw_obs['speedX'], dtype=np.float32)/300.0,
                               speedY=np.array(raw_obs['speedY'], dtype=np.float32)/300.0,
                               speedZ=np.array(raw_obs['speedZ'], dtype=np.float32)/300.0,
                               angle=np.array(raw_obs['angle'], dtype=np.float32)/3.1416,
                               damage=np.array(raw_obs['damage'], dtype=np.float32),
                               opponents=np.array(raw_obs['opponents'], dtype=np.float32)/200.,
                               rpm=np.array(raw_obs['rpm'], dtype=np.float32)/10000,
                               track=np.array(raw_obs['track'], dtype=np.float32)/200.,
                               trackPos=np.array(raw_obs['trackPos'], dtype=np.float32)/1.,
                               wheelSpinVel=np.array(raw_obs['wheelSpinVel'], dtype=np.float32))
        else:
            names = ['focus',
                     'speedX', 'speedY', 'speedZ', 'angle',
                     'opponents',
                     'rpm',
                     'track',
                     'trackPos',
                     'wheelSpinVel',
                     'img']
            Observation = col.namedtuple('Observaion', names)

            # RGB
            image_rgb = self.obs_vision_to_image_rgb(raw_obs[names[8]])

            return Observation(focus=np.array(raw_obs['focus'], dtype=np.float32)/200.,
                               speedX=np.array(raw_obs['speedX'], dtype=np.float32)/self.default_speed,
                               speedY=np.array(raw_obs['speedY'], dtype=np.float32)/self.default_speed,
                               speedZ=np.array(raw_obs['speedZ'], dtype=np.float32)/self.default_speed,
                               opponents=np.array(raw_obs['opponents'], dtype=np.float32)/200.,
                               rpm=np.array(raw_obs['rpm'], dtype=np.float32),
                               track=np.array(raw_obs['track'], dtype=np.float32)/200.,
                               trackPos=np.array(raw_obs['trackPos'], dtype=np.float32)/1.,
                               wheelSpinVel=np.array(raw_obs['wheelSpinVel'], dtype=np.float32),
                               img=image_rgb)

四、ActorNetwork.py分析

代碼中使用了2個隐藏層，分别有300和600個神經元。輸出包括3個連續動作。

（1）Steering：使用tanh激活函數（輸出-1表示最大右轉彎，+1表示最大左轉彎）；

（2）Acceleration：使用

sigmoid

激活函數（輸出0代表不加速，1表示全加速）；

（3）Brake：使用

sigmoid

激活函數（輸出0表示不制動，1表示緊急制動）。

代碼使用Keras函數Merge來合并三個輸出層（在keras2.2.0的版本中Merge已經取消）。

def create_actor_network(self, state_size,action_dim):
        print("Now we build the model")
        S = Input(shape=[state_size])   
        h0 = Dense(HIDDEN1_UNITS, activation='relu')(S)
        h1 = Dense(HIDDEN2_UNITS, activation='relu')(h0)
        Steering = Dense(1,activation='tanh',init=lambda shape, name: normal(shape, scale=1e-4, name=name))(h1)  
        Acceleration = Dense(1,activation='sigmoid',init=lambda shape, name: normal(shape, scale=1e-4, name=name))(h1)   
        Brake = Dense(1,activation='sigmoid',init=lambda shape, name: normal(shape, scale=1e-4, name=name))(h1) 
        V = merge([Steering,Acceleration,Brake],mode='concat')          
        model = Model(input=S,output=V)
        return model, model.trainable_weights, S

權重更新的代碼如下，代碼中使用了tf.gradients()，其中self.model.output對self.weights求導，self.action_gradient對model.trainable_weights中的每個元素的求導權重重。

#初始化        
        self.model , self.weights, self.state = self.create_actor_network(state_size, action_size)   
        self.target_model, self.target_weights, self.target_state = self.create_actor_network(state_size, action_size) 
       #權重更新
        self.action_gradient = tf.placeholder(tf.float32,[None, action_size])
        self.params_grad = tf.gradients(self.model.output, self.weights, -self.action_gradient)   #梯度計算
        grads = zip(self.params_grad, self.weights)
        self.optimize = tf.train.AdamOptimizer(LEARNING_RATE).apply_gradients(grads)   
        self.sess.run(tf.initialize_all_variables())

    def train(self, states, action_grads):
        self.sess.run(self.optimize, feed_dict={
            self.state: states,
            self.action_gradient: action_grads
        })
      #實作目标網絡，用來計算目标值
    def target_train(self):
        actor_weights = self.model.get_weights()
        actor_target_weights = self.target_model.get_weights()
        for i in xrange(len(actor_weights)):
            actor_target_weights[i] = self.TAU * actor_weights[i] + (1 - self.TAU)* actor_target_weights[i]
        self.target_model.set_weights(actor_target_weights)

五、CriticNetwork.py分析

Critic網絡模型的建構與DQN網絡非常相似。代碼使用2個具有300和600個神經元的隐含層。Critic網絡将狀态和動作都作為輸入。根據DDPG的論文，直到Q網絡的第二個隐藏層才包含這些動作。同樣使用Keras中的Merge函數将動作和隐含層合并在一起。

def create_critic_network(self, state_size,action_dim):
        print("Now we build the model")
        S = Input(shape=[state_size])  
        A = Input(shape=[action_dim],name='action2')   
        w1 = Dense(HIDDEN1_UNITS, activation='relu')(S)
        a1 = Dense(HIDDEN2_UNITS, activation='linear')(A) 
        h1 = Dense(HIDDEN2_UNITS, activation='linear')(w1)
        h2 = merge([h1,a1],mode='sum')    
        h3 = Dense(HIDDEN2_UNITS, activation='relu')(h2)
        V = Dense(action_dim,activation='linear')(h3)   
        model = Model(input=[S,A],output=V)
        adam = Adam(lr=self.LEARNING_RATE)
        model.compile(loss='mse', optimizer=adam)
        return model, A, S

六、OU.py分析

OU.py檔案寫的是Ornstein-Uhlenbeck過程，它是具有均值回歸特性的随機過程。

公式：

keras終止訓練後顯存不釋放_基于DDPG的TORCS自動駕駛訓練筆記(二)

θ意味着變量多快恢複到均值；μ表示均值；σ是過程的波動程度。

Ornstein-Uhlenbeck過程是一種非常常見的方法，可以随機模拟利率，外彙和商品價格。

下表是代碼中使用的建議值：

keras終止訓練後顯存不釋放_基于DDPG的TORCS自動駕駛訓練筆記(二)

最重要的參數是加速度的μ，想讓汽車具有一定的初始速度，而不要陷在局部最小值（汽車一直踩刹車而不踩油門）。可以随意更改參數來看看AI在不同組合下的行為。

代碼如下：

import random
import numpy as np 

class OU(object):

    def function(self, x, mu, theta, sigma):
        return theta * (mu - x) + sigma * np.random.randn(1)

keras終止訓練後顯存不釋放_基于DDPG的TORCS自動駕駛訓練筆記(二)

一、程式架構分析

二、ddpg.py檔案分析

三、gym_torcs.py檔案分析

四、ActorNetwork.py分析

五、CriticNetwork.py分析

六、OU.py分析

繼續閱讀

matlab卷積神經網絡代碼_如何用卷積神經網絡預測股票波動率？（附Python代碼）...

keras終止訓練後顯存不釋放_巨省顯存的重計算技巧在TF、Keras中的正确打開方式...前言使用效果環境