DQN三大改進(三)-Dueling Network

1、Dueling Network

什麼是Dueling Deep Q Network呢？看下面的圖檔

上面是我們傳統的DQN，下面是我們的Dueling DQN。在原始的DQN中，神經網絡直接輸出的是每種動作的 Q值, 而 Dueling DQN 每個動作的 Q值是有下面的公式确定的：

它分成了這個 state 的值, 加上每個動作在這個 state 上的 advantage。我們通過下面這張圖來解釋一下：

在這款賽車遊戲中。左邊是 state value, 發紅的部分證明了 state value 和前面的路線有關, 右邊是 advantage, 發紅的部分說明了 advantage 很在乎旁邊要靠近的車子, 這時的動作會受更多 advantage 的影響. 發紅的地方左右了自己車子的移動原則。

但是，利用上面的式子計算Q值會出現一個unidentifiable問題：給定一個Q，是無法得到唯一的V和A的。比如，V和A分别加上和減去一個值能夠得到同樣的Q，但反過來顯然無法由Q得到唯一的V和A。

解決方法

強制令所選擇貪婪動作的優勢函數為0：

則我們能得到唯一的值函數：

解決方法的改進

使用優勢函數的平均值代替上述的最優值

采用這種方法，雖然使得值函數V和優勢函數A不再完美的表示值函數和優勢函數(在語義上的表示)，但是這種操作提高了穩定性。而且，并沒有改變值函數V和優勢函數A的本質表示。

2、代碼實作

本文的代碼還是根據莫煩大神的代碼，它的github位址為：https://github.com/MorvanZhou/Reinforcement-learning-with-tensorflow

這裡我們想要實作的效果類似于尋寶。

其中，紅色的方塊代表尋寶人，黑色的方塊代表陷阱，黃色的方塊代表寶藏，我們的目标就是讓尋寶人找到最終的寶藏。

這裡，我們的狀态可以用橫縱坐标表示，而動作有上下左右四個動作。使用tkinter來做這樣一個動畫效果。寶藏的獎勵是1，陷阱的獎勵是-1，而其他時候的獎勵都為0。

接下來，我們重點看一下我們Dueling-DQN相關的代碼。

定義輸入

# ------------------------input---------------------------self.s = tf.placeholder(tf.float32, [None, self.n_features], name='s')self.q_target = tf.placeholder(tf.float32, [None, self.n_actions], name='Q-target')self.s_ = tf.placeholder(tf.float32,[None,self.n_features],name='s_')

定義網絡結構

根據Dueling DQN的網絡結構，我們首先定義一個隐藏層，針對隐藏層的輸出，我們将此輸出分别作為兩個隐藏層的輸入，分别輸出state的Value，和每個action的Advantage，最後，根據Q = V+A得到每個action的Q值：

def build_layers(s, c_names, n_l1, w_initializer, b_initializer):

with tf.variable_scope('l1'):

w1 = tf.get_variable('w1', [self.n_features, n_l1], initializer=w_initializer, collections=c_names)

b1 = tf.get_variable('b1', [1, n_l1], initializer=b_initializer, collections=c_names)

l1 = tf.nn.relu(tf.matmul(s, w1) + b1) if self.dueling:

with tf.variable_scope('Value'):

w2 = tf.get_variable('w2',[n_l1,1],initializer=w_initializer,collections=c_names)

b2 = tf.get_variable('b2',[1,1],initializer=b_initializer,collections=c_names) self.V = tf.matmul(l1,w2) + b2

with tf.variable_scope('Advantage'):

w2 = tf.get_variable('w2',[n_l1,self.n_actions],initializer=w_initializer,collections=c_names)

b2 = tf.get_variable('b2',[1,self.n_actions],initializer=b_initializer,collections=c_names) self.A = tf.matmul(l1,w2) + b2

with tf.variable_scope('Q'):

out = self.V + self.A - tf.reduce_mean(self.A,axis=1,keep_dims=True) else:

w2 = tf.get_variable('w2', [n_l1, self.n_actions], initializer=w_initializer, collections=c_names)

b2 = tf.get_variable('b2', [1, self.n_actions], initializer=b_initializer, collections=c_names)

out = tf.matmul(l1, w2) + b2 return out

接下來，我們定義我們的eval-net和target-net

# -----------------------------eval net ---------------------with tf.variable_scope('eval_net'):

c_names, n_l1, w_initializer, b_initializer = \

['eval_net_params', tf.GraphKeys.GLOBAL_VARIABLES], 20, \

tf.random_normal_initializer(0., 0.3), tf.constant_initializer(0.1) # config of layers

self.q_eval = build_layers(self.s, c_names, n_l1, w_initializer, b_initializer)# ------------------ build target_net ------------------with tf.variable_scope('target_net'):

c_names = ['target_net_params', tf.GraphKeys.GLOBAL_VARIABLES] self.q_next = build_layers(self.s_, c_names, n_l1, w_initializer, b_initializer)

定義損失和優化器

接下來，我們定義我們的損失，和DQN一樣，我們使用的是平方損失：

with tf.variable_scope('loss'): self.loss = tf.reduce_mean(tf.squared_difference(self.q_target,self.q_eval))

with tf.variable_scope('train'): self.train_op = tf.train.RMSPropOptimizer(self.lr).minimize(self.loss)

定義經驗池

我們使用一個函數定義我們的經驗池，經驗池每一行的長度為狀态feature * 2 + 2。

def store_transition(self,s,a,r,s_): if not hasattr(self, 'memory_counter'): self.memory_counter = 0

transition = np.hstack((s, [a, r], s_))

index = self.memory_counter % self.memory_size self.memory[index, :] = transition self.memory_counter += 1

選擇action

我們仍然使用的是e-greedy的選擇動作政策，即以e的機率選擇随機動作，以1-e的機率通過貪心算法選擇能得到最多獎勵的動作a。

def choose_action(self,observation):

observation = observation[np.newaxis,:]

actions_value = self.sess.run(self.q_eval,feed_dict={self.s:observation})

action = np.argmax(actions_value) if np.random.random() > self.epsilon:

action = np.random.randint(0,self.n_actions) return action

選擇資料batch

我們從經驗池中選擇我們訓練要使用的資料。

if self.memory_counter > self.memory_size:

sample_index = np.random.choice(self.memory_size, size=self.batch_size)else:

sample_index = np.random.choice(self.memory_counter, size=self.batch_size)

batch_memory = self.memory[sample_index,:]

更新target-net

這裡，每個一定的步數，我們就更新target-net中的參數：

t_params = tf.get_collection('target_net_params')

e_params = tf.get_collection('eval_net_params')self.replace_target_op = [tf.assign(t, e) for t, e in zip(t_params, e_params)]if self.learn_step_counter % self.replace_target_iter == 0: self.sess.run(self.replace_target_op) print('\ntarget_params_replaced\n')

更新網絡參數

我們使用DQN的做法來更新網絡參數：

q_next = self.sess.run(self.q_next, feed_dict={self.s_: batch_memory[:, -self.n_features:]}) # next observationq_eval = self.sess.run(self.q_eval, {self.s: batch_memory[:, :self.n_features]})

q_target = q_eval.copy()

batch_index = np.arange(self.batch_size, dtype=np.int32)

eval_act_index = batch_memory[:, self.n_features].astype(int)

reward = batch_memory[:, self.n_features + 1]

q_target[batch_index, eval_act_index] = reward + self.gamma * np.max(q_next, axis=1)

_, self.cost = self.sess.run([self._train_op, self.loss],

feed_dict={self.s: batch_memory[:, :self.n_features], self.q_target: q_target})self.cost_his.append(self.cost)self.epsilon = self.epsilon + self.epsilon_increment if self.epsilon < self.epsilon_max else self.epsilon_maxself.learn_step_counter +=

原文釋出時間為：2018-10-9

本文作者：文文

本文來自雲栖社群合作夥伴“

Python愛好者社群

”，了解相關資訊可以關注“

”。

DQN三大改進(三)-Dueling Network

繼續閱讀

來自python的【條件控制/語句循環/break/continue/else/pass】一、條件控制二、語句循環

無法解析的外部符号 wmain，該符号在函數 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink導出用例轉換工具(XML2Excel)

YAML簡介和PyYAML安全操作YAML支援的類型YAML的優點：yaml的基本文法python操作

Small tricks

libsvm for python 安裝

學習軟體測試基礎測試第七天

Zeppelin 配置通路 REST APIApache Zeppelin Configuration REST API

【Torch】最簡潔logging使用指南

27. Remove Element(清單)題目代碼

Cloud Studio初體驗

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

在python中建立excel并寫入