Training algorithm

The reinforcement learning algorithm described hereinafter is implemented in learn.py (Learn).

Introduction

The optimal value function is approximated by \(\hat{v}(s; {\bf w}^*)\), a function computed by a multi-layer neural network with weight vector \({\bf w}\).

The Bellman optimality equation for the approximate value function is

\[\begin{equation} \hat{v}(s; {\bf w}^*) = \min_a \sum_{s'} \text{Pr}(s'|s,a) [ 1 + \hat{v}(s'; {\bf w}^*)] \qquad \forall s \neq s^\Omega \end{equation}\]

with \(\hat{v}(s^\Omega; {\bf w}^*) = 0\).

The problem consists in finding the weights \({\bf w}^*\) that allow this equation to be satisfied “at best”, that is, the weights that minimize the residual error. This residual error, called the Bellman optimality error, reads

\[\begin{equation} L({\bf w}) = \mathbb{E}_{s} \left[ \min_a \sum_{s'} \text{Pr}(s'|s,a) [1 + \hat{v}(s'; {\bf w})] - \hat{v}(s; {\bf w}) \right]^2 \end{equation}\]

where the expectation is taken over belief states \(s\) visited when following the policy \(\hat{\pi}\) derived from \(\hat{v}\) and defined by

\[\begin{equation} \hat{\pi}(s; {\bf w}) = \text{argmin}_a \, \sum_{s'} \Pr(s'|s,a) [1 + \hat{v}(s'; {\bf w})]. \end{equation}\]

The functional \(L({\bf w})\) is known as the “loss function”, and “training” the network then refers to the iterative update (through stochastic gradient descent) of the weights \({\bf w}\) so as to minimize this loss function.

Deep neural network

Preprocessing

The belief state, which contains both the agent’s position (an \(n\)-tuple) and the source probability distribution (an \(N^n\) array, with \(N\) the linear grid size and \(n\) its dimensionality), is equivalently represented as a source probability distribution centered on the agent (a \((2N-1)^n\) array). This transformation, illustrated below, is applied to the belief state before using it as an input to the neural network.

Architecture

The neural network consists of fully connected layers: the input layer, followed by hidden layers with rectifier linear units (ReLU) activations, and a linear output layer. The network size (number of hidden layers and number of neurons per hidden layer) can be chosen by the user with the FC_LAYERS and FC_UNITS parameters.

Near-optimal policies can be obtained if large enough neural networks are used. Typically 3 hidden layers are sufficient for most cases, and an increasing number of neurons per layer is required for increasing problem sizes.

Algorithm

The training algorithm is a custom model-based version of DQN (Mnih et al., Nature, 2015). The pseudo-code is given below.

pseudo-code for model-based DQN — Pseudo-code for the training algorithm (based on DQN).