Definitions
The agent’s behavior is controlled by a policy, denoted \(\pi\), which maps each belief state \(s\) to an action \(a\). That is, at each step the action is chosen according to \(a = \pi(s)\).
The following policies are implemented in OTTO:
POLICY = -1
: neural network policy
POLICY = 0
: infotaxis
POLICY = 0
,STEPS_AHEAD = 3
: N-step infotaxis (here with a 3-step anticipation)
POLICY = 1
: space-aware infotaxis
POLICY = 2
: user-defined policy
POLICY = 5
: random walk
POLICY = 6
: greedy policy
POLICY = 7
: mean distance policy
POLICY = 8
: voting policy
POLICY = 9
: most likely state policy
They are defined hereinafter.
Optimal policy
The optimal policy is denoted \(\pi^*\) and is defined as the policy that minimizes the duration of the search. It can be determined from the solution of a recurrence equation known as the Bellman optimality equation as follows.
The optimal value function \(v^*(s)\) is defined as the minimum, over all policies, of the expected number of steps remaining to find the source when starting from belief state \(s\):
It satisfies the Bellman optimality equation:
where \(\text{Pr}(s'|s,a)\) is the probability of transitioning from belief state \(s\) to next belief state \(s'\) after executing action \(a\), and where \(v^*(s^\Omega) = 0\) (by definition of the terminal state \(s^\Omega\)). Possible transitions from \(s\) to \(s'\) corresponds to possible observations: either finding the source (event \(F\)) or not finding the source (event \(\bar{F}\)) and receiving \(h\) hits, as illustrated below.
Given \(v^*(s)\), the optimal policy consists in choosing the action that minimizes the expected number of remaining steps \(v^*(s')\):
The continuous nature of the belief state prevents from computing the optimal value function exactly. But near-optimal value functions can be approximated by neural networks trained using deep reinforcement learning.
Neural network policy
The neural network policy is
where \(\hat{v}^*(s', {\bf w})\) is an approximation of the optimal value function computed by a multi-layer neural network with weights \({\bf w}\).
To optimize their weights, neural networks are trained using reinforcement learning.
Near-optimal policies can be obtained if large enough neural networks are used.
Trained neural networks yielding near-optimal policies are provided in the zoo
directory.
Infotaxis
The infotaxis policy states that the agent should choose the action from which it expects the greatest information gain about the source location. It reads
where \(H(s)\) is the Shannon entropy of belief state \(s = [{\bf x}^a, p({\bf x})]\):
Infotaxis was proposed by Vergassola et al. [Vergassola2007].
The physical intuition behind this strategy is, quoting the authors, that “information accumulates faster close to the source because cues arrive at a higher rate, hence tracking the maximum rate of information acquisition will guide the searcher to the source much like concentration gradients in chemotaxis”.
Infotaxis is far superior to all naive strategies, such as going to the more likely source location, but it is not optimal.
N-step infotaxis
Infotaxis is based on a one-step anticipation of possible outcomes of each action. N-step infotaxis is the generalization of infotaxis to an anticipation over an arbitrary number of steps. It maximizes the cumulated information gain over those steps.
The detailed algorithm relies on an exhaustive tree search [Loisy2022].
The number of anticipated steps is called STEPS_AHEAD
in the code.
Space-aware infotaxis
The space-aware infotaxis policy is variant of infotaxis shifted toward more exploitaton. It reads
where \(J(s)\) is defined by
with \(J(s^\Omega)=0\). In this expression, \(H(s)\) is the Shannon entropy of belief state \(s = [{\bf x}^a, p({\bf x})]\) and \(D(s)\) is the mean Manhattan distance between the agent and the source
Space-aware infotaxis was proposed by Loisy and Eloy [Loisy2022]. It has been shown to beat infotaxis in most cases, and is believed to be a good approximation of the optimal policy.
User-defined policy
Heuristic policies are implemented in otto.classes.heuristicpolicy
.
A template for a new policy is provided by the method _custom_policy
of the class HeuristicPolicy
.
One can then set POLICY = 2
in the parameter file to use this custom policy.
Random walk
The random walk policy chooses an action at random with equal probabilities. It is not a good policy for the source-tracking POMDP.
Greedy policy
The greedy policy is a short-sighted policy which chooses the action that maximizes the probability of finding the source in the next cell:
where \({\bf x}^a(a)\) is the new agent position after executing action \(a\).
It a standard policy for MDPs and POMDPs. It is not a good policy for the source-tracking POMDP.
Mean distance policy
The mean distance policy is based on the intuitive idea that the agent should get, on average, closer to the source. For a belief state \(s=[{\bf x}^a, p({\bf x})]\), the mean distance to the source is
where we use the Manhattan norm \(\lVert \cdot \rVert_1\).
The expected value of the mean distance upon executing action \(a\) in belief state \(s\) is given by
where the sum is taken over all successor belief states \(s'\).
The mean distance policy is then defined by
and consists in choosing the action that minimizes the expected distance to the source at the next step.
It was proposed in [Loisy2022], as an example of a naive policy which performs poorly.
Voting policy
The voting policy chooses the action that is the most likely to be optimal.
It determines the optimal action for each possible source location, weights each action by the corresponding probability of that location being the true source location, and picks the action with the highest probability.
The probability that action \(a\) is optimal is
where \(\phi\) has value 1 if the argument is true and 0 otherwise, and where \(a^*({\bf x})\) denote the optimal action for a source located in \({\bf x}\), which is given by
where \({\bf x}^a(a)\) is the new agent position after executing action \(a\) and where \(\lVert \cdot \rVert_1\) is the Manhattan norm.
The voting policy then reads
It was originally proposed for robotic navigation [Cassandra1996]. It is not a good policy for the source-tracking POMDP.
Most likely state policy
The most likely state policy finds the most likely source location, and executes the action that would be optimal for that location. In other words, the agent executing this policy moves in the direction of the most likely source location.
This reads
where \({\bf x}^a(a)\) is the new agent position after executing action \(a\), where \(\lVert \cdot \rVert_1\) is the Manhattan norm, and where \({\bf x}^{\text{mls}}\) is the most likely source location
It was originally proposed for robotic navigation [Cassandra1996]. It is not a good policy for the source-tracking POMDP.