Imitating Unknown Policies via Exploration

Behavioral cloning is an imitation learning technique that teaches an agent how to behave through expert demonstrations. Recent approaches use self-supervision of fully-observable unlabeled snapshots of the states to decode state-pairs into actions. However, the iterative learning scheme from these techniques are prone to getting stuck into bad local minima. We address these limitations incorporating a two-phase model into the original framework, which learns from unlabeled observations via exploration, substantially improving traditional behavioral cloning by exploiting (i) a sampling mechanism to prevent bad local minima, (ii) a sampling mechanism to improve exploration, and (iii) self-attention modules to capture global features. The resulting technique outperforms the previous state-of-the-art in four different environments by a large margin.

In this paper, we address the problem of imitation from observation by learning to imitate the behavior of an expert through the use of its state information without any other prior information of the observed actions. Our approach, Imitating Unknown Policies via Exploration (IUPE), substantially improves over traditional behavior cloning by exploiting three novel ideas that we develop in this paper: (i) the impact of a carefully-designed sampling mechanism that regulates the observations that are used to feed the inverse dynamics model, preventing the models from reaching bad local minima; (ii) the impact of sampling from the softmax distribution of actions instead of reaching for the maximum a posteriori estimation, as a means to also add stochasticity (i.e., improve exploration) to the iterative imitation learning process and prevent local minima; and (iii) the impact of self-attention within both the inverse dynamics model and the policy model. By making use of sampling, exploration, and self-attention, IUPE is capable of outperforming all state-of-the-art approaches based on behavior cloning, either over low-dimensional state spaces or over raw images, regarding both performance and Average Episodic Reward (\(AER\)).

Sampling

For obtaining the samples from post-demonstrations, we first select the distribution of actions given an environment and the current policy \(P(A|E;\mathcal{I}^{pos})\). Our sampling strategy considers only successful runs from \(\mathcal{I}^{pos}\), i.e., only state-action sequences in which the agent was able to achieve the environment goal. We represent it as \(v_e\) in the equation below, where \(v_e\) is set to 1 if the agent achieves the environment goal or zero otherwise, and \(E\) is the set of runs in an environment:

\[P(A|E;\mathcal{I}^{pos}) = \frac{\sum_{e=1}^{|E|} v_e \times P(A|e)}{|E|}\]

Exploration

The original behavioral cloning framework uses the maximum a posteriori (MAP) estimation, i.e., it predicts the action with the greatest probability given by the model for a pair of states, both in its original version and in its \(\alpha\)-iterations version. By using MAP predictions, the model is relatively unsure about the correct action in earlier iterations, leading to undesired local minima. To avoid such a problem, we sample the actions from the softmax distribution of both models (IDM during the expert labeling and PM during the execution of the environment) creating a stochastic policy capable of further exploration in early iterations considering the model uncertainty. We show that creating those samples allows the IDM to converge in fewer iterations, since the sampling method guarantees a more sparse dataset, and the stochastic policy guarantees more exploration of the search space for properly achieving the environment goal.

Self-Attention

Self-Attention (SA) is a module that learns long-range dependencies within the internal representation of a neural network by computing non-local responses as a weighted sum of the features at all positions. It allows the network to focus on specific features that are relevant to the task at each step, while minimizing the impact of the constant changes in the post-demonstrations at each iteration. Below, we have the Grad-CAM technique to visualize the self-attention gradient activations given an image in a trained policy.

The complete paper can be seen here. When citing our work in academic papers, please use this BibTeX entry:

@inproceedings{GavenskiEtAl2020bmvc,
  author    = {Gavenski, Nathan and
               Monteiro, Juarez and 
               Granada, Roger and 
               Meneguzzi, Felipe and 
               Barros, Rodrigo C},
  title     = {Imitating Unknown Policies via Exploration},
  booktitle = {Proceedings of the 31st British Machine Vision Conference},
  series    = {BMVC 2020},
  location  = {Manchester, UK},
  pages     = {1--10},
  url       = {https://arxiv.org/abs/2008.05660},
  month     = {September},
  year      = {2020},
  publisher = {BMVA Press}
}
Written on July 29, 2020