Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (2024)

AdrianRemondaEduardoVeasGranitLuzhnica

Abstract

Model-based reinforcement learning (MBRL) aims to learn model(s) of the environment dynamics that can predict the outcome of its actions. Forward application of the model yields so called imagined trajectories (sequences of action, predicted state-reward) used to optimize the set of candidate actions that maximize expected reward. The outcome, an ideal imagined trajectory or plan, is imperfect and typically MBRL relies on model predictive control (MPC) to overcome this by continuously re-planning from scratch, incurring thus major computational cost and increasing complexity in tasks with longer receding horizon.

We propose uncertainty estimation methods for online evaluation of imagined trajectories to assess whether further planned actions can be trusted to deliver acceptable reward.These methods include comparing the error after performing the last action with the standard expected error and using model uncertainty to assess the deviation from expected outcomes. Additionally, we introduce methods that exploit the forward propagation of the dynamics model to evaluate if the remainder of the plan aligns with expected results and assess the remainder of the plan in terms of the expected reward.Our experiments demonstrate the effectiveness of the proposed uncertainty estimation methods by applying them to avoid unnecessary trajectory replanning in a shooting MBRL setting. Results highlight significant reduction on computational costs without sacrificing performance.

keywords:

Deep Reinforcement Learning, Model Based Reinforcement Learning, Model-Predictive Control, Robotics, Random shooting methods., Planning

journal: ISA transactions\affiliation

[label1]organization=Know-Center,city=Graz,country=Austria

\affiliation

[label2]organization=Graz University of Technology,city=Graz,country=Austria

1 Introduction

Reinforcement learning can be successfully applied in continuous control of complex and highly non-linear systems. Algorithms for reinforcement learning can be categorized as model free (MFRL) or model based (MBRL).Using deep learning in MFRL has achieved success in learning complex policies from raw input, such as solving problems with high-dimensional state spaces (Sutton and Barto, 1998)(Mnih etal., 2015) and continuous action space optimization problems with algorithms like deep deterministic policy gradient (DDPG)(Lillicrap etal., 2015), Proximal Policy Optimization (PPO) (Schulman etal., 2017) or Soft Actor-Critic (SAC) (Haarnoja etal., 2018).While significant progress has been made, the high sample complexity of MFRL limits its real-world applicability.Collecting millions of transitions needed to converge is not always feasible in real-world problems, as excessive data collection is expensive, tedious or can lead to physical damage (Williams etal., 2017; Chua etal., 2018). Instead, model based methods are comparably sample efficient. MBRL techniques build a predictive model of the world to imagine trajectories of future states and plan the best actions to maximize a reward.

MBRL uses a dynamics model to predict the outcome of taking an action in an environment with given states. It can bootstrap from existing experiences and is versatile to changing objectives on the fly. Nevertheless, its performance degrades with growing model error. In the general case of nonlinear dynamics guarantees of local optimality do not hold(Janner etal., 2019).Sampling-based MPC (model predictive control) algorithms can be used to address this issue. A Sampling-based MPC generates a large number of trajectories with the goal of maximizing the expected accumulated reward. But, complex environments are typically partially observable and the problem is formulated as a receding horizon planning task. Hence, after executing a single step, trajectories are generated again from scratch to deal with the receding horizon and reduce the impact of model compound error (Rao, 2010).An additional challenge lies in the high cost incurred by frequently generating imagined trajectories from scratch during planning. For each trajectory generated, a sequence of actions has to be evaluated with the dynamics model. Each successive step in a trajectory depends on previous states. So the evaluation of a single trajectory is a recurrent process that cannot be parallelized. Acting upon imagination, the method here proposed, seeks to estimate online the uncertainty of the imagined trajectory and to reduce the planning cost by continuing to act upon an it unless it cannot be trusted.

Therefore, in this work, we present methods for uncertainty estimation, designed to address and improve the computational limitations of shooting MPC MBRL methods. Our main objectiveis the online estimation of uncertainty of the model plan. A second objective is the application of uncertainty estimation to avoid frequent replanning. Here, we propose using the degree to which computations are reduced as a practical and quantifiable proxy for the effectiveness of uncertainty estimation methods. If our methods are successful, we expect to observe a significant decrease in computations without a substantial decay in performance. The balance between computational efficiency and performance demonstrates the reliability of our uncertainty estimation methods. A robust estimation of uncertainty facilitates efficient decision-making and optimizes the use of computational resources.Our contributions are as follows:

  • 1.

    We provide a thorough analysis and discussion on quantifying the error of predicted trajectories.

  • 2.

    We propose methods for uncertainty estimation in MBRL:

    • (a)

      Methods that observe the outcome of the last action in the trajectory i) comparing the error after performing the last action with the standard expected error, ii) assessing the deviation with respect to expected outcomes using model uncertainty.

    • (b)

      Methods that exploit the forward propagation of the dynamics model to iii) evaluate if the remainder of the plan aligns with expected results, iv) assess the remainder of the plan in terms of the expected reward.

  • 3.

    We demonstrate how our proposed uncertainty estimation methods can be used to bypass the need for replanning in sampling-based MBRL methods.

Our experimental results on challenging benchmark control tasks demonstrate that the proposed methods effectively leverage accurate predictions as well as dynamically decide when to replan. This approach leads to substantial reductions in training time and promotes more efficient use of computational resources by eliminating unnecessary replanning steps.

2 Related work

Model-based reinforcement learning (MBRL) has been applied in various real-world control tasks, such as robotics. Compared to model-free approaches, MBRL tends to be more sample-efficient(Deisenroth etal., 2013).MBRL can be grouped into four main categories(Zhu etal., 2020):
1) Dyna-style algorithms optimize policies using samples from a learned world model (Sutton, 1990).
2) Model-augmented value expansion methods, such as MVE (Oh etal., 2017), use model-based rollouts to enhance targets for model-free Temporal Difference updates.
3) Analytic-gradient methods can be used when a differentiable world model is available, which adjust the policy through gradients that flow through the model. When compared to traditional planning algorithms that create numerous rollouts to choose the optimal action sequence, analytic-gradient methods are more computationally efficient. Stochastic Value Gradients (SVG) (Heess etal., 2015) provide a new way to calculate analytic value gradients using a generic differentiable world model. Dreamer (Hafner etal., 2020), a milestone in the realm of analytic-gradient model-based RL, demonstrates superior performance in visual control tasks. Dreamer expands upon SVG by facilitating the generation of imaginary rollouts within the latent space
4) Model Predictive Control (MPC) and sampling-based shooting methods employ planning to select actions. They are notably effective for addressing real-world scenarios since excessive data collection is not only costly and tedious, but it can also result in physical damage. Additionally, sampling-based MPC methods have the capacity to bootstrap from existing experiences and rapidly adapt to changing objectives on the fly. However, a significant drawback to these approaches is their computationally intensive nature (Rao, 2010; Chua etal., 2018). The present work belongs into the latter category.

Recently, it was demonstrated that parametric function approximators, neural networks (NN), efficiently reduce sample complexity in problems with high-dimensional non-linear dynamics(Nagabandi etal., 2018).Random shooting methods artificially generate large number of actions(Rao, 2010) and MPC is used to select candidate actions(Camacho etal., 2004). I.e.Williams etal. (2017) and Drews etal. (2017) introduced a sampling-based MPC with dynamics model to sample a large number of trajectories in parallel. A two-layer NN trained from maneuvers performed by a human pilot was superior compared with a physics model built using vehicle dynamics equations from bicycle models. One disadvantage is that NNs cannot quantify predictive uncertainty.

Lakshminarayanan etal. (2016) utilized the ensembles of probabilistic NNs to determine predictive uncertainty. Kalweit and Boedecker (2017) used a notion of uncertainty within a model-free RL (MFRL) agent to switch executing imagined trajectories from a dynamics model when MFRL agent has a high uncertainty. Conversely, Buckman etal. (2018) used imagined trajectories to improve the sample complexity of a MFRL agent. They improved the Q function by using ensembles to estimate uncertainty and prioritize trajectories thereupon.

Measuring the reliability of a learned dynamics model when generating imagined trajectories has been proposed in several works.Chua etal. (2018) identified two types of uncertainty: aleatoric (inherent to the process) and epistemic (resulting from datasets with too few data points). The former is the uncertainty inherent to the process, and the latter results from datasets with too few data points.They combined uncertainty aware probabilistic ensembles in the trajectory sampling of the MPC with a cross entropy controller and achieved asymptotic performance comparable to Proximal Policy Optimization (PPO) Schulman etal. (2017) or Soft Actor-Critic (SAC) Haarnoja etal. (2018) with more sample efficient convergence. Janner etal. (2019) generated (truncated) short trajectories with a probabilistic ensemble to train the policy of a MFRL agent, thus improving significantly its sampling efficiency. Yu etal. (2020) also exploits the uncertainty of the dynamics model to improve policy learning on an offline RL setting. They learn policies entirely from a large batch of previously collected data with rewards artificially penalized by the uncertainty of the dynamics. These works focus on sample efficiency and improving the performance, our work proposes novel methods to estimate the uncertainty of the dynamics model to determine when to replan.

The authors in (Hafez etal., 2020) propose a analytic-gradient-based method that considers the reliability of the learned dynamics model used for imagining the future. They evaluate their approach in the context of enhancing vision-based robotic grasping, aiming to improve sample efficiency in sparse reward environments. In contrast to their method, ours does not require the use of numerous local dynamics models or a self-organizing map. Instead, we introduce a technique that exploits the uncertainty of the dynamics model to estimate the uncertainty of plan during execution, primarily aimed at minimizing replanning within an MPC framework. Close to our work, Zhu etal. (2020) studied the discrepancy between imagination. Their method allows for policy generalization to real-world interactions by optimizing the mutual information between imagined and real trajectories, while simultaneously refining the policy based on the imagined trajectories. However, their focus is on analytic gradients MBRL only, our method can be applied to any MBRL which yields a notion of uncertainty and we focus on shooting methods, which are still the first choice in domains like self driving cars (Williams etal., 2017).

Hansen etal. (2022) obtained state of the art performance in terms of reward and training time on diverse continuous control tasks by significantly improving Model-Augmented Value Expansion methods. Their approach effectively combines the strengths of both MFRL and MBRL. They adopt a learned task-oriented latent dynamics model for localized trajectory optimization over a short horizon. Furthermore, they utilize a learned terminal value function to estimate the long-term returns. However, their method still necessitates the learning of the value function. Depending on the context, this could present challenges when compared to shooting methods.

Nevertheless, shooting MPC methods still suffer from expensive computationChua etal. (2018); Zhu etal. (2020).Thus, our research seeks to reduce the amount of computation continuing to act upon trajectories that seem trustworthy. Our solution builds upon results ofChua etal. (2018), using probabilistic ensembles and cross entropy in the MPC.

3 Preliminaries

RL aims to learn a policy that maximizes the accumulated reward obtained from the environment. At each time t𝑡titalic_t, the agent is at a state stSsubscript𝑠𝑡𝑆s_{t}\in Sitalic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_S, executes an action atAsubscript𝑎𝑡𝐴a_{t}\in Aitalic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_A and receives from the environment a reward rt=r(st,at)subscript𝑟𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡r_{t}=r(s_{t},a_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and a state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT according to some unknown dynamics function f:S×AS:𝑓𝑆𝐴𝑆f:S\times A\to Sitalic_f : italic_S × italic_A → italic_S. The goal is then to maximize the sum of discounted rewards i=tγ(it)r(si,ai)superscriptsubscript𝑖𝑡superscript𝛾𝑖𝑡𝑟subscript𝑠𝑖subscript𝑎𝑖\sum_{i=t}^{\infty}\gamma^{(i-t)}r(s_{i},a_{i})∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT ( italic_i - italic_t ) end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where γ[0,1]𝛾01\gamma\ \in[0,1]italic_γ ∈ [ 0 , 1 ].MBRL uses a discrete time dynamics model f^=(st,at)^𝑓subscript𝑠𝑡subscript𝑎𝑡\hat{f}=(s_{t},a_{t})over^ start_ARG italic_f end_ARG = ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to predict the future state s^t+Δtsubscript^𝑠𝑡subscriptΔ𝑡\hat{s}_{t+\Delta_{t}}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT after executing action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To reach a state into the future, the dynamics model evaluates sequences of actions, at:t+H=(at,,at+H1)subscript𝑎:𝑡𝑡𝐻subscript𝑎𝑡subscript𝑎𝑡𝐻1a_{t:t+H}=(a_{t},\ldots,a_{t+H-1})italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT = ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t + italic_H - 1 end_POSTSUBSCRIPT ) over a longer horizon H𝐻Hitalic_H, to maximize their discounted reward i=tt+H1γ(it)r(si,ai)superscriptsubscript𝑖𝑡𝑡𝐻1superscript𝛾𝑖𝑡𝑟subscript𝑠𝑖subscript𝑎𝑖\sum_{i=t}^{t+H-1}\gamma^{(i-t)}r(s_{i},a_{i})∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_H - 1 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT ( italic_i - italic_t ) end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).Due to partial observability of the environment and the error of the dynamics model f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG in predicting the real physics f𝑓fitalic_f, the controller typically executes only one action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the trajectory and the optimization is solved again with the updated state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.Algorithm1 outlines the general steps.When training from scratch, the dynamics model f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is learned with data, 𝒟envsubscript𝒟𝑒𝑛𝑣\mathcal{D}_{env}caligraphic_D start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT, collected on the fly. With f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, the simulator starts and the controller is called to plan the best trajectory resulting in at:t+H*subscriptsuperscript𝑎:𝑡𝑡𝐻a^{*}_{t:t+H}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT. Only the first action of the trajectory at*superscriptsubscript𝑎𝑡a_{t}^{*}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is executed in the environment and the rest is discarded. The data collected from the environment is added to 𝒟envsubscript𝒟𝑒𝑛𝑣\mathcal{D}_{env}caligraphic_D start_POSTSUBSCRIPT italic_e italic_n italic_v end_POSTSUBSCRIPT and f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is trained further. MBRL requires a strategy to generate an action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, given a state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a discrete time dynamics model f^=(st,at)^𝑓subscript𝑠𝑡subscript𝑎𝑡\hat{f}=(s_{t},a_{t})over^ start_ARG italic_f end_ARG = ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to predict the state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, and a reward function rt=r(st,at)subscript𝑟𝑡𝑟subscript𝑠𝑡subscript𝑎𝑡r_{t}=r(s_{t},a_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Probabilistic Dynamics Model.

We model the probability distribution of next state given current state and an action using a neural network based regression model similar toLakshminarayanan etal. (2016). The last layer of the model outputs parameters of a Gaussian distribution modeling the aleatoric uncertainty (due to the randomness of the environment). Its parameters are learned together with the parameters of the neural network. To model the epistemic uncertainty (of the dynamics model due to generalization errors), we use ensembles with bagging where all the members of the ensemble are identical except for their initial weight values.Each ensemble element takes as input the current state and action stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and it is trained to predict the difference between stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, instead of directly predicting the next step. Thus, the learning objective for the dynamics model becomes, Δs=st+1stΔ𝑠subscript𝑠𝑡1subscript𝑠𝑡\Delta s=s_{t+1}-s_{t}roman_Δ italic_s = italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT outputs the probability distribution of the future state ps(t+1)subscript𝑝𝑠𝑡1p_{s(t+1)}italic_p start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT from which we can sample the future step and its confidence s^,s^σ=f^θ(s,[𝐚])^𝑠subscript^𝑠𝜎subscript^𝑓𝜃𝑠delimited-[]𝐚\hat{s},\hat{s}_{\sigma}=\hat{f}_{\theta}(s,[\textbf{a}])over^ start_ARG italic_s end_ARG , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , [ a ] ), where sσsubscript𝑠𝜎s_{\sigma}italic_s start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT captures both, epistemic and aleatoric uncertainty.

1:Set replay buffer 𝒟𝒟\mathcal{D}caligraphic_D with one iteration of random controller

2:forIteration i=1𝑖1i=1italic_i = 1 to NIterations𝑁𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠NIterationsitalic_N italic_I italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_sdo

3:Train f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG given 𝒟𝒟\mathcal{D}caligraphic_D

4:forTime t=0𝑡0t=0italic_t = 0 to TaskHorizon𝑇𝑎𝑠𝑘𝐻𝑜𝑟𝑖𝑧𝑜𝑛TaskHorizonitalic_T italic_a italic_s italic_k italic_H italic_o italic_r italic_i italic_z italic_o italic_ndo

5:Get at:t+H*superscriptsubscript𝑎:𝑡𝑡𝐻a_{t:t+H}^{*}italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT from CompOptTrajectory(st,f^)𝐶𝑜𝑚𝑝𝑂𝑝𝑡𝑇𝑟𝑎𝑗𝑒𝑐𝑡𝑜𝑟𝑦subscript𝑠𝑡^𝑓CompOptTrajectory(s_{t},\hat{f})italic_C italic_o italic_m italic_p italic_O italic_p italic_t italic_T italic_r italic_a italic_j italic_e italic_c italic_t italic_o italic_r italic_y ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG )

6:Execute first action at*superscriptsubscript𝑎𝑡a_{t}^{*}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT from optimal actions at:t+H*superscriptsubscript𝑎:𝑡𝑡𝐻a_{t:t+H}^{*}italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT

7:Record outcome: 𝒟𝒟{st,at*,st+1}𝒟𝒟subscript𝑠𝑡superscriptsubscript𝑎𝑡subscript𝑠𝑡1\mathcal{D}\leftarrow\mathcal{D}\cup\{s_{t},a_{t}^{*},s_{t+1}\}caligraphic_D ← caligraphic_D ∪ { italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT }

8:endfor

9:endfor

Trajectory Generation.

Each ensemble element outputs the parameters of a normal distribution. To generate trajectories, P particles are created from the current state, stp=stsuperscriptsubscript𝑠𝑡𝑝subscript𝑠𝑡s_{t}^{p}=s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT = italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which are then propagated by: st+1pf^b(stp,at)similar-tosuperscriptsubscript𝑠𝑡1𝑝subscript^𝑓𝑏superscriptsubscript𝑠𝑡𝑝subscript𝑎𝑡s_{t+1}^{p}\sim\hat{f}_{b}(s_{t}^{p},a_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∼ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), using a particular bootstrap element b{1,,B}𝑏1𝐵b\in\{1,...,B\}italic_b ∈ { 1 , … , italic_B }. There are many options on how to propagate the particles through the ensemble as analyzed in detail in Chua etal. (2018).They obtained the best results using the TS𝑇subscript𝑆TS_{\infty}italic_T italic_S start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT method, which refers to particles never changing the initial bootstrap element. Doing so, results in having both uncertainties separated at the end of the trajectory. Specifically, aleatoric state variance is the average variance of particles of same bootstrap, whilst epistemic state variance is the variance of the average of particles of same bootstrap indexes. Our approach also uses the TS𝑇subscript𝑆TS_{\infty}italic_T italic_S start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT method.

Planning.

To select a convenient course of action leading to sHsubscript𝑠𝐻s_{H}italic_s start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, MBRL generates a large number of trajectories K𝐾Kitalic_K and evaluates them in terms of reward.To find the actions that maximize reward,we used the cross entropy method (CEM) Botev etal. (2013), which is an algorithm for solving optimization problems based on cross-entropy minimization. CEM gradually changes the sampling distribution of the random search so that the rare-event is more likely to occur. Thus, this method estimates a sequence of sampling distributions that converges to a distribution with probability mass concentrated in a region of near-optimal solutions.Algorithm2 describes the use CEM to compute the optimal sequence of actions at:t+H*subscriptsuperscript𝑎:𝑡𝑡𝐻a^{*}_{t:t+H}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT.The controller uses a single action in a trajectory, the computational complexity is constant at each step, given by the depth of the task horizon (H𝐻Hitalic_H) and the number of trajectories (K𝐾Kitalic_K) or breadth. It is possible to parallelize in breadth, but the evaluation of some action at+isubscript𝑎𝑡𝑖a_{t+i}italic_a start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT at state st+isubscript𝑠𝑡𝑖s_{t+i}italic_s start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT with dynamics model f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG is iterative, requiring knowledge of at least one past state and cannot be parallelized in depth. This leads to complexity O(H x A x K), where A refers to the dimension of actions (how many controllable aspects). A𝐴Aitalic_A and K𝐾Kitalic_K depend on the environment.

Input: sinitsubscript𝑠𝑖𝑛𝑖𝑡s_{init}italic_s start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT: current state of the environment, dyn. model f^normal-^𝑓\hat{f}over^ start_ARG italic_f end_ARG

1:Initialize P𝑃Pitalic_P particles, sτpsuperscriptsubscript𝑠𝜏𝑝s_{\tau}^{p}italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, with the initial state, sinitsubscript𝑠𝑖𝑛𝑖𝑡s_{init}italic_s start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT

2:foractions sampled at:t+Hsubscript𝑎:𝑡𝑡𝐻a_{t:t+H}italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT in 1CEMsamples1𝐶𝐸subscript𝑀𝑠𝑎𝑚𝑝𝑙𝑒𝑠1\dots CEM_{samples}1 … italic_C italic_E italic_M start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_s end_POSTSUBSCRIPTdo

3:Propagate state particles sτpsuperscriptsubscript𝑠𝜏𝑝s_{\tau}^{p}italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT using TS and f^|{𝒟,at:t+H}conditional^𝑓𝒟subscript𝑎:𝑡𝑡𝐻\hat{f}|\{\mathcal{D},a_{t:t+H}\}over^ start_ARG italic_f end_ARG | { caligraphic_D , italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT }

4:Evaluate actions as τ=tt+H1Pp=1Pr(sτp,aτ)superscriptsubscript𝜏𝑡𝑡𝐻1𝑃superscriptsubscript𝑝1𝑃𝑟superscriptsubscript𝑠𝜏𝑝subscript𝑎𝜏\sum\limits_{\tau=t}^{t+H}{\frac{1}{P}\sum\limits_{p=1}^{P}r(s_{\tau}^{p},a_{%\tau})}∑ start_POSTSUBSCRIPT italic_τ = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_H end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_P end_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_r ( italic_s start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT )

5:Update CEM(.)(.)( . ) distribution

6:endfor

7:return at:t+H*subscriptsuperscript𝑎:𝑡𝑡𝐻a^{*}_{t:t+H}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT

4 The Promise of Imagination

Generating trajectories is an essential part of the entire process. The predicted states within those trajectories may have a high variance and their quality will depend on the complexity of the environment as well as the number of steps in the future, H. Estimation and online update of uncertainty is needed to determine if a trajectory is reliable. We contend that when predicted trajectories are reliable it is not necessary to frequently replan them.

Starting from a state at time t𝑡titalic_t, a run of the planner yields the optimal set of actions a*superscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT of H steps. Using a*superscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, the dynamics model yields the probability of the next state from which we can sample the next state and confidence s^t+1,σt+1ps(t+1)*similar-tosubscript^𝑠𝑡1subscript𝜎𝑡1subscriptsuperscript𝑝𝑠𝑡1\hat{s}_{t+1},\sigma_{t+1}\sim p^{*}_{s(t+1)}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT and future reward r^tpr(t+1)*similar-tosubscript^𝑟𝑡subscriptsuperscript𝑝𝑟𝑡1\hat{r}_{t}\sim p^{*}_{r(t+1)}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r ( italic_t + 1 ) end_POSTSUBSCRIPT.Thus, one step sampled from the imagined trajectory is composed of st,at,s^t+1,σt+1,r^t+1subscript𝑠𝑡subscript𝑎𝑡subscript^𝑠𝑡1subscript𝜎𝑡1subscript^𝑟𝑡1\langle s_{t},a_{t},\hat{s}_{t+1},\sigma_{t+1},\hat{r}_{t+1}\rangle⟨ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ⟩, where stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the current state, atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the action to be taken, r^t+1subscript^𝑟𝑡1\hat{r}_{t+1}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is the predicted reward if atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is executed, s^t+1subscript^𝑠𝑡1\hat{s}_{t+1}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT the predicted next state and σt+1subscript𝜎𝑡1\sigma_{t+1}italic_σ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT the confidence bound issued by the dynamics model for the prediction. Then, the planner generates iteratively the entire trajectory τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of H steps, where each step i𝑖iitalic_i is composed of the probability distribution ps(t+i)*subscriptsuperscript𝑝𝑠𝑡𝑖p^{*}_{s(t+i)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + italic_i ) end_POSTSUBSCRIPT.

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (1)

Our methods stem from the following information obtained after executing each step in a trajectory: instantaneous deviation between predicted and real outcome, and impact on the projected plan, see Fig.1.Executing the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the imagined trajectory τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, yields a real-world state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. The state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is expected to fall within the uncertainty of the model s^t+1,σt+1ps(t+1)*similar-tosubscript^𝑠𝑡1subscript𝜎𝑡1subscriptsuperscript𝑝𝑠𝑡1\hat{s}_{t+1},\sigma_{t+1}\sim p^{*}_{s(t+1)}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT. Given f(st,at)=st+1𝑓subscript𝑠𝑡subscript𝑎𝑡subscript𝑠𝑡1f(s_{t},a_{t})=s_{t+1}italic_f ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and f^(st,at)=s^t+1^𝑓subscript𝑠𝑡subscript𝑎𝑡subscript^𝑠𝑡1\hat{f}(s_{t},a_{t})=\hat{s}_{t+1}over^ start_ARG italic_f end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, the instantaneous error ϵt=|st+1s^t+1|0subscriptitalic-ϵ𝑡subscript𝑠𝑡1subscript^𝑠𝑡10\epsilon_{t}=|s_{t+1}-\hat{s}_{t+1}|\geq 0italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = | italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | ≥ 0 can be measured. The error at step i𝑖iitalic_i, ϵt+isubscriptitalic-ϵ𝑡𝑖\epsilon_{t+i}italic_ϵ start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT refers to the error after trajectory calculation and executing atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We can model the error distribution and observe if new errors fall within.This refers to immediate effect on the last state.As regards impact on future actions, forward application of the model to the remainder of the actions starting at the new state yields a new trajectory with projected state ps(t+1:t+H)*subscriptsuperscript𝑝𝑠:𝑡1𝑡𝐻p^{*}_{s(t+1:t+H)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + 1 : italic_t + italic_H ) end_POSTSUBSCRIPT and projected reward pr(t+1:t+H)*subscriptsuperscript𝑝𝑟:𝑡1𝑡𝐻p^{*}_{r(t+1:t+H)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r ( italic_t + 1 : italic_t + italic_H ) end_POSTSUBSCRIPT, which can be compared with the planned expected outcomes.

Trajectory Quality Analysis:

As a preliminary experiment, we wanted to analyze the quality of imagined trajectories with a trained dynamics, to determine boundaries of how many actions can be executed without deviating from the plan. We analyzed imagined trajectories on agents in the MuJoCoTodorov etal. (2012) physics engine, with the Cartpole (CP) environment S4,A1formulae-sequence𝑆superscript4𝐴superscript1S\in\mathbb{R}^{4},A\in\mathbb{R}^{1}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H 200, H𝐻Hitalic_H 25, with TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H being the task horizon and H𝐻Hitalic_H the trajectory horizon. The additional material shows similar findings in other environments.A dynamics model was pre-trained in conventional MBRL-MPC, running Algorithm1 five times from scratch, with trajectory (re-)planning after executing each single action. The best performing model was selected for the analysis.The procedure consisted in collecting the errors of predicted and actual state when avoiding the re-planning for n𝑛nitalic_n steps.For each n{0,19}𝑛019n\in\{0,...19\}italic_n ∈ { 0 , … 19 } the algorithm was run for TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H steps, 10101010 times. Therefore, the error at n=0𝑛0n=0italic_n = 0 represents the average error of 10 runs executing the first action in the trajectory.Figure2 illustrates the error of predicted trajectories as a function of n𝑛nitalic_n steps used (i.e., avoiding re-planning). While the error and its variation increases with n𝑛nitalic_n, the minimum error at each step is still at the same level (and often lower) than the average error at step 0, where re-planning can not be avoided at all.Generally, re-planning earlier results in a lower error. However, the chart also shows that some trajectories are so reliable that 19 steps can be executed with error lower than the average error of first point in trajectory.

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (2)

Reward Analysis:Compared to the vector of state errors, the reward has the advantage of being a more compact representation (single scalar). It also provides substantial information. Figure2 right shows the reward of successfully solved task in CP.After 50 steps, the reward does not change significantly and the system is at a local equilibrium. We contend that when the system is at equilibrium the dynamics model can reliably anticipate the outcomes of the agent’s actions, consequentially rewards are expected to remain similar Δr(t)=rtrt1Δr(t+1)=rt+1rtsubscriptΔ𝑟𝑡subscript𝑟𝑡subscript𝑟𝑡1similar-to-or-equalssubscriptΔ𝑟𝑡1subscript𝑟𝑡1subscript𝑟𝑡\Delta_{r}(t)=r_{t}-r_{t-1}\simeq\Delta_{r}(t+1)=r_{t+1}-r_{t}roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_t ) = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ≃ roman_Δ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_t + 1 ) = italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

5 Acting Upon Imagination

From the above discussion, the following information is available after executing each step (shown in Fig2):(i) immediate error (ϵt+1subscriptitalic-ϵ𝑡1\epsilon_{t+1}italic_ϵ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT), (ii) the model uncertainty or confidence bounds for an action imagined against its execution (s^t+1,σt+1ps(t+1)*similar-tosubscript^𝑠𝑡1subscript𝜎𝑡1subscriptsuperscript𝑝𝑠𝑡1\hat{s}_{t+1},\sigma_{t+1}\sim p^{*}_{s(t+1)}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT), iii) deviation in projected future states(ps(t+1:t+H)*subscriptsuperscript𝑝𝑠:𝑡1𝑡𝐻p^{*}_{s(t+1:t+H)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + 1 : italic_t + italic_H ) end_POSTSUBSCRIPT) and iv) the deviation in projected future rewards (pr(t+1:t+H)*subscriptsuperscript𝑝𝑟:𝑡1𝑡𝐻p^{*}_{r(t+1:t+H)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r ( italic_t + 1 : italic_t + italic_H ) end_POSTSUBSCRIPT). The last two pieces (iii), (iv) are obtained by forward applying the dynamics model with the remainder of the actions starting at the new state.We leverage each piece of available information to develop methods for uncertainty estimation and evaluate their performance in avoiding replanning events.

Algorithm3 presents the core logic of our proposed methods to continue acting upon imagined trajectories and reduce computation. The variable skip𝑠𝑘𝑖𝑝skipitalic_s italic_k italic_i italic_p is updated with the result of one of four proposed methods in Algorithms4, 5, 6 or 7. Depending on the outcome, replanning can be avoided and computations reduced.If skip𝑠𝑘𝑖𝑝skipitalic_s italic_k italic_i italic_p is False, only the first predicted action is executed in the environment. Otherwise, subsequent actions from at:t+H*subscriptsuperscript𝑎:𝑡𝑡𝐻a^{*}_{t:t+H}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT are executed until the skip𝑠𝑘𝑖𝑝skipitalic_s italic_k italic_i italic_p flag is set back to False or the TaskHorizon𝑇𝑎𝑠𝑘𝐻𝑜𝑟𝑖𝑧𝑜𝑛TaskHorizonitalic_T italic_a italic_s italic_k italic_H italic_o italic_r italic_i italic_z italic_o italic_n number of steps in the environment is reached.

1:iftrainModelthen

2:Initialize PE dynamics model f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG parameters

3:Set replay buffer 𝒟𝒟\mathcal{D}caligraphic_D with one iteration of a random controller

4:else

5:Load pre trained f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG parameters and replay buffer 𝒟𝒟\mathcal{D}caligraphic_D

6:endif

7:skip=False𝑠𝑘𝑖𝑝𝐹𝑎𝑙𝑠𝑒skip=Falseitalic_s italic_k italic_i italic_p = italic_F italic_a italic_l italic_s italic_e

8:forIteration l=1𝑙1l=1italic_l = 1 to NIterations𝑁𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠NIterationsitalic_N italic_I italic_t italic_e italic_r italic_a italic_t italic_i italic_o italic_n italic_sdo

9:iftrainModelthen

10:Train f^^𝑓\hat{f}over^ start_ARG italic_f end_ARG given 𝒟𝒟\mathcal{D}caligraphic_D

11:endif

12:forTime t=0𝑡0t=0italic_t = 0 to TaskHorizon𝑇𝑎𝑠𝑘𝐻𝑜𝑟𝑖𝑧𝑜𝑛TaskHorizonitalic_T italic_a italic_s italic_k italic_H italic_o italic_r italic_i italic_z italic_o italic_ndo

13:ifnot skip𝑠𝑘𝑖𝑝skipitalic_s italic_k italic_i italic_pthen

14:Get at:t+H*superscriptsubscript𝑎:𝑡𝑡𝐻a_{t:t+H}^{*}italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT from CompOptTrajectory𝐶𝑜𝑚𝑝𝑂𝑝𝑡𝑇𝑟𝑎𝑗𝑒𝑐𝑡𝑜𝑟𝑦CompOptTrajectoryitalic_C italic_o italic_m italic_p italic_O italic_p italic_t italic_T italic_r italic_a italic_j italic_e italic_c italic_t italic_o italic_r italic_y (st,f^)subscript𝑠𝑡^𝑓(s_{t},\hat{f})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG ) and

15:ps(t:t+H)*subscriptsuperscript𝑝𝑠:𝑡𝑡𝐻p^{*}_{s(t:t+H)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t : italic_t + italic_H ) end_POSTSUBSCRIPT, pr(t:t+H)*subscriptsuperscript𝑝𝑟:𝑡𝑡𝐻p^{*}_{r(t:t+H)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r ( italic_t : italic_t + italic_H ) end_POSTSUBSCRIPT given (st,f^,at:t+H*)subscript𝑠𝑡^𝑓subscriptsuperscript𝑎:𝑡𝑡𝐻(s_{t},\hat{f},a^{*}_{t:t+H})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG , italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT )

16:i=0𝑖0i=0italic_i = 0

17:else

18:i𝑖iitalic_i += 1111

19:endif

20:Execute first action at*superscriptsubscript𝑎𝑡a_{t}^{*}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT from optimal actions at:t+H*superscriptsubscript𝑎:𝑡𝑡𝐻a_{t:t+H}^{*}italic_a start_POSTSUBSCRIPT italic_t : italic_t + italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT

21:Discard first action and keep the rest at*=at+1:t+H*superscriptsubscript𝑎𝑡superscriptsubscript𝑎:𝑡1𝑡𝐻a_{t}^{*}=a_{t+1:t+H}^{*}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT

22:Record outcome: 𝒟𝒟{st,at*,st+1}𝒟𝒟subscript𝑠𝑡superscriptsubscript𝑎𝑡subscript𝑠𝑡1\mathcal{D}\leftarrow\mathcal{D}\cup\{s_{t},a_{t}^{*},s_{t+1}\}caligraphic_D ← caligraphic_D ∪ { italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT }

23:skip = shouldSkip{NSKIP|FSA|CB|FUT|BICHO}conditional-set𝑁𝑆𝐾𝐼𝑃conditional𝐹𝑆𝐴𝐶𝐵𝐹𝑈𝑇𝐵𝐼𝐶𝐻𝑂\{NSKIP|FSA|CB|FUT|BICHO\}{ italic_N italic_S italic_K italic_I italic_P | italic_F italic_S italic_A | italic_C italic_B | italic_F italic_U italic_T | italic_B italic_I italic_C italic_H italic_O }

24:endfor

25:endfor

N-Skip

as a baseline, we introduce N-Skip, which is a straight forward method for replanning that executes a fixed n𝑛nitalic_n steps in a trajectory (of length H𝐻Hitalic_H) and triggers replanning at step n+1𝑛1n+1italic_n + 1 (n<H𝑛𝐻n<Hitalic_n < italic_H). For n=0𝑛0n=0italic_n = 0 the trajectory is recomputed at every step. As earlier replanning generally lead to lower error. n𝑛nitalic_n is a hyperparameter that should be tuned to meet the performance requirements. In the CP environment, Figure2 left shows a sharply increasing error at n=7𝑛7n=7italic_n = 7, which amounts to 88%percent8888\%88 % less computations.Interestingly, despite its simplicity, N-Skip has not been extensively analyzed or reported in existing literature.

First Step Alike (FSA)

Some trajectories are more reliable than others and a cutoff of n skips for all trajectories does not consider this variation in quality. Figure2 left shows that there are cases where after 19 steps of a trajectory, the prediction error is still lower than the average error of predicted states right after replanning. To account for such a variation, we propose a dynamic decision making.We would like to continue acting if the replanning will not improve over the error of predicted states. The error is lowest just after replanning and increases with number of steps. So, the main principle of FSA, is to omit replanning at any point t+i,i<H𝑡𝑖𝑖𝐻t+i,i<Hitalic_t + italic_i , italic_i < italic_H, as long as the error ϵt+i1subscriptitalic-ϵ𝑡𝑖1\epsilon_{t+i-1}italic_ϵ start_POSTSUBSCRIPT italic_t + italic_i - 1 end_POSTSUBSCRIPT is comparable to errors right after replanning.
In formal terms, assuming a large sample of M errors collected right after replanning (ϵ0subscriptitalic-ϵ0\epsilon_{0}italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) denoted by 𝔼={ϵ0(m)m{1M}}𝔼conditional-setsuperscriptsubscriptitalic-ϵ0𝑚𝑚1𝑀\mathbb{E}=\{\epsilon_{0}^{(m)}\mid m\in\{1...M\}\}blackboard_E = { italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ∣ italic_m ∈ { 1 … italic_M } }. Actions in a trajectory τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with predicted states {s^t+i+1i{0H1}}conditional-setsubscript^𝑠𝑡𝑖1𝑖0𝐻1\{\hat{s}_{t+i+1}\mid i\in\{0...H-1\}\}{ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT ∣ italic_i ∈ { 0 … italic_H - 1 } } are evaluated at each point i𝑖iitalic_i, and if the error ϵt+isubscriptitalic-ϵ𝑡𝑖\epsilon_{t+i}italic_ϵ start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT fits the distribution of 𝔼𝔼\mathbb{E}blackboard_E, then the replanning is skipped. Otherwise, a new trajectory is generated.The challenge lies in determining when the error fits the distribution of 𝔼𝔼\mathbb{E}blackboard_E. Two methods are proposed for handling this. If the errors 𝔼𝔼\mathbb{E}blackboard_E follow a Gaussian distribution represented by mean μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and standard deviation θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, then according to the three sigma rulePukelsheim (1994), 68.27%, 95.45% and 99.73% of the errors should lie within one, two and three standard deviations of the mean, respectively. It follows that, any given error ϵt+isubscriptitalic-ϵ𝑡𝑖\epsilon_{t+i}italic_ϵ start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT (at point t+i+1𝑡𝑖1t+i+1italic_t + italic_i + 1) such that μ0c×θ0ϵt+iμ0+c×θ0subscript𝜇0𝑐subscript𝜃0subscriptitalic-ϵ𝑡𝑖subscript𝜇0𝑐subscript𝜃0\mu_{0}-c\times\theta_{0}\leq\epsilon_{t+i}\leq\mu_{0}+c\times\theta_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_c × italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ≤ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_c × italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT fits the distribution and thus the replanning should be skipped. The constant c𝑐citalic_c is a hyper parameter that defines the specificity of such filtering method.This filtering ensures that the error ϵt+i+1subscriptitalic-ϵ𝑡𝑖1\epsilon_{t+i+1}italic_ϵ start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT is below a percentile of errors in 𝔼𝔼\mathbb{E}blackboard_E where the percentage number depends on c𝑐citalic_c.Furthermore, as we do not want to filter out errors that are too small, we could adopt our rule to one side only: ϵt+iμ0+c×θ0subscriptitalic-ϵ𝑡𝑖subscript𝜇0𝑐subscript𝜃0\epsilon_{t+i}\leq\mu_{0}+c\times\theta_{0}italic_ϵ start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT ≤ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_c × italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Finally, for the case where the distribution of 𝔼𝔼\mathbb{E}blackboard_E is not Gaussian, a similar effect can be achieved by ensuring that ϵt+isubscriptitalic-ϵ𝑡𝑖\epsilon_{t+i}italic_ϵ start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT is within the c𝑐citalic_c percentile (Pc%subscript𝑃percent𝑐P_{c\%}italic_P start_POSTSUBSCRIPT italic_c % end_POSTSUBSCRIPT) of the errors in 𝔼𝔼\mathbb{E}blackboard_E, where c𝑐citalic_c is a parameter to tune. The specific logic for the FSA is given by Algorithm4.

Input: 𝔼={ϵ0(i)i{1M}}𝔼conditional-setsuperscriptsubscriptitalic-ϵ0𝑖𝑖1normal-…𝑀\mathbb{E}=\{\epsilon_{0}^{(i)}\mid i\in\{1...M\}\}blackboard_E = { italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∣ italic_i ∈ { 1 … italic_M } }, parameter c𝑐citalic_c and ϵt+ksubscriptitalic-ϵ𝑡𝑘\epsilon_{t+k}italic_ϵ start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT

1:if𝔼𝔼\mathbb{E}blackboard_E is normally distributed (described by μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT)then

2:return TRUE if ϵt+kμ0+c×θ0subscriptitalic-ϵ𝑡𝑘subscript𝜇0𝑐subscript𝜃0\epsilon_{t+k}\leq\mu_{0}+c\times\theta_{0}italic_ϵ start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT ≤ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_c × italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT else FALSE

3:endif

4:return TRUE if ϵt+kPc%subscriptitalic-ϵ𝑡𝑘subscript𝑃percent𝑐\epsilon_{t+k}\leq P_{c\%}italic_ϵ start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_c % end_POSTSUBSCRIPT else FALSE

Confidence Bounds (CB)

If the dynamics model has a notion of uncertainty, one can obtain a prediction s^t+1subscript^𝑠𝑡1\hat{s}_{t+1}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT with an uncertainty or confidence σt+1subscript𝜎𝑡1\sigma_{t+1}italic_σ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.Given that ps(t+1)*subscriptsuperscript𝑝𝑠𝑡1p^{*}_{s(t+1)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT is modeled by an ensemble of Gaussian regressors, we can assume that the confidence bound σs(t+1)subscript𝜎𝑠𝑡1\sigma_{s(t+1)}italic_σ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT represents the variability in predicted outcomes μs(t+1)subscript𝜇𝑠𝑡1\mu_{s(t+1)}italic_μ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT of an action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where the action in question has been deemed appropriate at the given state s(t)𝑠𝑡s(t)italic_s ( italic_t ) therefore is in the trajectory. After performing the action, we obtain a real-world state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. This method considers the trajectory reliable, if the actual state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is close to predicted state μs(t+1)subscript𝜇𝑠𝑡1\mu_{s(t+1)}italic_μ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT within the confidence bound σs(t+1)subscript𝜎𝑠𝑡1\sigma_{s(t+1)}italic_σ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT of predicted output states obtained from the dynamics model, meaning:μs(t+1)c×σs(t+1)st+1μs(t+1)+c×σs(t+1)subscript𝜇𝑠𝑡1𝑐subscript𝜎𝑠𝑡1subscript𝑠𝑡1subscript𝜇𝑠𝑡1𝑐subscript𝜎𝑠𝑡1\mu_{s(t+1)}-c\times\sigma_{s(t+1)}\leq s_{t+1}\leq\mu_{s(t+1)}+c\times\sigma_%{s(t+1)}italic_μ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT - italic_c × italic_σ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≤ italic_μ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT + italic_c × italic_σ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT, where c𝑐citalic_c is a constant representing the selectivity of the filter adjusted with a factor of sigma, a hyper parameter to be tuned. The specific logic for the CB method is given by Algorithm5.In a nutshell, this method assumes that the performance of an action could lead to several expected possible outcomes (bounded by the prediction). After performing the action, it is observed whether the outcome lies within the boundary of expected outcomes to determine the reliability of the trajectory.

Input: st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, ps(t+1)*subscriptsuperscript𝑝𝑠𝑡1p^{*}_{s(t+1)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT and c𝑐citalic_c

1:Get μs(t+1)subscript𝜇𝑠𝑡1\mu_{s(t+1)}italic_μ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT, σs(t+1)subscript𝜎𝑠𝑡1\sigma_{s(t+1)}italic_σ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT from ps(t+1)*subscriptsuperscript𝑝𝑠𝑡1p^{*}_{s(t+1)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT

2:return TRUE if (μs(t+1)c×σs(t+1)st+1μs(t+1)+c×σs(t+1))\mu_{s(t+1)}-c\times\sigma_{s(t+1)}\leq s_{t+1}\leq\mu_{s(t+1)}+c\times\sigma_%{s(t+1)})italic_μ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT - italic_c × italic_σ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≤ italic_μ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT + italic_c × italic_σ start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT ) else FALSE

Probabilistic future trust (FUT)

FSA and CB asses the error of the state obtained after executing an action vs the prediction estimated by f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Instead, FUT regards the effect of the last action on the outcomes of future actions, by projecting the remaining imagined actions in the trajectory from the newly obtained state.After replanning, the trajectory τt+1subscript𝜏𝑡1\tau_{t+1}italic_τ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT of H steps, where each step i,i=0H𝑖𝑖0𝐻i,i=0...Hitalic_i , italic_i = 0 … italic_H is composed of s^t+i+1subscript^𝑠𝑡𝑖1\hat{s}_{t+i+1}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT, σt+i+1subscript𝜎𝑡𝑖1\sigma_{t+i+1}italic_σ start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT, r^t+i+1subscript^𝑟𝑡𝑖1\hat{r}_{t+i+1}over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT sampled from ps(t+i+1)*subscriptsuperscript𝑝𝑠𝑡𝑖1p^{*}_{s(t+i+1)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + italic_i + 1 ) end_POSTSUBSCRIPT and at+i*subscriptsuperscript𝑎𝑡𝑖a^{*}_{t+i}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT, offers a wealth of predicted information. FUT intends to detect whether after taking at+i*subscriptsuperscript𝑎𝑡𝑖a^{*}_{t+i}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT and reaching a new state st+i+1s^t+i+1subscript𝑠𝑡𝑖1subscript^𝑠𝑡𝑖1s_{t+i+1}\neq\hat{s}_{t+i+1}italic_s start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT ≠ over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT the rest of the predictions in τt+i+1subscript𝜏𝑡𝑖1\tau_{t+i+1}italic_τ start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT still hold.Thus, we project the trajectory τt+i+1subscriptsuperscript𝜏𝑡𝑖1\tau^{\prime}_{t+i+1}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT from state st+i+1subscript𝑠𝑡𝑖1s_{t+i+1}italic_s start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT using imagined actions at+i:H*τt+i+1subscriptsuperscript𝑎:𝑡𝑖𝐻subscript𝜏𝑡𝑖1a^{*}_{t+i:H}\in\tau_{t+i+1}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_i : italic_H end_POSTSUBSCRIPT ∈ italic_τ start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT and then we compare τt+i+1subscript𝜏𝑡𝑖1\tau_{t+i+1}italic_τ start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT and τt+i+1subscriptsuperscript𝜏𝑡𝑖1\tau^{\prime}_{t+i+1}italic_τ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT. If they differ, then the agent is deviated from the plan and it should trigger a replanning event. Otherwise, it proceeds to take action at+i+1subscript𝑎𝑡𝑖1a_{t+i+1}italic_a start_POSTSUBSCRIPT italic_t + italic_i + 1 end_POSTSUBSCRIPT.As long as the new trajectory is similar to the original estimation, we assume that the original plan is still valid and we skip that step. This does not mean that the optimal set of actions at each step is replanned. Rather, every time a replanning is skipped, we propagate only one trajectory starting from the current state of the simulator and still using the originally set of actions, a*superscript𝑎a^{*}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT, as initially planned.Algorithm6 describes FUT method.The original trajectory is estimated with the probability distribution ps(t+1)*subscriptsuperscript𝑝𝑠𝑡1p^{*}_{s(t+1)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT and the updated plan with ps(t+1)subscriptsuperscript𝑝𝑠𝑡1p^{\prime}_{s(t+1)}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + 1 ) end_POSTSUBSCRIPT. We use Kullback–Leibler divergence (KL) to evaluate the change in two distributions after each step in the simulator. We replan when there is larger difference than β𝛽\betaitalic_β (a hyper-parameter). We control how far ahead the two distributions are compared by introducing a hyper-parameter: LA (look ahead steps).

Input: i𝑖iitalic_i, ps(t+i+1:H)*subscriptsuperscript𝑝𝑠normal-:𝑡𝑖1𝐻p^{*}_{s(t+i+1:H)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + italic_i + 1 : italic_H ) end_POSTSUBSCRIPT, β𝛽\betaitalic_β and LA𝐿𝐴LAitalic_L italic_A

1:Get ps(t+i+1:H)subscriptsuperscript𝑝𝑠:𝑡𝑖1𝐻p^{\prime}_{s(t+i+1:H)}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + italic_i + 1 : italic_H ) end_POSTSUBSCRIPT from ComputeTrajectoryProbs𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑇𝑟𝑎𝑗𝑒𝑐𝑡𝑜𝑟𝑦𝑃𝑟𝑜𝑏𝑠ComputeTrajectoryProbsitalic_C italic_o italic_m italic_p italic_u italic_t italic_e italic_T italic_r italic_a italic_j italic_e italic_c italic_t italic_o italic_r italic_y italic_P italic_r italic_o italic_b italic_s (st+i,f^,at+i:H*subscript𝑠𝑡𝑖^𝑓subscriptsuperscript𝑎:𝑡𝑖𝐻s_{t+i},\hat{f},a^{*}_{t+i:H}italic_s start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG , italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_i : italic_H end_POSTSUBSCRIPT)

2:L = min(H, LA - j)

3:dist_error=KL(ps(t+i+1:L)||ps(t+i+1:L)*)dist\_error=KL(p^{\prime}_{s(t+i+1:L)}||p^{*}_{s(t+i+1:L)})italic_d italic_i italic_s italic_t _ italic_e italic_r italic_r italic_o italic_r = italic_K italic_L ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + italic_i + 1 : italic_L ) end_POSTSUBSCRIPT | | italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + italic_i + 1 : italic_L ) end_POSTSUBSCRIPT )

4:return TRUE if (dist_error<β)𝑑𝑖𝑠𝑡_𝑒𝑟𝑟𝑜𝑟𝛽(dist\_error<\beta)( italic_d italic_i italic_s italic_t _ italic_e italic_r italic_r italic_o italic_r < italic_β ) else FALSE

Bound Imagined Cost Horizon Omission (BICHO)

Figure2 (right) shows that when the system reaches equilibrium, the reward stabilizes as well. BICHO assumes the expected reward is stable and attempts to determine whether deviations arise after each step in the trajectory.

At each replanning step, we obtain the distribution of rewards pr*subscriptsuperscript𝑝𝑟p^{*}_{r}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT for H steps in the future. Moreover, at each step of the environment, regardless whether the replanning was skipped or not, we project a new trajectory prsubscriptsuperscript𝑝𝑟p^{\prime}_{r}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT of H steps, starting from state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT which is given by the environment and using actions at+i*subscriptsuperscript𝑎𝑡𝑖a^{*}_{t+i}italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT in the imagined trajectory obtained at the replanning step.We then compare this two distributions for LA steps (<<< H𝐻Hitalic_H), which is a hyper parameter to tune.Essentially, replanning steps should be skipped as long as the projected reward in the future does not change significantly with respect to the reward expected from the plan pr*subscriptsuperscript𝑝𝑟p^{*}_{r}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Note that when doing the comparison immediately after the replanning, both trajectories will start from the same state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. While, as steps are taken without replanning, the imagined pr(t+i)*subscriptsuperscript𝑝𝑟𝑡𝑖p^{*}_{r(t+i)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r ( italic_t + italic_i ) end_POSTSUBSCRIPT reward starts from one imagined state ps(t+i)*subscriptsuperscript𝑝𝑠𝑡𝑖p^{*}_{s(t+i)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s ( italic_t + italic_i ) end_POSTSUBSCRIPT and the projected trajectory starts at the environment state st+isubscript𝑠𝑡𝑖s_{t+i}italic_s start_POSTSUBSCRIPT italic_t + italic_i end_POSTSUBSCRIPT.We expect this method to work better (in terms of replanning skipped) in environments where the cost has local equilibrium regions.The requirement for large overlapping regions between consecutive trajectories is not necessary in our approach, as we consider an overlap of LA steps ahead, which is typically smaller than the trajectory horizon (H).

Input: i𝑖iitalic_i, pr(t+1:H)*subscriptsuperscript𝑝𝑟normal-:𝑡1𝐻p^{*}_{r(t+1:H)}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r ( italic_t + 1 : italic_H ) end_POSTSUBSCRIPT, β𝛽\betaitalic_β and LA𝐿𝐴LAitalic_L italic_A

1:Get pr(t+i+1:H)subscriptsuperscript𝑝𝑟:𝑡𝑖1𝐻p^{\prime}_{r(t+i+1:H)}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r ( italic_t + italic_i + 1 : italic_H ) end_POSTSUBSCRIPT from ComputeTrajectoryProbs𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑇𝑟𝑎𝑗𝑒𝑐𝑡𝑜𝑟𝑦𝑃𝑟𝑜𝑏𝑠ComputeTrajectoryProbsitalic_C italic_o italic_m italic_p italic_u italic_t italic_e italic_T italic_r italic_a italic_j italic_e italic_c italic_t italic_o italic_r italic_y italic_P italic_r italic_o italic_b italic_s (st,f^,at+i:H*subscript𝑠𝑡^𝑓subscriptsuperscript𝑎:𝑡𝑖𝐻s_{t},\hat{f},a^{*}_{t+i:H}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG , italic_a start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_i : italic_H end_POSTSUBSCRIPT)

2:L = min(H, LA - i)

3:r_error=KL(pr(t+i+1:L)||pr(t+1:L)*)r\_error=KL(p^{\prime}_{r(t+i+1:L)}||p^{*}_{r(t+1:L)})italic_r _ italic_e italic_r italic_r italic_o italic_r = italic_K italic_L ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r ( italic_t + italic_i + 1 : italic_L ) end_POSTSUBSCRIPT | | italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r ( italic_t + 1 : italic_L ) end_POSTSUBSCRIPT )

4:return TRUE if r_error<β𝑟_𝑒𝑟𝑟𝑜𝑟𝛽r\_error<\betaitalic_r _ italic_e italic_r italic_r italic_o italic_r < italic_β else FALSE

6 Experiments

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (3)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (4)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (5)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (6)

Our ultimate goal is to reduce computations while controlling performance decay. Intuitively, we expect that a trained dynamics model can anticipate the outcome of the agent’s actions and, if its predictions are reliable, it can do so for a number of consecutive imagined steps. So, the first experiment uses pre-trained dynamics to assess the potential gains of acting upon imagined trajectories with the proposed methods: N-Skip (NS), FSA, CB, FUT and BICHO. We aim to investigate the amount of re-planning that can be avoided in terms of number of trajectory planning skipped, the impact of our approach on the reward, and the average and variance of steps executed prior to initiating replanning.

We recognize that acting upon imagination has the potential to afford significant time savings while training the dynamics models and can potentially obtain a better result in terms of percentage of re-planning. Therefore, our second experiment evaluates the selected methods online while training the dynamics to assess the savings in overall training time and its effects on performance. In this experiment, we select the best performing methods from the previous experiment to test them while training the dynamics model from scratch.We evaluate the methods on agents in the MuJoCoTodorov etal. (2012) physics engine using two workstations with a last generation GPU. To establish a valid comparison with the baseline PETs Chua etal. (2018) (denoted as NS1), we use four environments with corresponding task length (TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H) and trajectory horizon (H𝐻Hitalic_H).
We use the following environments: Cartpole(CP): S4,A1formulae-sequence𝑆superscript4𝐴superscript1S\in\mathbb{R}^{4},A\in\mathbb{R}^{1}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H 200, H𝐻Hitalic_H 25. Reacher (RE): S17,A7formulae-sequence𝑆superscript17𝐴superscript7S\in\mathbb{R}^{17},A\in\mathbb{R}^{7}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT 17 end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT, TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H 150, H𝐻Hitalic_H 25. Pusher (PU): S20,A7formulae-sequence𝑆superscript20𝐴superscript7S\in\mathbb{R}^{20},A\in\mathbb{R}^{7}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT, TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H 150, H𝐻Hitalic_H 25. HalfCheetah (HC): S18,A6formulae-sequence𝑆superscript18𝐴superscript6S\in\mathbb{R}^{18},A\in\mathbb{R}^{6}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H 1000, H𝐻Hitalic_H 30.This means that each iteration will run for TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H, task horizon, steps, and that imagined trajectories include H𝐻Hitalic_H trajectory horizon steps.Si,Ajformulae-sequence𝑆superscript𝑖𝐴superscript𝑗S\in\mathbb{R}^{i},A\in\mathbb{R}^{j}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT refers to the dimensions of the environment state consisting in a vector of i𝑖iitalic_i components and the action consisting in a vector of j𝑗jitalic_j components. We assess performance in terms of reward per episode and evaluate wall time and avoided re-planing. All experiments use random seeds and randomize initial conditions per task.

6.1 Experiment I: Pre-trained Dynamics

This experiment uses a trained model and compares the uncertainty estimation methods: NS, FSA, CB, FUT, BICHO. It is expected that a pre-trained dynamics model predicts reliably outcomes of immediate actions and can project a number of steps i<H𝑖𝐻i<Hitalic_i < italic_H reliably, and any variability in i𝑖iitalic_i is attributable to task complexity. We quantify i𝑖iitalic_i for each method and environment (see appendix) and also the corresponding percentage of replanning.

A dynamics model is pre-trained for each environment by running Algorithm 1 (no skip) from scratch five times and selecting the best performing model. As a result, we obtain one dynamics model, parameters and replay buffer per task, which we use to evaluate our methods.For each method, the different hyperparameters are empirically evaluated to find out how robust are the algorithms with respect to hyper-parameters across the different environments.We report the amount of replanning and the corresponding performance in terms of reward per episode.

We validate each method hyper-parameter with 10 runs per task with different random seeds and randomized initial conditions. We report the episode reward as the maximum reward obtained by the agent in an episode over 10 runs. For NS, we used n{019}𝑛019n\in\{0-19\}italic_n ∈ { 0 - 19 } steps, where n=0𝑛0n=0italic_n = 0 recalculates at every step and it will be used as a baseline comparison. For FSA, we use constants c={0.1,0.15,0.25,0.35,0.5,.75,0.85,0.9,.95,.99,.999}𝑐0.10.150.250.350.5.750.850.9.95.99.999c=\{0.1,0.15,0.25,0.35,0.5,.75,0.85,0.9,.95,.99,.999\}italic_c = { 0.1 , 0.15 , 0.25 , 0.35 , 0.5 , .75 , 0.85 , 0.9 , .95 , .99 , .999 }. The error distribution 𝔼𝔼\mathbb{E}blackboard_E is constructed from error of the state prediction and the actual state in the environment, using the data set obtained while pre-training the dynamics model.As the collected errors did not follow a normal distribution, the percentiles will be used in the Algorithm4 to determine whether a trajectory should be recalculated.In CB, we evaluate different values of Ccb={0.0,,2.00}subscript𝐶𝑐𝑏0.02.00C_{cb}=\{0.0,...,2.00\}italic_C start_POSTSUBSCRIPT italic_c italic_b end_POSTSUBSCRIPT = { 0.0 , … , 2.00 } in steps of 0.050.050.050.05. We expect that a higher value of Ccbsubscript𝐶𝑐𝑏C_{cb}italic_C start_POSTSUBSCRIPT italic_c italic_b end_POSTSUBSCRIPT will make the algorithm decrease the performance, on the other hand a very low value of Ccbsubscript𝐶𝑐𝑏C_{cb}italic_C start_POSTSUBSCRIPT italic_c italic_b end_POSTSUBSCRIPT will make the the algorithm too selective and will result in no skipping at all.In FUT and BICHO, we evaluate different values of β={0.025,,1024}𝛽0.0251024\beta=\{0.025,...,1024\}italic_β = { 0.025 , … , 1024 } for different value of look ahead steps, CFUT|BICHOsubscript𝐶conditional𝐹𝑈𝑇𝐵𝐼𝐶𝐻𝑂C_{FUT|BICHO}italic_C start_POSTSUBSCRIPT italic_F italic_U italic_T | italic_B italic_I italic_C italic_H italic_O end_POSTSUBSCRIPT, ranging from 1 to H𝐻Hitalic_H. We report the best cFUT|BICHOsubscript𝑐conditional𝐹𝑈𝑇𝐵𝐼𝐶𝐻𝑂c_{FUT|BICHO}italic_c start_POSTSUBSCRIPT italic_F italic_U italic_T | italic_B italic_I italic_C italic_H italic_O end_POSTSUBSCRIPT for the full range of β𝛽\betaitalic_β.

Results.

For comparison we added the results for Soft-actor-critic (SAC) Haarnoja etal. (2018) and Proximal Policy Optimization (PPO) Schulman etal. (2017) at convergence. See Appendix Table 1 for detailed results for each environment.
CP. Fig 3 (left) shows the performance for NS, FSA, CB, FUT and BICHO in CP. There is no visible performance degradation by replanning only 40% of the time, and from 40% to 20% the task is still solved with a minor hit in performance. From 20% the results start decreasing dramatically. Interestingly, in BICHO there is no drastic loss in performance even when only replanning 9% of the steps. This is very close to the limit of the trajectory horizon H𝐻Hitalic_H.
HC. Fig 3 (mid-left) shows the performance in HC, a more complex environment. The graph shows no impact on performance for FSA, CB, FUT and BICHO when replanning up to 80% whilst performance is still acceptable better than SAC at convergence. With less than 60% replanning, the performance drops drastically. However, BICHO still outperforms PPO when recalculating only 20%.N-skip cannot reduce more than 50%percent5050\%50 % (n=1) without drastically degrading performance, showing that an adaptive method is necessary to skip replanning in complex environments such as HC.
PU. Fig 3 (mid-right) shows the PU results revealing that FSA, CB and n-skip have a stable performance with 80% replanning and a drastic drop when replanning decreases further. FUT keeps a good performance up to 50% and then the performance starts decreasing drastically. BICHO slightly drops in performance after 40% but it still maintains a good performance by replanning only 10% of the steps.
RE. Fig 3 (right) shows the results in RE. It reveals no visible performance degradation by replanning only 40% of the time and from 45% to 30% the task is still solved with FSA and CB with a minor hit in performance. With less than 20%, reward decreases dramatically. The performance of FUT drops after 30%. In this environment, n-skip has comparable degradation with other adaptive methods. However, the fact that n-skip starts at 50%percent5050\%50 % and it is fixed should be considered. The adaptive methods working at around 40%percent4040\%40 % replanning still retain performances above SAC and PPO.
In all environments, methods projecting future actions achieve longer sequences of steps without replanning with acceptable loss in reward. Results show that replanning less than 70%percent7070\%70 % is feasible whilst retaining state-of-the-art performance. In environments with lower dimensional actions, state spaces, or lower complexity, it is possible to save up to 80%percent8080\%80 % replanning steps. While blindly skipping replanning has an effect, the adaptive methods offer a reasonable trade-off to tune them to work at levels of replanning and performance not reachable by n-skip.

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (7)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (8)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (9)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (10)

6.2 Experiment II: Online dynamics update

As the model is being trained, the number of outcomes it can predict reliably varies. Here, uncertainty estimation should result in aborting plans in favor of re-planning at early training whilst executing longer trajectories as training progresses.

We evaluate our methods while training the dynamics model, using Algorithm 3 with FUT and BICHO. These methods were selected due to the performance in Experiment I, and because they do not need an error model trained in advance. So we can approximate a real-deployment of the method with minimum tuning effort.We evaluate the algorithm in each environment with 3 runs. We increased 50% the number of training iterations (episodes) in order to better observe the effects of skipping re-planning.PETs from Chua etal. (2018) is the baseline.

Results.

In the following, we outline the performance of used algorithms with the best performing hyper-parameters:
CP. Fig4 shows the results of BICHO and FUT versus the baseline and NSKIP3 and NSKIP4 depicting the relative wall time compared with the non skip. As NSKIP3 replans 33% of the steps, it takes approximate a third of the time to train compared with the baseline, neglecting the training times of the dynamics model. Both BICHO and FUT outperform the baseline and static skip methods. They retain the same performance of NSKIP by replanning only 15%percent1515\%15 % of the steps.In BICHO, we observe a negligible impact in performance while reducing the wall time to 10%percent1010\%10 % and reaching top performance within few episodes. To achieve this performance, the conventional MPC needs 200200200200 calls to calculateTrajectory𝑐𝑎𝑙𝑐𝑢𝑙𝑎𝑡𝑒𝑇𝑟𝑎𝑗𝑒𝑐𝑡𝑜𝑟𝑦calculateTrajectoryitalic_c italic_a italic_l italic_c italic_u italic_l italic_a italic_t italic_e italic_T italic_r italic_a italic_j italic_e italic_c italic_t italic_o italic_r italic_y. Instead, BICHO performed 28 (SD=1.68). See Table 6in the appendix. As the dynamics model improves, it produces more accurate predictions and the percentage of replanning steps drops. Indeed, after 10 episodes the replanning needed drop to 10%percent1010\%10 %.
PU. BICHO outperforms the baseline, static methods and FUT. It maintains the performance of no skip whilst replanning only 10%percent1010\%10 % of the steps and reducing the wall time to 17%percent1717\%17 %, reaching peek performance in few episodes (see Fig4).
RE. Fig4 shows the results of the BICHO and FUT methods versus the baseline and NSKIP2 and NSKIP3. FUT outperforms the baseline and static skip methods whereas BICHO is more conservative in this environment and needs more replanning to reach top performance. Both BICHO and FUT maintain the performance of no skip, but BICHO needs 48% replanning events whereas FUT needs only 14%.
HC. Fig4 shows the results of BICHO and FUT versus the baseline and NSKIP2. Both BICHO and FUT outperform the static skip methods. BICHO has slightly worse top performance than the non skip baseline but it skips 20% of the replanning steps. FUT can skip more steps but the reward drops sharply. In this environment, each method reaches a local optima before it continues improving. Our methods reach this point very quickly without a performance drop and skipping up to 40% (FUT) of the replanning steps.
The graphs clearly illustrate significant savings in training times when acting upon imagination. In some cases, these savings are achieved without loss in reward and minimal loss in other cases. More importantly, this loss is referred to training running five or six times longer in the baseline.

7 Discussion

Acting upon imagination advocates for trusting a reliable imagined trajectory for several steps.Our experiments show that it leads to 20%-80% reduction in computation, depending on the environment, maintaining acceptable reward. The proposed methods leverage different kinds of information available after taking an action in the environment:FSA and CB decide to act on the basis of evaluating the last action in a trajectory, FUT and BICHO evaluate planned future actions from the new state. The latter result in less replanning.The proposed methods apply to a range of dynamics models for reducing computation costs, regardless of their capabilities to output the uncertainty. FUT and BICHO can be used along any dynamics model that models uncertainty. On the other hand, FSA and CB could be used along any MBRL algorithm by computing statistics of a sliding window of past experiences. The choice of which algorithm to use depends on the dynamics model’s nature. If the dynamics model does not provide a notion of uncertainty, then FSA would be preferable. Otherwise, the methods looking towards the future (FUT, BICHO) have superior performance in terms of saving computation and stability in performance (solving the problem). BICHO reduces most calculations of these last methods while performing at least as the baseline.

Planning

Skipping replanning can indeed be particularly beneficial in robotics, where hardware limitations often impose constraints on computational resources. By intelligently skipping unnecessary replanning events, we can allocate computational power more efficiently and potentially leverage more sophisticated models or algorithms.

Uncertainty estimation

One alternative avenue of research basing on the work of Zhu etal. (2020) would be a progressive measure of mutual information between imagined and real trajectory updated with each successive step to decide whether or not to re-plan. As a limitation, the error expressed as euclidean distance between two state vectors is simple and useful, but it may give misleading information. Comparably, the method BICHO that looks for deviations between imagined reward and a more actual reprojection of future rewards achieves superior performance. Perhaps, a more sophisticated method can be used, taking as input the two state vectors, actions, predicted and observed reward, to output a decision to act or re-plan. For example, a model-free trained policy.

The proposed methods have important implications beyond a greedy motivation to reduce computational effort and time complexity. These methods offer a way to assess how well the dynamics model is at predicting the outcomes of the agent’s actions. Conversely, the proposed methods offer a way to evaluate experiences. Meaning, experiences where the outcome of the environment deviates from the predictions of the model may be more informative towards training.Indeed, our work offers interesting insights on using our method for guided exploration. Assuming steps in an imagined trajectory can be trusted, their evaluation yields a small error, meaning the dynamics model successfully predicts these transitions. One could refine the exploration by omitting actions that lead to transitions with low error, and thus favour less known transitions for future training.We assume that re-planing is due to errors in the dynamics model. Each re-planning adjusts for the new H𝐻Hitalic_H with a receding horizon task. But as we act upon imagination, the horizon is no longer receding, and the trajectory risks becoming obsolete. So whence should the model adjust towards a new horizon of imagination?

8 Conclusion

In conclusion, our study provides a comprehensive analysis and discussion quantifying the error of predicted trajectories in MBRL. We propose methods for online uncertainty estimation in MBRL, incorporating techniques that observe the outcome of the last action in the trajectory. These methods include comparing the error after performing the last action with the standard expected error and assessing the deviation with respect to expected outcomes using model uncertainty. Additionally, we introduce methods that exploit the forward propagation of the dynamics model to evaluate if the remainder of the plan aligns with expected results and assess the remainder of the plan in terms of the expected reward. These methods update the uncertainty estimation in real-time to assess the utility of the plan.We demonstrate the efficacy of these uncertainty estimation techniques. Our methods not only leverage accurate predictions but also intelligently determine when to replan trajectories. This approach significantly reduces training time and optimizes the utilization of computational resources by eliminating unnecessary replanning steps. Overall, our findings highlight the potential of these methods to enhance the performance and efficiency of sampling-based MBRL approaches.

Acknowledgment

Funding in direct support of this work: Adrian Remonda reports financial support was provided by AVL List GmbH. Adrian Remonda reports a relationship with Know-Center GmbH that includes: employment. This research was partially funded by AVL GmbH and Know-Center GmbH. Know-Center is funded within the Austrian COMET Program-Competence Centers for Excellent Technologies - under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG.

References

  • Botev etal. [2013]ZdravkoI. Botev, DirkP. Kroese, ReuvenY. Rubinstein, and Pierre L’Ecuyer.Chapter 3 - the cross-entropy method for optimization.In C.R. Rao and Venu Govindaraju, editors, Handbook ofStatistics, volume31 of Handbook of Statistics, pages 35 – 59.Elsevier, 2013.doi: https://doi.org/10.1016/B978-0-444-53859-8.00003-5.URLhttp://www.sciencedirect.com/science/article/pii/B9780444538598000035.
  • Buckman etal. [2018]Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee.Sample-efficient reinforcement learning with stochastic ensemblevalue expansion.In Proceedings of the 32nd International Conference on NeuralInformation Processing Systems, NIPS’18, page 8234–8244, Red Hook, NY,USA, 2018. Curran Associates Inc.
  • Camacho etal. [2004]E.F. Camacho, C.Bordons, and C.B. Alba.Model Predictive Control.Advanced Textbooks in Control and Signal Processing. Springer London,2004.ISBN 9781852336943.URL https://books.google.at/books?id=Sc1H3f3E8CQC.
  • Chua etal. [2018]Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine.Deep reinforcement learning in a handful of trials usingprobabilistic dynamics models, 2018.
  • Deisenroth etal. [2013]MarcPeter Deisenroth, Gerhard Neumann, and Jan Peters, 2013.
  • Drews etal. [2017]Paul Drews, Brian Goldfain, Grady Williams, and EvangelosA. Theodorou.Aggressive deep driving: Model predictive control with a cnn costmodel.2017.
  • Haarnoja etal. [2018]Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine.Soft actor-critic: Off-policy maximum entropy deep reinforcementlearning with a stochastic actor, 2018.
  • Hafez etal. [2020]MuhammadBurhan Hafez, Cornelius Weber, Matthias Kerzel, and Stefan Wermter.Improving robot dual-system motor learning with intrinsicallymotivated meta-control and latent-space experience imagination.CoRR, abs/2004.08830, 2020.URL https://arxiv.org/abs/2004.08830.
  • Hafner etal. [2020]Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi.Dream to control: Learning behaviors by latent imagination, 2020.
  • Hansen etal. [2022]Nicklas Hansen, Xiaolong Wang, and Hao Su.Temporal difference learning for model predictive control, 2022.
  • Heess etal. [2015]Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, andTom Erez.Learning continuous control policies by stochastic value gradients,2015.
  • Janner etal. [2019]Michael Janner, Justin Fu, Marvin Zhang, and S.Levine.When to trust your model: Model-based policy optimization.In NeurIPS, 2019.
  • Kalweit and Boedecker [2017]Gabriel Kalweit and Joschka Boedecker.Uncertainty-driven imagination for continuous deep reinforcementlearning.volume78 of Proceedings of Machine Learning Research, pages195–206. PMLR, 13–15 Nov 2017.URL http://proceedings.mlr.press/v78/kalweit17a.html.
  • Lakshminarayanan etal. [2016]Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell.Simple and scalable predictive uncertainty estimation using deepensembles, 2016.
  • Lillicrap etal. [2015]TimothyP. Lillicrap, JonathanJ. Hunt, Alexander Pritzel, Nicolas Heess, TomErez, Yuval Tassa, David Silver, and Daan Wierstra.Continuous control with deep reinforcement learning.CoRR, abs/1509.02971, 2015.URL http://arxiv.org/abs/1509.02971.
  • Mnih etal. [2015]Volodymyr Mnih, Koray Kavukcuoglu, David Silver, AndreiA. Rusu, Joel Veness,MarcG. Bellemare, Alex Graves, Martin Riedmiller, AndreasK. Fidjeland,Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, IoannisAntonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, andDemis Hassabis.Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, February 2015.ISSN 00280836.URL http://dx.doi.org/10.1038/nature14236.
  • Nagabandi etal. [2018]Anusha Nagabandi, G.Kahn, RonaldS. Fearing, and S.Levine.Neural network dynamics for model-based deep reinforcement learningwith model-free fine-tuning.2018 IEEE International Conference on Robotics and Automation(ICRA), pages 7559–7566, 2018.
  • Oh etal. [2017]Junhyuk Oh, Satinder Singh, and Honglak Lee.Value prediction network.In NIPS, 2017.
  • Pukelsheim [1994]Friedrich Pukelsheim.The three sigma rule.The American Statistician, 48:88–91, 1994.
  • Rao [2010]AnvilV. Rao.A survey of numerical methods for optimal control.Advances in the Astronautical Science, 135:497–528,2010.
  • Schulman etal. [2017]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.Proximal policy optimization algorithms, 2017.
  • Sutton [1990]RichardS. Sutton.Integrated architectures for learning, planning, and reacting basedon approximating dynamic programming.In ML Workshop, 1990.
  • Sutton and Barto [1998]RichardS. Sutton and AndrewG. Barto.Introduction to Reinforcement Learning.MIT Press, Cambridge, MA, USA, 1st edition, 1998.ISBN 0262193981.
  • Todorov etal. [2012]Emanuel Todorov, Tom Erez, and Yuval Tassa.Mujoco: A physics engine for model-based control.In IROS, pages 5026–5033. IEEE, 2012.ISBN 978-1-4673-1737-5.
  • Williams etal. [2017]Grady Williams, Paul Drews, Brian Goldfain, JamesM. Rehg, and EvangelosA.Theodorou.Information theoretic model predictive control: Theory andapplications to autonomous driving, 2017.
  • Yu etal. [2020]Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine,Chelsea Finn, and Tengyu Ma.Mopo: Model-based offline policy optimization, 2020.
  • Zhu etal. [2020]Guangxiang Zhu, Minghao Zhang, Honglak Lee, and Chongjie Zhang.Bridging imagination and reality for model-based deep reinforcementlearning.NeurIPS, 2020.

Appendix A Environments

We evaluate the methods on agents in the MuJoCoTodorov etal. [2012] physics engine. To establish a valid comparison with Chua etal. [2018] we use four environments with corresponding task length (TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H) and trajectory horizon (H𝐻Hitalic_H).

  • 1.

    Cartpole (CP): S4,A1formulae-sequence𝑆superscript4𝐴superscript1S\in\mathbb{R}^{4},A\in\mathbb{R}^{1}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H 200, H𝐻Hitalic_H 25

  • 2.

    Reacher (RE): S17,A7formulae-sequence𝑆superscript17𝐴superscript7S\in\mathbb{R}^{17},A\in\mathbb{R}^{7}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT 17 end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT, TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H 150, H𝐻Hitalic_H 25

  • 3.

    Pusher (PU): S20,A7formulae-sequence𝑆superscript20𝐴superscript7S\in\mathbb{R}^{20},A\in\mathbb{R}^{7}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT, TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H 150, H𝐻Hitalic_H 25

  • 4.

    HalfCheetah (HC): S18,A6formulae-sequence𝑆superscript18𝐴superscript6S\in\mathbb{R}^{18},A\in\mathbb{R}^{6}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H 1000, H𝐻Hitalic_H 30

This means that each iteration will run for TaskH𝑇𝑎𝑠𝑘𝐻TaskHitalic_T italic_a italic_s italic_k italic_H, task horizon, steps, and that imagined trajectories include H𝐻Hitalic_H trajectory horizon steps.Si,Ajformulae-sequence𝑆superscript𝑖𝐴superscript𝑗S\in\mathbb{R}^{i},A\in\mathbb{R}^{j}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT refers to the dimensions of the environment state consisting in a vector of i𝑖iitalic_i components and the action consisting in a vector of j𝑗jitalic_j components.

Appendix B Trajectory Quality Analysis

The error (euclidean distance between the actual state and predicted states) as a function of the predicted steps in the future is given in Figure5. This Figure is an extension of Figure 2to all the environments. One curious observation is that the error for environments PU and RE his relatively higher also when not skipping replanning and increases faster than for environments CP and HC.

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (11)

Appendix C Computational Costs

Our proposed algorithms aim to save computations by omitting trajectory recalculations. The complexity of trajectory recalculation is O(H x A x K), H is the length of horizon (we use H=20), A the dimensions of action (ACPsubscript𝐴𝐶𝑃A_{CP}italic_A start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT=1, AREsubscript𝐴𝑅𝐸A_{RE}italic_A start_POSTSUBSCRIPT italic_R italic_E end_POSTSUBSCRIPT=7, APUsubscript𝐴𝑃𝑈A_{PU}italic_A start_POSTSUBSCRIPT italic_P italic_U end_POSTSUBSCRIPT=7, AHCsubscript𝐴𝐻𝐶A_{HC}italic_A start_POSTSUBSCRIPT italic_H italic_C end_POSTSUBSCRIPT=6) and K is the number of trajectories generated at each recalculation. K depends on the solver and environment, and in our case it is KCPsubscript𝐾𝐶𝑃K_{CP}italic_K start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT=10000, KHCsubscript𝐾𝐻𝐶K_{HC}italic_K start_POSTSUBSCRIPT italic_H italic_C end_POSTSUBSCRIPT=12500, KPUsubscript𝐾𝑃𝑈K_{PU}italic_K start_POSTSUBSCRIPT italic_P italic_U end_POSTSUBSCRIPT=12500, KREsubscript𝐾𝑅𝐸K_{RE}italic_K start_POSTSUBSCRIPT italic_R italic_E end_POSTSUBSCRIPT=10000.

However, the algorithms for deciding whether to skip replanning introduce additional computations. For the n-skip, the computational costs are O(0). Both FSA and CB have a computational complexity 𝒪(S)𝒪𝑆\mathcal{O}(S)caligraphic_O ( italic_S ) (computing the error is 𝒪(S)𝒪𝑆\mathcal{O}(S)caligraphic_O ( italic_S ) and deciding whether to skip is 𝒪(0)𝒪0\mathcal{O}(0)caligraphic_O ( 0 )), where S𝑆Sitalic_S is the number of dimensions of the state and it is different for each environment (SCPsubscript𝑆𝐶𝑃S_{CP}italic_S start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT=4, SREsubscript𝑆𝑅𝐸S_{RE}italic_S start_POSTSUBSCRIPT italic_R italic_E end_POSTSUBSCRIPT=17, SPUsubscript𝑆𝑃𝑈S_{PU}italic_S start_POSTSUBSCRIPT italic_P italic_U end_POSTSUBSCRIPT=20, SHCsubscript𝑆𝐻𝐶S_{HC}italic_S start_POSTSUBSCRIPT italic_H italic_C end_POSTSUBSCRIPT=18). FUT and BICHO project one additional trajectory of length tH𝑡𝐻t\leq Hitalic_t ≤ italic_H to decide whether to skip or not, where t𝑡titalic_t is a hyperparameter. The resulting computational complexity is then 𝒪(t)𝒪𝑡\mathcal{O}(t)caligraphic_O ( italic_t ). Additionally, comparing the trajectories increases the complexity by 𝒪(t)𝒪𝑡\mathcal{O}(t)caligraphic_O ( italic_t ).Comparing the computation costs above, it is clear that the costs introduced by n-skip, FCA and CB are negligible compared to the costs of replanning. The cost of both FUT and BICHO is higher but still not comparable to the computational cost of having to replan.

Appendix D EX-1: Offline Reward and Replanning Rate

Additional information for 10 runs of each environment for each considered hyper parameter is provided as reference. Table 1 summarizes selected results. Table 2, Table 3, Table 4 and Table 5 show detailed results of each hyper parameter for the environments CP, HC, PU and RE respectively. We report the average and STD reward per episode, steps replanned and number of sequential steps skipped. We also included the error and its STD.

CP
MethodRwMaxRwRcRcPeri
Baseline179.373178.830200.001.000.00
NSKIP1179.610178.201100.000.500.99
NSKIP2178.687177.47467.000.331.97
FSA0.50179.923178.997130.500.651.25
FSA0.99177.061172.56927.600.136.22
CB0.50179.473178.30376.500.382.77
BICHO10β𝛽\betaitalic_β32179.061175.95121.700.107.92
BICHO20β𝛽\betaitalic_β64177.022174.54218.000.099.79
FUT01β𝛽\betaitalic_β4.00178.620175.47440.0000.203.94
HC
Baseline16750.77612764.6681000.001.0000.000
NSKIP113748.6259247.623500.000.5000.998
NSKIP210266.3116118.676334.000.3341.994
FSA0.5015791.49010881.401586.000.5861.613
BICHO05β𝛽\betaitalic_β0.20018341.04211637.706662.300.6620.998
BICHO05β𝛽\betaitalic_β6.0008646.3611872.543218.500.2187.004
FUT01β𝛽\betaitalic_β0.10015190.09910127.721595.100.5951.021
PU
Baseline-49.277-56.858150.001.0000.000
NSKIP1-68.296-79.49375.000.5000.990
NSKIP2-79.710-85.36850.000.3331.960
FSA0.50-51.347-78.970116.700.7731.199
CB1.00-49.149-76.829128.700.8531.242
BICHO10β𝛽\betaitalic_β16.0-51.527-85.72317.700.1187.002
FUT01β𝛽\betaitalic_β0.40-56.583-81.13473.900.4931.019
RE
Baseline-45.121-45.930150.001.0000.000
NSKIP1-45.144-46.29675.000.5000.987
NSKIP2-46.076-47.16750.000.3331.960
FSA0.50-44.420-46.080100.600.6711.211
CB1.00-45.097-46.59286.500.5772.011
BICHO23β𝛽\betaitalic_β0.125-44.609-45.972115.100.7670.972
BICHO23β𝛽\betaitalic_β768-45.719-58.67749.900.3332.154
FUT01β𝛽\betaitalic_β0.200-45.080-46.55475.200.5010.991
FUT01β𝛽\betaitalic_β8.000-46.922-49.80426.600.1774.549
CP
MethodRwRwSTDRcRcSTDRcPeri meani STDRwMinRwMax
Baseline178.830.41200.000.001.000.000.00177.92179.37
NSKIP1178.200.84100.000.000.500.990.00176.62179.61
NSKIP2177.470.9067.000.000.341.970.00175.87178.69
NSKIP3175.912.7250.000.000.252.940.00168.24177.43
NSKIP5174.264.7434.000.000.174.850.00161.63177.95
NSKIP6170.226.3529.000.000.145.790.00162.63176.79
NSKIP7163.604.7025.000.000.126.720.00155.70171.89
NSKIP8158.945.9723.000.000.127.650.00143.23163.29
NSKIP9129.2635.1520.000.000.108.550.0068.02161.79
BICHO10β𝛽\betaitalic_β0.10178.760.75177.605.250.890.950.03177.12179.69
BICHO10β𝛽\betaitalic_β0.20178.810.55156.304.640.780.980.01178.05179.98
BICHO10β𝛽\betaitalic_β0.40178.290.46148.708.710.741.010.21177.53178.84
BICHO10β𝛽\betaitalic_β0.80178.050.64139.603.890.701.020.31176.98178.74
BICHO10β𝛽\betaitalic_β32177.710.8983.9015.160.421.700.84176.48179.06
BICHO10β𝛽\betaitalic_β128177.780.5044.4011.710.223.790.56176.90178.52
FSA0.15178.930.53173.604.810.871.020.07178.03179.74
FSA0.25178.870.66154.603.630.771.100.09177.95179.76
FSA0.35178.870.37141.804.190.711.180.08177.98179.43
FSA0.50179.000.57130.504.970.651.250.08177.92179.92
FSA0.99172.575.2427.602.590.146.230.40161.68177.06
CB0.50178.301.1476.506.360.382.770.50175.34179.47
CB0.90120.2945.1927.507.500.148.780.6231.59170.37
CB1.00114.1634.0825.805.940.139.270.6244.18151.96
CB1.7574.1726.9815.804.590.0814.430.4840.05117.84
BICHO20β𝛽\betaitalic_β0.05178.760.75177.605.250.890.950.03177.12179.69
BICHO20β𝛽\betaitalic_β0.10178.810.55156.304.640.780.980.01178.05179.98
BICHO20β𝛽\betaitalic_β0.70177.800.6156.9015.600.292.870.87176.95178.81
BICHO20β𝛽\betaitalic_β1177.600.7340.207.710.204.070.60176.56178.57
BICHO20β𝛽\betaitalic_β8177.020.8326.000.820.136.480.39175.50178.13
BICHO20β𝛽\betaitalic_β64172.287.5418.301.830.099.700.58153.22177.02
BICHO20β𝛽\betaitalic_β256165.4714.6717.001.250.0810.470.97138.41175.40
BICHO23β𝛽\betaitalic_β0.05178.690.41178.003.620.890.960.02178.00179.30
BICHO23β𝛽\betaitalic_β0.10178.370.91150.804.050.750.990.03176.97179.57
BICHO23β𝛽\betaitalic_β0.70177.880.4048.8017.960.243.470.90177.34178.56
BICHO23β𝛽\betaitalic_β1177.460.6942.5014.320.213.990.86176.23178.42
BICHO23β𝛽\betaitalic_β8175.017.3425.301.640.136.670.46154.18177.88
BICHO23β𝛽\betaitalic_β64174.542.0818.000.670.099.790.48170.59177.01
FUT01β𝛽\betaitalic_β0.05178.610.64167.304.880.840.970.01177.55179.70
FUT01β𝛽\betaitalic_β0.15178.450.54104.801.480.521.040.05177.44179.04
FUT01β𝛽\betaitalic_β0.80177.080.8167.501.780.341.950.05175.96178.15
FUT01β𝛽\betaitalic_β2.00177.190.7550.800.920.252.910.04176.18178.07
FUT01β𝛽\betaitalic_β4175.474.0340.000.470.203.940.09164.78178.62
FUT01β𝛽\betaitalic_β64.082.4634.0416.401.430.0810.920.7827.88147.24
FUT01β𝛽\betaitalic_β25623.7424.5110.801.480.0517.110.724.4582.34
HC
MethodRwRwSTDRcRcSTDRcPeri meani STDRwMinRwMax
Baseline12764.6682849.8531000.0000.0001.0000.0000.000737216750
NSKIP19247.6232179.981500.0000.0000.5000.9980.000529913748
NSKIP26118.6763011.780334.0000.0000.3341.9940.000137510266
NSKIP31443.533357.099250.0000.0000.2502.9880.00010081944
NSKIP41048.150147.981200.0000.0000.2003.9800.0008591352
NSKIP5750.22636.873167.0000.0000.1674.9700.000704823
NSKIP7453.36040.811125.0000.0000.1256.9440.000379512
NSKIP9261.672146.953100.0000.0000.1008.9100.000-11379
FSA0.1510637.7613986.620922.40010.7000.9221.0310.059571816835
FSA0.2513283.9894397.851854.30023.6360.8541.0990.061509918182
FSA0.359963.9642448.046736.40020.9190.7361.2440.049519013450
FSA0.5010881.4012565.614586.00022.6910.5861.6130.078699715791
FSA0.90173.839143.72950.50011.3360.05119.8381.53653499
CB0.5013269.0673347.259994.4001.1740.9940.8440.026655017071
CB0.9011010.0533129.994657.9008.9000.6581.6120.215478115326
CB1.007262.1393246.655523.50031.3660.5242.0310.175304012650
CB1.75408.850412.64085.10037.2450.08513.5531.8661051538
BICHO05β𝛽\betaitalic_β0.05012685.2463163.449940.1003.4140.9400.9840.004501516299
BICHO05β𝛽\betaitalic_β0.10011696.3362267.720808.4009.8120.8080.9950.002915016227
BICHO05β𝛽\betaitalic_β0.20011637.7063778.409662.3008.5380.6620.9980.010627418341
BICHO05β𝛽\betaitalic_β0.8008218.9472584.988535.4004.4770.5351.0310.122420711222
BICHO05β𝛽\betaitalic_β4.0005302.0873623.648404.70040.6040.4051.5171.104129310756
BICHO05β𝛽\betaitalic_β6.0001872.5432639.615218.500130.9960.2187.0043.94058646
BICHO05β𝛽\betaitalic_β8.000587.539865.74579.20042.8560.07914.7861.851112800
FUT01β𝛽\betaitalic_β0.02512936.0222014.828999.5000.7071.0000.2170.2551003715763
FUT01β𝛽\betaitalic_β0.05013436.7253264.052964.1009.1580.9640.9710.025919218000
FUT01β𝛽\betaitalic_β0.10010127.7212911.831595.10011.3960.5951.0210.022663015190
FUT01β𝛽\betaitalic_β0.1258075.1273342.769522.2005.9220.5221.0480.013173312706
FUT01β𝛽\betaitalic_β0.1507652.0312645.142492.4007.5450.4921.0860.028394512400
FUT01β𝛽\betaitalic_β0.4006498.4062534.774388.8004.6380.3891.5700.02516459707
FUT01β𝛽\betaitalic_β2.0001973.3941416.910237.5007.3670.2373.2080.0838784863
PU
MethodRwRwSTDRcRcSTDRcPeri meani STDRwMinRwMax
Baseline-56.85810.768150.0000.0001.0000.0000.000-75.277-49.277
NSKIP1-79.4939.19975.0000.0000.5000.9870.000-96.068-68.296
NSKIP2-85.3683.23550.0000.0000.3331.9600.000-89.991-79.710
NSKIP3-87.9833.63338.0000.0000.2532.9210.000-96.177-83.838
NSKIP4-86.7733.52630.0000.0000.2003.8670.000-94.565-83.384
NSKIP5-89.8475.74525.0000.0000.1674.8000.000-96.857-83.023
NSKIP6-89.5043.97422.0000.0000.1475.7270.000-94.531-84.423
NSKIP7-93.9253.33119.0000.0000.1276.6320.000-99.409-87.872
NSKIP8-97.0022.20017.0000.0000.1137.5290.000-100.367-93.706
NSKIP9-97.4253.08615.0000.0000.1008.4000.000-101.036-91.792
FSA0.15-71.23214.667142.7003.4330.9510.8520.066-86.917-50.196
FSA0.25-71.60612.279138.9004.7480.9200.9160.065-86.514-50.309
FSA0.35-68.62315.260131.70010.3710.8730.9900.057-89.602-50.198
FSA0.50-78.97016.020116.70012.8590.7731.1990.096-100.741-51.347
FSA0.90-106.7591.6437.1000.3160.04718.7750.164-109.108-104.638
CB0.50-59.22712.197149.8000.4220.9930.1000.211-87.747-50.568
CB0.90-67.43615.815133.4006.1500.8871.1420.130-87.785-49.913
CB1.00-76.82916.610128.7008.1380.8531.2420.212-99.612-49.149
CB1.75-90.5196.52065.6008.5400.4333.0490.625-103.549-83.159
BICHO10β𝛽\betaitalic_β0.05-59.8719.853139.1004.3320.9270.8990.061-73.980-49.726
BICHO10β𝛽\betaitalic_β0.35-72.78315.04481.7003.8890.5451.0790.467-90.490-50.358
BICHO10β𝛽\betaitalic_β0.50-75.20510.42175.9008.5170.5061.1920.894-89.100-53.112
BICHO10β𝛽\betaitalic_β2.00-75.10114.26439.4009.6630.2632.8381.563-90.866-54.450
BICHO10β𝛽\betaitalic_β3.00-76.37413.97330.8001.3170.2053.4710.422-91.605-50.523
BICHO10β𝛽\betaitalic_β8.00-78.87413.93123.6002.5470.1575.0760.582-93.285-54.045
BICHO10β𝛽\betaitalic_β16.0-85.72313.36017.7001.6360.1187.0020.749-100.344-51.527
BICHO10β𝛽\betaitalic_β32.0-89.35911.47113.2002.3940.0889.8050.474-108.031-66.529
FUT01β𝛽\betaitalic_β0.25-56.85810.768150.0000.0001.0000.0000.000-75.277-49.277
FUT01β𝛽\betaitalic_β0.05-54.3356.991148.3001.4940.9890.5080.200-68.104-48.443
FUT01β𝛽\betaitalic_β0.40-81.13410.06373.9000.7380.4931.0190.066-93.088-56.583
FUT01β𝛽\betaitalic_β0.80-86.3523.43058.4002.4590.3891.5540.046-92.642-81.845
FUT01β𝛽\betaitalic_β1.00-84.5473.92055.2001.4760.3681.7030.048-90.966-77.338
FUT01β𝛽\betaitalic_β2.00-85.2071.82446.4002.1190.3092.2110.074-88.006-81.926
FUT01β𝛽\betaitalic_β4.00-88.1254.48737.5001.4340.2502.9650.079-96.489-82.626
FUT01β𝛽\betaitalic_β16.0-90.6545.11322.8000.9190.1525.4840.116-97.510-83.038
FUT01β𝛽\betaitalic_β32.0-97.6712.08717.6000.5160.1177.2820.178-101.307-94.572
FUT01β𝛽\betaitalic_β64.0-99.8944.78914.0000.6670.0939.4550.276-104.336-91.531
FUT01β𝛽\betaitalic_β64.0-105.0051.3198.2000.4220.05516.1890.301-106.247-102.273
RE
MethodRwRwSTDRcRcSTDRcPeri meani STDRwMinRwMax
Baseline-45.9300.606150.0000.0001.0000.0000.000-46.689-45.121
NSKIP1-46.2960.93475.0000.0000.5000.9870.000-47.843-45.144
NSKIP2-47.1670.81550.0000.0000.3331.9600.000-48.515-46.076
NSKIP3-48.5531.07538.0000.0000.2532.9210.000-49.748-46.352
NSKIP4-49.8550.95330.0000.0000.2003.8670.000-51.185-47.895
NSKIP5-50.0721.76325.0000.0000.1674.8000.000-53.788-47.317
NSKIP7-51.6402.48019.0000.0000.1276.6320.000-54.880-47.674
NSKIP9-54.7981.95515.0000.0000.1008.4000.000-57.343-51.526
FSA0.15-45.5911.155134.8002.8600.8990.9580.058-48.215-44.236
FSA0.25-45.6390.512124.2004.9840.8281.0060.071-46.553-45.137
FSA0.35-45.8391.132111.4002.8750.7431.0940.109-47.770-44.163
FSA0.50-46.0801.044100.6002.3190.6711.2110.117-47.473-44.420
FSA0.90-57.1922.6888.2000.4220.05516.2540.583-63.307-53.410
CB0.50-46.3550.555148.7001.3370.9910.4220.230-47.001-45.323
CB0.90-45.7091.114107.9003.6350.7191.4650.449-47.032-43.904
CB1.00-46.5921.07386.5008.3030.5772.0110.582-48.847-45.097
CB1.75-51.3751.91122.0003.8590.1478.0661.230-55.623-49.221
BICHO23β𝛽\betaitalic_β0.100-45.5861.053122.8004.5170.8190.9640.013-47.478-44.001
BICHO23β𝛽\betaitalic_β0.125-45.9720.939115.1004.7250.7670.9720.011-47.803-44.609
BICHO23β𝛽\betaitalic_β256-47.0430.89872.1001.9120.4811.0720.360-48.583-45.539
BICHO23β𝛽\betaitalic_β576-53.6224.64056.3008.7950.3751.7181.546-62.853-46.846
BICHO23β𝛽\betaitalic_β768-58.6777.79449.90012.1420.3332.1541.811-68.808-45.719
BICHO23β𝛽\betaitalic_β1024-58.7657.33146.9008.0340.3132.2721.153-74.104-50.834
FUT01β𝛽\betaitalic_β0.025-46.0031.062150.0000.0001.0000.0000.000-47.128-43.905
FUT01β𝛽\betaitalic_β0.050-45.3940.577144.3002.5410.9620.8220.061-46.137-44.614
FUT01β𝛽\betaitalic_β0.200-46.5541.03475.2000.4220.5010.9910.024-48.400-45.080
FUT01β𝛽\betaitalic_β1.000-47.5081.03552.3001.4180.3491.8510.058-48.661-45.099
FUT01β𝛽\betaitalic_β8.000-49.8041.87226.6001.0750.1774.5490.131-53.130-46.922
FUT01β𝛽\betaitalic_β32.00-53.2421.74317.4000.6990.1167.3250.346-55.674-50.721
FUT01β𝛽\betaitalic_β64.00-54.5561.60413.7000.8230.0919.6690.420-57.550-52.503
FUT01β𝛽\betaitalic_β256.0-56.8233.4019.4000.5160.06314.4990.485-63.382-52.324

Appendix E EX-2: Online dynamics update

Figure 6, 7 and 8 show the resulting performance of different hyper parameters while training the dynamics model with skipping in CP, PU, RE and HC. Table 6 shows numerical results for each environment. CP was trained for 60 episodes, RE and PU for 150 and finally HC for 400. Rw represents the average over 3 experiments of the maximum rewards seen so far, RcPerMax is the percentage of replanning steps when the algorithm reached the maximum Rw. EpNr@@@@Max number of episodes needed to reach Rw and RelEp#@@@@Max represents the relative wall time compared to the baseline when the algorithm reached the maximum Rw.

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (12)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (13)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (14)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (15)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (16)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (17)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (18)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (19)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (20)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (21)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (22)

Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (23)

CP
MethodRwRcPer@@@@MaxEpN@@@@MaxRelEp#@@@@Max
Baseline181.67991.00005758.00
NSKIP3180.77600.33505920.10
NSKIP4180.65400.25005814.75
BICHO10β𝛽\betaitalic_β64180.89820.1475586.72
FUT01β𝛽\betaitalic_β4180.14860.25674710.53
PU
Baseline-47.94171.0000142143.00
NSKIP2-52.01290.500014372.00
NSKIP3-57.34620.333314749.33
BICHO10β𝛽\betaitalic_β2-49.02920.173313624.61
FUT01β𝛽\betaitalic_β2-52.84470.495613567.31
RE
Baseline-33.57171.00007071.00
NSKIP2-34.40230.50008241.50
NSKIP3-34.21900.33337224.33
BICHO10β𝛽\betaitalic_β512-37.73570.48229528.08
FUT01β𝛽\betaitalic_β16-35.79380.13339112.64
HC
Baseline22491.98761.0000372373.00
NSKIP26266.18450.50007739.00
BICHO05β𝛽\betaitalic_β=0.0518323.84630.8305384310.31
BICHO10β𝛽\betaitalic_β=0.0520787.04990.8640294254.11
FUT01β𝛽\betaitalic_β=0.112605.52040.5830259153.11
Acting upon Imagination: When to Trust Imagined Trajectories in Model Based Reinforcement Learning (2024)
Top Articles
Latest Posts
Article information

Author: Laurine Ryan

Last Updated:

Views: 5678

Rating: 4.7 / 5 (57 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Laurine Ryan

Birthday: 1994-12-23

Address: Suite 751 871 Lissette Throughway, West Kittie, NH 41603

Phone: +2366831109631

Job: Sales Producer

Hobby: Creative writing, Motor sports, Do it yourself, Skateboarding, Coffee roasting, Calligraphy, Stand-up comedy

Introduction: My name is Laurine Ryan, I am a adorable, fair, graceful, spotless, gorgeous, homely, cooperative person who loves writing and wants to share my knowledge and understanding with you.