AdrianRemondaEduardoVeasGranitLuzhnica
Abstract
Model-based reinforcement learning (MBRL) aims to learn model(s) of the environment dynamics that can predict the outcome of its actions. Forward application of the model yields so called imagined trajectories (sequences of action, predicted state-reward) used to optimize the set of candidate actions that maximize expected reward. The outcome, an ideal imagined trajectory or plan, is imperfect and typically MBRL relies on model predictive control (MPC) to overcome this by continuously re-planning from scratch, incurring thus major computational cost and increasing complexity in tasks with longer receding horizon.
We propose uncertainty estimation methods for online evaluation of imagined trajectories to assess whether further planned actions can be trusted to deliver acceptable reward.These methods include comparing the error after performing the last action with the standard expected error and using model uncertainty to assess the deviation from expected outcomes. Additionally, we introduce methods that exploit the forward propagation of the dynamics model to evaluate if the remainder of the plan aligns with expected results and assess the remainder of the plan in terms of the expected reward.Our experiments demonstrate the effectiveness of the proposed uncertainty estimation methods by applying them to avoid unnecessary trajectory replanning in a shooting MBRL setting. Results highlight significant reduction on computational costs without sacrificing performance.
keywords:
Deep Reinforcement Learning, Model Based Reinforcement Learning, Model-Predictive Control, Robotics, Random shooting methods., Planning
[label1]organization=Know-Center,city=Graz,country=Austria
\affiliation[label2]organization=Graz University of Technology,city=Graz,country=Austria
1 Introduction
Reinforcement learning can be successfully applied in continuous control of complex and highly non-linear systems. Algorithms for reinforcement learning can be categorized as model free (MFRL) or model based (MBRL).Using deep learning in MFRL has achieved success in learning complex policies from raw input, such as solving problems with high-dimensional state spaces (Sutton and Barto, 1998)(Mnih etal., 2015) and continuous action space optimization problems with algorithms like deep deterministic policy gradient (DDPG)(Lillicrap etal., 2015), Proximal Policy Optimization (PPO) (Schulman etal., 2017) or Soft Actor-Critic (SAC) (Haarnoja etal., 2018).While significant progress has been made, the high sample complexity of MFRL limits its real-world applicability.Collecting millions of transitions needed to converge is not always feasible in real-world problems, as excessive data collection is expensive, tedious or can lead to physical damage (Williams etal., 2017; Chua etal., 2018). Instead, model based methods are comparably sample efficient. MBRL techniques build a predictive model of the world to imagine trajectories of future states and plan the best actions to maximize a reward.
MBRL uses a dynamics model to predict the outcome of taking an action in an environment with given states. It can bootstrap from existing experiences and is versatile to changing objectives on the fly. Nevertheless, its performance degrades with growing model error. In the general case of nonlinear dynamics guarantees of local optimality do not hold(Janner etal., 2019).Sampling-based MPC (model predictive control) algorithms can be used to address this issue. A Sampling-based MPC generates a large number of trajectories with the goal of maximizing the expected accumulated reward. But, complex environments are typically partially observable and the problem is formulated as a receding horizon planning task. Hence, after executing a single step, trajectories are generated again from scratch to deal with the receding horizon and reduce the impact of model compound error (Rao, 2010).An additional challenge lies in the high cost incurred by frequently generating imagined trajectories from scratch during planning. For each trajectory generated, a sequence of actions has to be evaluated with the dynamics model. Each successive step in a trajectory depends on previous states. So the evaluation of a single trajectory is a recurrent process that cannot be parallelized. Acting upon imagination, the method here proposed, seeks to estimate online the uncertainty of the imagined trajectory and to reduce the planning cost by continuing to act upon an it unless it cannot be trusted.
Therefore, in this work, we present methods for uncertainty estimation, designed to address and improve the computational limitations of shooting MPC MBRL methods. Our main objectiveis the online estimation of uncertainty of the model plan. A second objective is the application of uncertainty estimation to avoid frequent replanning. Here, we propose using the degree to which computations are reduced as a practical and quantifiable proxy for the effectiveness of uncertainty estimation methods. If our methods are successful, we expect to observe a significant decrease in computations without a substantial decay in performance. The balance between computational efficiency and performance demonstrates the reliability of our uncertainty estimation methods. A robust estimation of uncertainty facilitates efficient decision-making and optimizes the use of computational resources.Our contributions are as follows:
- 1.
We provide a thorough analysis and discussion on quantifying the error of predicted trajectories.
- 2.
We propose methods for uncertainty estimation in MBRL:
- (a)
Methods that observe the outcome of the last action in the trajectory i) comparing the error after performing the last action with the standard expected error, ii) assessing the deviation with respect to expected outcomes using model uncertainty.
- (b)
Methods that exploit the forward propagation of the dynamics model to iii) evaluate if the remainder of the plan aligns with expected results, iv) assess the remainder of the plan in terms of the expected reward.
- (a)
- 3.
We demonstrate how our proposed uncertainty estimation methods can be used to bypass the need for replanning in sampling-based MBRL methods.
Our experimental results on challenging benchmark control tasks demonstrate that the proposed methods effectively leverage accurate predictions as well as dynamically decide when to replan. This approach leads to substantial reductions in training time and promotes more efficient use of computational resources by eliminating unnecessary replanning steps.
2 Related work
Model-based reinforcement learning (MBRL) has been applied in various real-world control tasks, such as robotics. Compared to model-free approaches, MBRL tends to be more sample-efficient(Deisenroth etal., 2013).MBRL can be grouped into four main categories(Zhu etal., 2020):
1) Dyna-style algorithms optimize policies using samples from a learned world model (Sutton, 1990).
2) Model-augmented value expansion methods, such as MVE (Oh etal., 2017), use model-based rollouts to enhance targets for model-free Temporal Difference updates.
3) Analytic-gradient methods can be used when a differentiable world model is available, which adjust the policy through gradients that flow through the model. When compared to traditional planning algorithms that create numerous rollouts to choose the optimal action sequence, analytic-gradient methods are more computationally efficient. Stochastic Value Gradients (SVG) (Heess etal., 2015) provide a new way to calculate analytic value gradients using a generic differentiable world model. Dreamer (Hafner etal., 2020), a milestone in the realm of analytic-gradient model-based RL, demonstrates superior performance in visual control tasks. Dreamer expands upon SVG by facilitating the generation of imaginary rollouts within the latent space
4) Model Predictive Control (MPC) and sampling-based shooting methods employ planning to select actions. They are notably effective for addressing real-world scenarios since excessive data collection is not only costly and tedious, but it can also result in physical damage. Additionally, sampling-based MPC methods have the capacity to bootstrap from existing experiences and rapidly adapt to changing objectives on the fly. However, a significant drawback to these approaches is their computationally intensive nature (Rao, 2010; Chua etal., 2018). The present work belongs into the latter category.
Recently, it was demonstrated that parametric function approximators, neural networks (NN), efficiently reduce sample complexity in problems with high-dimensional non-linear dynamics(Nagabandi etal., 2018).Random shooting methods artificially generate large number of actions(Rao, 2010) and MPC is used to select candidate actions(Camacho etal., 2004). I.e.Williams etal. (2017) and Drews etal. (2017) introduced a sampling-based MPC with dynamics model to sample a large number of trajectories in parallel. A two-layer NN trained from maneuvers performed by a human pilot was superior compared with a physics model built using vehicle dynamics equations from bicycle models. One disadvantage is that NNs cannot quantify predictive uncertainty.
Lakshminarayanan etal. (2016) utilized the ensembles of probabilistic NNs to determine predictive uncertainty. Kalweit and Boedecker (2017) used a notion of uncertainty within a model-free RL (MFRL) agent to switch executing imagined trajectories from a dynamics model when MFRL agent has a high uncertainty. Conversely, Buckman etal. (2018) used imagined trajectories to improve the sample complexity of a MFRL agent. They improved the Q function by using ensembles to estimate uncertainty and prioritize trajectories thereupon.
Measuring the reliability of a learned dynamics model when generating imagined trajectories has been proposed in several works.Chua etal. (2018) identified two types of uncertainty: aleatoric (inherent to the process) and epistemic (resulting from datasets with too few data points). The former is the uncertainty inherent to the process, and the latter results from datasets with too few data points.They combined uncertainty aware probabilistic ensembles in the trajectory sampling of the MPC with a cross entropy controller and achieved asymptotic performance comparable to Proximal Policy Optimization (PPO) Schulman etal. (2017) or Soft Actor-Critic (SAC) Haarnoja etal. (2018) with more sample efficient convergence. Janner etal. (2019) generated (truncated) short trajectories with a probabilistic ensemble to train the policy of a MFRL agent, thus improving significantly its sampling efficiency. Yu etal. (2020) also exploits the uncertainty of the dynamics model to improve policy learning on an offline RL setting. They learn policies entirely from a large batch of previously collected data with rewards artificially penalized by the uncertainty of the dynamics. These works focus on sample efficiency and improving the performance, our work proposes novel methods to estimate the uncertainty of the dynamics model to determine when to replan.
The authors in (Hafez etal., 2020) propose a analytic-gradient-based method that considers the reliability of the learned dynamics model used for imagining the future. They evaluate their approach in the context of enhancing vision-based robotic grasping, aiming to improve sample efficiency in sparse reward environments. In contrast to their method, ours does not require the use of numerous local dynamics models or a self-organizing map. Instead, we introduce a technique that exploits the uncertainty of the dynamics model to estimate the uncertainty of plan during execution, primarily aimed at minimizing replanning within an MPC framework. Close to our work, Zhu etal. (2020) studied the discrepancy between imagination. Their method allows for policy generalization to real-world interactions by optimizing the mutual information between imagined and real trajectories, while simultaneously refining the policy based on the imagined trajectories. However, their focus is on analytic gradients MBRL only, our method can be applied to any MBRL which yields a notion of uncertainty and we focus on shooting methods, which are still the first choice in domains like self driving cars (Williams etal., 2017).
Hansen etal. (2022) obtained state of the art performance in terms of reward and training time on diverse continuous control tasks by significantly improving Model-Augmented Value Expansion methods. Their approach effectively combines the strengths of both MFRL and MBRL. They adopt a learned task-oriented latent dynamics model for localized trajectory optimization over a short horizon. Furthermore, they utilize a learned terminal value function to estimate the long-term returns. However, their method still necessitates the learning of the value function. Depending on the context, this could present challenges when compared to shooting methods.
Nevertheless, shooting MPC methods still suffer from expensive computationChua etal. (2018); Zhu etal. (2020).Thus, our research seeks to reduce the amount of computation continuing to act upon trajectories that seem trustworthy. Our solution builds upon results ofChua etal. (2018), using probabilistic ensembles and cross entropy in the MPC.
3 Preliminaries
RL aims to learn a policy that maximizes the accumulated reward obtained from the environment. At each time , the agent is at a state , executes an action and receives from the environment a reward and a state according to some unknown dynamics function . The goal is then to maximize the sum of discounted rewards , where .MBRL uses a discrete time dynamics model to predict the future state after executing action at state . To reach a state into the future, the dynamics model evaluates sequences of actions, over a longer horizon , to maximize their discounted reward .Due to partial observability of the environment and the error of the dynamics model in predicting the real physics , the controller typically executes only one action in the trajectory and the optimization is solved again with the updated state .Algorithm1 outlines the general steps.When training from scratch, the dynamics model is learned with data, , collected on the fly. With , the simulator starts and the controller is called to plan the best trajectory resulting in . Only the first action of the trajectory is executed in the environment and the rest is discarded. The data collected from the environment is added to and is trained further. MBRL requires a strategy to generate an action , given a state , a discrete time dynamics model to predict the state , and a reward function .
Probabilistic Dynamics Model.
We model the probability distribution of next state given current state and an action using a neural network based regression model similar toLakshminarayanan etal. (2016). The last layer of the model outputs parameters of a Gaussian distribution modeling the aleatoric uncertainty (due to the randomness of the environment). Its parameters are learned together with the parameters of the neural network. To model the epistemic uncertainty (of the dynamics model due to generalization errors), we use ensembles with bagging where all the members of the ensemble are identical except for their initial weight values.Each ensemble element takes as input the current state and action and , and it is trained to predict the difference between and , instead of directly predicting the next step. Thus, the learning objective for the dynamics model becomes, . outputs the probability distribution of the future state from which we can sample the future step and its confidence , where captures both, epistemic and aleatoric uncertainty.
Trajectory Generation.
Each ensemble element outputs the parameters of a normal distribution. To generate trajectories, P particles are created from the current state, , which are then propagated by: , using a particular bootstrap element . There are many options on how to propagate the particles through the ensemble as analyzed in detail in Chua etal. (2018).They obtained the best results using the method, which refers to particles never changing the initial bootstrap element. Doing so, results in having both uncertainties separated at the end of the trajectory. Specifically, aleatoric state variance is the average variance of particles of same bootstrap, whilst epistemic state variance is the variance of the average of particles of same bootstrap indexes. Our approach also uses the method.
Planning.
To select a convenient course of action leading to , MBRL generates a large number of trajectories and evaluates them in terms of reward.To find the actions that maximize reward,we used the cross entropy method (CEM) Botev etal. (2013), which is an algorithm for solving optimization problems based on cross-entropy minimization. CEM gradually changes the sampling distribution of the random search so that the rare-event is more likely to occur. Thus, this method estimates a sequence of sampling distributions that converges to a distribution with probability mass concentrated in a region of near-optimal solutions.Algorithm2 describes the use CEM to compute the optimal sequence of actions .The controller uses a single action in a trajectory, the computational complexity is constant at each step, given by the depth of the task horizon () and the number of trajectories () or breadth. It is possible to parallelize in breadth, but the evaluation of some action at state with dynamics model is iterative, requiring knowledge of at least one past state and cannot be parallelized in depth. This leads to complexity O(H x A x K), where A refers to the dimension of actions (how many controllable aspects). and depend on the environment.
4 The Promise of Imagination
Generating trajectories is an essential part of the entire process. The predicted states within those trajectories may have a high variance and their quality will depend on the complexity of the environment as well as the number of steps in the future, H. Estimation and online update of uncertainty is needed to determine if a trajectory is reliable. We contend that when predicted trajectories are reliable it is not necessary to frequently replan them.
Starting from a state at time , a run of the planner yields the optimal set of actions of H steps. Using , the dynamics model yields the probability of the next state from which we can sample the next state and confidence and future reward .Thus, one step sampled from the imagined trajectory is composed of , where represents the current state, the action to be taken, is the predicted reward if is executed, the predicted next state and the confidence bound issued by the dynamics model for the prediction. Then, the planner generates iteratively the entire trajectory of H steps, where each step is composed of the probability distribution .
Our methods stem from the following information obtained after executing each step in a trajectory: instantaneous deviation between predicted and real outcome, and impact on the projected plan, see Fig.1.Executing the action from the imagined trajectory , yields a real-world state and reward . The state is expected to fall within the uncertainty of the model . Given and , the instantaneous error can be measured. The error at step , refers to the error after trajectory calculation and executing . We can model the error distribution and observe if new errors fall within.This refers to immediate effect on the last state.As regards impact on future actions, forward application of the model to the remainder of the actions starting at the new state yields a new trajectory with projected state and projected reward , which can be compared with the planned expected outcomes.
Trajectory Quality Analysis:
As a preliminary experiment, we wanted to analyze the quality of imagined trajectories with a trained dynamics, to determine boundaries of how many actions can be executed without deviating from the plan. We analyzed imagined trajectories on agents in the MuJoCoTodorov etal. (2012) physics engine, with the Cartpole (CP) environment , 200, 25, with being the task horizon and the trajectory horizon. The additional material shows similar findings in other environments.A dynamics model was pre-trained in conventional MBRL-MPC, running Algorithm1 five times from scratch, with trajectory (re-)planning after executing each single action. The best performing model was selected for the analysis.The procedure consisted in collecting the errors of predicted and actual state when avoiding the re-planning for steps.For each the algorithm was run for steps, times. Therefore, the error at represents the average error of 10 runs executing the first action in the trajectory.Figure2 illustrates the error of predicted trajectories as a function of steps used (i.e., avoiding re-planning). While the error and its variation increases with , the minimum error at each step is still at the same level (and often lower) than the average error at step 0, where re-planning can not be avoided at all.Generally, re-planning earlier results in a lower error. However, the chart also shows that some trajectories are so reliable that 19 steps can be executed with error lower than the average error of first point in trajectory.
Reward Analysis:Compared to the vector of state errors, the reward has the advantage of being a more compact representation (single scalar). It also provides substantial information. Figure2 right shows the reward of successfully solved task in CP.After 50 steps, the reward does not change significantly and the system is at a local equilibrium. We contend that when the system is at equilibrium the dynamics model can reliably anticipate the outcomes of the agent’s actions, consequentially rewards are expected to remain similar .
5 Acting Upon Imagination
From the above discussion, the following information is available after executing each step (shown in Fig2):(i) immediate error (), (ii) the model uncertainty or confidence bounds for an action imagined against its execution (), iii) deviation in projected future states() and iv) the deviation in projected future rewards (). The last two pieces (iii), (iv) are obtained by forward applying the dynamics model with the remainder of the actions starting at the new state.We leverage each piece of available information to develop methods for uncertainty estimation and evaluate their performance in avoiding replanning events.
Algorithm3 presents the core logic of our proposed methods to continue acting upon imagined trajectories and reduce computation. The variable is updated with the result of one of four proposed methods in Algorithms4, 5, 6 or 7. Depending on the outcome, replanning can be avoided and computations reduced.If is False, only the first predicted action is executed in the environment. Otherwise, subsequent actions from are executed until the flag is set back to False or the number of steps in the environment is reached.
N-Skip
as a baseline, we introduce N-Skip, which is a straight forward method for replanning that executes a fixed steps in a trajectory (of length ) and triggers replanning at step (). For the trajectory is recomputed at every step. As earlier replanning generally lead to lower error. is a hyperparameter that should be tuned to meet the performance requirements. In the CP environment, Figure2 left shows a sharply increasing error at , which amounts to less computations.Interestingly, despite its simplicity, N-Skip has not been extensively analyzed or reported in existing literature.
First Step Alike (FSA)
Some trajectories are more reliable than others and a cutoff of n skips for all trajectories does not consider this variation in quality. Figure2 left shows that there are cases where after 19 steps of a trajectory, the prediction error is still lower than the average error of predicted states right after replanning. To account for such a variation, we propose a dynamic decision making.We would like to continue acting if the replanning will not improve over the error of predicted states. The error is lowest just after replanning and increases with number of steps. So, the main principle of FSA, is to omit replanning at any point , as long as the error is comparable to errors right after replanning.
In formal terms, assuming a large sample of M errors collected right after replanning () denoted by . Actions in a trajectory with predicted states are evaluated at each point , and if the error fits the distribution of , then the replanning is skipped. Otherwise, a new trajectory is generated.The challenge lies in determining when the error fits the distribution of . Two methods are proposed for handling this. If the errors follow a Gaussian distribution represented by mean and standard deviation , then according to the three sigma rulePukelsheim (1994), 68.27%, 95.45% and 99.73% of the errors should lie within one, two and three standard deviations of the mean, respectively. It follows that, any given error (at point ) such that fits the distribution and thus the replanning should be skipped. The constant is a hyper parameter that defines the specificity of such filtering method.This filtering ensures that the error is below a percentile of errors in where the percentage number depends on .Furthermore, as we do not want to filter out errors that are too small, we could adopt our rule to one side only: . Finally, for the case where the distribution of is not Gaussian, a similar effect can be achieved by ensuring that is within the percentile () of the errors in , where is a parameter to tune. The specific logic for the FSA is given by Algorithm4.
Confidence Bounds (CB)
If the dynamics model has a notion of uncertainty, one can obtain a prediction with an uncertainty or confidence .Given that is modeled by an ensemble of Gaussian regressors, we can assume that the confidence bound represents the variability in predicted outcomes of an action , where the action in question has been deemed appropriate at the given state therefore is in the trajectory. After performing the action, we obtain a real-world state . This method considers the trajectory reliable, if the actual state is close to predicted state within the confidence bound of predicted output states obtained from the dynamics model, meaning:, where is a constant representing the selectivity of the filter adjusted with a factor of sigma, a hyper parameter to be tuned. The specific logic for the CB method is given by Algorithm5.In a nutshell, this method assumes that the performance of an action could lead to several expected possible outcomes (bounded by the prediction). After performing the action, it is observed whether the outcome lies within the boundary of expected outcomes to determine the reliability of the trajectory.
Probabilistic future trust (FUT)
FSA and CB asses the error of the state obtained after executing an action vs the prediction estimated by . Instead, FUT regards the effect of the last action on the outcomes of future actions, by projecting the remaining imagined actions in the trajectory from the newly obtained state.After replanning, the trajectory of H steps, where each step is composed of , , sampled from and , offers a wealth of predicted information. FUT intends to detect whether after taking and reaching a new state the rest of the predictions in still hold.Thus, we project the trajectory from state using imagined actions and then we compare and . If they differ, then the agent is deviated from the plan and it should trigger a replanning event. Otherwise, it proceeds to take action .As long as the new trajectory is similar to the original estimation, we assume that the original plan is still valid and we skip that step. This does not mean that the optimal set of actions at each step is replanned. Rather, every time a replanning is skipped, we propagate only one trajectory starting from the current state of the simulator and still using the originally set of actions, , as initially planned.Algorithm6 describes FUT method.The original trajectory is estimated with the probability distribution and the updated plan with . We use Kullback–Leibler divergence (KL) to evaluate the change in two distributions after each step in the simulator. We replan when there is larger difference than (a hyper-parameter). We control how far ahead the two distributions are compared by introducing a hyper-parameter: LA (look ahead steps).
Bound Imagined Cost Horizon Omission (BICHO)
Figure2 (right) shows that when the system reaches equilibrium, the reward stabilizes as well. BICHO assumes the expected reward is stable and attempts to determine whether deviations arise after each step in the trajectory.
At each replanning step, we obtain the distribution of rewards for H steps in the future. Moreover, at each step of the environment, regardless whether the replanning was skipped or not, we project a new trajectory of H steps, starting from state which is given by the environment and using actions in the imagined trajectory obtained at the replanning step.We then compare this two distributions for LA steps ( ), which is a hyper parameter to tune.Essentially, replanning steps should be skipped as long as the projected reward in the future does not change significantly with respect to the reward expected from the plan . Note that when doing the comparison immediately after the replanning, both trajectories will start from the same state . While, as steps are taken without replanning, the imagined reward starts from one imagined state and the projected trajectory starts at the environment state .We expect this method to work better (in terms of replanning skipped) in environments where the cost has local equilibrium regions.The requirement for large overlapping regions between consecutive trajectories is not necessary in our approach, as we consider an overlap of LA steps ahead, which is typically smaller than the trajectory horizon (H).
6 Experiments
Our ultimate goal is to reduce computations while controlling performance decay. Intuitively, we expect that a trained dynamics model can anticipate the outcome of the agent’s actions and, if its predictions are reliable, it can do so for a number of consecutive imagined steps. So, the first experiment uses pre-trained dynamics to assess the potential gains of acting upon imagined trajectories with the proposed methods: N-Skip (NS), FSA, CB, FUT and BICHO. We aim to investigate the amount of re-planning that can be avoided in terms of number of trajectory planning skipped, the impact of our approach on the reward, and the average and variance of steps executed prior to initiating replanning.
We recognize that acting upon imagination has the potential to afford significant time savings while training the dynamics models and can potentially obtain a better result in terms of percentage of re-planning. Therefore, our second experiment evaluates the selected methods online while training the dynamics to assess the savings in overall training time and its effects on performance. In this experiment, we select the best performing methods from the previous experiment to test them while training the dynamics model from scratch.We evaluate the methods on agents in the MuJoCoTodorov etal. (2012) physics engine using two workstations with a last generation GPU. To establish a valid comparison with the baseline PETs Chua etal. (2018) (denoted as NS1), we use four environments with corresponding task length () and trajectory horizon ().
We use the following environments: Cartpole(CP): , 200, 25. Reacher (RE): , 150, 25. Pusher (PU): , 150, 25. HalfCheetah (HC): , 1000, 30.This means that each iteration will run for , task horizon, steps, and that imagined trajectories include trajectory horizon steps. refers to the dimensions of the environment state consisting in a vector of components and the action consisting in a vector of components. We assess performance in terms of reward per episode and evaluate wall time and avoided re-planing. All experiments use random seeds and randomize initial conditions per task.
6.1 Experiment I: Pre-trained Dynamics
This experiment uses a trained model and compares the uncertainty estimation methods: NS, FSA, CB, FUT, BICHO. It is expected that a pre-trained dynamics model predicts reliably outcomes of immediate actions and can project a number of steps reliably, and any variability in is attributable to task complexity. We quantify for each method and environment (see appendix) and also the corresponding percentage of replanning.
A dynamics model is pre-trained for each environment by running Algorithm 1 (no skip) from scratch five times and selecting the best performing model. As a result, we obtain one dynamics model, parameters and replay buffer per task, which we use to evaluate our methods.For each method, the different hyperparameters are empirically evaluated to find out how robust are the algorithms with respect to hyper-parameters across the different environments.We report the amount of replanning and the corresponding performance in terms of reward per episode.
We validate each method hyper-parameter with 10 runs per task with different random seeds and randomized initial conditions. We report the episode reward as the maximum reward obtained by the agent in an episode over 10 runs. For NS, we used steps, where recalculates at every step and it will be used as a baseline comparison. For FSA, we use constants . The error distribution is constructed from error of the state prediction and the actual state in the environment, using the data set obtained while pre-training the dynamics model.As the collected errors did not follow a normal distribution, the percentiles will be used in the Algorithm4 to determine whether a trajectory should be recalculated.In CB, we evaluate different values of in steps of . We expect that a higher value of will make the algorithm decrease the performance, on the other hand a very low value of will make the the algorithm too selective and will result in no skipping at all.In FUT and BICHO, we evaluate different values of for different value of look ahead steps, , ranging from 1 to . We report the best for the full range of .
Results.
For comparison we added the results for Soft-actor-critic (SAC) Haarnoja etal. (2018) and Proximal Policy Optimization (PPO) Schulman etal. (2017) at convergence. See Appendix Table 1 for detailed results for each environment.
CP. Fig 3 (left) shows the performance for NS, FSA, CB, FUT and BICHO in CP. There is no visible performance degradation by replanning only 40% of the time, and from 40% to 20% the task is still solved with a minor hit in performance. From 20% the results start decreasing dramatically. Interestingly, in BICHO there is no drastic loss in performance even when only replanning 9% of the steps. This is very close to the limit of the trajectory horizon .
HC. Fig 3 (mid-left) shows the performance in HC, a more complex environment. The graph shows no impact on performance for FSA, CB, FUT and BICHO when replanning up to 80% whilst performance is still acceptable better than SAC at convergence. With less than 60% replanning, the performance drops drastically. However, BICHO still outperforms PPO when recalculating only 20%.N-skip cannot reduce more than (n=1) without drastically degrading performance, showing that an adaptive method is necessary to skip replanning in complex environments such as HC.
PU. Fig 3 (mid-right) shows the PU results revealing that FSA, CB and n-skip have a stable performance with 80% replanning and a drastic drop when replanning decreases further. FUT keeps a good performance up to 50% and then the performance starts decreasing drastically. BICHO slightly drops in performance after 40% but it still maintains a good performance by replanning only 10% of the steps.
RE. Fig 3 (right) shows the results in RE. It reveals no visible performance degradation by replanning only 40% of the time and from 45% to 30% the task is still solved with FSA and CB with a minor hit in performance. With less than 20%, reward decreases dramatically. The performance of FUT drops after 30%. In this environment, n-skip has comparable degradation with other adaptive methods. However, the fact that n-skip starts at and it is fixed should be considered. The adaptive methods working at around replanning still retain performances above SAC and PPO.
In all environments, methods projecting future actions achieve longer sequences of steps without replanning with acceptable loss in reward. Results show that replanning less than is feasible whilst retaining state-of-the-art performance. In environments with lower dimensional actions, state spaces, or lower complexity, it is possible to save up to replanning steps. While blindly skipping replanning has an effect, the adaptive methods offer a reasonable trade-off to tune them to work at levels of replanning and performance not reachable by n-skip.
6.2 Experiment II: Online dynamics update
As the model is being trained, the number of outcomes it can predict reliably varies. Here, uncertainty estimation should result in aborting plans in favor of re-planning at early training whilst executing longer trajectories as training progresses.
We evaluate our methods while training the dynamics model, using Algorithm 3 with FUT and BICHO. These methods were selected due to the performance in Experiment I, and because they do not need an error model trained in advance. So we can approximate a real-deployment of the method with minimum tuning effort.We evaluate the algorithm in each environment with 3 runs. We increased 50% the number of training iterations (episodes) in order to better observe the effects of skipping re-planning.PETs from Chua etal. (2018) is the baseline.
Results.
In the following, we outline the performance of used algorithms with the best performing hyper-parameters:
CP. Fig4 shows the results of BICHO and FUT versus the baseline and NSKIP3 and NSKIP4 depicting the relative wall time compared with the non skip. As NSKIP3 replans 33% of the steps, it takes approximate a third of the time to train compared with the baseline, neglecting the training times of the dynamics model. Both BICHO and FUT outperform the baseline and static skip methods. They retain the same performance of NSKIP by replanning only of the steps.In BICHO, we observe a negligible impact in performance while reducing the wall time to and reaching top performance within few episodes. To achieve this performance, the conventional MPC needs calls to . Instead, BICHO performed 28 (SD=1.68). See Table 6in the appendix. As the dynamics model improves, it produces more accurate predictions and the percentage of replanning steps drops. Indeed, after 10 episodes the replanning needed drop to .
PU. BICHO outperforms the baseline, static methods and FUT. It maintains the performance of no skip whilst replanning only of the steps and reducing the wall time to , reaching peek performance in few episodes (see Fig4).
RE. Fig4 shows the results of the BICHO and FUT methods versus the baseline and NSKIP2 and NSKIP3. FUT outperforms the baseline and static skip methods whereas BICHO is more conservative in this environment and needs more replanning to reach top performance. Both BICHO and FUT maintain the performance of no skip, but BICHO needs 48% replanning events whereas FUT needs only 14%.
HC. Fig4 shows the results of BICHO and FUT versus the baseline and NSKIP2. Both BICHO and FUT outperform the static skip methods. BICHO has slightly worse top performance than the non skip baseline but it skips 20% of the replanning steps. FUT can skip more steps but the reward drops sharply. In this environment, each method reaches a local optima before it continues improving. Our methods reach this point very quickly without a performance drop and skipping up to 40% (FUT) of the replanning steps.
The graphs clearly illustrate significant savings in training times when acting upon imagination. In some cases, these savings are achieved without loss in reward and minimal loss in other cases. More importantly, this loss is referred to training running five or six times longer in the baseline.
7 Discussion
Acting upon imagination advocates for trusting a reliable imagined trajectory for several steps.Our experiments show that it leads to 20%-80% reduction in computation, depending on the environment, maintaining acceptable reward. The proposed methods leverage different kinds of information available after taking an action in the environment:FSA and CB decide to act on the basis of evaluating the last action in a trajectory, FUT and BICHO evaluate planned future actions from the new state. The latter result in less replanning.The proposed methods apply to a range of dynamics models for reducing computation costs, regardless of their capabilities to output the uncertainty. FUT and BICHO can be used along any dynamics model that models uncertainty. On the other hand, FSA and CB could be used along any MBRL algorithm by computing statistics of a sliding window of past experiences. The choice of which algorithm to use depends on the dynamics model’s nature. If the dynamics model does not provide a notion of uncertainty, then FSA would be preferable. Otherwise, the methods looking towards the future (FUT, BICHO) have superior performance in terms of saving computation and stability in performance (solving the problem). BICHO reduces most calculations of these last methods while performing at least as the baseline.
Planning
Skipping replanning can indeed be particularly beneficial in robotics, where hardware limitations often impose constraints on computational resources. By intelligently skipping unnecessary replanning events, we can allocate computational power more efficiently and potentially leverage more sophisticated models or algorithms.
Uncertainty estimation
One alternative avenue of research basing on the work of Zhu etal. (2020) would be a progressive measure of mutual information between imagined and real trajectory updated with each successive step to decide whether or not to re-plan. As a limitation, the error expressed as euclidean distance between two state vectors is simple and useful, but it may give misleading information. Comparably, the method BICHO that looks for deviations between imagined reward and a more actual reprojection of future rewards achieves superior performance. Perhaps, a more sophisticated method can be used, taking as input the two state vectors, actions, predicted and observed reward, to output a decision to act or re-plan. For example, a model-free trained policy.
The proposed methods have important implications beyond a greedy motivation to reduce computational effort and time complexity. These methods offer a way to assess how well the dynamics model is at predicting the outcomes of the agent’s actions. Conversely, the proposed methods offer a way to evaluate experiences. Meaning, experiences where the outcome of the environment deviates from the predictions of the model may be more informative towards training.Indeed, our work offers interesting insights on using our method for guided exploration. Assuming steps in an imagined trajectory can be trusted, their evaluation yields a small error, meaning the dynamics model successfully predicts these transitions. One could refine the exploration by omitting actions that lead to transitions with low error, and thus favour less known transitions for future training.We assume that re-planing is due to errors in the dynamics model. Each re-planning adjusts for the new with a receding horizon task. But as we act upon imagination, the horizon is no longer receding, and the trajectory risks becoming obsolete. So whence should the model adjust towards a new horizon of imagination?
8 Conclusion
In conclusion, our study provides a comprehensive analysis and discussion quantifying the error of predicted trajectories in MBRL. We propose methods for online uncertainty estimation in MBRL, incorporating techniques that observe the outcome of the last action in the trajectory. These methods include comparing the error after performing the last action with the standard expected error and assessing the deviation with respect to expected outcomes using model uncertainty. Additionally, we introduce methods that exploit the forward propagation of the dynamics model to evaluate if the remainder of the plan aligns with expected results and assess the remainder of the plan in terms of the expected reward. These methods update the uncertainty estimation in real-time to assess the utility of the plan.We demonstrate the efficacy of these uncertainty estimation techniques. Our methods not only leverage accurate predictions but also intelligently determine when to replan trajectories. This approach significantly reduces training time and optimizes the utilization of computational resources by eliminating unnecessary replanning steps. Overall, our findings highlight the potential of these methods to enhance the performance and efficiency of sampling-based MBRL approaches.
Acknowledgment
Funding in direct support of this work: Adrian Remonda reports financial support was provided by AVL List GmbH. Adrian Remonda reports a relationship with Know-Center GmbH that includes: employment. This research was partially funded by AVL GmbH and Know-Center GmbH. Know-Center is funded within the Austrian COMET Program-Competence Centers for Excellent Technologies - under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG.
References
- Botev etal. [2013]ZdravkoI. Botev, DirkP. Kroese, ReuvenY. Rubinstein, and Pierre L’Ecuyer.Chapter 3 - the cross-entropy method for optimization.In C.R. Rao and Venu Govindaraju, editors, Handbook ofStatistics, volume31 of Handbook of Statistics, pages 35 – 59.Elsevier, 2013.doi: https://doi.org/10.1016/B978-0-444-53859-8.00003-5.URLhttp://www.sciencedirect.com/science/article/pii/B9780444538598000035.
- Buckman etal. [2018]Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee.Sample-efficient reinforcement learning with stochastic ensemblevalue expansion.In Proceedings of the 32nd International Conference on NeuralInformation Processing Systems, NIPS’18, page 8234–8244, Red Hook, NY,USA, 2018. Curran Associates Inc.
- Camacho etal. [2004]E.F. Camacho, C.Bordons, and C.B. Alba.Model Predictive Control.Advanced Textbooks in Control and Signal Processing. Springer London,2004.ISBN 9781852336943.URL https://books.google.at/books?id=Sc1H3f3E8CQC.
- Chua etal. [2018]Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine.Deep reinforcement learning in a handful of trials usingprobabilistic dynamics models, 2018.
- Deisenroth etal. [2013]MarcPeter Deisenroth, Gerhard Neumann, and Jan Peters, 2013.
- Drews etal. [2017]Paul Drews, Brian Goldfain, Grady Williams, and EvangelosA. Theodorou.Aggressive deep driving: Model predictive control with a cnn costmodel.2017.
- Haarnoja etal. [2018]Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine.Soft actor-critic: Off-policy maximum entropy deep reinforcementlearning with a stochastic actor, 2018.
- Hafez etal. [2020]MuhammadBurhan Hafez, Cornelius Weber, Matthias Kerzel, and Stefan Wermter.Improving robot dual-system motor learning with intrinsicallymotivated meta-control and latent-space experience imagination.CoRR, abs/2004.08830, 2020.URL https://arxiv.org/abs/2004.08830.
- Hafner etal. [2020]Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi.Dream to control: Learning behaviors by latent imagination, 2020.
- Hansen etal. [2022]Nicklas Hansen, Xiaolong Wang, and Hao Su.Temporal difference learning for model predictive control, 2022.
- Heess etal. [2015]Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, andTom Erez.Learning continuous control policies by stochastic value gradients,2015.
- Janner etal. [2019]Michael Janner, Justin Fu, Marvin Zhang, and S.Levine.When to trust your model: Model-based policy optimization.In NeurIPS, 2019.
- Kalweit and Boedecker [2017]Gabriel Kalweit and Joschka Boedecker.Uncertainty-driven imagination for continuous deep reinforcementlearning.volume78 of Proceedings of Machine Learning Research, pages195–206. PMLR, 13–15 Nov 2017.URL http://proceedings.mlr.press/v78/kalweit17a.html.
- Lakshminarayanan etal. [2016]Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell.Simple and scalable predictive uncertainty estimation using deepensembles, 2016.
- Lillicrap etal. [2015]TimothyP. Lillicrap, JonathanJ. Hunt, Alexander Pritzel, Nicolas Heess, TomErez, Yuval Tassa, David Silver, and Daan Wierstra.Continuous control with deep reinforcement learning.CoRR, abs/1509.02971, 2015.URL http://arxiv.org/abs/1509.02971.
- Mnih etal. [2015]Volodymyr Mnih, Koray Kavukcuoglu, David Silver, AndreiA. Rusu, Joel Veness,MarcG. Bellemare, Alex Graves, Martin Riedmiller, AndreasK. Fidjeland,Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, IoannisAntonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, andDemis Hassabis.Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, February 2015.ISSN 00280836.URL http://dx.doi.org/10.1038/nature14236.
- Nagabandi etal. [2018]Anusha Nagabandi, G.Kahn, RonaldS. Fearing, and S.Levine.Neural network dynamics for model-based deep reinforcement learningwith model-free fine-tuning.2018 IEEE International Conference on Robotics and Automation(ICRA), pages 7559–7566, 2018.
- Oh etal. [2017]Junhyuk Oh, Satinder Singh, and Honglak Lee.Value prediction network.In NIPS, 2017.
- Pukelsheim [1994]Friedrich Pukelsheim.The three sigma rule.The American Statistician, 48:88–91, 1994.
- Rao [2010]AnvilV. Rao.A survey of numerical methods for optimal control.Advances in the Astronautical Science, 135:497–528,2010.
- Schulman etal. [2017]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.Proximal policy optimization algorithms, 2017.
- Sutton [1990]RichardS. Sutton.Integrated architectures for learning, planning, and reacting basedon approximating dynamic programming.In ML Workshop, 1990.
- Sutton and Barto [1998]RichardS. Sutton and AndrewG. Barto.Introduction to Reinforcement Learning.MIT Press, Cambridge, MA, USA, 1st edition, 1998.ISBN 0262193981.
- Todorov etal. [2012]Emanuel Todorov, Tom Erez, and Yuval Tassa.Mujoco: A physics engine for model-based control.In IROS, pages 5026–5033. IEEE, 2012.ISBN 978-1-4673-1737-5.
- Williams etal. [2017]Grady Williams, Paul Drews, Brian Goldfain, JamesM. Rehg, and EvangelosA.Theodorou.Information theoretic model predictive control: Theory andapplications to autonomous driving, 2017.
- Yu etal. [2020]Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine,Chelsea Finn, and Tengyu Ma.Mopo: Model-based offline policy optimization, 2020.
- Zhu etal. [2020]Guangxiang Zhu, Minghao Zhang, Honglak Lee, and Chongjie Zhang.Bridging imagination and reality for model-based deep reinforcementlearning.NeurIPS, 2020.
Appendix A Environments
We evaluate the methods on agents in the MuJoCoTodorov etal. [2012] physics engine. To establish a valid comparison with Chua etal. [2018] we use four environments with corresponding task length () and trajectory horizon ().
- 1.
Cartpole (CP): , 200, 25
- 2.
Reacher (RE): , 150, 25
- 3.
Pusher (PU): , 150, 25
- 4.
HalfCheetah (HC): , 1000, 30
This means that each iteration will run for , task horizon, steps, and that imagined trajectories include trajectory horizon steps. refers to the dimensions of the environment state consisting in a vector of components and the action consisting in a vector of components.
Appendix B Trajectory Quality Analysis
The error (euclidean distance between the actual state and predicted states) as a function of the predicted steps in the future is given in Figure5. This Figure is an extension of Figure 2to all the environments. One curious observation is that the error for environments PU and RE his relatively higher also when not skipping replanning and increases faster than for environments CP and HC.
Appendix C Computational Costs
Our proposed algorithms aim to save computations by omitting trajectory recalculations. The complexity of trajectory recalculation is O(H x A x K), H is the length of horizon (we use H=20), A the dimensions of action (=1, =7, =7, =6) and K is the number of trajectories generated at each recalculation. K depends on the solver and environment, and in our case it is =10000, =12500, =12500, =10000.
However, the algorithms for deciding whether to skip replanning introduce additional computations. For the n-skip, the computational costs are O(0). Both FSA and CB have a computational complexity (computing the error is and deciding whether to skip is ), where is the number of dimensions of the state and it is different for each environment (=4, =17, =20, =18). FUT and BICHO project one additional trajectory of length to decide whether to skip or not, where is a hyperparameter. The resulting computational complexity is then . Additionally, comparing the trajectories increases the complexity by .Comparing the computation costs above, it is clear that the costs introduced by n-skip, FCA and CB are negligible compared to the costs of replanning. The cost of both FUT and BICHO is higher but still not comparable to the computational cost of having to replan.
Appendix D EX-1: Offline Reward and Replanning Rate
Additional information for 10 runs of each environment for each considered hyper parameter is provided as reference. Table 1 summarizes selected results. Table 2, Table 3, Table 4 and Table 5 show detailed results of each hyper parameter for the environments CP, HC, PU and RE respectively. We report the average and STD reward per episode, steps replanned and number of sequential steps skipped. We also included the error and its STD.
CP | |||||
---|---|---|---|---|---|
Method | RwMax | Rw | Rc | RcPer | i |
Baseline | 179.373 | 178.830 | 200.00 | 1.00 | 0.00 |
NSKIP1 | 179.610 | 178.201 | 100.00 | 0.50 | 0.99 |
NSKIP2 | 178.687 | 177.474 | 67.00 | 0.33 | 1.97 |
FSA0.50 | 179.923 | 178.997 | 130.50 | 0.65 | 1.25 |
FSA0.99 | 177.061 | 172.569 | 27.60 | 0.13 | 6.22 |
CB0.50 | 179.473 | 178.303 | 76.50 | 0.38 | 2.77 |
BICHO1032 | 179.061 | 175.951 | 21.70 | 0.10 | 7.92 |
BICHO2064 | 177.022 | 174.542 | 18.00 | 0.09 | 9.79 |
FUT014.00 | 178.620 | 175.474 | 40.000 | 0.20 | 3.94 |
HC | |||||
Baseline | 16750.776 | 12764.668 | 1000.00 | 1.000 | 0.000 |
NSKIP1 | 13748.625 | 9247.623 | 500.00 | 0.500 | 0.998 |
NSKIP2 | 10266.311 | 6118.676 | 334.00 | 0.334 | 1.994 |
FSA0.50 | 15791.490 | 10881.401 | 586.00 | 0.586 | 1.613 |
BICHO050.200 | 18341.042 | 11637.706 | 662.30 | 0.662 | 0.998 |
BICHO056.000 | 8646.361 | 1872.543 | 218.50 | 0.218 | 7.004 |
FUT010.100 | 15190.099 | 10127.721 | 595.10 | 0.595 | 1.021 |
PU | |||||
Baseline | -49.277 | -56.858 | 150.00 | 1.000 | 0.000 |
NSKIP1 | -68.296 | -79.493 | 75.00 | 0.500 | 0.990 |
NSKIP2 | -79.710 | -85.368 | 50.00 | 0.333 | 1.960 |
FSA0.50 | -51.347 | -78.970 | 116.70 | 0.773 | 1.199 |
CB1.00 | -49.149 | -76.829 | 128.70 | 0.853 | 1.242 |
BICHO1016.0 | -51.527 | -85.723 | 17.70 | 0.118 | 7.002 |
FUT010.40 | -56.583 | -81.134 | 73.90 | 0.493 | 1.019 |
RE | |||||
Baseline | -45.121 | -45.930 | 150.00 | 1.000 | 0.000 |
NSKIP1 | -45.144 | -46.296 | 75.00 | 0.500 | 0.987 |
NSKIP2 | -46.076 | -47.167 | 50.00 | 0.333 | 1.960 |
FSA0.50 | -44.420 | -46.080 | 100.60 | 0.671 | 1.211 |
CB1.00 | -45.097 | -46.592 | 86.50 | 0.577 | 2.011 |
BICHO230.125 | -44.609 | -45.972 | 115.10 | 0.767 | 0.972 |
BICHO23768 | -45.719 | -58.677 | 49.90 | 0.333 | 2.154 |
FUT010.200 | -45.080 | -46.554 | 75.20 | 0.501 | 0.991 |
FUT018.000 | -46.922 | -49.804 | 26.60 | 0.177 | 4.549 |
CP | |||||||||
---|---|---|---|---|---|---|---|---|---|
Method | Rw | RwSTD | Rc | RcSTD | RcPer | i mean | i STD | RwMin | RwMax |
Baseline | 178.83 | 0.41 | 200.00 | 0.00 | 1.00 | 0.00 | 0.00 | 177.92 | 179.37 |
NSKIP1 | 178.20 | 0.84 | 100.00 | 0.00 | 0.50 | 0.99 | 0.00 | 176.62 | 179.61 |
NSKIP2 | 177.47 | 0.90 | 67.00 | 0.00 | 0.34 | 1.97 | 0.00 | 175.87 | 178.69 |
NSKIP3 | 175.91 | 2.72 | 50.00 | 0.00 | 0.25 | 2.94 | 0.00 | 168.24 | 177.43 |
NSKIP5 | 174.26 | 4.74 | 34.00 | 0.00 | 0.17 | 4.85 | 0.00 | 161.63 | 177.95 |
NSKIP6 | 170.22 | 6.35 | 29.00 | 0.00 | 0.14 | 5.79 | 0.00 | 162.63 | 176.79 |
NSKIP7 | 163.60 | 4.70 | 25.00 | 0.00 | 0.12 | 6.72 | 0.00 | 155.70 | 171.89 |
NSKIP8 | 158.94 | 5.97 | 23.00 | 0.00 | 0.12 | 7.65 | 0.00 | 143.23 | 163.29 |
NSKIP9 | 129.26 | 35.15 | 20.00 | 0.00 | 0.10 | 8.55 | 0.00 | 68.02 | 161.79 |
BICHO100.10 | 178.76 | 0.75 | 177.60 | 5.25 | 0.89 | 0.95 | 0.03 | 177.12 | 179.69 |
BICHO100.20 | 178.81 | 0.55 | 156.30 | 4.64 | 0.78 | 0.98 | 0.01 | 178.05 | 179.98 |
BICHO100.40 | 178.29 | 0.46 | 148.70 | 8.71 | 0.74 | 1.01 | 0.21 | 177.53 | 178.84 |
BICHO100.80 | 178.05 | 0.64 | 139.60 | 3.89 | 0.70 | 1.02 | 0.31 | 176.98 | 178.74 |
BICHO1032 | 177.71 | 0.89 | 83.90 | 15.16 | 0.42 | 1.70 | 0.84 | 176.48 | 179.06 |
BICHO10128 | 177.78 | 0.50 | 44.40 | 11.71 | 0.22 | 3.79 | 0.56 | 176.90 | 178.52 |
FSA0.15 | 178.93 | 0.53 | 173.60 | 4.81 | 0.87 | 1.02 | 0.07 | 178.03 | 179.74 |
FSA0.25 | 178.87 | 0.66 | 154.60 | 3.63 | 0.77 | 1.10 | 0.09 | 177.95 | 179.76 |
FSA0.35 | 178.87 | 0.37 | 141.80 | 4.19 | 0.71 | 1.18 | 0.08 | 177.98 | 179.43 |
FSA0.50 | 179.00 | 0.57 | 130.50 | 4.97 | 0.65 | 1.25 | 0.08 | 177.92 | 179.92 |
FSA0.99 | 172.57 | 5.24 | 27.60 | 2.59 | 0.14 | 6.23 | 0.40 | 161.68 | 177.06 |
CB0.50 | 178.30 | 1.14 | 76.50 | 6.36 | 0.38 | 2.77 | 0.50 | 175.34 | 179.47 |
CB0.90 | 120.29 | 45.19 | 27.50 | 7.50 | 0.14 | 8.78 | 0.62 | 31.59 | 170.37 |
CB1.00 | 114.16 | 34.08 | 25.80 | 5.94 | 0.13 | 9.27 | 0.62 | 44.18 | 151.96 |
CB1.75 | 74.17 | 26.98 | 15.80 | 4.59 | 0.08 | 14.43 | 0.48 | 40.05 | 117.84 |
BICHO200.05 | 178.76 | 0.75 | 177.60 | 5.25 | 0.89 | 0.95 | 0.03 | 177.12 | 179.69 |
BICHO200.10 | 178.81 | 0.55 | 156.30 | 4.64 | 0.78 | 0.98 | 0.01 | 178.05 | 179.98 |
BICHO200.70 | 177.80 | 0.61 | 56.90 | 15.60 | 0.29 | 2.87 | 0.87 | 176.95 | 178.81 |
BICHO201 | 177.60 | 0.73 | 40.20 | 7.71 | 0.20 | 4.07 | 0.60 | 176.56 | 178.57 |
BICHO208 | 177.02 | 0.83 | 26.00 | 0.82 | 0.13 | 6.48 | 0.39 | 175.50 | 178.13 |
BICHO2064 | 172.28 | 7.54 | 18.30 | 1.83 | 0.09 | 9.70 | 0.58 | 153.22 | 177.02 |
BICHO20256 | 165.47 | 14.67 | 17.00 | 1.25 | 0.08 | 10.47 | 0.97 | 138.41 | 175.40 |
BICHO230.05 | 178.69 | 0.41 | 178.00 | 3.62 | 0.89 | 0.96 | 0.02 | 178.00 | 179.30 |
BICHO230.10 | 178.37 | 0.91 | 150.80 | 4.05 | 0.75 | 0.99 | 0.03 | 176.97 | 179.57 |
BICHO230.70 | 177.88 | 0.40 | 48.80 | 17.96 | 0.24 | 3.47 | 0.90 | 177.34 | 178.56 |
BICHO231 | 177.46 | 0.69 | 42.50 | 14.32 | 0.21 | 3.99 | 0.86 | 176.23 | 178.42 |
BICHO238 | 175.01 | 7.34 | 25.30 | 1.64 | 0.13 | 6.67 | 0.46 | 154.18 | 177.88 |
BICHO2364 | 174.54 | 2.08 | 18.00 | 0.67 | 0.09 | 9.79 | 0.48 | 170.59 | 177.01 |
FUT010.05 | 178.61 | 0.64 | 167.30 | 4.88 | 0.84 | 0.97 | 0.01 | 177.55 | 179.70 |
FUT010.15 | 178.45 | 0.54 | 104.80 | 1.48 | 0.52 | 1.04 | 0.05 | 177.44 | 179.04 |
FUT010.80 | 177.08 | 0.81 | 67.50 | 1.78 | 0.34 | 1.95 | 0.05 | 175.96 | 178.15 |
FUT012.00 | 177.19 | 0.75 | 50.80 | 0.92 | 0.25 | 2.91 | 0.04 | 176.18 | 178.07 |
FUT014 | 175.47 | 4.03 | 40.00 | 0.47 | 0.20 | 3.94 | 0.09 | 164.78 | 178.62 |
FUT0164.0 | 82.46 | 34.04 | 16.40 | 1.43 | 0.08 | 10.92 | 0.78 | 27.88 | 147.24 |
FUT01256 | 23.74 | 24.51 | 10.80 | 1.48 | 0.05 | 17.11 | 0.72 | 4.45 | 82.34 |
HC | |||||||||
---|---|---|---|---|---|---|---|---|---|
Method | Rw | RwSTD | Rc | RcSTD | RcPer | i mean | i STD | RwMin | RwMax |
Baseline | 12764.668 | 2849.853 | 1000.000 | 0.000 | 1.000 | 0.000 | 0.000 | 7372 | 16750 |
NSKIP1 | 9247.623 | 2179.981 | 500.000 | 0.000 | 0.500 | 0.998 | 0.000 | 5299 | 13748 |
NSKIP2 | 6118.676 | 3011.780 | 334.000 | 0.000 | 0.334 | 1.994 | 0.000 | 1375 | 10266 |
NSKIP3 | 1443.533 | 357.099 | 250.000 | 0.000 | 0.250 | 2.988 | 0.000 | 1008 | 1944 |
NSKIP4 | 1048.150 | 147.981 | 200.000 | 0.000 | 0.200 | 3.980 | 0.000 | 859 | 1352 |
NSKIP5 | 750.226 | 36.873 | 167.000 | 0.000 | 0.167 | 4.970 | 0.000 | 704 | 823 |
NSKIP7 | 453.360 | 40.811 | 125.000 | 0.000 | 0.125 | 6.944 | 0.000 | 379 | 512 |
NSKIP9 | 261.672 | 146.953 | 100.000 | 0.000 | 0.100 | 8.910 | 0.000 | -11 | 379 |
FSA0.15 | 10637.761 | 3986.620 | 922.400 | 10.700 | 0.922 | 1.031 | 0.059 | 5718 | 16835 |
FSA0.25 | 13283.989 | 4397.851 | 854.300 | 23.636 | 0.854 | 1.099 | 0.061 | 5099 | 18182 |
FSA0.35 | 9963.964 | 2448.046 | 736.400 | 20.919 | 0.736 | 1.244 | 0.049 | 5190 | 13450 |
FSA0.50 | 10881.401 | 2565.614 | 586.000 | 22.691 | 0.586 | 1.613 | 0.078 | 6997 | 15791 |
FSA0.90 | 173.839 | 143.729 | 50.500 | 11.336 | 0.051 | 19.838 | 1.536 | 53 | 499 |
CB0.50 | 13269.067 | 3347.259 | 994.400 | 1.174 | 0.994 | 0.844 | 0.026 | 6550 | 17071 |
CB0.90 | 11010.053 | 3129.994 | 657.900 | 8.900 | 0.658 | 1.612 | 0.215 | 4781 | 15326 |
CB1.00 | 7262.139 | 3246.655 | 523.500 | 31.366 | 0.524 | 2.031 | 0.175 | 3040 | 12650 |
CB1.75 | 408.850 | 412.640 | 85.100 | 37.245 | 0.085 | 13.553 | 1.866 | 105 | 1538 |
BICHO050.050 | 12685.246 | 3163.449 | 940.100 | 3.414 | 0.940 | 0.984 | 0.004 | 5015 | 16299 |
BICHO050.100 | 11696.336 | 2267.720 | 808.400 | 9.812 | 0.808 | 0.995 | 0.002 | 9150 | 16227 |
BICHO050.200 | 11637.706 | 3778.409 | 662.300 | 8.538 | 0.662 | 0.998 | 0.010 | 6274 | 18341 |
BICHO050.800 | 8218.947 | 2584.988 | 535.400 | 4.477 | 0.535 | 1.031 | 0.122 | 4207 | 11222 |
BICHO054.000 | 5302.087 | 3623.648 | 404.700 | 40.604 | 0.405 | 1.517 | 1.104 | 1293 | 10756 |
BICHO056.000 | 1872.543 | 2639.615 | 218.500 | 130.996 | 0.218 | 7.004 | 3.940 | 5 | 8646 |
BICHO058.000 | 587.539 | 865.745 | 79.200 | 42.856 | 0.079 | 14.786 | 1.851 | 11 | 2800 |
FUT010.025 | 12936.022 | 2014.828 | 999.500 | 0.707 | 1.000 | 0.217 | 0.255 | 10037 | 15763 |
FUT010.050 | 13436.725 | 3264.052 | 964.100 | 9.158 | 0.964 | 0.971 | 0.025 | 9192 | 18000 |
FUT010.100 | 10127.721 | 2911.831 | 595.100 | 11.396 | 0.595 | 1.021 | 0.022 | 6630 | 15190 |
FUT010.125 | 8075.127 | 3342.769 | 522.200 | 5.922 | 0.522 | 1.048 | 0.013 | 1733 | 12706 |
FUT010.150 | 7652.031 | 2645.142 | 492.400 | 7.545 | 0.492 | 1.086 | 0.028 | 3945 | 12400 |
FUT010.400 | 6498.406 | 2534.774 | 388.800 | 4.638 | 0.389 | 1.570 | 0.025 | 1645 | 9707 |
FUT012.000 | 1973.394 | 1416.910 | 237.500 | 7.367 | 0.237 | 3.208 | 0.083 | 878 | 4863 |
PU | |||||||||
---|---|---|---|---|---|---|---|---|---|
Method | Rw | RwSTD | Rc | RcSTD | RcPer | i mean | i STD | RwMin | RwMax |
Baseline | -56.858 | 10.768 | 150.000 | 0.000 | 1.000 | 0.000 | 0.000 | -75.277 | -49.277 |
NSKIP1 | -79.493 | 9.199 | 75.000 | 0.000 | 0.500 | 0.987 | 0.000 | -96.068 | -68.296 |
NSKIP2 | -85.368 | 3.235 | 50.000 | 0.000 | 0.333 | 1.960 | 0.000 | -89.991 | -79.710 |
NSKIP3 | -87.983 | 3.633 | 38.000 | 0.000 | 0.253 | 2.921 | 0.000 | -96.177 | -83.838 |
NSKIP4 | -86.773 | 3.526 | 30.000 | 0.000 | 0.200 | 3.867 | 0.000 | -94.565 | -83.384 |
NSKIP5 | -89.847 | 5.745 | 25.000 | 0.000 | 0.167 | 4.800 | 0.000 | -96.857 | -83.023 |
NSKIP6 | -89.504 | 3.974 | 22.000 | 0.000 | 0.147 | 5.727 | 0.000 | -94.531 | -84.423 |
NSKIP7 | -93.925 | 3.331 | 19.000 | 0.000 | 0.127 | 6.632 | 0.000 | -99.409 | -87.872 |
NSKIP8 | -97.002 | 2.200 | 17.000 | 0.000 | 0.113 | 7.529 | 0.000 | -100.367 | -93.706 |
NSKIP9 | -97.425 | 3.086 | 15.000 | 0.000 | 0.100 | 8.400 | 0.000 | -101.036 | -91.792 |
FSA0.15 | -71.232 | 14.667 | 142.700 | 3.433 | 0.951 | 0.852 | 0.066 | -86.917 | -50.196 |
FSA0.25 | -71.606 | 12.279 | 138.900 | 4.748 | 0.920 | 0.916 | 0.065 | -86.514 | -50.309 |
FSA0.35 | -68.623 | 15.260 | 131.700 | 10.371 | 0.873 | 0.990 | 0.057 | -89.602 | -50.198 |
FSA0.50 | -78.970 | 16.020 | 116.700 | 12.859 | 0.773 | 1.199 | 0.096 | -100.741 | -51.347 |
FSA0.90 | -106.759 | 1.643 | 7.100 | 0.316 | 0.047 | 18.775 | 0.164 | -109.108 | -104.638 |
CB0.50 | -59.227 | 12.197 | 149.800 | 0.422 | 0.993 | 0.100 | 0.211 | -87.747 | -50.568 |
CB0.90 | -67.436 | 15.815 | 133.400 | 6.150 | 0.887 | 1.142 | 0.130 | -87.785 | -49.913 |
CB1.00 | -76.829 | 16.610 | 128.700 | 8.138 | 0.853 | 1.242 | 0.212 | -99.612 | -49.149 |
CB1.75 | -90.519 | 6.520 | 65.600 | 8.540 | 0.433 | 3.049 | 0.625 | -103.549 | -83.159 |
BICHO100.05 | -59.871 | 9.853 | 139.100 | 4.332 | 0.927 | 0.899 | 0.061 | -73.980 | -49.726 |
BICHO100.35 | -72.783 | 15.044 | 81.700 | 3.889 | 0.545 | 1.079 | 0.467 | -90.490 | -50.358 |
BICHO100.50 | -75.205 | 10.421 | 75.900 | 8.517 | 0.506 | 1.192 | 0.894 | -89.100 | -53.112 |
BICHO102.00 | -75.101 | 14.264 | 39.400 | 9.663 | 0.263 | 2.838 | 1.563 | -90.866 | -54.450 |
BICHO103.00 | -76.374 | 13.973 | 30.800 | 1.317 | 0.205 | 3.471 | 0.422 | -91.605 | -50.523 |
BICHO108.00 | -78.874 | 13.931 | 23.600 | 2.547 | 0.157 | 5.076 | 0.582 | -93.285 | -54.045 |
BICHO1016.0 | -85.723 | 13.360 | 17.700 | 1.636 | 0.118 | 7.002 | 0.749 | -100.344 | -51.527 |
BICHO1032.0 | -89.359 | 11.471 | 13.200 | 2.394 | 0.088 | 9.805 | 0.474 | -108.031 | -66.529 |
FUT010.25 | -56.858 | 10.768 | 150.000 | 0.000 | 1.000 | 0.000 | 0.000 | -75.277 | -49.277 |
FUT010.05 | -54.335 | 6.991 | 148.300 | 1.494 | 0.989 | 0.508 | 0.200 | -68.104 | -48.443 |
FUT010.40 | -81.134 | 10.063 | 73.900 | 0.738 | 0.493 | 1.019 | 0.066 | -93.088 | -56.583 |
FUT010.80 | -86.352 | 3.430 | 58.400 | 2.459 | 0.389 | 1.554 | 0.046 | -92.642 | -81.845 |
FUT011.00 | -84.547 | 3.920 | 55.200 | 1.476 | 0.368 | 1.703 | 0.048 | -90.966 | -77.338 |
FUT012.00 | -85.207 | 1.824 | 46.400 | 2.119 | 0.309 | 2.211 | 0.074 | -88.006 | -81.926 |
FUT014.00 | -88.125 | 4.487 | 37.500 | 1.434 | 0.250 | 2.965 | 0.079 | -96.489 | -82.626 |
FUT0116.0 | -90.654 | 5.113 | 22.800 | 0.919 | 0.152 | 5.484 | 0.116 | -97.510 | -83.038 |
FUT0132.0 | -97.671 | 2.087 | 17.600 | 0.516 | 0.117 | 7.282 | 0.178 | -101.307 | -94.572 |
FUT0164.0 | -99.894 | 4.789 | 14.000 | 0.667 | 0.093 | 9.455 | 0.276 | -104.336 | -91.531 |
FUT0164.0 | -105.005 | 1.319 | 8.200 | 0.422 | 0.055 | 16.189 | 0.301 | -106.247 | -102.273 |
RE | |||||||||
---|---|---|---|---|---|---|---|---|---|
Method | Rw | RwSTD | Rc | RcSTD | RcPer | i mean | i STD | RwMin | RwMax |
Baseline | -45.930 | 0.606 | 150.000 | 0.000 | 1.000 | 0.000 | 0.000 | -46.689 | -45.121 |
NSKIP1 | -46.296 | 0.934 | 75.000 | 0.000 | 0.500 | 0.987 | 0.000 | -47.843 | -45.144 |
NSKIP2 | -47.167 | 0.815 | 50.000 | 0.000 | 0.333 | 1.960 | 0.000 | -48.515 | -46.076 |
NSKIP3 | -48.553 | 1.075 | 38.000 | 0.000 | 0.253 | 2.921 | 0.000 | -49.748 | -46.352 |
NSKIP4 | -49.855 | 0.953 | 30.000 | 0.000 | 0.200 | 3.867 | 0.000 | -51.185 | -47.895 |
NSKIP5 | -50.072 | 1.763 | 25.000 | 0.000 | 0.167 | 4.800 | 0.000 | -53.788 | -47.317 |
NSKIP7 | -51.640 | 2.480 | 19.000 | 0.000 | 0.127 | 6.632 | 0.000 | -54.880 | -47.674 |
NSKIP9 | -54.798 | 1.955 | 15.000 | 0.000 | 0.100 | 8.400 | 0.000 | -57.343 | -51.526 |
FSA0.15 | -45.591 | 1.155 | 134.800 | 2.860 | 0.899 | 0.958 | 0.058 | -48.215 | -44.236 |
FSA0.25 | -45.639 | 0.512 | 124.200 | 4.984 | 0.828 | 1.006 | 0.071 | -46.553 | -45.137 |
FSA0.35 | -45.839 | 1.132 | 111.400 | 2.875 | 0.743 | 1.094 | 0.109 | -47.770 | -44.163 |
FSA0.50 | -46.080 | 1.044 | 100.600 | 2.319 | 0.671 | 1.211 | 0.117 | -47.473 | -44.420 |
FSA0.90 | -57.192 | 2.688 | 8.200 | 0.422 | 0.055 | 16.254 | 0.583 | -63.307 | -53.410 |
CB0.50 | -46.355 | 0.555 | 148.700 | 1.337 | 0.991 | 0.422 | 0.230 | -47.001 | -45.323 |
CB0.90 | -45.709 | 1.114 | 107.900 | 3.635 | 0.719 | 1.465 | 0.449 | -47.032 | -43.904 |
CB1.00 | -46.592 | 1.073 | 86.500 | 8.303 | 0.577 | 2.011 | 0.582 | -48.847 | -45.097 |
CB1.75 | -51.375 | 1.911 | 22.000 | 3.859 | 0.147 | 8.066 | 1.230 | -55.623 | -49.221 |
BICHO230.100 | -45.586 | 1.053 | 122.800 | 4.517 | 0.819 | 0.964 | 0.013 | -47.478 | -44.001 |
BICHO230.125 | -45.972 | 0.939 | 115.100 | 4.725 | 0.767 | 0.972 | 0.011 | -47.803 | -44.609 |
BICHO23256 | -47.043 | 0.898 | 72.100 | 1.912 | 0.481 | 1.072 | 0.360 | -48.583 | -45.539 |
BICHO23576 | -53.622 | 4.640 | 56.300 | 8.795 | 0.375 | 1.718 | 1.546 | -62.853 | -46.846 |
BICHO23768 | -58.677 | 7.794 | 49.900 | 12.142 | 0.333 | 2.154 | 1.811 | -68.808 | -45.719 |
BICHO231024 | -58.765 | 7.331 | 46.900 | 8.034 | 0.313 | 2.272 | 1.153 | -74.104 | -50.834 |
FUT010.025 | -46.003 | 1.062 | 150.000 | 0.000 | 1.000 | 0.000 | 0.000 | -47.128 | -43.905 |
FUT010.050 | -45.394 | 0.577 | 144.300 | 2.541 | 0.962 | 0.822 | 0.061 | -46.137 | -44.614 |
FUT010.200 | -46.554 | 1.034 | 75.200 | 0.422 | 0.501 | 0.991 | 0.024 | -48.400 | -45.080 |
FUT011.000 | -47.508 | 1.035 | 52.300 | 1.418 | 0.349 | 1.851 | 0.058 | -48.661 | -45.099 |
FUT018.000 | -49.804 | 1.872 | 26.600 | 1.075 | 0.177 | 4.549 | 0.131 | -53.130 | -46.922 |
FUT0132.00 | -53.242 | 1.743 | 17.400 | 0.699 | 0.116 | 7.325 | 0.346 | -55.674 | -50.721 |
FUT0164.00 | -54.556 | 1.604 | 13.700 | 0.823 | 0.091 | 9.669 | 0.420 | -57.550 | -52.503 |
FUT01256.0 | -56.823 | 3.401 | 9.400 | 0.516 | 0.063 | 14.499 | 0.485 | -63.382 | -52.324 |
Appendix E EX-2: Online dynamics update
Figure 6, 7 and 8 show the resulting performance of different hyper parameters while training the dynamics model with skipping in CP, PU, RE and HC. Table 6 shows numerical results for each environment. CP was trained for 60 episodes, RE and PU for 150 and finally HC for 400. Rw represents the average over 3 experiments of the maximum rewards seen so far, RcPerMax is the percentage of replanning steps when the algorithm reached the maximum Rw. EpNrMax number of episodes needed to reach Rw and RelEp#Max represents the relative wall time compared to the baseline when the algorithm reached the maximum Rw.
CP | ||||
Method | Rw | RcPerMax | EpNMax | RelEp#Max |
Baseline | 181.6799 | 1.0000 | 57 | 58.00 |
NSKIP3 | 180.7760 | 0.3350 | 59 | 20.10 |
NSKIP4 | 180.6540 | 0.2500 | 58 | 14.75 |
BICHO1064 | 180.8982 | 0.1475 | 58 | 6.72 |
FUT014 | 180.1486 | 0.2567 | 47 | 10.53 |
PU | ||||
Baseline | -47.9417 | 1.0000 | 142 | 143.00 |
NSKIP2 | -52.0129 | 0.5000 | 143 | 72.00 |
NSKIP3 | -57.3462 | 0.3333 | 147 | 49.33 |
BICHO102 | -49.0292 | 0.1733 | 136 | 24.61 |
FUT012 | -52.8447 | 0.4956 | 135 | 67.31 |
RE | ||||
Baseline | -33.5717 | 1.0000 | 70 | 71.00 |
NSKIP2 | -34.4023 | 0.5000 | 82 | 41.50 |
NSKIP3 | -34.2190 | 0.3333 | 72 | 24.33 |
BICHO10512 | -37.7357 | 0.4822 | 95 | 28.08 |
FUT0116 | -35.7938 | 0.1333 | 91 | 12.64 |
HC | ||||
Baseline | 22491.9876 | 1.0000 | 372 | 373.00 |
NSKIP2 | 6266.1845 | 0.5000 | 77 | 39.00 |
BICHO05=0.05 | 18323.8463 | 0.8305 | 384 | 310.31 |
BICHO10=0.05 | 20787.0499 | 0.8640 | 294 | 254.11 |
FUT01=0.1 | 12605.5204 | 0.5830 | 259 | 153.11 |