Focused Real-Time Dynamic Programming for MDPs:
Robotics Institute, Carnegie Mellon University
FRTDP guides outcome selection by maintaining a prior-
ity value at each node that estimates the benefit of directing
Real-time dynamic programming (RTDP) is a heuris-
search to that node. Priority-based outcome selection both
tic search algorithm for solving MDPs. We present amodified algorithm called Focused RTDP with several
focuses sampling on the most relevant parts of the search
improvements. While RTDP maintains only an upper
graph and allows FRTDP to avoid nodes that have already
bound on the long-term reward function, FRTDP main-
tains two-sided bounds and bases the output policy on the
FRTDP has modified trial termination criteria that should
lower bound. FRTDP guides search with a new rule for
allow it to solve some problems that RTDP cannot. RTDP is
outcome selection, focusing on parts of the search graph
known to solve a class of non-pathological stochastic short-
that contribute most to uncertainty about the values of
est path problems (Barto, Bradtke, & Singh 1995). We con-
good policies. FRTDP has modified trial termination cri-
jecture that FRTDP additionally solves (within ) a broader
teria that should allow it to solve some problems (within
class of problems in which the state set may be infinite and
) that RTDP cannot. Experiments show that for all the
the goal may not be reachable from every state.
problems we studied, FRTDP significantly outperformsRTDP and LRTDP, and converges with up to six times
Relative to existing algorithms, FRTDP is intended to be
fewer backups than the state-of-the-art HDP algorithm.
more robust, converge faster, and have better anytime solu-tion quality before convergence is reached. Experimentally,FRTDP significantly outperformed several other algorithms
Markov decision processes (MDPs) are planning problemsin which an agent’s actions have uncertain outcomes, but
the state of the world is fully observable. This paper stud-
Action selection based on the lower bound is a frequently
ies techniques for speeding up MDP planning by leverag-
occurring idea in decision theoretic search. It was used in
ing heuristic information about the value function (expected
(Goodwin 1996) and probably earlier, and applied to MDPs
long-term reward available from each state). In many do-
in (McMahan, Likhachev, & Gordon 2005).
mains, one can quickly calculate upper and lower bounds on
LRTDP (Bonet & Geffner 2003b) and HDP (Bonet &
the value function. As with the A∗ algorithm in a determin-
Geffner 2003a) are RTDP-derived algorithms that similarly
istic setting, admissible bounds can be used to prune much
use the idea of avoiding updates to irrelevant states. How-
of the search space while still guaranteeing optimality of the
ever, their trials are not restricted to a single path through
the search graph, and they do not explicitly select outcomes.
Real-time dynamic programming (RTDP) is a well-
Irrelevant states are avoided through a modified trial termi-
known MDP heuristic search algorithm (Barto, Bradtke, &
Singh 1995). Each RTDP trial begins at the initial state
HSVI (Smith & Simmons 2004) and BRTDP (McMahan,
of the MDP and explores forward, choosing actions greed-
Likhachev, & Gordon 2005) both include the idea of out-
ily and choosing outcomes stochastically according to their
come selection, but they prioritize internal nodes by uncer-
tainty rather than the FRTDP concept of priority. We con-
This paper introduces the Focused RTDP algorithm,
jecture that FRTDP priorities will lead to better performance
which is designed to both converge faster than RTDP and
than uncertainty values because they better reflect the single-
solve a broader class of problems. Whereas RTDP keeps
path trial structure of RTDP. Unfortunately, we do not have
only an upper bound on the long-term reward function,
a performance comparison. HSVI is designed for POMDPs,
FRTDP keeps two-sided bounds and bases its output policy
and we did not compare with BRTDP because we did not
on the lower bound (Goodwin 1996), significantly improv-
become aware of it until just before publication.
ing anytime solution quality and performance guarantees.
LAO∗ (Hansen & Zilberstein 2001) is another heuristic
Copyright c 2006, American Association for Artificial Intelli-
search algorithm, but with a control structure unlike RTDP’s
gence (www.aaai.org). All rights reserved.
deep single-path trials. We did compare with LAO∗ because
it was dominated by LRTDP in an earlier study with similar
the optimal value function, satisfying hL ≤ V ∗ ≤ hU . For
problems (Bonet & Geffner 2003b).
goal states s ∈ G, we enforce that hL(s) = hU (s) = 0.
IPS and PPI (McMahan & Gordon 2005) also outperform
hL and hU help the algorithm choose which states to asyn-
LRTDP on racetrack problems. We did not compare with
these algorithms because they explore backward from a goalstate rather than forward from a start state. The distinction
is not crucial for the racetrack problem, but forward explo-
Every MDP has a corresponding AND/OR search graph.
ration is required for multiple-reward problems where the
Nodes of the graph are states of the MDP. Each state/action
set of goal states is ill-defined, and for problems where the
pair of the MDP is represented with a k-connector in the
set of possible predecessors of a state is not finite (as when
search graph, connecting a state s to its possible successors
POMDPs are formulated as belief-state MDPs).
Ca(s). Individual edges of each k-connector are annotatedwith transition probabilities.
The explicit graph is a data structure containing a sub-
An MDP models a planning problem in which action out-
set of the MDP search graph. Heuristic search algorithms
comes are uncertain but the world state is fully observable.
can often reach an optimal solution while only examining a
The agent is assumed to know a probability distribution for
tiny fraction of the states in the search graph, in which case
action outcomes. This paper studies discrete infinite-horizon
generating node and link data structures for the unexamined
stationary MDPs, formally described by a set of states S, a
states would be wasteful. Instead, one usually provides the
finite set of actions A, a successor function C providing the
algorithm with an initial explicit graph containing s0 and
set of states that could result from an action Ca(s) ⊆ S,
callbacks for extending the explicit graph as needed.
transition probabilities T a(si, sj) = Pr(sj|si, a), a real-
Expanding a node means generating its outgoing k-
valued reward function R(s, a), a discount factor γ ≤ 1,
connectors and successor nodes and adding them to the ex-
an initial state s0, and a (possibly empty) set of absorbing
plicit graph. Explicit graph nodes are either internal nodes,
goal states G ⊆ S. Taking any action in a goal state causes
which have already been expanded, or fringe nodes, which
a zero-reward self-transition with probability 1.
have not. Let I denote the set of internal nodes and F the
The object of the planning problem is to generate a sta-
tionary policy π that maximizes expected long-term reward:
In this framework, one can ask: given the information em-
bodied in a particular explicit graph, what inferences can be
drawn about the quality of different policies? And which
fringe nodes should be expanded in order to improve quality
Our algorithms generate an approximately optimal policy ˆ
A useful statistic for characterizing a policy π is its occu-
pancy W π(s) for each state s. If we consider the distribution
of possible execution traces for π and interpret the discount
γ in terms of trace termination (i.e., execution terminates atany given time step with probability 1 − γ), then W π(s)
An MDP can be solved by approximating its optimal value
is the expected number of time steps per execution that π
function V ∗ = J π∗ . Any value function V induces a greedy
spends in state s before passing beyond the fringe into the
policy πV in which actions are selected via one-step looka-
head. The policy induced by V ∗ is optimal.
Formally, occupancy is defined as the solution to the fol-
Value iteration (VI) is a widely used dynamic program-
lowing simultaneous equations (s ∈ I ∪ F ):
ming technique in which one solves for V ∗ using the factthat it is the unique fixed point of the Bellman update:
where W0(s ) is 1 if s = s0 and 0 otherwise.1
The occupancy at each fringe node indicates its relevance
VI algorithms start with an initial guess for the right-hand
to the policy. In particular, the quality J π(s0) of a policy
side of (2) and repeatedly update so that V gets closer to
In classical synchronous value iteration (VI), at each step
the Bellman update is applied over all states simultaneously.
Conversely, asynchronous algorithms update states one at
a time. By updating the most relevant states more often
I is the expected reward from executing π up to
the point where it reaches a fringe node. Given an explicit
and carefully choosing the right update ordering, they often
I can be calculated, but J π (s) for fringe nodes s
The heuristic search algorithms we consider are asyn-
1If γ = 1 and π has loops, the occupancy of some states may
chronous VI algorithms that can use additional a priori in-
diverge. However, it converges for the problems and policies that
formation in the form of admissible bounds hL and hU on
cannot, because it depends on information in the unexplored
One way to estimate V ∗(s0) is by keeping bound func-
{implicitly called the first time each state s is touched}
tions V L ≤ V ∗ ≤ V U and choosing fringe nodes that help
zero. Eq. 4 can be added and subtracted to get
we have brought in the only available information about the
trialRecurse(chooseSuccessorStochastically(s, a∗))
value at fringe nodes, i.e., their heuristic values. This expres-
sion makes clear each fringe node’s contribution to the un-
0. The best possible result of expanding a fringe
node s is to decrease its local uncertainty to 0, reducing the
uncertainty at s0 by at most W π∗ (s)|hU (s) − hL(s)|, “theoccupancy times the uncertainty”. Later, we discuss how to
approximate this upper bound on uncertainty reduction and
use it to guide fringe node expansion.
{implicitly called the first time each state s is touched}(s.L, s.U) ← (hL(s), hU (s)); s.prio ← ∆(s)
RTDP (alg. 1) is an asynchronous VI algorithm that works
by repeated trials; each trial starts at s0 and explores forwardin the search graph. At each forward step, action selection
is greedy based on the current value function, and outcome
selection is stochastic according to the distribution of possi-
ble successor states given the chosen action. When a goal
state is reached, RTDP terminates the trial by retracing its
0, updating each state along the way.
value function V U is initialized with a heuristic hU ≥ V ∗.
Like the A∗ algorithm in the deterministic setting, RTDP of-
ten converges without even examining all the states of the
Focused RTDP (alg. 2) is derived from RTDP. As in RTDP,
FRTDP execution proceeds in trials that begin at s0 and ex-
plore forward through the search graph, selecting actions
(qcurr, ncurr) ← (qcurr + q, ncurr + 1)
greedily according to the upper bound, then terminating and
performing updates on the way back to s0. Unlike RTDP,
(qprev, nprev) ← (qprev + q, nprev + 1)
FRTDP maintains a lower bound and uses modified rules
for action selection and trial termination.
(a∗, s∗, δ) ← backup(s)trackUpdateQuality(δW, d)
RTDP keeps an upper bound V U that is initialized with an
admissible heuristic hU ≥ V ∗, and its output policy is thegreedy policy induced by V U . In contrast, FRTDP keeps
trialRecurse(s∗, γT a∗ (s, s∗)W, d + 1)
two-sided bounds V L ≤ V ∗ ≤ V U and outputs the greedy
There are two main benefits to keeping a lower bound.
First, if hL is uniformly improvable2, the greedy policy in-
duced by VL has value at least as good as VL(s0)—in other
(qprev, nprev, qcurr, ncurr) ← (0, 0, 0, 0)trialRecurse(s
Applying the Bellman update to a uniformly improvable func-
if (qcurr/ncurr) ≥ (qprev/nprev) then D ← kDD
tion brings the function everywhere closer to V ∗ (Zhang & Zhang
words, one can interrupt the algorithm at any time and get
path). Second, since after trial termination FRTDP performs
a policy with a performance guarantee. The second benefit
updates back to s0 along only one path instead of along all
is that empirically, up to the point where V L and V U con-
paths, priorities at internal nodes can be inconsistent with
verge, policies derived from V L tend to perform better. Poli-
cies derived from the upper bound are often “get rich quick”
In practice we find that, despite multi-path violations of
schemes that seem good only because they have not been
the assumptions on which the priority is based, choosing
thoroughly evaluated. The obvious drawback to keeping a
outcomes by priority is better than choosing them stochas-
lower bound is that updating it increases the cost of each
tically. There may also be more accurate priority update
backup. In practice, we observe that adding lower bound
schemes that mitigate multi-path error—the current scheme
calculation to the HDP algorithm increases wallclock time
was chosen to keep overhead small and retain the trial-based
to convergence by about 10%, but with substantial benefits
FRTDP maintains a lower bound and outputs the greedy
policy induced by the lower bound. It also uses the lower
With the excess uncertainty trial termination alone, FRTDP
bound during its priority calculation for outcome selection,
is a usable search algorithm. However, as with RTDP, poor
outcome selection early in a trial could lead into a quagmireof irrelevant states that takes a long time to escape.
FRTDP’s adaptive maximum depth (AMD) trial termina-
Whereas RTDP chooses an action outcome stochastically,
tion criterion mitigates this problem by cutting off long tri-
FRTDP outcome selection attempts to maximize the im-
als. FRTDP maintains a current maximum depth D. A trial
provement in the quality estimate of the greedy policy πU
is terminated if it reaches depth d ≥ D. FRTDP initializes
by expanding the fringe node s with the largest contribution
D to a small value D0, and increases it for subsequent trials.
The idea is to avoid over-committing to long trials early on,
FRTDP allows the user to specify a regret bound . If
but retain the ability to go deeper in later trials, in case there
there is an envelope of nodes s that all satisfy |V U (s) −
are relevant states deeper in the search graph.
, then FRTDP algorithm termination can be
FRTDP performance for any particular problem depends
achieved without expanding any more fringe nodes. And
on how D is adjusted, so it is important that whatever tech-
it is perhaps easier to achieve a condition where |V U (s) −
nique is used be relatively robust across problems without
V L(s)| ≤ /2 for a majority of nodes in the envelope, with
manual parameter tuning. We chose to adjust D adaptively,
uncertainties not too large at the rest. Thus, FRTDP can
using trial statistics as feedback. After each trial, FRTDP
safely terminate a trial when it reaches a state whose uncer-
chooses whether to keep the current value of D or update
tainty is very small. We define the excess uncertainty of a
state ∆(s) = |V U (s) − V L(s)| − /2 and terminate any trial
The feedback mechanism is fairly ad hoc. Each update in
that reaches a state with ∆(s) ≤ 0. (This is one of two trial
a trial is given an update quality score q = δW that is in-
termination criteria; see the next section.)
tended to reflect how useful the update was. δ measures how
FRTDP outcome selection is designed to choose the
much the update changed the upper bound value V U (s). W
fringe node s that maximizes W πU (s)∆(s) (occupancy
is a single-path estimate of the occupancy of the state being
times excess uncertainty). But due to computational con-
updated under the current greedy policy. After each trial, D
straints, it prioritizes nodes via an approximation scheme
is increased if the average update quality near the end of the
that only guarantees the best fringe node in certain special
trial (d > D/kD) is at least as good as the average update
cases. FRTDP recursively calculates a priority p(s) for each
quality in the earlier part of the trial. Refer to the pseudo-
node s, such that choosing the successor state with the high-
est priority at each step causes the trial to arrive at the max-
The racetrack problems used in our experiments were de-
imizing fringe node. The recursive update formula is
signed to be solved by RTDP, so it is no surprise that they areparticularly benign and suitable for deep trials. In the race-
track domain, AMD termination hurts performance slightly
overall, as early trials are less efficient before D grows large.
However, AMD improves performance on some more chal-
where the action a∗ is chosen greedily according to the upper
lenging problems (not reported here). For all the listed re-
bound. p(s) is recalculated along with V U (s) and V L(s) at
sults, we used AMD with D0 = 10 and kD = 1.1.
The priority update rule is guaranteed to lead FRTDP to
the best fringe node only in the case that the search graph
Stochastic shortest path problems (SSPs) are MDPs that sat-
is a tree. In a general graph, there are two confounding fac-
isfy additional restrictions (Bertsekas & Tsitsiklis 1996):
tors that violate the guarantee. First, W π(s) is the expected
S1. All rewards are strictly negative.
amount of time π spends in s, adding up all possible pathsfrom s
0 to s. Maximizing p(s) at each step effectively prior-
itizes fringe nodes according to their maximum occupancy
S3. There exists at least one proper policy, that is, a policy
along the single most likely path (in a tree there is only one
that reaches G from any state with probability 1.
(increased chance of skidding, marked by adding a suffix
R1. All policies that are improper must incur infinite cost
of -3 to the problem name). Second, we tried increasing
the number of possible outcomes from an error. We callthis the “wind” variant (marked by adding a suffix of -w).
Under conditions S1-S3 and R1, RTDP’s V U value function
In the wind variant, with probability p = 0.1 an additional
is guaranteed (with probability 1) to converge to V ∗ over
acceleration is added to the commanded acceleration. The
the set of relevant states, i.e., states that can be reached by
additional acceleration is drawn from a uniform distribu-
at least one optimal policy (Barto, Bradtke, & Singh 1995).
tion over 8 possible values: {(−1, −1), (−1, 0), (−1, 1),
Unsurprisingly, there is a corresponding result for FRTDP.
(0, −1), (0, 1), (1, −1), (1, 0), (1, 1)}. The idea is that in-
Theorem 1. Under conditions S1-S3 and R1 and setting =
stead of skidding, a “gust of wind” provides additional ac-
0, FRTDP’s V L and V U bounds are guaranteed to converge
to V ∗ over the set of relevant states.
We selected two racetrack problems whose maps have
We conjecture that FRTDP can also approximately solve
been published: large-b from (Barto, Bradtke, & Singh
a broader class of SSP-like problems that satisfy:
1995) and large-ring from (Bonet & Geffner 2003b). With three versions for each problem, our results cover a
F1. There are global reward bounds RL and RU such that
for every (s, a) pair, RL ≤ R(s, a) ≤ RU < 0.
We selected three heuristic search asynchronous VI algo-
F2. There exists at least one policy that (a) reaches G start-
rithms to compare with FRTDP: RTDP, LRTDP, and HDP.
ing from s0 with probability 1, and (b) has positive oc-
In addition, we implemented a modified version of HDP that
cupancy for only a finite number of states.
maintains a lower bound and uses that as the basis for its out-
put policy. We call this algorithm HDP+L.
Conjecture 2. Under conditions F1-F3, FRTDP is guar-
Following (Bonet & Geffner 2003b), all algorithms were
anteed to terminate with an output policy whose regret is
provided with the same admissible upper bound heuristic
hU , calculated by a domain-independent relaxation in whichthe best possible outcome of any action is assumed to always
FRTDP should terminate under these weaker conditions
occur. Formally, the Bellman update is replaced by
because (unlike RTDP) each trial is capped at a finite maxu-mum depth D; thus poor decisions early in a trial can alwaysbe reconsidered in the next trial. The mechanism for adjust-
ing D should not affect the termination guarantee, as long asthe sequence of cap values increases without bound.
The time required to calculate hU is not included in the re-
The weaker conditions allow FRTDP to solve problems
ported running times for the algorithms.
There is no trivial way to calculate an informative ad-
as belief-space MDPs naturally have an infinite state set.
missible lower bound for a racetrack problem. A formally
FRTDP can still solve them if they satisfy F1-F3. FRTDP
correct way to handle this, with minor algorithm modifi-
can also solve problems with “one-way doors”, in which
cations, is to set hL(s) = −∞ for all non-goal states s.
poor early action choices lead to states from which the goal
However, dealing with infinite values would have required
is unreachable, as long as there is a policy guaranteed to
some extra bookkeeping, so for convenience we supplied
hL(s) = −1000, which is a gross underestimate of the ac-tual V ∗(s) values. In principle, the finite lower bound could
allow FRTDP to prune some additional low-probability out-
We evaluated the performance of FRTDP on problems in the
comes, but this did not happen in practice. See (McMahan,
popular racetrack benchmark domain from (Barto, Bradtke,
Likhachev, & Gordon 2005) for discussion of how to effi-
& Singh 1995). States of racetrack are integer vectors in
ciently calculate a more informative lower bound.
y) that represent the discrete position and
Fig. 1 reports time to convergence within
speed of the car in a 2D grid. The actions available to the car
each (problem, algorithm) pair, measured both as number
of backups and wallclock time. Our experiments were run
from {−1, 0, 1}. The car starts in one of a set of possible
on a 3.2 GHz Pentium-4 processor with 1 GB of RAM. We
start states. The goal is to maneuver the car into one of a set
implemented all the algorithms in C++; they were not thor-
of goal states. Some cells in the grid are marked as obstacles;
oughly optimized, so the number of backups required for
if the car’s path intersects one of these cells, it is reset back to
convergence was measured more reliably than the wallclock
one of the start states with zero velocity. Uncertainty in this
time (which could probably be substantially reduced across
problem comes from “skidding”. Each time the agent takes
the board). Included in wallclock time measurements is the
an acceleration action, with probability p the car skids: the
time required to check racetrack paths for collisions; colli-
sion checking was performed the first time a state was ex-
Because FRTDP focuses on outcome selection, we also
panded, and the results were cached for subsequent updates.
wanted to study increasing the amount of uncertainty in the
The observed speedup of FRTDP convergence compared
problem. We did this in two ways. First, we examined per-
to HDP, measured in terms of number of backups, ranges
formance with p = 0.1 (the standard case) and p = 0.3
from 2.9 up to 6.4. Our initial expectation was that FRTDP
Figure 1: Millions of backups before convergence with
= 10−3. Each entry gives the number of millions of backups, with
the corresponding wallclock time (seconds) in parentheses. The fastest time for each problem is shown in bold.
would show more speedup on the -3 and -w problem vari-
ants with more uncertainty; in fact its speedup was about
Barto, A.; Bradtke, S.; and Singh, S. 1995. Learning to
the same on -3 problems and smaller on -w problems. We
act using real-time dynamic programming. Artificial Intel-
do not yet understand why this is the case. By construc-
tion, HDP and HDP+L have identical convergence proper-
Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dynamic
ties in terms of the number of backups required. As mea-
Programming. Belmont, MA: Athena Scientific.
sured in wallclock time, lower bound updating for HDP+Lintroduces an additional cost overhead of about 10%.
Bonet, B., and Geffner, H. 2003a. Faster heuristic search
Fig. 2 reports anytime performance of three of the
algorithms for planning with uncertainty and full feedback.
algorithms (HDP, HDP+L, and FRTDP) on the two
problems where FRTDP showed the least convergence
Bonet, B., and Geffner, H. 2003b. Labeled RTDP: Improv-
time speedup (large-ring-w) and the most speedup
ing the convergence of real time dynamic programming. In
(large-ring-3) relative to HDP. The quality (expected
reward) of an algorithm’s output policy was measured at
Goodwin, R. 1996. Meta-Level Control for Decision The-
each epoch by simulating the policy 1000 times, with each
oretic Planners. Ph.D. Dissertation, School of Computer
execution terminated after 250 time steps if the goal was
Science, Carnegie Mellon Univ., CMU-CS-96-186.
not reached. Error bars are 2σ confidence intervals. The
Hansen, E., and Zilberstein, S. 2001. LAO*: A heuristic
two algorithms that output policies based on a lower bound
search algorithm that finds solutions with loops. Artificial
(HDP+L and FRTDP) are seen to have significantly bet-
FRTDP reaches a solution quality of -40 with about 40 times
McMahan, H. B., and Gordon, G. J. 2005. Fast exact
fewer backups than HDP. In each plot, the solid line indicat-
planning in Markov decision processes. In Proc. of ICAPS.
ing FRTDP solution quality ends at the point where FRTDP
McMahan, H. B.; Likhachev, M.; and Gordon, G. J. 2005.
Bounded real-time dynamic programming: RTDP withmonotone upper bounds and performance guarantees. In
Smith, T., and Simmons, R. 2004. Heuristic search value
iteration for POMDPs. In Proc. of UAI.
Zhang, N. L., and Zhang, W. 2001. Speeding up the con-
vergence of value iteration in partially observable Markov
decision processes. Journal of AI Research 14:29–51.
Figure 2: Anytime performance comparison: solution qual-ity vs. number of backups.
FRTDP improves RTDP by keeping a lower bound and mod-ifying outcome selection and trial termination rules. Thesemodifications allow FRTDP to solve a broader class of prob-lems, and in performance experiments FRTDP provided sig-nificant speedup across all problems, requiring up to sixtimes fewer backups than HDP to reach convergence.
We also examined the separate performance impact of us-
ing a lower bound by implementing both HDP and HDP+L. This technique can be usefully applied on its own to anyRTDP-like algorithm.
CURRICULUM VITAE Dr. FABIO ZAINA Dati personali e anagrafici Nato ad Asti il 28/06/1975. Stato civile: coniugato, padre di 2 figli Educazione • Diploma di Maturità Scientifica conseguito nel 1994 presso il Liceo Scientifico “F. Vercelli” di Asti con il punteggio di 50/60. • Laurea in Medicina e Chirurgia conseguita il 10/10/2002 presso l’Università degli Studi di �
Bone morphogenetic protein 1 (BMP1), is a which in humans is encoded by the BMP1 (1, 2). There are seven isoforms of the protein created by . BMP1 belongs to thefamily of (BMPs). It induces bone and cartilagedevelopment. Unlike other BMPs, it does not belong to the superfamily. It was initially discoveredto work like other BMPs by inducing bone and cartilage development. It however, is a that