longhorizon.org

Longhorizon.org

Focused Real-Time Dynamic Programming for MDPs: Robotics Institute, Carnegie Mellon University FRTDP guides outcome selection by maintaining a prior- ity value at each node that estimates the benefit of directing Real-time dynamic programming (RTDP) is a heuris- search to that node. Priority-based outcome selection both tic search algorithm for solving MDPs. We present amodified algorithm called Focused RTDP with several focuses sampling on the most relevant parts of the search improvements. While RTDP maintains only an upper graph and allows FRTDP to avoid nodes that have already bound on the long-term reward function, FRTDP main- tains two-sided bounds and bases the output policy on the FRTDP has modified trial termination criteria that should lower bound. FRTDP guides search with a new rule for allow it to solve some problems that RTDP cannot. RTDP is outcome selection, focusing on parts of the search graph known to solve a class of non-pathological stochastic short- that contribute most to uncertainty about the values of est path problems (Barto, Bradtke, & Singh 1995). We con- good policies. FRTDP has modified trial termination cri- jecture that FRTDP additionally solves (within ) a broader teria that should allow it to solve some problems (within class of problems in which the state set may be infinite and ) that RTDP cannot. Experiments show that for all the the goal may not be reachable from every state.
problems we studied, FRTDP significantly outperformsRTDP and LRTDP, and converges with up to six times Relative to existing algorithms, FRTDP is intended to be fewer backups than the state-of-the-art HDP algorithm.
more robust, converge faster, and have better anytime solu-tion quality before convergence is reached. Experimentally,FRTDP significantly outperformed several other algorithms Markov decision processes (MDPs) are planning problemsin which an agent’s actions have uncertain outcomes, but the state of the world is fully observable. This paper stud- Action selection based on the lower bound is a frequently ies techniques for speeding up MDP planning by leverag- occurring idea in decision theoretic search. It was used in ing heuristic information about the value function (expected (Goodwin 1996) and probably earlier, and applied to MDPs long-term reward available from each state). In many do- in (McMahan, Likhachev, & Gordon 2005).
mains, one can quickly calculate upper and lower bounds on LRTDP (Bonet & Geffner 2003b) and HDP (Bonet & the value function. As with the A∗ algorithm in a determin- Geffner 2003a) are RTDP-derived algorithms that similarly istic setting, admissible bounds can be used to prune much use the idea of avoiding updates to irrelevant states. How- of the search space while still guaranteeing optimality of the ever, their trials are not restricted to a single path through the search graph, and they do not explicitly select outcomes.
Real-time dynamic programming (RTDP) is a well- Irrelevant states are avoided through a modified trial termi- known MDP heuristic search algorithm (Barto, Bradtke, & Singh 1995). Each RTDP trial begins at the initial state HSVI (Smith & Simmons 2004) and BRTDP (McMahan, of the MDP and explores forward, choosing actions greed- Likhachev, & Gordon 2005) both include the idea of out- ily and choosing outcomes stochastically according to their come selection, but they prioritize internal nodes by uncer- tainty rather than the FRTDP concept of priority. We con- This paper introduces the Focused RTDP algorithm, jecture that FRTDP priorities will lead to better performance which is designed to both converge faster than RTDP and than uncertainty values because they better reflect the single- solve a broader class of problems. Whereas RTDP keeps path trial structure of RTDP. Unfortunately, we do not have only an upper bound on the long-term reward function, a performance comparison. HSVI is designed for POMDPs, FRTDP keeps two-sided bounds and bases its output policy and we did not compare with BRTDP because we did not on the lower bound (Goodwin 1996), significantly improv- become aware of it until just before publication.
ing anytime solution quality and performance guarantees.
LAO∗ (Hansen & Zilberstein 2001) is another heuristic Copyright c 2006, American Association for Artificial Intelli- search algorithm, but with a control structure unlike RTDP’s gence (www.aaai.org). All rights reserved.
deep single-path trials. We did compare with LAO∗ because it was dominated by LRTDP in an earlier study with similar the optimal value function, satisfying hL ≤ V ∗ ≤ hU . For problems (Bonet & Geffner 2003b).
goal states s ∈ G, we enforce that hL(s) = hU (s) = 0.
IPS and PPI (McMahan & Gordon 2005) also outperform hL and hU help the algorithm choose which states to asyn- LRTDP on racetrack problems. We did not compare with these algorithms because they explore backward from a goalstate rather than forward from a start state. The distinction is not crucial for the racetrack problem, but forward explo- Every MDP has a corresponding AND/OR search graph.
ration is required for multiple-reward problems where the Nodes of the graph are states of the MDP. Each state/action set of goal states is ill-defined, and for problems where the pair of the MDP is represented with a k-connector in the set of possible predecessors of a state is not finite (as when search graph, connecting a state s to its possible successors POMDPs are formulated as belief-state MDPs).
Ca(s). Individual edges of each k-connector are annotatedwith transition probabilities.
The explicit graph is a data structure containing a sub- An MDP models a planning problem in which action out- set of the MDP search graph. Heuristic search algorithms comes are uncertain but the world state is fully observable.
can often reach an optimal solution while only examining a The agent is assumed to know a probability distribution for tiny fraction of the states in the search graph, in which case action outcomes. This paper studies discrete infinite-horizon generating node and link data structures for the unexamined stationary MDPs, formally described by a set of states S, a states would be wasteful. Instead, one usually provides the finite set of actions A, a successor function C providing the algorithm with an initial explicit graph containing s0 and set of states that could result from an action Ca(s) ⊆ S, callbacks for extending the explicit graph as needed.
transition probabilities T a(si, sj) = Pr(sj|si, a), a real- Expanding a node means generating its outgoing k- valued reward function R(s, a), a discount factor γ ≤ 1, connectors and successor nodes and adding them to the ex- an initial state s0, and a (possibly empty) set of absorbing plicit graph. Explicit graph nodes are either internal nodes, goal states G ⊆ S. Taking any action in a goal state causes which have already been expanded, or fringe nodes, which a zero-reward self-transition with probability 1.
have not. Let I denote the set of internal nodes and F the The object of the planning problem is to generate a sta- tionary policy π that maximizes expected long-term reward: In this framework, one can ask: given the information em- bodied in a particular explicit graph, what inferences can be drawn about the quality of different policies? And which fringe nodes should be expanded in order to improve quality Our algorithms generate an approximately optimal policy ˆ A useful statistic for characterizing a policy π is its occu- pancy W π(s) for each state s. If we consider the distribution of possible execution traces for π and interpret the discount γ in terms of trace termination (i.e., execution terminates atany given time step with probability 1 − γ), then W π(s) An MDP can be solved by approximating its optimal value is the expected number of time steps per execution that π function V ∗ = J π∗ . Any value function V induces a greedy spends in state s before passing beyond the fringe into the policy πV in which actions are selected via one-step looka- head. The policy induced by V ∗ is optimal.
Formally, occupancy is defined as the solution to the fol- Value iteration (VI) is a widely used dynamic program- lowing simultaneous equations (s ∈ I ∪ F ): ming technique in which one solves for V ∗ using the factthat it is the unique fixed point of the Bellman update: where W0(s ) is 1 if s = s0 and 0 otherwise.1 The occupancy at each fringe node indicates its relevance VI algorithms start with an initial guess for the right-hand to the policy. In particular, the quality J π(s0) of a policy side of (2) and repeatedly update so that V gets closer to In classical synchronous value iteration (VI), at each step the Bellman update is applied over all states simultaneously.
Conversely, asynchronous algorithms update states one at a time. By updating the most relevant states more often I is the expected reward from executing π up to the point where it reaches a fringe node. Given an explicit and carefully choosing the right update ordering, they often I can be calculated, but J π (s) for fringe nodes s The heuristic search algorithms we consider are asyn- 1If γ = 1 and π has loops, the occupancy of some states may chronous VI algorithms that can use additional a priori in- diverge. However, it converges for the problems and policies that formation in the form of admissible bounds hL and hU on cannot, because it depends on information in the unexplored One way to estimate V ∗(s0) is by keeping bound func- {implicitly called the first time each state s is touched} tions V L ≤ V ∗ ≤ V U and choosing fringe nodes that help zero. Eq. 4 can be added and subtracted to get we have brought in the only available information about the trialRecurse(chooseSuccessorStochastically(s, a∗)) value at fringe nodes, i.e., their heuristic values. This expres- sion makes clear each fringe node’s contribution to the un- 0. The best possible result of expanding a fringe node s is to decrease its local uncertainty to 0, reducing the uncertainty at s0 by at most W π∗ (s)|hU (s) − hL(s)|, “theoccupancy times the uncertainty”. Later, we discuss how to approximate this upper bound on uncertainty reduction and use it to guide fringe node expansion.
{implicitly called the first time each state s is touched}(s.L, s.U) ← (hL(s), hU (s)); s.prio ← ∆(s) RTDP (alg. 1) is an asynchronous VI algorithm that works by repeated trials; each trial starts at s0 and explores forwardin the search graph. At each forward step, action selection is greedy based on the current value function, and outcome selection is stochastic according to the distribution of possi- ble successor states given the chosen action. When a goal state is reached, RTDP terminates the trial by retracing its 0, updating each state along the way.
value function V U is initialized with a heuristic hU ≥ V ∗.
Like the A∗ algorithm in the deterministic setting, RTDP of- ten converges without even examining all the states of the Focused RTDP (alg. 2) is derived from RTDP. As in RTDP, FRTDP execution proceeds in trials that begin at s0 and ex- plore forward through the search graph, selecting actions (qcurr, ncurr) ← (qcurr + q, ncurr + 1) greedily according to the upper bound, then terminating and performing updates on the way back to s0. Unlike RTDP, (qprev, nprev) ← (qprev + q, nprev + 1) FRTDP maintains a lower bound and uses modified rules for action selection and trial termination.
(a∗, s∗, δ) ← backup(s)trackUpdateQuality(δW, d) RTDP keeps an upper bound V U that is initialized with an admissible heuristic hU ≥ V ∗, and its output policy is thegreedy policy induced by V U . In contrast, FRTDP keeps trialRecurse(s∗, γT a∗ (s, s∗)W, d + 1) two-sided bounds V L ≤ V ∗ ≤ V U and outputs the greedy There are two main benefits to keeping a lower bound.
First, if hL is uniformly improvable2, the greedy policy in- duced by VL has value at least as good as VL(s0)—in other (qprev, nprev, qcurr, ncurr) ← (0, 0, 0, 0)trialRecurse(s Applying the Bellman update to a uniformly improvable func- if (qcurr/ncurr) ≥ (qprev/nprev) then D ← kDD tion brings the function everywhere closer to V ∗ (Zhang & Zhang words, one can interrupt the algorithm at any time and get path). Second, since after trial termination FRTDP performs a policy with a performance guarantee. The second benefit updates back to s0 along only one path instead of along all is that empirically, up to the point where V L and V U con- paths, priorities at internal nodes can be inconsistent with verge, policies derived from V L tend to perform better. Poli- cies derived from the upper bound are often “get rich quick” In practice we find that, despite multi-path violations of schemes that seem good only because they have not been the assumptions on which the priority is based, choosing thoroughly evaluated. The obvious drawback to keeping a outcomes by priority is better than choosing them stochas- lower bound is that updating it increases the cost of each tically. There may also be more accurate priority update backup. In practice, we observe that adding lower bound schemes that mitigate multi-path error—the current scheme calculation to the HDP algorithm increases wallclock time was chosen to keep overhead small and retain the trial-based to convergence by about 10%, but with substantial benefits FRTDP maintains a lower bound and outputs the greedy policy induced by the lower bound. It also uses the lower With the excess uncertainty trial termination alone, FRTDP bound during its priority calculation for outcome selection, is a usable search algorithm. However, as with RTDP, poor outcome selection early in a trial could lead into a quagmireof irrelevant states that takes a long time to escape.
FRTDP’s adaptive maximum depth (AMD) trial termina- Whereas RTDP chooses an action outcome stochastically, tion criterion mitigates this problem by cutting off long tri- FRTDP outcome selection attempts to maximize the im- als. FRTDP maintains a current maximum depth D. A trial provement in the quality estimate of the greedy policy πU is terminated if it reaches depth d ≥ D. FRTDP initializes by expanding the fringe node s with the largest contribution D to a small value D0, and increases it for subsequent trials.
The idea is to avoid over-committing to long trials early on, FRTDP allows the user to specify a regret bound . If but retain the ability to go deeper in later trials, in case there there is an envelope of nodes s that all satisfy |V U (s) − are relevant states deeper in the search graph.
, then FRTDP algorithm termination can be FRTDP performance for any particular problem depends achieved without expanding any more fringe nodes. And on how D is adjusted, so it is important that whatever tech- it is perhaps easier to achieve a condition where |V U (s) − nique is used be relatively robust across problems without V L(s)| ≤ /2 for a majority of nodes in the envelope, with manual parameter tuning. We chose to adjust D adaptively, uncertainties not too large at the rest. Thus, FRTDP can using trial statistics as feedback. After each trial, FRTDP safely terminate a trial when it reaches a state whose uncer- chooses whether to keep the current value of D or update tainty is very small. We define the excess uncertainty of a state ∆(s) = |V U (s) − V L(s)| − /2 and terminate any trial The feedback mechanism is fairly ad hoc. Each update in that reaches a state with ∆(s) ≤ 0. (This is one of two trial a trial is given an update quality score q = δW that is in- termination criteria; see the next section.) tended to reflect how useful the update was. δ measures how FRTDP outcome selection is designed to choose the much the update changed the upper bound value V U (s). W fringe node s that maximizes W πU (s)∆(s) (occupancy is a single-path estimate of the occupancy of the state being times excess uncertainty). But due to computational con- updated under the current greedy policy. After each trial, D straints, it prioritizes nodes via an approximation scheme is increased if the average update quality near the end of the that only guarantees the best fringe node in certain special trial (d > D/kD) is at least as good as the average update cases. FRTDP recursively calculates a priority p(s) for each quality in the earlier part of the trial. Refer to the pseudo- node s, such that choosing the successor state with the high- est priority at each step causes the trial to arrive at the max- The racetrack problems used in our experiments were de- imizing fringe node. The recursive update formula is signed to be solved by RTDP, so it is no surprise that they areparticularly benign and suitable for deep trials. In the race- track domain, AMD termination hurts performance slightly overall, as early trials are less efficient before D grows large.
However, AMD improves performance on some more chal- where the action a∗ is chosen greedily according to the upper lenging problems (not reported here). For all the listed re- bound. p(s) is recalculated along with V U (s) and V L(s) at sults, we used AMD with D0 = 10 and kD = 1.1.
The priority update rule is guaranteed to lead FRTDP to the best fringe node only in the case that the search graph Stochastic shortest path problems (SSPs) are MDPs that sat- is a tree. In a general graph, there are two confounding fac- isfy additional restrictions (Bertsekas & Tsitsiklis 1996): tors that violate the guarantee. First, W π(s) is the expected S1. All rewards are strictly negative.
amount of time π spends in s, adding up all possible pathsfrom s 0 to s. Maximizing p(s) at each step effectively prior- itizes fringe nodes according to their maximum occupancy S3. There exists at least one proper policy, that is, a policy along the single most likely path (in a tree there is only one that reaches G from any state with probability 1.
(increased chance of skidding, marked by adding a suffix R1. All policies that are improper must incur infinite cost of -3 to the problem name). Second, we tried increasing the number of possible outcomes from an error. We callthis the “wind” variant (marked by adding a suffix of -w).
Under conditions S1-S3 and R1, RTDP’s V U value function In the wind variant, with probability p = 0.1 an additional is guaranteed (with probability 1) to converge to V ∗ over acceleration is added to the commanded acceleration. The the set of relevant states, i.e., states that can be reached by additional acceleration is drawn from a uniform distribu- at least one optimal policy (Barto, Bradtke, & Singh 1995).
tion over 8 possible values: {(−1, −1), (−1, 0), (−1, 1), Unsurprisingly, there is a corresponding result for FRTDP.
(0, −1), (0, 1), (1, −1), (1, 0), (1, 1)}. The idea is that in- Theorem 1. Under conditions S1-S3 and R1 and setting = stead of skidding, a “gust of wind” provides additional ac- 0, FRTDP’s V L and V U bounds are guaranteed to converge to V ∗ over the set of relevant states.
We selected two racetrack problems whose maps have We conjecture that FRTDP can also approximately solve been published: large-b from (Barto, Bradtke, & Singh a broader class of SSP-like problems that satisfy: 1995) and large-ring from (Bonet & Geffner 2003b).
With three versions for each problem, our results cover a F1. There are global reward bounds RL and RU such that for every (s, a) pair, RL ≤ R(s, a) ≤ RU < 0.
We selected three heuristic search asynchronous VI algo- F2. There exists at least one policy that (a) reaches G start- rithms to compare with FRTDP: RTDP, LRTDP, and HDP.
ing from s0 with probability 1, and (b) has positive oc- In addition, we implemented a modified version of HDP that cupancy for only a finite number of states.
maintains a lower bound and uses that as the basis for its out- put policy. We call this algorithm HDP+L.
Conjecture 2. Under conditions F1-F3, FRTDP is guar- Following (Bonet & Geffner 2003b), all algorithms were anteed to terminate with an output policy whose regret is provided with the same admissible upper bound heuristic hU , calculated by a domain-independent relaxation in whichthe best possible outcome of any action is assumed to always FRTDP should terminate under these weaker conditions occur. Formally, the Bellman update is replaced by because (unlike RTDP) each trial is capped at a finite maxu-mum depth D; thus poor decisions early in a trial can alwaysbe reconsidered in the next trial. The mechanism for adjust- ing D should not affect the termination guarantee, as long asthe sequence of cap values increases without bound.
The time required to calculate hU is not included in the re- The weaker conditions allow FRTDP to solve problems ported running times for the algorithms.
There is no trivial way to calculate an informative ad- as belief-space MDPs naturally have an infinite state set.
missible lower bound for a racetrack problem. A formally FRTDP can still solve them if they satisfy F1-F3. FRTDP correct way to handle this, with minor algorithm modifi- can also solve problems with “one-way doors”, in which cations, is to set hL(s) = −∞ for all non-goal states s.
poor early action choices lead to states from which the goal However, dealing with infinite values would have required is unreachable, as long as there is a policy guaranteed to some extra bookkeeping, so for convenience we supplied hL(s) = −1000, which is a gross underestimate of the ac-tual V ∗(s) values. In principle, the finite lower bound could allow FRTDP to prune some additional low-probability out- We evaluated the performance of FRTDP on problems in the comes, but this did not happen in practice. See (McMahan, popular racetrack benchmark domain from (Barto, Bradtke, Likhachev, & Gordon 2005) for discussion of how to effi- & Singh 1995). States of racetrack are integer vectors in ciently calculate a more informative lower bound.
y) that represent the discrete position and Fig. 1 reports time to convergence within speed of the car in a 2D grid. The actions available to the car each (problem, algorithm) pair, measured both as number of backups and wallclock time. Our experiments were run from {−1, 0, 1}. The car starts in one of a set of possible on a 3.2 GHz Pentium-4 processor with 1 GB of RAM. We start states. The goal is to maneuver the car into one of a set implemented all the algorithms in C++; they were not thor- of goal states. Some cells in the grid are marked as obstacles; oughly optimized, so the number of backups required for if the car’s path intersects one of these cells, it is reset back to convergence was measured more reliably than the wallclock one of the start states with zero velocity. Uncertainty in this time (which could probably be substantially reduced across problem comes from “skidding”. Each time the agent takes the board). Included in wallclock time measurements is the an acceleration action, with probability p the car skids: the time required to check racetrack paths for collisions; colli- sion checking was performed the first time a state was ex- Because FRTDP focuses on outcome selection, we also panded, and the results were cached for subsequent updates.
wanted to study increasing the amount of uncertainty in the The observed speedup of FRTDP convergence compared problem. We did this in two ways. First, we examined per- to HDP, measured in terms of number of backups, ranges formance with p = 0.1 (the standard case) and p = 0.3 from 2.9 up to 6.4. Our initial expectation was that FRTDP Figure 1: Millions of backups before convergence with = 10−3. Each entry gives the number of millions of backups, with the corresponding wallclock time (seconds) in parentheses. The fastest time for each problem is shown in bold.
would show more speedup on the -3 and -w problem vari- ants with more uncertainty; in fact its speedup was about Barto, A.; Bradtke, S.; and Singh, S. 1995. Learning to the same on -3 problems and smaller on -w problems. We act using real-time dynamic programming. Artificial Intel- do not yet understand why this is the case. By construc- tion, HDP and HDP+L have identical convergence proper- Bertsekas, D. P., and Tsitsiklis, J. N. 1996. Neuro-Dynamic ties in terms of the number of backups required. As mea- Programming. Belmont, MA: Athena Scientific.
sured in wallclock time, lower bound updating for HDP+Lintroduces an additional cost overhead of about 10%.
Bonet, B., and Geffner, H. 2003a. Faster heuristic search Fig. 2 reports anytime performance of three of the algorithms for planning with uncertainty and full feedback.
algorithms (HDP, HDP+L, and FRTDP) on the two problems where FRTDP showed the least convergence Bonet, B., and Geffner, H. 2003b. Labeled RTDP: Improv- time speedup (large-ring-w) and the most speedup ing the convergence of real time dynamic programming. In (large-ring-3) relative to HDP. The quality (expected reward) of an algorithm’s output policy was measured at Goodwin, R. 1996. Meta-Level Control for Decision The- each epoch by simulating the policy 1000 times, with each oretic Planners. Ph.D. Dissertation, School of Computer execution terminated after 250 time steps if the goal was Science, Carnegie Mellon Univ., CMU-CS-96-186.
not reached. Error bars are 2σ confidence intervals. The Hansen, E., and Zilberstein, S. 2001. LAO*: A heuristic two algorithms that output policies based on a lower bound search algorithm that finds solutions with loops. Artificial (HDP+L and FRTDP) are seen to have significantly bet- FRTDP reaches a solution quality of -40 with about 40 times McMahan, H. B., and Gordon, G. J. 2005. Fast exact fewer backups than HDP. In each plot, the solid line indicat- planning in Markov decision processes. In Proc. of ICAPS.
ing FRTDP solution quality ends at the point where FRTDP McMahan, H. B.; Likhachev, M.; and Gordon, G. J. 2005.
Bounded real-time dynamic programming: RTDP withmonotone upper bounds and performance guarantees. In Smith, T., and Simmons, R. 2004. Heuristic search value iteration for POMDPs. In Proc. of UAI.
Zhang, N. L., and Zhang, W. 2001. Speeding up the con- vergence of value iteration in partially observable Markov decision processes. Journal of AI Research 14:29–51.
Figure 2: Anytime performance comparison: solution qual-ity vs. number of backups.
FRTDP improves RTDP by keeping a lower bound and mod-ifying outcome selection and trial termination rules. Thesemodifications allow FRTDP to solve a broader class of prob-lems, and in performance experiments FRTDP provided sig-nificant speedup across all problems, requiring up to sixtimes fewer backups than HDP to reach convergence.
We also examined the separate performance impact of us- ing a lower bound by implementing both HDP and HDP+L.
This technique can be usefully applied on its own to anyRTDP-like algorithm.

Source: http://longhorizon.org/trey/papers/smith06_frtdp.pdf

Curriculum fabio

CURRICULUM VITAE Dr. FABIO ZAINA Dati personali e anagrafici Nato ad Asti il 28/06/1975. Stato civile: coniugato, padre di 2 figli Educazione • Diploma di Maturità Scientifica conseguito nel 1994 presso il Liceo Scientifico “F. Vercelli” di Asti con il punteggio di 50/60. • Laurea in Medicina e Chirurgia conseguita il 10/10/2002 presso l’Università degli Studi di �

genorise.cn

Bone morphogenetic protein 1 (BMP1), is a which in humans is encoded by the BMP1 (1, 2). There are seven isoforms of the protein created by . BMP1 belongs to thefamily of (BMPs). It induces bone and cartilagedevelopment. Unlike other BMPs, it does not belong to the superfamily. It was initially discoveredto work like other BMPs by inducing bone and cartilage development. It however, is a that