The idea of value iteration has been applied to the online learning of optimal controllers for discrete-time (DT) systems for many years. In the work of Werbos (1974, 1989, 1991, 1992, 2009) a family of DT learning control algorithms based on value iteration ideas has been developed. These techniques are known as approximate dynamic programming or adaptive dynamic programming (ADP). ADP includes heuristic dynamic programming (HDP) (which is value iteration), dual heuristic programming and action-based variants of those algorithms, which are equivalent to Q learning for DT dynamical system xk+1 = f (xk) + g(xk)uk. Value iteration algorithms rely on the special form of the DT Bellman equation V(xk) = r(xk, uk) + yV(xk+1), with r(xk, uk) the utility or stage cost of the value function. This equation has two occurrences of the value function evaluated at two times k and k + 1 and does not depend on the system dynamics f (xk),g(xk).
Value iteration for continuous-time systems, Page 1 of 2
< Previous page Next page > /docserver/preview/fulltext/books/ce/pbce081e/PBCE081E_ch6-1.gif /docserver/preview/fulltext/books/ce/pbce081e/PBCE081E_ch6-2.gif