Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles
This book gives an exposition of recently developed approximate dynamic programming (ADP) techniques for decision and control in human engineered systems. ADP is a reinforcement machine learning technique that is motivated by learning mechanisms in biological and animal systems. It is connected from a theoretical point of view with both adaptive control and optimal control methods. The book shows how ADP can be used to design a family of adaptive optimal control algorithms that converge in realtime to optimal control solutions by measuring data along the system trajectories. Generally, in the current literature adaptive controllers and optimal controllers are two distinct methods for the design of automatic control systems. Traditional adaptive controllers learn online in real time how to control systems, but do not yield optimal performance. On the other hand, traditional optimal controllers must be designed offline using full knowledge of the systems dynamics. It is also shown how to use ADP methods to solve multiplayer differential games online. Differential games have been shown to be important in Hinfinity robust control for disturbance rejection, and in coordinating activities among multiple agents in networked teams. The focus of this book is on continuoustime systems, whose dynamical models can be derived directly from physical principles based on Hamiltonian or Lagrangian dynamics.
Inspec keywords: adaptive control; optimal control; learning (artificial intelligence); feedback; differential games
Other keywords: dynamic feedback control systems; reinforcement learning principles; optimal adaptive control; online differential games
Subjects: General and management topics; Selfadjusting control systems; Knowledge engineering techniques; Game theory; Optimal control
 Book DOI: 10.1049/PBCE081E
 Chapter DOI: 10.1049/PBCE081E
 ISBN: 9781849194891
 eISBN: 9781849194907
 Page count: 400
 Format: PDF

Front Matter
 + Show details  Hide details

p.
(1)
1 Introduction to optimal control, adaptive control and reinforcement learning
 + Show details  Hide details

p.
1
–8
(8)
In this book, we show how to use RL techniques to unify optimal control and adaptive control. By this we mean that a novel class of adaptive control structures will be developed that learn the solutions of optimal control problems in real time by measuring data along the system trajectories online. We call these optimal adaptive controllers. These optimal adaptive controllers have structures based on the actorcritic learning architecture.
2 Reinforcement learning and optimal control of discretetime systems: Using natural decision methods to design optimal adaptive controllers
 + Show details  Hide details

p.
9
–47
(39)
In this chapter, the use of principles of reinforcement learning to design a new class of feedback controllers for continuoustime dynamical systems is presented. This chapter also reviews current technology, showing that for discretetime dynamical systems, reinforcement learning methods allow the solution of HJB design equations online, forward in time and without knowing the full system dynamics.

Part I: Optimal adaptive control using reinforcement learning structures
3 Optimal adaptive control using integral reinforcement learning for linear systems
 + Show details  Hide details

p.
51
–69
(19)
This chapter presented a new policy iteration technique that solves the continuous time LQR problem online without using knowledge about the system's internal dynamics (system matrix A). The algorithm was derived by writing the value function in integral reinforcement form to yield a new form of Bellman equation for CT systems. This allows the derivation of an integral reinforcement learning (IRL) algorithm, which is an adaptive controller that converges online to the solution of the optimal LQR controller. IRL is based on an adaptive critic scheme in which the actor performs continuoustime control while the critic incrementally corrects the actor's behavior at discrete moments in time until best performance is obtained. The critic evaluates the actor performance over a period of time and formulates it in a parameterized form. Based on the critic's evaluation the actor behavior policy is updated for improved control performance.
4 Integral reinforcement learning (IRL) for nonlinear continuoustime systems
 + Show details  Hide details

p.
71
–92
(22)
This chapter presents an adaptive method based on actorcritic reinforcement learning (RL) for solving online the optimal control problem for nonlinear continuoustime systems in the state space form x(t)=f(x)+g(x)u(t). The algorithm, first presented in Vrabie et al. (2008, 2009), Vrabie (2009), Vrabie and Lewis (2009), solves the optimal control problem without requiring knowledge of the drift dynamics f(x). The method is based on policy iteration (PI), a RL algorithm that iterates between the steps of policy evaluation and policy improvement. The PI method starts by evaluating the cost of a given admissible initial policy and then uses this information to obtain a new control policy, which is improved in the sense of having a smaller associated cost compared with the previous policy. These two steps are repeated until the policy improvement step no longer changes the present policy, indicating that the optimal control behavior has been obtained.
5 Generalized policy iteration for continuoustime systems
 + Show details  Hide details

p.
93
–108
(16)
This chapter gives the formulation of GPI algorithms in a CT framework. This is based on the IRL form of the Bellman equation, which allows the extension of GPI to CT systems. It is seen that GPI is in fact a spectrum of iterative algorithms, which has at one end the policy iteration algorithm and at the other a variant of the value iteration algorithm. GPI solves the Riccati equation in the LQR case, or the HJB equation for nonlinear optimal control, online in real time without requiring knowledge of the system drift dynamics f(x).
6 Value iteration for continuoustime systems
 + Show details  Hide details

p.
109
–122
(14)
The idea of value iteration has been applied to the online learning of optimal controllers for discretetime (DT) systems for many years. In the work of Werbos (1974, 1989, 1991, 1992, 2009) a family of DT learning control algorithms based on value iteration ideas has been developed. These techniques are known as approximate dynamic programming or adaptive dynamic programming (ADP). ADP includes heuristic dynamic programming (HDP) (which is value iteration), dual heuristic programming and actionbased variants of those algorithms, which are equivalent to Q learning for DT dynamical system xk+1 = f (xk) + g(xk)uk. Value iteration algorithms rely on the special form of the DT Bellman equation V(xk) = r(xk, uk) + yV(xk+1), with r(xk, uk) the utility or stage cost of the value function. This equation has two occurrences of the value function evaluated at two times k and k + 1 and does not depend on the system dynamics f (xk),g(xk).

Part II: Adaptive control structures based on reinforcement learning
7 Optimal adaptive control using synchronous online learning
 + Show details  Hide details

p.
125
–147
(23)
In this chapter, a new adaptive algorithm that solves the continuoustime (CT) optimal control problem online in real time for affineintheinputs nonlinear systems has been proposed. This algorithm is called as synchronous online optimal adaptive control for CT systems. The structure of this controller was based on the reinforcement learning policy iteration (RL PI ) algorithm, and the proofs were carried out using adaptive control Lyapunov function techniques.
8 Synchronous online learning with integral reinforcement
 + Show details  Hide details

p.
149
–164
(16)
This chapter presents an online adaptive learning algorithm to solve the infinitehorizon optimal control problem for nonlinear systems. This include simultaneous tuning of both actor and critic NNs (i.e. both neural networks are tuned at the same time) and no need for knowledge of the drift term f(x) in the dynamics. This algorithm is based on integral reinforcement learning (IRL) and solves the HamiltonJacobiBellman (HJB) equation online in real time by measuring data along the system trajectories, without knowing f(x). In the linear quadratic case ẋ = Ax + Bu it solves the algebraic Riccati equation (ARE) online without knowing the system matrix.

Part III: Online differential games using reinforcement learning
9 Synchronous online learning for zerosum twoplayer games and Hinfinity control
 + Show details  Hide details

p.
167
–194
(28)
In this chapter, methods for online gaming are provided for the online solution of twoplayer zerosum infinitehorizon games, through learning the saddlepoint strategies in real time. The dynamics may be nonlinear in continuous time and are assumed known in this chapter. A novel neuralnetwork (NN) adaptive control technique is given here that is based on reinforcement learning techniques, whereby the control and disturbance policies are tuned online using data generated in real time along the system trajectories.
10 Synchronous online learning for multiplayer nonzerosum games
 + Show details  Hide details

p.
195
–219
(25)
This chapter shows how to solve multiplayer nonzerosum (NZS) games online using novel adaptive control structures based on reinforcement learning. For the most part, interest in the control systems community has been in the (noncooperative) zerosum games, which provide the solution of the Hinfinity robust control problem. However, dynamic team games may have some cooperative objectives and some selfish objectives among the players. This cooperative/ noncooperative balance is captured in the NZS games, as detailed herein.
11 Integral reinforcement learning for zerosum twoplayer games
 + Show details  Hide details

p.
221
–235
(15)
In this chapter we present a continuoustime adaptive dynamic programming (ADP) procedure that uses the idea of integral reinforcement learning (IRL) to find online the Nashequilibrium solution for the twoplayer zerosum (ZS) differential game. We consider continuoustime (CT) linear dynamics of the form x= Ax + B_{1}w + B_{2}u, where u(t), w(t) are the control actions of the two players, and an infinitehorizon quadratic cost. This work is from Vrabie and Lewis (2010).

Appendix A: Proofs
 + Show details  Hide details

p.
237
–272
(36)
Proofs for selected results from various chapters.
Back Matter
 + Show details  Hide details

p.
(1)