Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles
This book gives an exposition of recently developed approximate dynamic programming (ADP) techniques for decision and control in human engineered systems. ADP is a reinforcement machine learning technique that is motivated by learning mechanisms in biological and animal systems. It is connected from a theoretical point of view with both adaptive control and optimal control methods. The book shows how ADP can be used to design a family of adaptive optimal control algorithms that converge in real-time to optimal control solutions by measuring data along the system trajectories. Generally, in the current literature adaptive controllers and optimal controllers are two distinct methods for the design of automatic control systems. Traditional adaptive controllers learn online in real time how to control systems, but do not yield optimal performance. On the other hand, traditional optimal controllers must be designed offline using full knowledge of the systems dynamics. It is also shown how to use ADP methods to solve multi-player differential games online. Differential games have been shown to be important in H-infinity robust control for disturbance rejection, and in coordinating activities among multiple agents in networked teams. The focus of this book is on continuous-time systems, whose dynamical models can be derived directly from physical principles based on Hamiltonian or Lagrangian dynamics.
Inspec keywords: adaptive control; optimal control; learning (artificial intelligence); feedback; differential games
Other keywords: dynamic feedback control systems; reinforcement learning principles; optimal adaptive control; online differential games
Subjects: General and management topics; Self-adjusting control systems; Knowledge engineering techniques; Game theory; Optimal control
- Book DOI: 10.1049/PBCE081E
- Chapter DOI: 10.1049/PBCE081E
- ISBN: 9781849194891
- e-ISBN: 9781849194907
- Page count: 400
- Format: PDF
-
Front Matter
- + Show details - Hide details
-
p.
(1)
1 Introduction to optimal control, adaptive control and reinforcement learning
- + Show details - Hide details
-
p.
1
–8
(8)
In this book, we show how to use RL techniques to unify optimal control and adaptive control. By this we mean that a novel class of adaptive control structures will be developed that learn the solutions of optimal control problems in real time by measuring data along the system trajectories online. We call these optimal adaptive controllers. These optimal adaptive controllers have structures based on the actor-critic learning architecture.
2 Reinforcement learning and optimal control of discrete-time systems: Using natural decision methods to design optimal adaptive controllers
- + Show details - Hide details
-
p.
9
–47
(39)
In this chapter, the use of principles of reinforcement learning to design a new class of feedback controllers for continuous-time dynamical systems is presented. This chapter also reviews current technology, showing that for discrete-time dynamical systems, reinforcement learning methods allow the solution of HJB design equations online, forward in time and without knowing the full system dynamics.
-
Part I: Optimal adaptive control using reinforcement learning structures
3 Optimal adaptive control using integral reinforcement learning for linear systems
- + Show details - Hide details
-
p.
51
–69
(19)
This chapter presented a new policy iteration technique that solves the continuous time LQR problem online without using knowledge about the system's internal dynamics (system matrix A). The algorithm was derived by writing the value function in integral reinforcement form to yield a new form of Bellman equation for CT systems. This allows the derivation of an integral reinforcement learning (IRL) algorithm, which is an adaptive controller that converges online to the solution of the optimal LQR controller. IRL is based on an adaptive critic scheme in which the actor performs continuous-time control while the critic incrementally corrects the actor's behavior at discrete moments in time until best performance is obtained. The critic evaluates the actor performance over a period of time and formulates it in a parameterized form. Based on the critic's evaluation the actor behavior policy is updated for improved control performance.
4 Integral reinforcement learning (IRL) for non-linear continuous-time systems
- + Show details - Hide details
-
p.
71
–92
(22)
This chapter presents an adaptive method based on actor-critic reinforcement learning (RL) for solving online the optimal control problem for non-linear continuous-time systems in the state space form x(t)=f(x)+g(x)u(t). The algorithm, first presented in Vrabie et al. (2008, 2009), Vrabie (2009), Vrabie and Lewis (2009), solves the optimal control problem without requiring knowledge of the drift dynamics f(x). The method is based on policy iteration (PI), a RL algorithm that iterates between the steps of policy evaluation and policy improvement. The PI method starts by evaluating the cost of a given admissible initial policy and then uses this information to obtain a new control policy, which is improved in the sense of having a smaller associated cost compared with the previous policy. These two steps are repeated until the policy improvement step no longer changes the present policy, indicating that the optimal control behavior has been obtained.
5 Generalized policy iteration for continuous-time systems
- + Show details - Hide details
-
p.
93
–108
(16)
This chapter gives the formulation of GPI algorithms in a CT framework. This is based on the IRL form of the Bellman equation, which allows the extension of GPI to CT systems. It is seen that GPI is in fact a spectrum of iterative algorithms, which has at one end the policy iteration algorithm and at the other a variant of the value iteration algorithm. GPI solves the Riccati equation in the LQR case, or the HJB equation for non-linear optimal control, online in real time without requiring knowledge of the system drift dynamics f(x).
6 Value iteration for continuous-time systems
- + Show details - Hide details
-
p.
109
–122
(14)
The idea of value iteration has been applied to the online learning of optimal controllers for discrete-time (DT) systems for many years. In the work of Werbos (1974, 1989, 1991, 1992, 2009) a family of DT learning control algorithms based on value iteration ideas has been developed. These techniques are known as approximate dynamic programming or adaptive dynamic programming (ADP). ADP includes heuristic dynamic programming (HDP) (which is value iteration), dual heuristic programming and action-based variants of those algorithms, which are equivalent to Q learning for DT dynamical system xk+1 = f (xk) + g(xk)uk. Value iteration algorithms rely on the special form of the DT Bellman equation V(xk) = r(xk, uk) + yV(xk+1), with r(xk, uk) the utility or stage cost of the value function. This equation has two occurrences of the value function evaluated at two times k and k + 1 and does not depend on the system dynamics f (xk),g(xk).
-
Part II: Adaptive control structures based on reinforcement learning
7 Optimal adaptive control using synchronous online learning
- + Show details - Hide details
-
p.
125
–147
(23)
In this chapter, a new adaptive algorithm that solves the continuous-time (CT) optimal control problem online in real time for affine-in-the-inputs non-linear systems has been proposed. This algorithm is called as synchronous online optimal adaptive control for CT systems. The structure of this controller was based on the reinforcement learning policy iteration (RL PI ) algorithm, and the proofs were carried out using adaptive control Lyapunov function techniques.
8 Synchronous online learning with integral reinforcement
- + Show details - Hide details
-
p.
149
–164
(16)
This chapter presents an online adaptive learning algorithm to solve the infinitehorizon optimal control problem for non-linear systems. This include simultaneous tuning of both actor and critic NNs (i.e. both neural networks are tuned at the same time) and no need for knowledge of the drift term f(x) in the dynamics. This algorithm is based on integral reinforcement learning (IRL) and solves the Hamilton-Jacobi-Bellman (HJB) equation online in real time by measuring data along the system trajectories, without knowing f(x). In the linear quadratic case ẋ = Ax + Bu it solves the algebraic Riccati equation (ARE) online without knowing the system matrix.
-
Part III: Online differential games using reinforcement learning
9 Synchronous online learning for zero-sum two-player games and H-infinity control
- + Show details - Hide details
-
p.
167
–194
(28)
In this chapter, methods for online gaming are provided for the online solution of two-player zero-sum infinite-horizon games, through learning the saddle-point strategies in real time. The dynamics may be nonlinear in continuous time and are assumed known in this chapter. A novel neural-network (NN) adaptive control technique is given here that is based on reinforcement learning techniques, whereby the control and disturbance policies are tuned online using data generated in real time along the system trajectories.
10 Synchronous online learning for multiplayer non-zero-sum games
- + Show details - Hide details
-
p.
195
–219
(25)
This chapter shows how to solve multiplayer non-zero-sum (NZS) games online using novel adaptive control structures based on reinforcement learning. For the most part, interest in the control systems community has been in the (non-cooperative) zero-sum games, which provide the solution of the H-infinity robust control problem. However, dynamic team games may have some cooperative objectives and some selfish objectives among the players. This cooperative/ non-cooperative balance is captured in the NZS games, as detailed herein.
11 Integral reinforcement learning for zero-sum two-player games
- + Show details - Hide details
-
p.
221
–235
(15)
In this chapter we present a continuous-time adaptive dynamic programming (ADP) procedure that uses the idea of integral reinforcement learning (IRL) to find online the Nash-equilibrium solution for the two-player zero-sum (ZS) differential game. We consider continuous-time (CT) linear dynamics of the form x= Ax + B1w + B2u, where u(t), w(t) are the control actions of the two players, and an infinite-horizon quadratic cost. This work is from Vrabie and Lewis (2010).
-
Appendix A: Proofs
- + Show details - Hide details
-
p.
237
–272
(36)
Proofs for selected results from various chapters.
Back Matter
- + Show details - Hide details
-
p.
(1)