Abstract
The authors propose a novel reinforcement learning (RL) framework, where agent behaviour is governed by traditional control theory. This integrated approach, called time-in-action RL, enables RL to be applicable to many real-world systems, where underlying dynamics are known in their control theoretical formalism. The key insight to facilitate this integration is to model the explicit time function, mapping the state-action pair to the time accomplishing the action by its underlying controller. In their framework, they describe an action by its value (action value), and the time that it takes to perform (action time). An action-value results from the policy of RL regarding a state. Action time is estimated by an explicit time model learnt from the measured activities of the underlying controller. RL value network is then trained with embedded time model to predict action time. This approach is tested using a variant of Atari Pong and proved to be convergent.
6 References
1.
Mnih V., Kavukcuoglu K., Silver D., et al: ‘Playing Atari with deep reinforcement learning’, 2013, arXiv preprint arXiv:1312.5602
2.
Mnih V., Kavukcuoglu K., Silver D., et al: ‘Human-level control through deep reinforcement learning’, Nature, 2015, 518, (7540), pp. 529–533
3.
Lillicrap T.P., Hunt J.J., Pritzel A., et al: ‘Continuous control with deep reinforcement learning’, 2015, arXiv preprint arXiv:1509.02971
4.
Schulman J., Levine S., Abbeel P., et al: ‘Trust region policy optimization’. Int. Conf. Machine Learning, Lille, France, 2015, pp. 1889–1897
5.
Schulman J., Wolski F., Dhariwal P., et al: ‘Proximal policy optimization algorithms’, 2017, arXiv preprint arXiv:1707.06347
6.
Bellman R., Glicksberg I., and Gross O.: ‘On the ‘bang–bang’ control problem’, Q. Appl. Math., 1956, 14, (1), pp. 11–18
7.
Åström K.J. and Hägglund T.: ‘PID controllers: theory, design, and tuning’ vol. 2, (Instrument Society of America Research, Triangle Park, NC, 1995)
8.
Han J.: ‘From PID to active disturbance rejection control’, IEEE Trans. Ind. Electron., 2009, 56, (3), pp. 900–906
9.
Kawato M., Furukawa K., and Suzuki R.: ‘A hierarchical neural-network model for control and learning of voluntary movement’, Biol. Cybern., 1987, 57, (3), pp. 169–185
10.
Rosenbaum D.A., Kenny S.B., and Derr M.A.: ‘Hierarchical control of rapid movement sequences’, J. Exp. Psychol. Human Percept. Perform., 1983, 9, (1), p. 86
11.
Ivry R.B. and Keele S.W.: ‘Timing functions of the cerebellum’, J. Cogn. Neurosci., 1989, 1, (2), pp. 136–152
12.
Sutton R.S., Precup D., and Singh S.: ‘Between MDPS and semi-MDPS: a frame-work for temporal abstraction in reinforcement learning’, Artif. Intell., 1999, 112, (1–2), pp. 181–211
13.
Bradtke S.J. and Duff M.O.: ‘Reinforcement learning methods for continuous-time Markov decision problems’. Advances in Neural Information Processing Systems, Denver, USA, 1995, pp. 393–400
14.
Sutton R.S.: ‘TD models: modeling the world at a mixture of time scales’. Proc. Int. Conf. on Machine Learnin, Tahoe City, USA, 1995, pp. 531–539
15.
Dietterich T.G.: ‘Hierarchical reinforcement learning with the MAXQ value function decomposition’, J. Artif. Intell. Res., 2000, 13, pp. 227–303
16.
Hernandez-Gardiol N. and Mahadevan S.: ‘Hierarchical memory-based reinforcement learning’. Advances in Neural Information Processing Systems, Vancouver, Canada, 2001, pp. 1047–1053
17.
Vezhnevets A.S., Osindero S., Schaul T., et al: ‘Feudal networks for hierarchical reinforcement learning’, 2017, arXiv preprint arXiv:1703.01161
18.
Hauskrecht M., Meuleau N., Kaelbling L.P., et al: ‘Hierarchical solution of Markov decision processes using macro-actions’. Proc. 14th Conf. Uncertainty in Artificial Intelligence, Madison, USA, 1998, pp. 220–229
19.
Durugkar I.P., Rosenbaum C., Dernbach S., et al: ‘Deep reinforcement learning with macro-actions’, 2016, arXiv preprint arXiv:1606.04615
20.
Kulkarni T.D., Narasimhan K., Saeedi A., et al: ‘Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation’. Advances in Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 3675–3683
21.
Arulkumaran K., Dilokthanakul N., Shanahan M., et al: ‘Classifying options for deep reinforcement learning’, 2016, arXiv preprint arXiv:1604.08153
22.
Watkins C.J. and Dayan P.: ‘Q-learning’, Mach. Learn., 1992, 8, (3–4), pp. 279–292
23.
Brockman G., Cheung V., Pettersson L., et al: ‘OpenAI Gym’, 2016, arXiv preprint arXiv:1606.01540
24.
Bertsekas D.P.: ‘Dynamic programming and optimal control’, vol. II, (Athena Scientific, Belmont, MA, USA, 2007)
25.
Jaakkola T., Jordan M.I., and Singh S.P.: ‘On the convergence of stochastic iterative dynamic programming algorithms’, Neural Comput., 1993, 6, (6), pp. 1185–1201
7 Appendix
7.1 Convergence analysis
Consider a special kind of infinite-horizon discounted MDP, defined by the tuple , where is a finite set of states, is a finite set of actions, is the transition probability distribution, is the reward function, and is the discount factor function. Note that the discount factor here is a function of current state and action , rather than a constant.
Given an initial state , we want to find a deterministic policy that maximises the value function defined as below:where for . The time-varying policy defined above is called Markov because it does not involve dependence on states beyond the current and the initial. Actually, a more general definition of policy is to make it history dependent. In other words, a history-dependent policy allows the current action to depend on the entire past history . The adequacy of Markov policies says that the expected value function of any history-dependent policy can be replicated with a Markov policy in MDP, which is proved in [24]. Therefore, we only consider the Markov policy in what follows.
An optimal value function is defined bywhere for . Note that if is bounded, the supremum does exist for any . However, it remains to be seen whether the optimal policy that attains the supremum exists for all . What is more, we are especially interested in stationary deterministic policies, i.e. policies which are independent of initial state and time. Let denote a stationary deterministic policy, where the action is chosen as directly. We will find that there exists a stationary optimal policy if some assumptions are made. To show this, first we will prove some properties of the optimal value function , and then we will design a stationary policy which is optimal.
The reward per stage satisfieswhere M is some scalar. Furthermore
Boundedness of the reward per stage, as well as the customisation on the bound of discount factor per stage, is not as restrictive as might appear. Assumption 1 automatically holds when the spaces and are finite sets. Now, it is easy to check that is bounded for all with any given , hence the existence of can be ensured.
Actually, the optimal value function also satisfies the Bellman optimality equation, as is in the conventional case, which we will prove in detail. Define the Bellman optimality operator byWe denote by the composition of the mapping T with itself k times. The following proposition shows that any bounded function V will finally converge to the optimal value function , with T operated infinite times.
Given any bounded function , the optimal value function satisfiesfor any positive integer K and . Furthermore
For any initial state , policy and positive integer K, we partition into two parts, the first K stages and the remaining stageswhere . Since by Assumption 1, we have and , we also obtain
Using these relations, it follows that:andNote thatwe haveandwhere . By taking the supremum over , we obtain for any positive integer K and Now that , by taking the limit as , the results follow. □
For any functions and , such thatwe have
Letwe havehence . □
Let denote the unit function that takes the value identically on , i.e. . For any , any non-negative scalar d, and any value function , we have
As for the first inequality, we havehence holds for all . As for the second inequality, , hence for all . □
,
(12)
Now, we are ready to show that the optimal value function also satisfies the Bellman optimality equation as is in the conventional case. What is more, is the unique solution.
The optimal value function satisfies Bellman optimality equationor, equivalently
Furthermore, is the unique solution of this equation within the class of bounded functions.
By Proposition 1, we have for all and where is the zero function for any . Applying the Bellman optimality operator to this inequality and using Lemma 1 as well as Lemma 2, we obtain for all and
By taking the limit as and using the fact in Proposition 1we obtainTo show uniqueness, observe that if V is bounded and satisfies , then , so by Proposition 1, we have . □
Now that the existence and uniqueness of have been proven, we define the optimal state-action-value function byThe Bellman optimality equation can also be written asObviously, the stationary policy defined byis optimal, because definitely satisfies the , thus . We can conclude that there exists an optimal policy which is stationary. Note that the existence and uniqueness of ensuring those of and .
Now, we are preparing to show that Q-learning algorithm still works in our special kind of infinite-horizon discounted MDP. Define the operator bywhere is any state-value function.
The optimal state-value function is the equilibrium of the operator H, i.e.
By Theorem 1, we have , then for all In short, . □
.
(13)
The operator H is a contraction with respect to , i.e.for any state-value function Q and .
(14)
The random process taking values in and defined asconverges to zero with probability 1 if the following conditions are satisfied:
•
and ;
•
, with ;
•
, for .
See [25]. □
As for the special infinite-horizon discounted MDP defined above, the Q-learning algorithm, given by the update ruleconverges with probability 1 to the optimal state-action-value function as long asfor all .
The update rule of Q-learning can be written as
.
(15)
Let , the equation above yieldsIf we writewhere , we haveBy Lemma 3, By Lemma 4, we haveMoreoverOwing to Assumption 1, the term is bounded; obviously, we haveFinally, by Theorem 2, converges to zero with probability 1, i.e. converges to with probability 1. □
.
(16)
.
(17)
Information & Authors
Information
Published in
Copyright
This is an open access article published by the IET and Zhejiang University Press under the Creative Commons Attribution-NonCommercial-NoDerivs License (http://creativecommons.org/licenses/by-nc-nd/3.0/)
History
Received: 14 December 2018
Revision received: 30 January 2019
Accepted: 07 February 2019
Published ahead of print: 13 February 2019
Published online: 01 June 2019
Published in print: June 2019
Inspec keywords
Keywords
Authors
Funding Information
National Natural Science Foundation of China: 61473253
Metrics & Citations
Metrics
Citations
If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.