Skip to main content

Abstract

The authors propose a novel reinforcement learning (RL) framework, where agent behaviour is governed by traditional control theory. This integrated approach, called time-in-action RL, enables RL to be applicable to many real-world systems, where underlying dynamics are known in their control theoretical formalism. The key insight to facilitate this integration is to model the explicit time function, mapping the state-action pair to the time accomplishing the action by its underlying controller. In their framework, they describe an action by its value (action value), and the time that it takes to perform (action time). An action-value results from the policy of RL regarding a state. Action time is estimated by an explicit time model learnt from the measured activities of the underlying controller. RL value network is then trained with embedded time model to predict action time. This approach is tested using a variant of Atari Pong and proved to be convergent.

6 References

1.
Mnih V., Kavukcuoglu K., Silver D., et al: ‘Playing Atari with deep reinforcement learning’, 2013, arXiv preprint arXiv:1312.5602
2.
Mnih V., Kavukcuoglu K., Silver D., et al: ‘Human-level control through deep reinforcement learning’, Nature, 2015, 518, (7540), pp. 529–533
3.
Lillicrap T.P., Hunt J.J., Pritzel A., et al: ‘Continuous control with deep reinforcement learning’, 2015, arXiv preprint arXiv:1509.02971
4.
Schulman J., Levine S., Abbeel P., et al: ‘Trust region policy optimization’. Int. Conf. Machine Learning, Lille, France, 2015, pp. 1889–1897
5.
Schulman J., Wolski F., Dhariwal P., et al: ‘Proximal policy optimization algorithms’, 2017, arXiv preprint arXiv:1707.06347
6.
Bellman R., Glicksberg I., and Gross O.: ‘On the ‘bang–bang’ control problem’, Q. Appl. Math., 1956, 14, (1), pp. 11–18
7.
Åström K.J. and Hägglund T.: ‘PID controllers: theory, design, and tuning’ vol. 2, (Instrument Society of America Research, Triangle Park, NC, 1995)
8.
Han J.: ‘From PID to active disturbance rejection control’, IEEE Trans. Ind. Electron., 2009, 56, (3), pp. 900–906
9.
Kawato M., Furukawa K., and Suzuki R.: ‘A hierarchical neural-network model for control and learning of voluntary movement’, Biol. Cybern., 1987, 57, (3), pp. 169–185
10.
Rosenbaum D.A., Kenny S.B., and Derr M.A.: ‘Hierarchical control of rapid movement sequences’, J. Exp. Psychol. Human Percept. Perform., 1983, 9, (1), p. 86
11.
Ivry R.B. and Keele S.W.: ‘Timing functions of the cerebellum’, J. Cogn. Neurosci., 1989, 1, (2), pp. 136–152
12.
Sutton R.S., Precup D., and Singh S.: ‘Between MDPS and semi-MDPS: a frame-work for temporal abstraction in reinforcement learning’, Artif. Intell., 1999, 112, (1–2), pp. 181–211
13.
Bradtke S.J. and Duff M.O.: ‘Reinforcement learning methods for continuous-time Markov decision problems’. Advances in Neural Information Processing Systems, Denver, USA, 1995, pp. 393–400
14.
Sutton R.S.: ‘TD models: modeling the world at a mixture of time scales’. Proc. Int. Conf. on Machine Learnin, Tahoe City, USA, 1995, pp. 531–539
15.
Dietterich T.G.: ‘Hierarchical reinforcement learning with the MAXQ value function decomposition’, J. Artif. Intell. Res., 2000, 13, pp. 227–303
16.
Hernandez-Gardiol N. and Mahadevan S.: ‘Hierarchical memory-based reinforcement learning’. Advances in Neural Information Processing Systems, Vancouver, Canada, 2001, pp. 1047–1053
17.
Vezhnevets A.S., Osindero S., Schaul T., et al: ‘Feudal networks for hierarchical reinforcement learning’, 2017, arXiv preprint arXiv:1703.01161
18.
Hauskrecht M., Meuleau N., Kaelbling L.P., et al: ‘Hierarchical solution of Markov decision processes using macro-actions’. Proc. 14th Conf. Uncertainty in Artificial Intelligence, Madison, USA, 1998, pp. 220–229
19.
Durugkar I.P., Rosenbaum C., Dernbach S., et al: ‘Deep reinforcement learning with macro-actions’, 2016, arXiv preprint arXiv:1606.04615
20.
Kulkarni T.D., Narasimhan K., Saeedi A., et al: ‘Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation’. Advances in Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 3675–3683
21.
Arulkumaran K., Dilokthanakul N., Shanahan M., et al: ‘Classifying options for deep reinforcement learning’, 2016, arXiv preprint arXiv:1604.08153
22.
Watkins C.J. and Dayan P.: ‘Q-learning’, Mach. Learn., 1992, 8, (3–4), pp. 279–292
23.
Brockman G., Cheung V., Pettersson L., et al: ‘OpenAI Gym’, 2016, arXiv preprint arXiv:1606.01540
24.
Bertsekas D.P.: ‘Dynamic programming and optimal control’, vol. II, (Athena Scientific, Belmont, MA, USA, 2007)
25.
Jaakkola T., Jordan M.I., and Singh S.P.: ‘On the convergence of stochastic iterative dynamic programming algorithms’, Neural Comput., 1993, 6, (6), pp. 1185–1201

7 Appendix

7.1 Convergence analysis

Consider a special kind of infinite-horizon discounted MDP, defined by the tuple (S,A,P,r,γ), where S is a finite set of states, A is a finite set of actions, P:S×A×SR is the transition probability distribution, r:S×A×SR is the reward function, and γ:S×A[0,1) is the discount factor function. Note that the discount factor here is a function of current state st and action at, rather than a constant.
Given an initial state s0, we want to find a deterministic policy π~:S×S×NA that maximises the value function Vπ~:SR defined as below:
Vπ~(s):=Ea0,s1,t=0m=0t1γ(sm,am)rt|s0=s,
where rt:=r(st,at,st+1),at=π~(st,s0,t),st+1P(st+1|st,at) for t0. The time-varying policy defined above is called Markov because it does not involve dependence on states beyond the current and the initial. Actually, a more general definition of policy is to make it history dependent. In other words, a history-dependent policy allows the current action at to depend on the entire past history s0,a0,,at1,st. The adequacy of Markov policies says that the expected value function of any history-dependent policy can be replicated with a Markov policy in MDP, which is proved in [24]. Therefore, we only consider the Markov policy in what follows.
An optimal value function V* is defined by
V*(s):=supπ~Vπ~(s)=supπ~Ea0,s1,t=0m=0t1γ(sm,am)rts0=s,sS
where at=π~(st,s0,t),st+1P(st+1|st,at) for t0. Note that if Vπ~ is bounded, the supremum does exist for any sS. However, it remains to be seen whether the optimal policy π~ that attains the supremum exists for all sS. What is more, we are especially interested in stationary deterministic policies, i.e. policies which are independent of initial state and time. Let π:SA denote a stationary deterministic policy, where the action is chosen as at=π(st) directly. We will find that there exists a stationary optimal policy if some assumptions are made. To show this, first we will prove some properties of the optimal value function V*, and then we will design a stationary policy which is optimal.
Assumption 1
The reward per stage satisfies
r(s,a,s)M,(s,a,s)S×A×S,
where M is some scalar. Furthermore
ξ<1,γ(s,a)ξ,(s,a)S×A,
Boundedness of the reward per stage, as well as the customisation on the bound of discount factor per stage, is not as restrictive as might appear. Assumption 1 automatically holds when the spaces S and A are finite sets. Now, it is easy to check that Vπ~(s) is bounded for all sS with any given π~, hence the existence of V* can be ensured.
Actually, the optimal value function V* also satisfies the Bellman optimality equation, as is in the conventional case, which we will prove in detail. Define the Bellman optimality operator T:RSRS by
(TV)(s):=maxaAsP(s|s,a)(r(s,a,s)+γ(s,a)V(s)),sS.
We denote by Tk the composition of the mapping T with itself k times. The following proposition shows that any bounded function V will finally converge to the optimal value function V*, with T operated infinite times.
Proposition 1
Given any bounded function V0:SR, the optimal value function satisfies
V*(s0)ξKM1ξξKmaxsSV0(s)(TKV0)(s0)V*(s0)+ξKM1ξ+ξKmaxsSV0(s)
for any positive integer K and s0S. Furthermore
V*(s)=limN(TNV0)(s),sS.
Proof
For any initial state sS, policy π~ and positive integer K, we partition Vπ~ into two parts, the first K stages and the remaining stages
Vπ~(s0)=Ea0,s1,t=0m=0t1γ(sm,am)rt=Ea0,s1,,sKt=0K1m=0t1γ(sm,am)rt+EaK,sK+1,t=Km=0t1γ(sm,am)rt
where at=π~(st,s0,t),st+1P(st+1|st,at). Since by Assumption 1, we have rl=r(sl,al,sl+1)M and γ(s,a)ξ1, we also obtain
EaK,sK+1,t=Km=0t1γ(sm,am)rtMt=Kξt=ξKM1ξ
Using these relations, it follows that:
Vπ~(s0)ξKM1ξEa0,s1,,sKt=0K1m=0t1γ(sm,am)rt
and
Vπ~(s0)+ξKM1ξEa0,s1,,sKt=0K1m=0t1γ(sm,am)rt.
Note that
m=0K1γ(sm,am)V0(s0)ξKmaxsSV0(s),
we have
Vπ~(s0)ξKM1ξξKmaxsSV0(s)Ea0,s1,,sKt=0K1m=0t1γ(sm,am)rt+m=0K1γ(sm,am)V0(s0)
and
Vπ~(s0)+ξKM1ξ+ξKmaxsSV0(s)Ea0,s1,,sKt=0K1m=0t1γ(sm,am)rt+m=0K1γ(sm,am)V0(s0),
where at=π~(st,s0,t),st+1P(st+1|st,at). By taking the supremum over π~, we obtain for any positive integer K and s0S
V*(s0)ξKM1ξξKmaxsSV0(s)(TKV0)(s0)V*(s0)+ξKM1ξ+ξKmaxsSV0(s).
Now that 0<ξ<1, by taking the limit as K, the results follow. □
Lemma 1
For any functions V:SR and V:SR, such that
V(s)V(s),sS,
we have
(TkV)(s)(TkV)(s),sS,kZ+,
Proof
Let
a=argmaxaAsP(s|s,a)r(s,a,s)+γ(s,a)V(s),
we have
(TV)(s)=sP(s|s,a)r(s,a,s)+γ(s,a)V(s)sP(s|s,a)r(s,a,s)+γ(s,a)V(s)maxaAsP(s|s,a)r(s,a,s)+γ(s,a)V(s)=(TV)(s),
hence (TkV)(s)(TkV)(s). □
Lemma 2
Let e:SR denote the unit function that takes the value 1 identically on S, i.e. e(s)=1,sS. For any kZ+, any non-negative scalar d, and any value function V:SR, we have
(T(V+de))(s)(TV)(s)+ξd,sS.(T(Vde))(s)(TV)(s)ξd,sS.
Proof
As for the first inequality, we have
(T(V+de))(s)=maxaAsP(s|s,a)r(s,a,s)+γ(s,a)(V(s)+de(s))=maxaAsP(s|s,a)r(s,a,s)+γ(s,a)(V(s)+d)=maxaAγ(s,a)d+sP(s|s,a)r(s,a,s)+(γ(s,a)V(s))maxaAsP(s|s,a)r(s,a,s)+(γ(s,a)V(s))+maxaAγ(s,a)d(TV)(s)+ξd,,
(12)
hence (T(V+de))(s)(TV)(s)+ξd holds for all sS. As for the second inequality, (T(Vde+de))(s)(T(Vde))(s)+ξd, hence (T(Vde))(s)(TV)(s)ξd for all sS. □
Now, we are ready to show that the optimal value function V* also satisfies the Bellman optimality equation as is in the conventional case. What is more, V* is the unique solution.
Theorem 1
The optimal value function V* satisfies Bellman optimality equation
V*(s)=maxaAsP(s|s,a)r(s,a,s)+γ(s,a)V*(s)
or, equivalently
TV*=V*.
Furthermore, V* is the unique solution of this equation within the class of bounded functions.
Proof
By Proposition 1, we have for all sS and NZ+
V*(s)ξNM1ξ(TNVzero)(s)V*(s)+ξNM1ξ
where Vzero is the zero function Vzero(s)=0 for any sS. Applying the Bellman optimality operator to this inequality and using Lemma 1 as well as Lemma 2, we obtain for all sS and NZ+
(TV*)(s)ξN+1M1ξTV*ξNM1ξe(s)(TN+1Vzero)(s)TV*+ξNM1ξe(s)(TV*)(s)+ξN+1M1ξ.
By taking the limit as V and using the fact in Proposition 1
limNTN+1Vzero(s)=V*(s),
we obtain
TV*=V*.
To show uniqueness, observe that if V is bounded and satisfies V=TV, then V=limNTNV, so by Proposition 1, we have V=V*. □
Now that the existence and uniqueness of V* have been proven, we define the optimal state-action-value function by
Q*(s,a):=sP(s|s,a)r(s,a,s)+γ(s,a)V*(s),(s,a)S×A.
The Bellman optimality equation can also be written as
V*(s)=maxaAQ*(s,a),sS.
Obviously, the stationary policy π*:SA defined by
π*(s):=argmaxaAQ*(s,a),sS,
is optimal, because Vπ*:=Vπ~|π~=π* definitely satisfies the TVπ*=Vπ*, thus Vπ*=V*. We can conclude that there exists an optimal policy which is stationary. Note that the existence and uniqueness of V* ensuring those of Q* and π*.
Now, we are preparing to show that Q-learning algorithm still works in our special kind of infinite-horizon discounted MDP. Define the operator H:RS×ARS×A by
(HQ)(s,a):=sP(s|s,a)r(s,a,s)+γ(s,a)maxaAQ(s,a)
where Q:S×AR is any state-value function.
Lemma 3
The optimal state-value function Q* is the equilibrium of the operator H, i.e.
HQ*=Q*.
Proof
By Theorem 1, we have V*(s)=maxaAQ*(s,a), then for all (s,a)S×A
(HQ*)(s,a)=sP(s|s,a)r(s,a,s)+γ(s,a)maxaAQ*(s,a)=sP(s|s,a)r(s,a,s)+γ(s,a)V*(s)=Q*(s,a)..
(13)
In short, HQ*=Q*. □
Lemma 4
The operator H is a contraction with respect to , i.e.
HQHQξQQ,
for any state-value function Q and Q.
Proof
HQHQ=max(s,a)γ(s,a)sP(s|s,a)maxaAQ(s,a)maxaAQ(s,a)max(s,a)ξsP(s|s,a)maxaAQ(s,a)maxaAQ(s,a)max(s,a)ξsP(s|s,a)maxaAQ(s,a)maxaAQ(s,a)max(s,a)ξsP(s|s,a)max(s,a)S×AQ(s,a)Q(s,a)=max(s,a)Q(s,a)Q(s,a)ξ=ξQQ
(14)
Theorem 2
The random process {Δt} taking values in Rn and defined as
Δt+1(x)=(1αt(x))Δt(x)+αt(x)Ft(x)
converges to zero with probability 1 if the following conditions are satisfied:
0αt1,tαt(x)= and tαt(x)2<;
EFt(x)|FtWξΔtW, with ξ<1;
varFt(x)|FtC1+ΔtW2, for C>0.
Proof
See [25]. □
Theorem 3
As for the special infinite-horizon discounted MDP (S,A,P,r,γ) defined above, the Q-learning algorithm, given by the update rule
Qt+1(st,at)=Qt(st,at)+αt(st,at)r(st,at,st+1)+γ(st,at)maxaAQt(st+1,a)Qt(st,at),
converges with probability 1 to the optimal state-action-value function Q* as long as
tαt(s,a)=tαt2(s,a)<
for all (s,a)S×A.
Proof
The update rule of Q-learning can be written as
Qt+1(st,at)=(1αt(st,at))Qt(st,at)+αt(st,at)r(st,at,st+1)+γ(st,at)maxaAQt(st+1,a)..
(15)
Let Δt(s,a)=Qt(s,a)Q*(s,a), the equation above yields
Δt+1(s,a)=(1αt(st,at))Δt(st,at)+αt(st,at)r(st,at,st+1)+γ(st,at)maxaAQt(st+1,a)Q*(s,a).
(16)
If we write
Ft(s,a)=r(s,a,s)+γ(s,a)maxaAQt(s,a)Q*(s,a),
where sP(s|s,a), we have
EFt(x)|Ft=sSP(s|s,a)r(s,a,s)+γ(s,a)maxaAQt(s,a)Q*(s,a)=(HQt)(s,a)Q*(s,a)..
(17)
By Lemma 3, HQ*=Q*
EFt(s,a)|Ft=(HQt)(s,a)(HQ*)(s,a).
By Lemma 4, we have
EFt(s,a)|FtξQtQ*=ξΔt.
Moreover
varFt(s,a)|Ft=Er(s,a,s)+γ(s,a)maxaAQt(s,a)Q*(s,a)(HQt)(s,a)+Q*(s,a))2=Er(s,a,s)+γ(s,a)maxaAQt(s,a)(HQt)(s,a)2=varr(s,a,s)+γ(s,a)maxaAQt(s,a)|Ft
Owing to Assumption 1, the term r(s,a,s) is bounded; obviously, we have
varFt(s,a)|FtC1+Δt2.
Finally, by Theorem 2, Δt converges to zero with probability 1, i.e. Qt converges to Q* with probability 1. □

Information & Authors

Information

Published in

History

Received: 14 December 2018
Revision received: 30 January 2019
Accepted: 07 February 2019
Published ahead of print: 13 February 2019
Published online: 01 June 2019
Published in print: June 2019

Inspec keywords

  1. learning (artificial intelligence)

Keywords

  1. reinforcement learning framework
  2. control theoretical formalism
  3. explicit time function
  4. action value
  5. action time
  6. RL value network
  7. embedded time model
  8. time-in-action RL

Authors

Affiliations

Institute of Cyber-Systems and Control, Department of Control Science and Engineering, Zhejiang University, Hangzhou, People's Republic of China
Zhepei Wang
Institute of Cyber-Systems and Control, Department of Control Science and Engineering, Zhejiang University, Hangzhou, People's Republic of China
Douglas Mcilwraith
Data Science Institute, Department of Computing, Imperial College London, London, UK
Chao Wu
School of Public Affairs, Zhejiang University, Hangzhou, People's Republic of China
Chao Xu
Institute of Cyber-Systems and Control, Department of Control Science and Engineering, Zhejiang University, Hangzhou, People's Republic of China
Yike Guo
Data Science Institute, Department of Computing, Imperial College London, London, UK

Funding Information

National Natural Science Foundation of China: 61473253

Metrics & Citations

Metrics

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.

View Options

View options

PDF

View PDF
Access content
Login options

Figures

Tables

Media

Share

Share

Copy the content Link

Share on social media