This study is concerned with reinforcement learning enhanced by two-time scale approximations. Many systems arising in applications are large and complex. To treat these problems, it is often beneficial, and sometimes necessary, to reduce the dimensionality and aggregate states that are ‘close’ to each other. In this study, the authors propose a two-time scale reinforcement learning method for such an aggregation process. In particular, they present how to classify states that are ‘close’ and demonstrate the effectiveness of the authors' state aggregation based two-time scale methods. Thus the problem can be considered as using learning for identifying the system. A production planning problem with failure-prone machines is used throughout this study to illustrate the main ideas, key steps and results. Monte Carlo simulations are used to generate the random environment.

References

1. 1)
  - 13. Usuga Cadavid, J.P., Lamouri, S., Grabot, B., et al: ‘Machine learning in production planning and control: a review of empirical literature’, IFAC-PapersOnLine, 2019, 52, pp. 385–390.
2. 2)
  - 11. Bensoussan, A., Liu, R., Sethi, S.: ‘Optimality of an (s,S) policy with compound Poisson and diffusion demands: a quasi-variational inequalities approach’, SIAM J. Control Optim., 2005, 44, pp. 1650–1676.
3. 3)
  - 16. Akella, R., Kumar, P.R.: ‘Optimal control of production rate in a failure prone manufacturing system’, IEEE Tran. Autom. Control, 1986, 31, pp. 116–126.
4. 4)
  - 17. Zhang, Q., Yin, G.: ‘On nearly optimal controls of hybrid LQG problems’, IEEE Trans. Autom. Control, 1999, 44, (12), pp. 2271–2282.
5. 5)
  - 24. Berkovitz, L.: ‘Optimal control theory’ (Springer-Verlag, Berlin and New York, 1974).
6. 6)
  - 2. Bai, E.-W., Li, K., Zhao, W., et al: ‘Variable selection in nonlinear non-parametric system identification’, Sci. Sin.: Math., 2016, 46, (10), pp. 1383–1400.
7. 7)
  - 1. Han, J., Jentzen, A.E.W.: ‘Solving high-dimensional partial differential equations using deep learning’, Proc. Natl. Acad. Sci., 2018, 115, (34), pp. 8505–8510.
8. 8)
  - 4. Mu, B., Bai, E.-W., Zheng, W.X., et al: ‘A globally consistent nonlinear least squares estimator for identification of nonlinear rational systems’, Automatica, 2017, 77, pp. 322–335.
9. 9)
  - 8. Simon, H.A., Ando, A.: ‘Aggregation of variables in dynamic systems’, Econometrica, 1961, 29, pp. 111–138.
10. 10)
  - 25. Perthame, B.: ‘Perturbed dynamical systems with an attracting singularity and weak viscosity limits in Hamilton–Jacobi equations’, Tech. Rep. 18, Ecole Normale Supérieure, 1987.
11. 11)
  - 22. Watkins, C.I.C.H., Dayan, P.: ‘Q-learning’, Mach. Learn., 1992, 8, pp. 279–292.
12. 12)
  - 21. Tsitsiklis, J.N.: ‘Asynchronous stochastic approximation and Q-learning’, Mach. Learn., 1994, 16, pp. 185–202.
13. 13)
  - 18. Zhang, Q., Yin, G., Liu, R.H.: ‘A near-optimal selling rule for a two-time-scale market model’, Multiscale Model. Simul.: A SIAM J., 2005, 4, pp. 172–193.
14. 14)
  - 23. Yin, G., Xu, C., Wang, L.Y.: ‘Q-learning algorithms with random truncation bounds and applications to effective parallel computing’, J. Optim. Theory Appl., 2008, 137, pp. 435–451.
15. 15)
  - 12. Costa, O.L.V., Dufour, F.: ‘Continuous average control of piecewise deterministic Markov processes’, Springer Briefs in Math., (Springer, New York, 2013).
16. 16)
  - 26. Kushner, H.J.: ‘Approximation and weak convergence methods for random processes, with applications to stochastic systems theory’ (MIT Press, Cambridge, MA, 1984).
17. 17)
  - 15. Kushner, H.J., Yin, G.: ‘Stochastic approximation algorithms and applications’ (Springer, New York, 1997).
18. 18)
  - 3. Guo, J., Zhao, Y.: ‘Identification for Wiener-Hammerstein systems under quantized inputs and quantized output observations’, Asian J. Control, 2019, pp. 1–10. https://doi.org/10.1002/asjc.2237.
19. 19)
  - 5. Mu, B., Chen, T., Ljung, L.: ‘On asymptotic properties of hyperparameter estimators for kernel-based regularization methods’, Automatica, 2018, 94, pp. 381–395.
20. 20)
  - 20. Phillips, R.G., Kokotovic, P.V.: ‘A singular perturbation approach to modeling and control of Markov chains’, IEEE Trans. Autom. Control, 1981, 26, pp. 1087–1094.
21. 21)
  - 7. Yin, G., Zhang, H., Zhang, Q.: ‘Two-time-scale Markovian systems and applications’ (Sci. Press, Beijing, 2013).
22. 22)
  - 14. Usuga Cadavid, J.P., Lamouri, S., Grabot, B., et al: ‘Machine learning applied in production planning and control: a state-of-the-art in the era of industry 4.0’, J. Intell. Manuf., 2020, 31, pp. 1531–1558.
23. 23)
  - 6. Yin, G., Zhang, Q.: ‘Continuous-time Markov chains and applications: a two-time scale approach’ (Springer, New York, 2013, 2nd edn.).
24. 24)
  - 19. Tran, K., Yin, G.: ‘Optimal harvesting strategies for stochastic ecosystems’, IET Control Theory Appl., 2017, 11, (15), pp. 2521–2530.
25. 25)
  - 10. Sethi, S.P., Zhang, Q.: ‘Hierarchical decision making in stochastic manufacturing systems’ (Birkhäuser, Boston, 1994).
26. 26)
  - 9. O'Malley, R.E. Jr.: ‘Singular perturbation methods for ordinary differential equations’ (Springer-Verlag, New York, 1991).

Two-time scale reinforcement learning and applications to production planning

References

Related content