在本书的前四个部分中,我们仅仅讨论了只有一个智能体在环境中所做出的决策。在本书的第五部分中,我们将前面四个部分的阐述扩展到多个智能体,讨论多个智能体间不确定性的交互所带来的挑战。首先,我们将讨论最简单的情况,一组智能体同时选择执行同一个行为。最终的结果是基于联合行为对每个智能体进行单独的奖励。 马尔可夫博弈 (Markov Game,MG)代表对多个状态的简单博弈和对多个智能体马尔可夫决策过程的泛化。因此,智能体选择可以随机改变共享环境状态的相关操作行为。由于对其他智能体策略的不确定性,马尔可夫博弈相关算法依赖于强化学习。 部分可观测马尔可夫博弈 (Partially Observable Markov Game,POMG)引入了状态不确定性,从而进一步推广了马尔可夫博弈和POMDP,因为智能体目前只接收有噪声的局部观测。 分布式的部分可观测马尔可夫决策过程 (Decentralized Partially Observable Markov Decision Process,Dec-POMDP)将POMG专注于一个协作的多智能体团队,其中各个智能体之间存在共享的奖励。本书的第五部分将介绍这四类问题,并讨论解决这些问题的精确算法和近似算法。
[1] 此处重点讨论离散时间问题,连续时间问题属于控制理论领域的研究范畴。具体参见D.E.Kirk, Optimal Control Theory:An Introduction .Prentice-Hall,1970。
[2] Russell and P.Norvig, Artificial Intelligence : A Modern Approach ,4th ed.Pearson,2021,提供了对人工智 能的全 面 概述 。
[3] M. J. Kochenderfer, Decision Making Under Uncertainty : Theory and Application . MIT Press,2015中讨论了这一应用。
[4] M. Bouton,A. Nakhaei,K. Fujimura,and M. J. Kochenderfer,“Safe Reinforcement Learning with Scene Decomposition for Navigating Complex Urban Environments,”in IEEE Intelligent Vehicles Symposium (Ⅳ),2019探讨了类似的应用。
[5] T. Ayer,O. Alagoz,and N. K. Stout,“A POMDP Approach to Personalize Mammography Screening Decisions,” Operations Research ,vol. 60,no. 5,pp. 1019-1034,2012首先提出了该思想。
[6] R.C.Merton研究了与此相关的问题,参见R. C. Merton,“Optimum Consumption and Portfolio Rules in a Continuous-Time Model,” Journal of Economic Theory ,vol. 3,no. 4,pp. 373-413,1971。
[7] K. D. Julian and M. J. Kochenderfer,“Distributed Wildfire Surveillance with Autonomous Aircraft Using Deep Reinforcement Learning,” AIAA Journal of Guidance , Control , and Dynamics ,vol. 42,no. 8,pp. 1768-1778,2019中探索了该应用。
[8] D. Gaines,G. Doran,M. Paton,B. Rothrock,J. Russino,R. Mackey,R. Anderson,R. Francis,C. Joswig,H. Justice,K. Kolcio,G. Rabideau,S. Schaffer,J. Sawoniewicz,A. Vasavada,V. Wong,K. Yu,and A.-a. Agha-mohammadi,“Self-Reliant Rovers for Increased Mission Productivity,” Journal of Field Robotics ,vol. 37,no. 7,pp. 1171-1196,2020中提出并验证了该思想。
[9] S. Vasileiadou,D. Kalligeropoulos,and N. Karcanias,“Systems,Modelling and Control in Ancient Greece: Part 1: Mythical Automata,” Measurement and Control ,vol. 36,no. 3,pp. 76-80,2003.
[10] N. J. Nilsson, The Quest for Artificial Intelligence . Cambridge University Press,2009.
[11] G. B. Dantzig,“Linear Programming,” Operations Research ,vol. 50,no. 1,pp. 42-47,2002.
[12] G. J. Stigler,“The Development of Utility Theory. I,” Journal of Political Economy ,vol. 58,no. 4,pp. 307-327,1950.
[13] J. Bentham, Theory of Legislation. Trübner & Company,1887.
[14] O.Morgenstern and J. von Neumann, Theory of Games and Economic Behavior . Princeton University Press,1953.
[15] R.S.Sutton and A.G.Barto, Reinforcement Learning : An Introduction ,2nd ed.MIT Press,2018.
[16] N.J.Nilsson, The Quest for Artificial Intelligence .Cambridge University Press,2009.
[17] Quoted by J.Agar, Science in the 20th Century and Beyond .Polity,2012.
[18] S.Thrun,“Probabilistic Robotics,” Communications of the ACM ,vol.45,no.3,pp.52-57,2002.
[19] G.E.Moore,“Cramming More Components onto Integrated Circuits,” Electronics ,vol.38,no.8,pp.114-117,1965.
[20] D.A.Mindell, Between Human and Machine : Feedback , Control , and Computing Before Cybernetics .JHU Press , 2002.
[21] W.M.Bolstad and J.M.Curran, Introduction to Bayesian Statistics .Wiley,2016.
[22] B.O.Koopman, Search and Screening : General Principles with Historical Applications .Pergamon Press,1980.
[23] H.Koontz,“The Management Theory Jungle,” Academy of Management Journal ,vol.4,no.3,pp.174-188,1961.
[24] F.S.Hillier, Introduction to Operations Research .McGraw—Hill,2012.
[25] 有关更一般性的讨论,请参见B. Christian, The Alignment Problem . Norton & Company,2020。另外,可参见以下论文中的相关讨论:D.Amodei,C.Olah,J.Steinhardt,P.Christiano,J.Schulman,and D.Mané,“Concrete Problems in AI Safety,”2016.arXiv: 1606.06565v2。