Markov decision process

Markov decision process (MDP), also called a stochastic dynamic program or stochastic control problem, is a model for sequential decision making when outcomes are uncertain.^[1]

Originating from operations research in the 1950s,^[2]^[3] MDPs have since gained recognition in a variety of fields, including ecology, economics, healthcare, telecommunications and reinforcement learning.^[4] Reinforcement learning utilizes the MDP framework to model the interaction between a learning agent and its environment. In this framework, the interaction is characterized by states, actions, and rewards. The MDP framework is designed to provide a simplified representation of key elements of artificial intelligence challenges. These elements encompass the understanding of cause and effect, the management of uncertainty and nondeterminism, and the pursuit of explicit goals.^[4]

The name comes from its connection to Markov chains, a concept developed by the Russian mathematician Andrey Markov. The "Markov" in "Markov decision process" refers to the underlying structure of state transitions that still follow the Markov property. The process is called a "decision process" because it involves making decisions that influence these state transitions, extending the concept of a Markov chain into the realm of decision-making under uncertainty.

^ Puterman, Martin L. (1994). Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and mathematical statistics. Applied probability and statistics section. New York: Wiley. ISBN 978-0-471-61977-2.
^ Schneider, S.; Wagner, D. H. (1957-02-26). "Error detection in redundant systems". Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability. IRE-AIEE-ACM '57 (Western). New York, NY, USA: Association for Computing Machinery: 115–121. doi:10.1145/1455567.1455587. ISBN 978-1-4503-7861-1.
^ Bellman, Richard (1958-09-01). "Dynamic programming and stochastic control processes". Information and Control. 1 (3): 228–239. doi:10.1016/S0019-9958(58)80003-0. ISSN 0019-9958.
^ ^a ^b Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (2nd ed.). Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03924-6.

[1] Puterman, Martin L. (1994). Markov decision processes: discrete stochastic dynamic programming. Wiley series in probability and mathematical statistics. Applied probability and statistics section. New York: Wiley. ISBN 978-0-471-61977-2.

[2] Schneider, S.; Wagner, D. H. (1957-02-26). "Error detection in redundant systems". Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability. IRE-AIEE-ACM '57 (Western). New York, NY, USA: Association for Computing Machinery: 115–121. doi:10.1145/1455567.1455587. ISBN 978-1-4503-7861-1.

[3] Bellman, Richard (1958-09-01). "Dynamic programming and stochastic control processes". Information and Control. 1 (3): 228–239. doi:10.1016/S0019-9958(58)80003-0. ISSN 0019-9958.

[:0-4] Sutton, Richard S.; Barto, Andrew G. (2018). Reinforcement learning: an introduction. Adaptive computation and machine learning series (2nd ed.). Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03924-6.

[1]

[2]

[3]

[4]