Decision theory, reinforcement learning, and the brain

Dayan, Peter; Daw, Nathaniel D.

doi:10.3758/CABN.8.4.429

Decision theory, reinforcement learning, and the brain

Connections between Computational and Neurobiological Perspectives on Decision Making
Published: December 2008

Volume 8, pages 429–453, (2008)
Cite this article

Download PDF

Cognitive, Affective, & Behavioral Neuroscience Aims and scope Submit manuscript

Decision theory, reinforcement learning, and the brain

Download PDF

Peter Dayan¹ &
Nathaniel D. Daw²

9308 Accesses
328 Citations
Explore all metrics

Abstract

Decision making is a core competence for animals and humans acting and surviving in environments they only partially comprehend, gaining rewards and punishments for their troubles. Decision-theoretic concepts permeate experiments and computational models in ethology, psychology, and neuroscience. Here, we review a well-known, coherent Bayesian approach to decision making, showing how it unifies issues in Markovian decision problems, signal detection psychophysics, sequential sampling, and optimal exploration and discuss paradigmatic psychological and neural examples of each problem. We discuss computational issues concerning what subjects know about their task and how ambitious they are in seeking optimal solutions; we address algorithmic topics concerning model-based and model-free methods for making choices; and we highlight key aspects of the neural implementation of decision making.

References

Ainslie, G. (2001). Breakdown of will. Cambridge: Cambridge University Press.
Google Scholar
Balleine, B. W., Delgado, M. R., & Hikosaka, O. (2007). The role of the dorsal striatum in reward and decision-making. Journal of Neuroscience, 27, 8161–8165.
Article PubMed Google Scholar
Barto, A. G. (1995). Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 215–232). Cambridge, MA: MIT Press.
Google Scholar
Battaglia, P. W., Jacobs, R. A., & Aslin, R. N. (2003). Bayesian integration of visual and auditory signals for spatial localization. Journal of the Optical Society of America A, 20, 1391–1397.
Article Google Scholar
Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350.
Article Google Scholar
Beck, J. [M.], Ma, W. J., Latham, P. E., & Pouget, A. (2007). Probabilistic population codes and the exponential family of distributions. Progress in Brain Research, 165, 509–519.
Article PubMed Google Scholar
Beck, J. M., & Pouget, A. (2007). Exact inferences in a neural implementation of a hidden Markov model. Neural Computation, 19, 1344–1361.
Article PubMed Google Scholar
Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press.
Google Scholar
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. New York: Springer.
Google Scholar
Berridge, K. C. (2007). The debate over dopamine’s role in reward: The case for incentive salience. Psychopharmacology, 191, 391–431.
Article PubMed Google Scholar
Berry, D. A., & Fristedt, B. (1985). Bandit problems: Sequential allocation of experiments (Monographs on Statistics and Applied Probability). London: Chapman & Hall.
Google Scholar
Bertsekas, D. P. (2007). Dynamic programming and optimal control (2 vol.). Belmont, MA: Athena Scientific.
Google Scholar
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific.
Google Scholar
Bogacz, R., Brown, E., Moehlis, J., Holmes, P., & Cohen, J. D. (2006). The physics of optimal decision making: A formal analysis of models of performance in two-alternative forced-choice tasks. Psychology Review, 113, 700–765.
Article Google Scholar
Britten, K. H., Newsome, W. T., Shadlen, M. N., Celebrini, S., & Movshon, J. A. (1996). A relationship between behavioral choice and the visual responses of neurons in macaque mt. Visual Neuroscience, 13, 87–100.
Article PubMed Google Scholar
Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. Journal of Neuroscience, 12, 4745–4765.
PubMed Google Scholar
Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Proceedings of the 10th National Conference on Artificial Intelligence (pp. 183-188). Menlo Park, CA: AAAI Press.
Google Scholar
Cohen, J. D., McClure, S. M., & Yu, A. J. (2007). Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration. Philosophical Transactions of the Royal Society B, 362, 933–942.
Article Google Scholar
Costa, R. M. (2007). Plastic corticostriatal circuits for action learning: What’s dopamine got to do with it? In B. W. Balleine, K. Doya, J. O’Doherty, & M. Sakagami (Eds.), Reward and decision making in corticobasal ganglia networks (Annals of the New York Academy of Sciences, Vol. 1104, pp. 172–191). New York: New York Academy of Sciences.
Google Scholar
Daw, N. D., Courville, A. C., & Touretzky, D. S. (2006). Representation and timing in theories of the dopamine system. Neural Computation, 18, 1637–1677.
Article PubMed Google Scholar
Daw, N. D., & Doya, K. (2006). The computational neurobiology of learning and reward. Current Opinion in Neurobiology, 16, 199–204.
Article PubMed Google Scholar
Daw, N. D., Niv, Y., & Dayan, P. (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8, 1704–1711.
Article PubMed Google Scholar
Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441, 876–879.
Article PubMed Google Scholar
Day, J. J., Roitman, M. F., Wightman, R. M., & Carelli, R. M. (2007). Associative learning mediates dynamic shifts in dopamine signaling in the nucleus accumbens. Nature Neuroscience, 10, 1020–1028.
Article PubMed Google Scholar
Dayan, P., & Abbott, L. F. (2005). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press.
Google Scholar
Dayan, P., & Sejnowski, T. J. (1996). Exploration bonuses and dual control. Machine Learning, 25, 5–22.
Google Scholar
Dearden, R., Friedman, N., & Russell, S. (1998). Bayesian Q-learning. In Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence (pp. 761–768). Menlo Park, CA: AAAI Press.
Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39, 1–38.
Google Scholar
Deneve, S. (2008). Bayesian spiking neurons I: Inference. Neural Computation, 20, 91–117.
Article PubMed Google Scholar
Dickinson, A., & Balleine, B. (2002). The role of learning in motivation. In C. Gallistel (Ed.), Stevens’s handbook of experimental psychology (Vol. 3, pp. 497–533). New York: Wiley.
Google Scholar
Doya, K. (2002). Metalearning and neuromodulation. Neural Networks, 15, 495–506.
Article PubMed Google Scholar
Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415, 429–433.
Article PubMed Google Scholar
Everitt, B. J., & Robbins, T. W. (2005). Neural systems of reinforcement for drug addiction: From actions to habits to compulsion. Nature Neuroscience, 8, 1481–1489.
Article PubMed Google Scholar
Friston, K. J., Tononi, G., Reeke, G. N., Jr., Sporns, O., & Edelman, G. M. (1994). Value-dependent selection in the brain: Simulation in a synthetic neural model. Neuroscience, 59, 229–243.
Article PubMed Google Scholar
Gittins, J. C. (1989). Multi-armed bandit allocation indices. New York: Wiley.
Google Scholar
Glimcher, P. W. (2004). Decisions, uncertainty, and the brain: The science of neuroeconomics. Cambridge, MA: MIT Press, Bradford Books.
Google Scholar
Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends in Cognitive Sciences, 5, 10–16.
Article PubMed Google Scholar
Gold, J. I., & Shadlen, M. N. (2002). Banburismus and the brain: Decoding the relationship between sensory stimuli, decisions, and reward. Neuron, 36, 299–308.
Article PubMed Google Scholar
Gold, J. I., & Shadlen, M. N. (2007). The neural basis of decision making. Annual Review of Neuroscience, 30, 535–574.
Article PubMed Google Scholar
Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley.
Google Scholar
Haruno, M., Kuroda, T., Doya, K., Toyama, K., Kimura, M., Samejima, K., et al. (2004). A neural correlate of reward-based behavioral learning in caudate nucleus:A functional magnetic resonance imagresearch ing study of a stochastic decision task. Journal of Neuroscience, 24, 1660–1665.
Article PubMed Google Scholar
Hyman, S. E., Malenka, R. C., & Nestler, E. J. (2006). Neural mechanisms of addiction: The role of reward-related learning and memory. Annual Review of Neuroscience, 29, 565–598.
Article PubMed Google Scholar
Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6, 1185–1201.
Article Google Scholar
Jacobs, R. A. (1999). Optimal integration of texture and motion cues to depth. Vision Research, 39, 3621–3629.
Article PubMed Google Scholar
Jazayeri, M., & Movshon, J. A. (2006). Optimal representation of sensory information by neural populations. Nature Neuroscience, 9, 690–696.
Article PubMed Google Scholar
Joel, D., Niv, Y., & Ruppin, E. (2002). Actor-critic models of the basal ganglia: New anatomical and computational perspectives. Neural Networks, 15, 535–547.
Article PubMed Google Scholar
Kable, J. W., & Glimcher, P. W. (2007). The neural correlates of subjective value during intertemporal choice. Nature Neuroscience, 10, 1625–1633.
Article PubMed Google Scholar
Kaelbling, L. P. (1993). Learning in embedded systems. Cambridge, MA: MIT Press.
Google Scholar
Kakade, S., & Dayan, P. (2002). Dopamine: Generalization and bonuses. Neural Networks, 15, 549–559.
Article PubMed Google Scholar
Killcross, S., & Coutureau, E. (2003). Coordination of actions and habits in the medial prefrontal cortex of rats. Cerebral Cortex, 13, 400–408.
Article PubMed Google Scholar
Körding, K. [P.] (2007). Decision theory: What “should”the nervous system do? Science, 318, 606–610.
Article PubMed Google Scholar
Körding, K. P., & Wolpert, D. M. (2004). Bayesian integration in sensorimotor learning. Nature, 427, 244–247.
Article PubMed Google Scholar
Krebs, J., Kacelnik, A., & Taylor, P. (1978). Test of optimal sampling by foraging great tits. Nature, 275, 27–31.
Article Google Scholar
Lengyel, M., & Dayan, P. (2008). Hippocampal contributions to control: The third way. In J. Platt, D. Koller, Y. Singer, & S. Roweis (Eds.), Advances in neural information processing systems 20 (pp. 889–896). Cambridge, MA: MIT Press.
Google Scholar
Lo, C.-C., & Wang, X.-J. (2006). Cortico-basal ganglia circuit mechanism for a decision threshold in reaction time tasks. Nature Neuroscience, 9, 956–963.
Article PubMed Google Scholar
Ma, W. J., Beck, J. M., Latham, P. E., & Pouget, A. (2006). Bayesian inference with probabilistic population codes. Nature Neuroscience, 9, 1432–1438.
Article PubMed Google Scholar
Mangel, M., & Clark, C. W. (1989). Dynamic modeling in behavioral ecology. Princeton, NJ: Princeton University Press.
Google Scholar
McClure, S. M., Gilzenrat, M. S., & Cohen, J. D. (2006). An exploration-exploitation model based on norepinephrine and dopamine activity. In Y. Weiss, B. Schölkopf, & J. Platt (Eds.), Advances in neural information processing systems 18 (pp. 867–874). Cambridge, MA: MIT Press.
Google Scholar
McClure, S. M., Laibson, D. I., Loewenstein, G., & Cohen, J. D. (2004). Separate neural systems value immediate and delayed monetary rewards. Science, 306, 503–507.
Article PubMed Google Scholar
McNamara, J., & Houston, A. (1980). The application of statistical decision theory to animal behaviour. Journal of Theoretical Biology, 85, 673–690.
Article PubMed Google Scholar
Montague, [P.] R. (2006). Why choose this book?: How we make decisions. New York: Dutton.
Google Scholar
Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. Journal of Neuroscience, 16, 1936–1947.
PubMed Google Scholar
Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer.
Google Scholar
Ng, A., Harada, D., & Russell, S. (1999). Policy invariance under reward transformations: Theory and application to reward shaping. In Proceedings of the Sixteenth International Conference on Machine Learning (pp. 278–287). San Francisco: Morgan Kaufmann.
Google Scholar
O’Doherty, J. P., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference models and reward-related learning in the human brain. Neuron, 38, 329–337.
Article PubMed Google Scholar
O’Doherty, J. [P.], Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. J. (2004). Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 304, 452–454.
Article PubMed Google Scholar
Parker, A. J., & Newsome, W. T. (1998). Sense and the single neuron: Probing the physiology of perception. Annual Review of Neuroscience, 21, 227–277.
Article PubMed Google Scholar
Platt, M. L., & Glimcher, P. W. (1999). Neural correlates of decision variables in parietal cortex. Nature, 400, 233–238.
Article PubMed Google Scholar
Puterman, M. L. (2005). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley Interscience.
Google Scholar
Pyke, G. H. (1984). Optimal foraging theory: A critical review. Annual Review of Ecology & Systematics, 15, 523–575.
Article Google Scholar
Rao, R. P. N. (2004). Bayesian computation in recurrent neural circuits. Neural Computation, 16, 1–38.
Article PubMed Google Scholar
Ratcliff, R., & McKoon, G. (2008). The diffusion decision model: Theory and data for two-choice decision tasks. Neural Computation, 20, 873–922.
Article PubMed Google Scholar
Ratcliff, R., & Rouder, J. (1998). Modeling response times for twochoice decisions. Psychological Science, 9, 347–356.
Article Google Scholar
Redgrave, P., Gurney, K., & Reynolds, J. (2008). What is reinforced by phasic dopamine signals? Brain Research Reviews, 58, 322–339.
Article PubMed Google Scholar
Rescorla, R., & Wagner, A. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory (pp. 64–99). New York: Appleton-Century-Crofts.
Google Scholar
Roitman, J. D., & Shadlen, M. N. (2002). Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time task. Journal of Neuroscience, 22, 9475–9489.
PubMed Google Scholar
Ross, S. (1983). Introduction to stochastic dynamic programming: Probability and mathematical. Orlando, FL: Academic Press.
Google Scholar
Sahani, M., & Dayan, P. (2003). Doubly distributional population codes: Simultaneous representation of uncertainty and multiplicity. Neural Computation, 15, 2255–2279.
Article PubMed Google Scholar
Schultz, W. (2002). Getting formal with dopamine and reward. Neuron, 36, 241–263.
Article PubMed Google Scholar
Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599.
Article PubMed Google Scholar
Seymour, B., O’Doherty, J. P., Dayan, P., Koltzenburg, M., Jones, A. K., Dolan, R. J., et al. (2004). Temporal difference models describe higher-order learning in humans. Nature, 429, 664–667.
Article PubMed Google Scholar
Shadlen, M. N., Britten, K. H., Newsome, W. T., & Movshon, J. A. (1996). A computational analysis of the relationship between neuronal and behavioral responses to visual motion. Journal of Neuroscience, 16, 1486–1510.
PubMed Google Scholar
Shadlen, M. N., Hanks, T. D., Churchland, A. K., Kiani, R., & Yang, T. (2007). The speed and accuracy of a simple perceptual decision: A mathematical primer. In K. Doya, S. Ishii, A. Pouget, & R. P. Rao (Eds.), Bayesian brain: Probabilistic approaches to neural coding (pp. 209–238). Cambridge, MA: MIT Press.
Google Scholar
Shadlen, M. N., & Newsome, W. T. (1996). Motion perception: Seeing and deciding. Proceedings of the National Academy of Sciences, 93, 628–633.
Article Google Scholar
Smith, P. L., & Ratcliff, R. (2004). Psychology and neurobiology of simple decisions. Trends in Neurosciences, 27, 161–168.
Article PubMed Google Scholar
Stocker, A. A., & Simoncelli, E. P. (2006). Noise characteristics and prior expectations in human visual speed perception. Nature Neuroscience, 9, 578–585.
Article PubMed Google Scholar
Suri, R. E., & Schultz, W. (1998). Learning of sequential movements by neural network model with dopamine-like reinforcement signal. Experimental Brain Research, 121, 350–354.
Article Google Scholar
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.
Google Scholar
Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In B. W. Porter & R. J. Mooney (Eds.), Proceedings of the Seventh International Conference on Machine Learning (pp. 216–224). San Francisco: Morgan Kaufmann.
Google Scholar
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Google Scholar
Trommershäuser, J., Landy, M. S., & Maloney, L. T. (2006). Humans rapidly estimate expected gain in movement planning. Psychological Science, 17, 981–988.
Article PubMed Google Scholar
Trommershäuser, J., Maloney, L. T., & Landy, M. S. (2003a). Statistical decision theory and the selection of rapid, goal-directed movements. Journal of the Optical Society of America A, 20, 1419–1433.
Article Google Scholar
Trommershäuser, J., Maloney, L. T., & Landy, M. S. (2003b). Statistical decision theory and trade-offs in the control of motor response. Spatial Vision, 16, 255–275.
Article PubMed Google Scholar
Usher, M., & McClelland, J. L. (2001). The time course of perceptual choice: The leaky, competing accumulator model. Psychological Review, 108, 550–592.
Article PubMed Google Scholar
Wald, A. (1947). Sequential analysis. New York: Wiley.
Google Scholar
Wang, X.-J. (2002). Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36, 955–968.
Article PubMed Google Scholar
Wang, X.-J. (2006). Toward a prefrontal microcircuit model for cognitive deficits in schizophrenia. Pharmacopsychiatry, 39 (Suppl. 1), S80-S87.
Article PubMed Google Scholar
Watkins, C. (1989). Learning from delayed rewards. Unpublished doctoral thesis, University of Cambridge.
Whiteley, L., & Sahani, M. (2008). Implicit knowledge of visual uncertainty guides decisions with asymmetric outcomes. Journal of Vision, 8, 1–15.
Article PubMed Google Scholar
Wickens, J. [R.] (1990). Striatal dopamine in motor activation and reward-mediated learning: Steps towards a unifying model. Journal of Neural Transmission, 80, 9–31.
Article PubMed Google Scholar
Wickens, J. R., Horvitz, J. C., Costa, R. M., & Killcross, S. (2007). Dopaminergic mechanisms in actions and habits. Journal of Neuroscience, 27, 8181–8183.
Article PubMed Google Scholar
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.
Google Scholar
Wittmann, B. C., Daw, N. D., Seymour, B., & Dolan, R. J. (2008). Striatal activity underlies novelty-based choice in humans. Neuron, 58, 967–973.
Article PubMed Google Scholar
Yang, T., & Shadlen, M. N. (2007). Probabilistic reasoning by neurons. Nature, 447, 1075–1080.
Article PubMed Google Scholar
Yoshida, W., & Ishii, S. (2006). Resolution of uncertainty in prefrontal cortex. Neuron, 50, 781–789.
Article PubMed Google Scholar
Yu, A. J. (2007). Optimal change-detection and spiking neurons. In B. Schölkopf, J. Platt, & T. Hofmann (Eds.), Advances in neural information processing systems 19 (pp. 1545–1552). Cambridge, MA: MIT Press.
Google Scholar
Yuille, A. J., & Bülthoff, H. H. (1996). Bayesian decision theory and psychophysics. In D. C. Knill & W. Richards (Eds.), Perception as Bayesian inference (pp. 123–161). Cambridge: Cambridge University Press.
Google Scholar
Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10, 403–430.
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Gatsby Computational Neuroscience Unit, University College London, Room 407, Alexandra House, 17 Queen Square, WC1N 3AR, London, England
Peter Dayan
Center for Neural Science, Department of Psychology and Center for Neuroeconomics, New York University, 4 Washington Place, 10003, New York, NY
Nathaniel D. Daw

Authors

Peter Dayan
View author publications
You can also search for this author in PubMed Google Scholar
Nathaniel D. Daw
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Peter Dayan or Nathaniel D. Daw.

Additional information

Funding came from the Gatsby Charitable Foundation (to P.D.).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dayan, P., Daw, N.D. Decision theory, reinforcement learning, and the brain. Cognitive, Affective, & Behavioral Neuroscience 8, 429–453 (2008). https://doi.org/10.3758/CABN.8.4.429

Download citation

Received: 19 March 2008
Accepted: 24 June 2008
Issue Date: December 2008
DOI: https://doi.org/10.3758/CABN.8.4.429

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Decision theory, reinforcement learning, and the brain

Abstract

Article PDF

Similar content being viewed by others

Twenty years of load theory—Where are we now, and where should we go next?

On the computational complexity of ethics: moral tractability for minds and machines

Flow

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Decision theory, reinforcement learning, and the brain

Abstract

Article PDF

Similar content being viewed by others

Twenty years of load theory—Where are we now, and where should we go next?

On the computational complexity of ethics: moral tractability for minds and machines

Flow

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation