Michael Fairbank
PhD Student
Mr Michael Fairbank PhD Student
Room:
Department of Computing
School of Informatics
City University
London EC1V OHB
abdy934@soi.city.ac.uk
tel: +44 20 7040
fax: +44 20 7040 8887
I am a part time PhD student, researching neural networks and reinforcement learning.
My PhD supervisor is Eduardo Alonso.
Reinforcement Learning by Value Gradients
In Reinforcement Learning (RL) the objective is for an agent (e.g. a robot) to learn to navigate through an environment so as to maximise a given reward function. Reinforcement learning often makes use of a value-function which is a scalar field of the state space that the robot is moving within. This value-function assigns a score to each point of state space rating how "good" that position is.
The idea of my research is to attempt to explicitly learn the gradient of this value-function with respect to the state vector. This has the advantage of obviating the need for local exploration, hence increasing efficiency. It also has yielded a successful convergence proof for value-function learning with a general smooth function approximator for control problems. This convergence proof was achieved by proving equivalence, under certain conditions, to gradient ascent on the total reward function (a process known as policy gradient learning), hence also providing a theoretical connection between these two rival paradigms of RL.
My research is most closely related to Dual Heuristic Programming (DHP) and Globalized Dual Heuristic Programming (GDHP) by Paul Werbos. These are two extremely powerful methods used by the Adaptive Dynamic Programming (ADP) community, but strangely neglected by the RL community.
Gentle introductions:
Overview Presentation. We presented this work at the 6th Barbados Workshop on Reinforcement Learning, Bellairs Institute in Holetown, Barbados, 23rd March 2011, and The Reinforcement Learning Reading Group, Gatsby Computational Neuroscience Unit-UCL, London, 18th April 2011.
This presentation summarises the relationship of Value Gradient Learning to Dual Heuristic Programming and to Backpropagation Through Time. It also gives a nice short visual explanation of Pontryagin's Maximum Principle in comparison to Bellman's Optimality Principle. We also give a quick explanation of why the greedy policy is linked to the value-gradient and not values, and hence why learning the value gradient makes sense, and should be faster and obviate the need for local exploration. It also summarises theoretical and experimental results.
Accompanying java-demo explaining value-gradient concepts, and showing VGL methods working extremely fast at bending trajectories into the correct shape, without any need for stochastic exploration. Also shows value-learning (i.e. TD(λ)) failing in a deterministic environment due to lack of exploration.
Alternative Presentation explaining main concepts of value-gradients.
In greater depth:
The Local Optimality of Reinforcement Learning by Value-Gradients and its relationship to Policy Gradient Learning. Fairbank and Alonso, 2011.
This paper defines the VGL(λ) algorithm which is an extension of DHP to include a bootstrapping parameter, λ (just like the "λ" parameter used in TD(λ)). We prove local optimality of any trajectory where all of the target value gradients are met under a greedy policy. This is the result that provides efficiency and local exploration for free in value-gradient learning. The paper also provides a convergence proof for a form of VGL(1) under a greedy policy. This is the only value-function learning algorithm that we know of that converges under a greedy policy. It is also the only value-function learning algorithm that we know of that converges for control problems and a general non-linear function approximator for the value function.Reinforcement Learning by Value Gradients. 2008. This contains similar theoretical results in an earlier form, but contains more background information and empirical results. It gives full specification of the One-Dimensional Lunar Lander RL experiment implemented with a neural network function approximator, as used in the above java demo.
The Divergence of TD(1) and Sarsa(1) and DHP under a greedy policy. Draft paper. Fairbank and Alonso 2011. This paper gives specific examples of two major reinforcement learning algorithms and one major Adaptive Dynamic Programming algorithm (DHP) diverging when combined with a greedy policy. The RL algorithms' divergence might be surprising with λ=1, but this is possible because a greedy policy changes as learning progresses. Paper and source code available by email request.
A Comparison of Learning Speed and Ability to Cope Without Exploration between DHP and TD(λ). Draft paper. Fairbank and Alonso 2011. This paper demonstrates the principal motivations for Value-Gradient Learning (and DHP) learning methods: That of the ability to work without stochastic exploration in deterministic environments, and learning speed. Learning speed of the value gradient algorithm (DHP) is shown to be 1700 times faster than TD(λ) in a simple experiment. Learning by TD(λ) is shown to fail 100% of the time when stochastic exploration is removed in this experiment that causes no problems for DHP. Paper and source code available by email request.
Applications:
A 2D lunar lander controlled by a value-function represented by a neural network. Bang-bang controlled.
The same problem but with smooth controls.
Recurrent Neural Networks applied to Control problems
These applications do not use a value-function at all. A recurrent neural network controls the agent directly thus allowing it to tackle non-Markovian environments.
Applications:
A spacecraft with active sensor, using a range finding scanner to find its target objective, and flying towards it.
An essay on how close these two applications come to intelligence.
