Python Project: Markov Decision Processes: policies, rewards and
values, value iteration and Reinforcement Learning:
TD/Q Learning, Exploration and Approximation
Reinforcement Learning Project:
- We implemented Value Iteration and Q-learning, and
tested our agents first on GridWorld and later to a
simulated robot controller (Crawler) Pacman.
- Reinforcement learning: an agent that takes actions
on an environment, and the output state and reward from
the action are passed back to the agent that took the
action.
- It is a Markov decision process,
non-determininsitic.
- You have a set of states with a set of associated
actions, and a transition function, a probability of one
of the successor outcomes, a model of the dynamics,
along with the reward of the associated
transition.
- Markov means that given the present state, the future
and past are independent.
- For decision processes, Markov means action outcomes
depend only on the current state, just like search,
where the successor function could only depend on the
current state and not its history.
- Deterministic single agent search problems: wanted an
optimal plan, or sequence of actions, start to
goal.
- For Markov Decision processes, we want an optimal
policy for each state.
- The idea was to create a Pacman that could learn from
his environment.
- Once Pacman was done training, he should win very
reliably in test games, since now he is exploiting his
learned policy.
- This worked for small but not medium grids.
- To correct this, we had to implement a q learning
agent that learns weights for features of states, where
many states might share the same features.
- Approximate q learning assumes the existence of a
feature function over state and action pairs.
- Approximate-q-agent uses the an extractor to assign a
single feature to every (game state,action)
pair.
- Even much larger layouts should be no problem for
approximate-q-learning agent.
- We then had a learning Pacman agent for small, medium
or large grids.