Python Project: Markov Decision Processes: policies, rewards and values, value iteration and Reinforcement Learning: TD/Q Learning, Exploration and Approximation Reinforcement Learning Project:

We implemented Value Iteration and Q-learning, and tested our agents first on GridWorld and later to a simulated robot controller (Crawler) Pacman.

Reinforcement learning: an agent that takes actions on an environment, and the output state and reward from the action are passed back to the agent that took the action.

It is a Markov decision process, non-determininsitic.

You have a set of states with a set of associated actions, and a transition function, a probability of one of the successor outcomes, a model of the dynamics, along with the reward of the associated transition.

Markov means that given the present state, the future and past are independent.

For decision processes, Markov means action outcomes depend only on the current state, just like search, where the successor function could only depend on the current state and not its history.

Deterministic single agent search problems: wanted an optimal plan, or sequence of actions, start to goal.

For Markov Decision processes, we want an optimal policy for each state.

The idea was to create a Pacman that could learn from his environment.

Once Pacman was done training, he should win very reliably in test games, since now he is exploiting his learned policy.

This worked for small but not medium grids.

To correct this, we had to implement a q learning agent that learns weights for features of states, where many states might share the same features.

Approximate q learning assumes the existence of a feature function over state and action pairs.

Approximate-q-agent uses the an extractor to assign a single feature to every (game state,action) pair.

Even much larger layouts should be no problem for approximate-q-learning agent.

We then had a learning Pacman agent for small, medium or large grids.