Getting Hands-On with Policy Evaluation and Policy Iteration

Alright, now that we’ve laid the groundwork for Markov Decision Processes (MDPs), it’s time to roll up our sleeves and get into the action! We’ll dive into Policy Evaluation and Policy Iteration—two key components that help us refine decision-making strategies in MDPs. By the end of this section, you’ll not only understand these concepts but also see them in action in the Cliff Walking Environment. So, grab a coffee (or your favorite beverage) and let’s break it down!

Policy Evaluation: Understanding the Worth of a Policy

Let’s say you’re playing a board game where you earn points based on the moves you make. You want to know: if you follow a particular strategy, how much reward can you expect to accumulate over time? That’s precisely what Policy Evaluation does—it helps compute the value function for a given policy.
Mathematically, we express the value function for a policy as:

description
Where:

γ is the discount factor, controlling how much we care about future rewards.
R_t+1 is the reward at time t+1.

This means we evaluate the expected sum of rewards, discounted over time, if we start at state and follow policy . The Bellman Equation gives us a recursive way to compute this:

Policy Iteration: Making Better Decisions Over Time

Now that we know how to evaluate a policy, the next logical step is to improve it! Policy Iteration does exactly that: it alternates between Policy Evaluation (measuring the value of a policy) and Policy Improvement (updating the policy to make better decisions).

Algorithm: Policy Iteration

description

Applying Policy Iteration in the Cliff Walking Environment

Now, let’s put these concepts to the test in the Cliff Walking Environment—a classic reinforcement learning problem. Picture a grid where you must navigate from Start to Goal without falling off a cliff (which incurs a heavy penalty).

States: Each grid position.
Actions: Move left, right, up, or down.
Rewards: -1 for each step, -100 for falling off the cliff, and 0 at the goal.
Goal: Find the optimal policy that reaches the goal with minimal penalties.

Using Policy Iteration, we:

Initialize a random policy (e.g., moving randomly or always moving right).
Evaluate the policy to compute the value function.
Improve the policy by choosing actions that minimize the risk of falling off while reaching the goal efficiently.
Repeat until we find the optimal policy.

Implementation: Policy Iteration

                                        
import numpy as np

    def policy_iteration(P, R, gamma=0.9, theta=1e-6):
        n_states, n_actions = R.shape
        policy = np.zeros(n_states, dtype=int)
        V = np.zeros(n_states)
        
        while True:
            # Policy Evaluation
            while True:
                delta = 0
                for s in range(n_states):
                    v = V[s]
                    a = policy[s]
                    V[s] = sum([P[s, a, s_prime] * (R[s, a] + gamma * V[s_prime]) for s_prime in range(n_states)])
                    delta = max(delta, abs(v - V[s]))
                if delta < theta:
                    break
            
            # Policy Improvement
            policy_stable = True
            for s in range(n_states):
                old_action = policy[s]
                policy[s] = np.argmax([sum([P[s, a, s_prime] * (R[s, a] + gamma * V[s_prime]) for s_prime in range(n_states)]) for a in range(n_actions)])
                if old_action != policy[s]:
                    policy_stable = False
            
            if policy_stable:
                break
        
        return policy, V

Wrapping Up

We’ve taken a deep dive into Policy Evaluation and Policy Iteration, seeing how they help refine decision-making in MDPs. Whether it’s choosing the best route as a delivery driver, improving Netflix recommendations, or avoiding the dreaded cliffs in reinforcement learning, these concepts have real-world significance.

In the next section, we’ll push further into Value Iteration, another powerful way to solve MDPs efficiently. Ready to see how we can find the best policies even faster? Stay tuned! 🚀