Reinforcement Learning: Tabular Q-Learning

code

python

mdp

Author

Gregor Cerar

Published

2025-12-17

Modified

2025-12-17

Abstract

I’ve decided to start learning reinforcement learning with Frozenlake game and Q-Learning.

As I started exploring reinforcement learning, a colleague suggested me to start with a Q-learning, one of the simplest and most widely used algorithms in the field. To get a hands-on feel for the fundamentals, I decided to replicate the official Solving Frozenlake with Tabular Q-Learning tutorial from the Gymnasium docs.

This post captures that learning journey: walking through the environment, understanding the Q-learning steps, and getting comfortable with Gymnasium library along the way.

The Frozen Lake Environment

The frozen lake is a small, grid-based reinforcement learning environment. We play as an elf whose goal is to cross a frozen lake from the starting tile (top-left corner) to a present (bottom-right) corner, without falling into any holes along the way.

To make the task more interesting (and more realistic), the lake can be slippery. When Sliperiness is enabled, the elf does not always move exactly in the intended direction and may occasionally slip sideways.

Action Space

The action space is simple and discrete. At each timestep, the agent can choose one of four actions:

move left,
move right,
move down,
move up.

Formally, the action is represented as a scalar with shape $(1,)$, taking values from the set $\{ 0, 1, 2, 3 \}$.

This representation is a general abstraction used throughout Gymnasium. In more complex environments, a single action may encode multiple simultaneous commands. For example, in a game like Super Mario, a player can jump while moving left or right. Such combinations are still treated as a single action by the environment.

Observation Space

The observation returned by the environment represents the agent’s current position on the grid. Since Frozen Lake consists of a finite number of discrete tiles, each tile is assigned a unique integer identifier.

For example, a $3\times{}3$ grid is indexed as:

\[ \begin{matrix} 0 & 1 & 2 \\ 3 & 4 & 5 \\ 6 & 7 & 8 \end{matrix} \]

More generally, the tile index can be computed as:

\[ \text{tile}(r,c) = r \cdot N_{\text{cols}} + c, \]

where:

where $r$ is row index,
$c$ is column index,
$N_{\text{cols}}$ is number of columns in grid.

This discrete state representation makes Frozen Lake particularly well-suited for tabular methods such as Q-learning.

Rewards

The default reward structure is sparse:

$+1$ for reaching the goal tile,
$0$ for stepping onto a frozen tile,
$0$ for falling into a hole.

In other words, the agent receives a reward only when it successfully reaches the goal. This sparse reward setting makes the problem deceptively challenging and highlights the importance of exploration in reinforcement learning.

For full details, see the official Frozen Lake environment documentation.

Reinforcement Learning Formulation

Frozen Lake can be formalized as a finite Markov Decision Process (MDP) defined by the tuple $(\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma)$.

State Space $\mathcal{S}$

The state space consists of all discrete tiles on the grid:

\[ \mathcal{S} = \{ 0, 1, \dots, (N_\text{rows}\cdot{}N_\text{cols}-1) \}. \]

Each state uniquely represents the agent’s current position in the lake.

Action Space $\mathcal{A}$

At each time step, the agent can choose one of four actions:

\[ \mathcal{A} = \{ \text{left}, \text{down}, \text{right}, \text{up} \}. \]

These actions correspond to deterministic intentions, even though the actual transition may be stochastic when the lake is slippery.

Transition Dynamics $P(s'|s,a)$

The transition function defines the probability of moving from state $s$ to state $s'$ after taking action $a$.

In the non-slippery version of the environment, transitions are deterministic.
In the slippery version, the intended action may fail, and the agent may move in a perpendicular direction with non-zero probability.

This stochasticity makes Frozen Lake a useful testbed for algorithms that must learn under uncertainty.

Reward Function $R(s, a, s')$

The reward function is sparse and simple:

\[ R(s, a, s') = \begin{cases} 1, & \text{if $s'$ is the goal state},\\ 0, & \text{otherwise}. \end{cases} \]

Episodes terminate when the agent reaches the goal or falls into a hole.

import os

# get rid of the audio warnings
os.environ["SDL_AUDIODRIVER"] = "dummy"

from dataclasses import dataclass

import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from aquarel import load_theme
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
from tqdm.auto import tqdm, trange

%config InlineBackend.figure_formats = {'retina', 'png'}

@dataclass(frozen=True, slots=True)
class Params:
    n_runs: int = 20  # number of runs from scratch
    total_episodes: int = 2_000  # total episodes (# of playthroughs) in the same run
    learning_rate: float = 0.8  # Q-Learning learning rate
    gamma: float = 0.95  # discounting rate
    epsilon: float = 0.1  # probability of exploration vs. exploitation
    proba_frozen: float = 0.9  # probability that a tile is frozen (not a hole)
    is_slippery: bool = False  # enables slipping: 1/3 forward, 1/3 left, 1/3 right
    seed: int = 123  # seed for reproducability

SHOW_PROGRESS: bool = False

The Implementation

class Qlearning:
    qtable: np.ndarray

    def __init__(self, lr: float, gamma: float, state_size: int, action_size: int) -> None:
        self.lr = lr
        self.gamma = gamma
        self.state_size = state_size
        self.action_size = action_size
        self.reset_qtable()

    def update(self, state: int, action: int, reward: float, new_state: int) -> float:
        """Update Q(s,a) := Q(s,a) + lr * [R(s,a) + gamma * max Q(s',a') - Q(s,a)]"""
        delta = reward + self.gamma * np.max(self.qtable[new_state, :]) - self.qtable[state, action]

        q_update = self.qtable[state, action] + self.lr * delta
        return q_update

    def reset_qtable(self) -> None:
        """Reset the Q-table."""
        self.qtable = np.zeros((self.state_size, self.action_size))


class EpsilonGreedy:
    def __init__(self, epsilon: float, seed: int | None) -> None:
        self.eps = epsilon
        self.rng = np.random.default_rng(seed)

    def choose_action(self, action_space: gym.spaces.Space, state: int, qtable: np.ndarray) -> int:
        """Choose an action `a` in the current world state (s)."""
        action: int

        # random number decides whether we do ...
        explore_exploit_tradeoff = self.rng.uniform(0, 1)
        if explore_exploit_tradeoff < self.eps:  # ... exploration (random action) ...
            action = action_space.sample()
        else:  # ... or exploitation (use direction with the biggest Q-value for this state)
            (max_ids,) = np.where(qtable[state, :] == max(qtable[state, :]))
            action = self.rng.choice(max_ids)  # pick one if multiple directions with max probability

        return action

Define Training Loop

def run_env(env: gym.Env, learner: Qlearning, explorer: EpsilonGreedy, p: Params, state_size: int, action_size: int):
    rewards = np.zeros((p.total_episodes, p.n_runs), dtype=float)
    steps = np.zeros((p.total_episodes, p.n_runs), dtype=int)
    episodes = np.arange(p.total_episodes, dtype=int)

    qtables = np.zeros((p.n_runs, state_size, action_size), dtype=float)

    all_states: list[int] = []
    all_actions: list[int] = []

    for run in trange(p.n_runs, leave=False, disable=(not SHOW_PROGRESS)):
        learner.reset_qtable()

        for episode in tqdm(episodes, leave=False, disable=(not SHOW_PROGRESS)):
            state, _ = env.reset(seed=p.seed)
            step: int = 0
            done: bool = False
            total_rewards: float = 0.0

            while not done:
                action = explorer.choose_action(action_space=env.action_space, state=state, qtable=learner.qtable)

                # log all the stats and actions
                all_states.append(state)
                all_actions.append(action)

                # take the action $a$ and observe the outcome state $s'$ and reward $r$
                new_state, reward, terminated, truncated, info = env.step(action)

                # mark as done whether game was terminated (victory, hole) or truncated (wall)
                done = terminated or truncated

                # learner updates Q-table
                learner.qtable[state, action] = learner.update(state, action, float(reward), new_state)

                total_rewards += float(reward)
                step += 1

                # our new state is state
                state = new_state

            # log all rewards and steps
            rewards[episode, run] = total_rewards
            steps[episode, run] = step

        qtables[run, :, :] = learner.qtable

    return rewards, steps, episodes, qtables, all_states, all_actions

def postprocess(episodes: np.ndarray, params: Params, rewards: np.ndarray, steps: np.ndarray, map_size: int):
    """Convert the results of the simulation into dataframes."""

    res = pd.DataFrame(
        data={
            "Episodes": np.tile(episodes, reps=params.n_runs),
            "Rewards": rewards.flatten(order="F"),
            "Steps": steps.flatten(order="F"),
        }
    )
    res["cum_rewards"] = rewards.cumsum(axis=0).flatten(order="F")
    res["map_size"] = np.repeat(f"{map_size}x{map_size}", res.shape[0])

    st = pd.DataFrame(data={"Episodes": episodes, "Steps": steps.mean(axis=1)})
    st["map_size"] = np.repeat(f"{map_size}x{map_size}", st.shape[0])

    return res, st

def qtable_directions_map(qtable: np.ndarray, map_size: int):
    """Get the best learned action & map it to arrows."""

    eps = np.finfo(qtable.dtype).eps  # minimum float number on the machine
    directions = {0: "←", 1: "↓", 2: "→", 3: "↑"}

    qtable_val_max = qtable.max(axis=1).reshape(map_size, map_size)
    qtable_best_action = np.argmax(qtable, axis=1).reshape(map_size, map_size)
    qtable_directions = np.empty(qtable_best_action.size, dtype=str)

    for idx, val in enumerate(qtable_best_action.flat):
        if qtable_val_max.flat[idx] > eps:
            # Assign an arrow only if a minimal Q-value has been learned as best action
            # otherwise since 0 is a direction, it also gets mapped on the tiles where
            # it didn't actually learn anything
            qtable_directions[idx] = directions[val]

    qtable_directions = qtable_directions.reshape(map_size, map_size)
    return qtable_val_max, qtable_directions

def plot_q_values_map(qtable: np.ndarray, env: gym.Env, map_size: int):
    """Plot the last frame of the simulation and the policy learned."""

    qtable_val_max, qtable_directions = qtable_directions_map(qtable, map_size)

    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5.5), constrained_layout=True)

    ax[0].imshow(env.render(), aspect="equal", interpolation="none")
    ax[0].axis("off")
    ax[0].set_title("Last frame")

    # Plot the policy
    sns.heatmap(
        qtable_val_max,
        annot=qtable_directions,
        fmt="",
        square=True,
        ax=ax[1],
        cmap=sns.color_palette("Blues", as_cmap=True),
        linewidths=0.5,
        linecolor="black",
        xticklabels=[],
        yticklabels=[],
    )
    ax[1].set(title="Learned Q-values\nArrows represent best action")
    ax[1].axis("off")

    # autoscale annotation font size
    rows, cols = qtable_val_max.shape
    bbox = ax[0].get_window_extent().transformed(fig.dpi_scale_trans.inverted())
    width_in, height_in = bbox.width, bbox.height

    # Heuristic scaling factor (tweak as needed)
    scale = min(width_in / cols, height_in / rows)
    fontsize = scale * 50

    # Apply new font size
    for text in ax[1].texts:
        text.set_fontsize(fontsize)

    for _, spine in ax[1].spines.items():
        spine.set_visible(True)
        spine.set_linewidth(0.7)
        spine.set_color("black")

    return fig, ax

def plot_states_actions_distribution(states: list[int], actions: list[int], map_size: int):
    """Plot the distributions of states and actions."""
    labels = {"LEFT": 0, "DOWN": 1, "RIGHT": 2, "UP": 3}

    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(11, 5), constrained_layout=True)
    sns.histplot(data=states, ax=ax[0], kde=True)
    ax[0].set_title("States")

    sns.histplot(data=actions, ax=ax[1])
    ax[1].set_xticks(list(labels.values()), labels=labels.keys())
    ax[1].set_title("Actions")

    return fig, ax

def plot_steps_and_rewards(rewards_df: pd.DataFrame, steps_df: pd.DataFrame):
    """Plot the steps and rewards from dataframes."""
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(11, 5), constrained_layout=True)
    sns.lineplot(data=rewards_df, x="Episodes", y="cum_rewards", hue="map_size", linewidth=0.7, ax=ax[0])
    ax[0].set(ylabel="Cumulated rewards")

    sns.lineplot(data=steps_df, x="Episodes", y="Steps", hue="map_size", linewidth=0.7, ax=ax[1])
    ax[1].set(ylabel="Averaged steps number")

    for axi in ax:
        axi.legend(title="map size")

    return fig, ax

from collections.abc import Callable

EnvFactory = Callable[[int], gym.Env]


def run_experiments(make_env: EnvFactory, params: Params, map_sizes: list[int] | int):
    res_all = pd.DataFrame()
    st_all = pd.DataFrame()

    if isinstance(map_sizes, int):
        map_sizes = [map_sizes]

    for map_size in map_sizes:
        env = make_env(map_size)

        action_size: int | None = getattr(env.action_space, "n", None)
        assert action_size is not None

        state_size: int | None = getattr(env.observation_space, "n", None)
        assert state_size is not None

        env.action_space.seed(params.seed)  # Set the seed to get reproducible results when sampling the action space
        learner = Qlearning(
            lr=params.learning_rate, gamma=params.gamma, state_size=state_size, action_size=action_size
        )
        explorer = EpsilonGreedy(epsilon=params.epsilon, seed=params.seed)

        print(f"Map size: {map_size}x{map_size}")
        rewards, steps, episodes, qtables, all_states, all_actions = run_env(
            env, learner, explorer, params, state_size, action_size
        )

        # Save the results in dataframes
        res, st = postprocess(episodes, params, rewards, steps, map_size)
        res_all = pd.concat([res_all, res])
        st_all = pd.concat([st_all, st])
        qtable = qtables.mean(axis=0)  # Average the Q-table between runs

        with load_theme("ambivalent"):
            plot_states_actions_distribution(states=all_states, actions=all_actions, map_size=map_size)
        plt.show()

        with load_theme("ambivalent"):
            plot_q_values_map(qtable, env, map_size)
        plt.show()

        env.close()

    with load_theme("ambivalent"):
        plot_steps_and_rewards(res_all, st_all)
    plt.show()

def make_frozenlake_env(params: Params) -> EnvFactory:
    def _factory(map_size: int) -> gym.Env:
        return gym.make(
            "FrozenLake-v1",
            is_slippery=params.is_slippery,
            render_mode="rgb_array",
            desc=generate_random_map(size=map_size, p=params.proba_frozen, seed=params.seed),
            # reward_schedule=(10.0, -1.0, -0.01),  # reach goal, reach hole, reach frozen (includes Start)
        )

    return _factory


map_sizes = [4, 7, 9, 11]
params = Params()

run_experiments(make_frozenlake_env(params), params, map_sizes)

Map size: 4x4

Map size: 7x7

Map size: 9x9

Map size: 11x11

Appendix

SHOW_PROGRESS = False


def make_frozenlake_env(params: Params) -> EnvFactory:
    def _factory(map_size: int) -> gym.Env:
        return gym.make(
            "FrozenLake-v1",
            is_slippery=params.is_slippery,
            render_mode="rgb_array",
            desc=generate_random_map(size=map_size, p=params.proba_frozen, seed=params.seed),
            reward_schedule=(10.0, -10.0, -0.01),  # reach goal, reach hole, reach frozen (includes Start)
        )

    return _factory


params = Params()

run_experiments(make_frozenlake_env(params), params, map_sizes=[5, 25])