Greg's blog

Reinforcement Learning: Tabular Q-Learning

Gregor Cerar — Tue, 16 Dec 2025 23:00:00 GMT

As I started exploring reinforcement learning, a colleague suggested me to start with a Q-learning, one of the simplest and most widely used algorithms in the field. To get a hands-on feel for the fundamentals, I decided to replicate the official Solving Frozenlake with Tabular Q-Learning tutorial from the Gymnasium docs.

This post captures that learning journey: walking through the environment, understanding the Q-learning steps, and getting comfortable with Gymnasium library along the way.

The Frozen Lake Environment

The frozen lake is a small, grid-based reinforcement learning environment. We play as an elf whose goal is to cross a frozen lake from the starting tile (top-left corner) to a present (bottom-right) corner, without falling into any holes along the way.

To make the task more interesting (and more realistic), the lake can be slippery. When Sliperiness is enabled, the elf does not always move exactly in the intended direction and may occasionally slip sideways.

Frozen Lake (3x3) sample

Action Space

The action space is simple and discrete. At each timestep, the agent can choose one of four actions:

move left,
move right,
move down,
move up.

Formally, the action is represented as a scalar with shape , taking values from the set .

This representation is a general abstraction used throughout Gymnasium. In more complex environments, a single action may encode multiple simultaneous commands. For example, in a game like Super Mario, a player can jump while moving left or right. Such combinations are still treated as a single action by the environment.

Observation Space

The observation returned by the environment represents the agent’s current position on the grid. Since Frozen Lake consists of a finite number of discrete tiles, each tile is assigned a unique integer identifier.

For example, a grid is indexed as:

More generally, the tile index can be computed as:

where:

where is row index,
is column index,
is number of columns in grid.

This discrete state representation makes Frozen Lake particularly well-suited for tabular methods such as Q-learning.

Rewards

The default reward structure is sparse:

for reaching the goal tile,
for stepping onto a frozen tile,
for falling into a hole.

In other words, the agent receives a reward only when it successfully reaches the goal. This sparse reward setting makes the problem deceptively challenging and highlights the importance of exploration in reinforcement learning.

For full details, see the official Frozen Lake environment documentation.

Reinforcement Learning Formulation

Frozen Lake can be formalized as a finite Markov Decision Process (MDP) defined by the tuple .

State Space

The state space consists of all discrete tiles on the grid:

Each state uniquely represents the agent’s current position in the lake.

Action Space

At each time step, the agent can choose one of four actions:

These actions correspond to deterministic intentions, even though the actual transition may be stochastic when the lake is slippery.

Transition Dynamics

The transition function defines the probability of moving from state to state after taking action .

In the non-slippery version of the environment, transitions are deterministic.
In the slippery version, the intended action may fail, and the agent may move in a perpendicular direction with non-zero probability.

This stochasticity makes Frozen Lake a useful testbed for algorithms that must learn under uncertainty.

Reward Function

The reward function is sparse and simple:

Episodes terminate when the agent reaches the goal or falls into a hole.

import os

# get rid of the audio warnings
os.environ["SDL_AUDIODRIVER"] = "dummy"

from dataclasses import dataclass

import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from aquarel import load_theme
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
from tqdm.auto import tqdm, trange

%config InlineBackend.figure_formats = {'retina', 'png'}

@dataclass(frozen=True, slots=True)
class Params:
    n_runs: int = 20  # number of runs from scratch
    total_episodes: int = 2_000  # total episodes (# of playthroughs) in the same run
    learning_rate: float = 0.8  # Q-Learning learning rate
    gamma: float = 0.95  # discounting rate
    epsilon: float = 0.1  # probability of exploration vs. exploitation
    proba_frozen: float = 0.9  # probability that a tile is frozen (not a hole)
    is_slippery: bool = False  # enables slipping: 1/3 forward, 1/3 left, 1/3 right
    seed: int = 123  # seed for reproducability

SHOW_PROGRESS: bool = False

The Implementation

class Qlearning:
    qtable: np.ndarray

    def __init__(self, lr: float, gamma: float, state_size: int, action_size: int) -> None:
        self.lr = lr
        self.gamma = gamma
        self.state_size = state_size
        self.action_size = action_size
        self.reset_qtable()

    def update(self, state: int, action: int, reward: float, new_state: int) -> float:
        """Update Q(s,a) := Q(s,a) + lr * [R(s,a) + gamma * max Q(s',a') - Q(s,a)]"""
        delta = reward + self.gamma * np.max(self.qtable[new_state, :]) - self.qtable[state, action]

        q_update = self.qtable[state, action] + self.lr * delta
        return q_update

    def reset_qtable(self) -> None:
        """Reset the Q-table."""
        self.qtable = np.zeros((self.state_size, self.action_size))


class EpsilonGreedy:
    def __init__(self, epsilon: float, seed: int | None) -> None:
        self.eps = epsilon
        self.rng = np.random.default_rng(seed)

    def choose_action(self, action_space: gym.spaces.Space, state: int, qtable: np.ndarray) -> int:
        """Choose an action `a` in the current world state (s)."""
        action: int

        # random number decides whether we do ...
        explore_exploit_tradeoff = self.rng.uniform(0, 1)
        if explore_exploit_tradeoff < self.eps:  # ... exploration (random action) ...
            action = action_space.sample()
        else:  # ... or exploitation (use direction with the biggest Q-value for this state)
            (max_ids,) = np.where(qtable[state, :] == max(qtable[state, :]))
            action = self.rng.choice(max_ids)  # pick one if multiple directions with max probability

        return action

Define Training Loop

def run_env(env: gym.Env, learner: Qlearning, explorer: EpsilonGreedy, p: Params, state_size: int, action_size: int):
    rewards = np.zeros((p.total_episodes, p.n_runs), dtype=float)
    steps = np.zeros((p.total_episodes, p.n_runs), dtype=int)
    episodes = np.arange(p.total_episodes, dtype=int)

    qtables = np.zeros((p.n_runs, state_size, action_size), dtype=float)

    all_states: list[int] = []
    all_actions: list[int] = []

    for run in trange(p.n_runs, leave=False, disable=(not SHOW_PROGRESS)):
        learner.reset_qtable()

        for episode in tqdm(episodes, leave=False, disable=(not SHOW_PROGRESS)):
            state, _ = env.reset(seed=p.seed)
            step: int = 0
            done: bool = False
            total_rewards: float = 0.0

            while not done:
                action = explorer.choose_action(action_space=env.action_space, state=state, qtable=learner.qtable)

                # log all the stats and actions
                all_states.append(state)
                all_actions.append(action)

                # take the action $a$ and observe the outcome state $s'$ and reward $r$
                new_state, reward, terminated, truncated, info = env.step(action)

                # mark as done whether game was terminated (victory, hole) or truncated (wall)
                done = terminated or truncated

                # learner updates Q-table
                learner.qtable[state, action] = learner.update(state, action, float(reward), new_state)

                total_rewards += float(reward)
                step += 1

                # our new state is state
                state = new_state

            # log all rewards and steps
            rewards[episode, run] = total_rewards
            steps[episode, run] = step

        qtables[run, :, :] = learner.qtable

    return rewards, steps, episodes, qtables, all_states, all_actions

def postprocess(episodes: np.ndarray, params: Params, rewards: np.ndarray, steps: np.ndarray, map_size: int):
    """Convert the results of the simulation into dataframes."""

    res = pd.DataFrame(
        data={
            "Episodes": np.tile(episodes, reps=params.n_runs),
            "Rewards": rewards.flatten(order="F"),
            "Steps": steps.flatten(order="F"),
        }
    )
    res["cum_rewards"] = rewards.cumsum(axis=0).flatten(order="F")
    res["map_size"] = np.repeat(f"{map_size}x{map_size}", res.shape[0])

    st = pd.DataFrame(data={"Episodes": episodes, "Steps": steps.mean(axis=1)})
    st["map_size"] = np.repeat(f"{map_size}x{map_size}", st.shape[0])

    return res, st

def qtable_directions_map(qtable: np.ndarray, map_size: int):
    """Get the best learned action & map it to arrows."""

    eps = np.finfo(qtable.dtype).eps  # minimum float number on the machine
    directions = {0: "←", 1: "↓", 2: "→", 3: "↑"}

    qtable_val_max = qtable.max(axis=1).reshape(map_size, map_size)
    qtable_best_action = np.argmax(qtable, axis=1).reshape(map_size, map_size)
    qtable_directions = np.empty(qtable_best_action.size, dtype=str)

    for idx, val in enumerate(qtable_best_action.flat):
        if qtable_val_max.flat[idx] > eps:
            # Assign an arrow only if a minimal Q-value has been learned as best action
            # otherwise since 0 is a direction, it also gets mapped on the tiles where
            # it didn't actually learn anything
            qtable_directions[idx] = directions[val]

    qtable_directions = qtable_directions.reshape(map_size, map_size)
    return qtable_val_max, qtable_directions

def plot_q_values_map(qtable: np.ndarray, env: gym.Env, map_size: int):
    """Plot the last frame of the simulation and the policy learned."""

    qtable_val_max, qtable_directions = qtable_directions_map(qtable, map_size)

    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5.5), constrained_layout=True)

    ax[0].imshow(env.render(), aspect="equal", interpolation="none")
    ax[0].axis("off")
    ax[0].set_title("Last frame")

    # Plot the policy
    sns.heatmap(
        qtable_val_max,
        annot=qtable_directions,
        fmt="",
        square=True,
        ax=ax[1],
        cmap=sns.color_palette("Blues", as_cmap=True),
        linewidths=0.5,
        linecolor="black",
        xticklabels=[],
        yticklabels=[],
    )
    ax[1].set(title="Learned Q-values\nArrows represent best action")
    ax[1].axis("off")

    # autoscale annotation font size
    rows, cols = qtable_val_max.shape
    bbox = ax[0].get_window_extent().transformed(fig.dpi_scale_trans.inverted())
    width_in, height_in = bbox.width, bbox.height

    # Heuristic scaling factor (tweak as needed)
    scale = min(width_in / cols, height_in / rows)
    fontsize = scale * 50

    # Apply new font size
    for text in ax[1].texts:
        text.set_fontsize(fontsize)

    for _, spine in ax[1].spines.items():
        spine.set_visible(True)
        spine.set_linewidth(0.7)
        spine.set_color("black")

    return fig, ax

def plot_states_actions_distribution(states: list[int], actions: list[int], map_size: int):
    """Plot the distributions of states and actions."""
    labels = {"LEFT": 0, "DOWN": 1, "RIGHT": 2, "UP": 3}

    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(11, 5), constrained_layout=True)
    sns.histplot(data=states, ax=ax[0], kde=True)
    ax[0].set_title("States")

    sns.histplot(data=actions, ax=ax[1])
    ax[1].set_xticks(list(labels.values()), labels=labels.keys())
    ax[1].set_title("Actions")

    return fig, ax

def plot_steps_and_rewards(rewards_df: pd.DataFrame, steps_df: pd.DataFrame):
    """Plot the steps and rewards from dataframes."""
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(11, 5), constrained_layout=True)
    sns.lineplot(data=rewards_df, x="Episodes", y="cum_rewards", hue="map_size", linewidth=0.7, ax=ax[0])
    ax[0].set(ylabel="Cumulated rewards")

    sns.lineplot(data=steps_df, x="Episodes", y="Steps", hue="map_size", linewidth=0.7, ax=ax[1])
    ax[1].set(ylabel="Averaged steps number")

    for axi in ax:
        axi.legend(title="map size")

    return fig, ax

from collections.abc import Callable

EnvFactory = Callable[[int], gym.Env]


def run_experiments(make_env: EnvFactory, params: Params, map_sizes: list[int] | int):
    res_all = pd.DataFrame()
    st_all = pd.DataFrame()

    if isinstance(map_sizes, int):
        map_sizes = [map_sizes]

    for map_size in map_sizes:
        env = make_env(map_size)

        action_size: int | None = getattr(env.action_space, "n", None)
        assert action_size is not None

        state_size: int | None = getattr(env.observation_space, "n", None)
        assert state_size is not None

        env.action_space.seed(params.seed)  # Set the seed to get reproducible results when sampling the action space
        learner = Qlearning(
            lr=params.learning_rate, gamma=params.gamma, state_size=state_size, action_size=action_size
        )
        explorer = EpsilonGreedy(epsilon=params.epsilon, seed=params.seed)

        print(f"Map size: {map_size}x{map_size}")
        rewards, steps, episodes, qtables, all_states, all_actions = run_env(
            env, learner, explorer, params, state_size, action_size
        )

        # Save the results in dataframes
        res, st = postprocess(episodes, params, rewards, steps, map_size)
        res_all = pd.concat([res_all, res])
        st_all = pd.concat([st_all, st])
        qtable = qtables.mean(axis=0)  # Average the Q-table between runs

        with load_theme("ambivalent"):
            plot_states_actions_distribution(states=all_states, actions=all_actions, map_size=map_size)
        plt.show()

        with load_theme("ambivalent"):
            plot_q_values_map(qtable, env, map_size)
        plt.show()

        env.close()

    with load_theme("ambivalent"):
        plot_steps_and_rewards(res_all, st_all)
    plt.show()

def make_frozenlake_env(params: Params) -> EnvFactory:
    def _factory(map_size: int) -> gym.Env:
        return gym.make(
            "FrozenLake-v1",
            is_slippery=params.is_slippery,
            render_mode="rgb_array",
            desc=generate_random_map(size=map_size, p=params.proba_frozen, seed=params.seed),
            # reward_schedule=(10.0, -1.0, -0.01),  # reach goal, reach hole, reach frozen (includes Start)
        )

    return _factory


map_sizes = [4, 7, 9, 11]
params = Params()

run_experiments(make_frozenlake_env(params), params, map_sizes)

Map size: 4x4

Map size: 7x7

Map size: 9x9

Map size: 11x11

Appendix

SHOW_PROGRESS = False


def make_frozenlake_env(params: Params) -> EnvFactory:
    def _factory(map_size: int) -> gym.Env:
        return gym.make(
            "FrozenLake-v1",
            is_slippery=params.is_slippery,
            render_mode="rgb_array",
            desc=generate_random_map(size=map_size, p=params.proba_frozen, seed=params.seed),
            reward_schedule=(10.0, -10.0, -0.01),  # reach goal, reach hole, reach frozen (includes Start)
        )

    return _factory


params = Params()

run_experiments(make_frozenlake_env(params), params, map_sizes=[5, 25])

Map size: 5x5

Map size: 25x25

Reuse

CC BY-NC-SA 4.0

Bernoulli Multi-Armed Bandit Problem

Gregor Cerar — Mon, 15 Dec 2025 23:00:00 GMT

Note

Credits to Lil’s blog post. I slightly improved and extended it for myself to better understand statistical terms.

The exploitation-exploration dilemma exists in many aspects of our lives. For instance, for your favourite option (e.g., restaurant, chatbot, artist, busic band) you are confident of what you will get, but you miss the chance to discover an even better option. But if you choose to try new options all the time, you’re very likely gonna deal with unpleasant service from time to time. Not every new option pays off.

This trade-off becomes especially important when we operate under incomplete information. Without full knowledge of our environment, we must gather information while simultaneously making good decisions. Exploitation uses what what we’ve learned, while exploration risks short-term loss to gain long-term insight.

To see how this plays out in a clean mathematical settings, we turn to a classic model.

What is a Multi-Armed Bandit?

The multi-armed bandit (MAB) captures this dilemma elegantly. Imagine a row of slot machines (i.e., “one-armed bandits”) each with unknown probability of payout. The goals is to maximize the total reward over time. Each pull (i.e., action) gives you information, but also costs you the chance to pull a better machine.

The Environment

Let’s consider the simplest version of the problem. You face several slot machines, each with unknown Bernoulli reward distribution. Each play either gives you a fixed reward or gives nothing. You have plenty of trials, and your choices don’t change the underlying probabilities.

The question is: What is the best strategy to achieve the highest long-term reward?

Note

For newcommers to reinforcement learning (as I was when writing this), the following clarifications help.

First, regret measures how much reward you lost compared to always choosing the best option in hindsight. It quantifies the “if only I had known…” feeling.

Second, the reward probabilities are not known ahead of time. You discover them through experiennce. This is what makes the problem interesting.

from abc import ABC, abstractmethod

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import numpy as np
import numpy.typing as npt
from aquarel import load_theme

%config InlineBackend.figure_formats = {'retina', 'png'}

class BaseBandit(ABC):
    k: int  # number of arms
    best_proba: float | np.float64  # hidden to solver; for regret calculation, highest possible reward probability
    probas: npt.NDArray[np.float64]  # hidden to solver; reward probabilities

    @abstractmethod
    def generate_reward(self, i: int) -> float:
        """Returns reward after lever `i` is pulled."""
        raise NotImplementedError


class BaseSolver(ABC):
    bandit: BaseBandit  # reference to the bandit instance
    counts: npt.NDArray[np.int64]  # hold stats of pulled levers

    actions: list[int]
    rewards: list[float]
    regrets: list[float]

    @abstractmethod
    def __init__(self, bandit: BaseBandit) -> None:
        """bandit (BaseBandit): the target bandit to solve."""
        assert isinstance(bandit, BaseBandit)
        self.bandit = bandit

        self.counts = np.zeros(self.bandit.k, dtype=np.int64)

        self.actions = []  # a history of lever ids, 0 to bandit n-1.
        self.rewards = []  # a history of collected rewards.
        self.regrets = []  # a history of regrets for taken actions.

    @property
    def num_steps(self) -> int:
        return len(self.actions)

    def update_regret(self, i: int) -> None:
        """Update the regret after the lever `i` is pulled."""
        regret = self.bandit.best_proba - self.bandit.probas[i]
        self.regrets.append(regret)

    @property
    @abstractmethod
    def estimated_probas(self) -> npt.NDArray[np.float64]:
        """Retrieve learned reward probability for each arm `n` of the bandit."""
        raise NotImplementedError

    @abstractmethod
    def run_one_step(self) -> tuple[int, float]:
        """Return solver's selected action and bandit's outcome reward."""
        raise NotImplementedError

    def run(self, num_steps: int) -> None:
        """Run simulation for `num_steps` steps."""
        for _ in range(num_steps):
            i, r = self.run_one_step()

            self.counts[i] += 1
            self.actions.append(i)
            self.update_regret(i)
            self.rewards.append(r)

Formal Definition

With the intuition in place, we can now describe the Bernoulli multi-armed bandit more formally. A bandit problem is defined as a tuple , where:

We have machines (or levers) with probabilities .
At each time step , we take an action on one slot machine and receive a reward .
is a set of possible actions. The value of an action is expected reward, . If action corresponds to machine , then .
is the reward function. In a Bernoulli bandit, each pull yields a reward of with probability , and otherwise.

Note

Recall that a Bernoulli distribution is a discrete probability distribution, which takes the value with probability and with probability .

The symbol denotes the expected value, a generalized weighted average. The expression reads as the expected reward () that we took action .”

Crucially, the probabilities are NOT known in advance. They must be estimated through interaction.

A Bernoulli bandit can be seen as a simplified Marko decision process (MDP) without a state space. The objective is to maximize the total reward . If we knew which action had the biggest reward probability, this would be equivalent to minimizing the regret from not always choosing that optimal action.

Let denote the reward probability of the optimal action :

The expected cumulative regret up to the time is then:

class BernoulliBandit(BaseBandit):
    def __init__(
        self,
        k: int,
        probas: list[float] | npt.NDArray[np.float64] | None = None,
        seed: int | None = None,
    ):
        # sanity check: `probas` needs to be None or of size `n`.
        assert probas is None or len(probas) == k

        self.k = k  # save number of bandits

        self.rng = np.random.default_rng(seed=seed)

        # random probabilities, if they are explicitly defined
        if probas is None:
            probas = self.rng.random(size=self.k)

        # convert to numpy array for easier operations later
        self.probas = np.asarray(probas)

        # in case of Bernoulli MAB, highest probabily is equal to optimal
        self.best_proba = np.max(self.probas)

    def generate_reward(self, i: int) -> int:
        # The player selected the i-th machine.
        return int(self.rng.random() < self.probas[i])

Bandit Strategies

With the bandint problem formally defined, the next question is: how should we choose actions over time? Different strategies encode different assumptions about how exploration should be handled. Broadly, we can distinguish three categories:

No exploration: always exploit the best-known action (naive and generally poor).
Random exploration: explore uniformly at random.
Informed exploration: explore more often when uncertainty is high.

A simple and widely used example of the last category is the -greedy algorithm.

Epsilon-Greedy Algorithm

The -greedy algorithm balances exploitation and exploration by choosing the currently best action most of the time, while occasionally exploring at random.

Information State

At the time step , the algorithm maintains:

empirical action-value estimates ,
action counts ,

summarizing all past interactions.

The empirical value estimate for action is defined as:

where:

is the reward received at time step . For a Bernoulli bandit, this is either (success) or (no reward).
is an indicator function equal to when action was taken at time , and otherwise.
is the number of times action has been selected:

Policy

The -greedy policy defines a stochastic action-selection rule:

with probability , the greedy action is selected:
with probability , an action is selected uniformly at random.

Equivalently, the policy can be written as:

Update Rule

After selecting action and observing reward , the estimate is updated using the new observation.

Note

Despite its simplicity, -greedy often performs reasonably well. However, because exploration is random and does not depend on uncertainty, it can waste trials on clearly suboptimal actions.

class EpsilonGreedy(BaseSolver):
    def __init__(self, bandit: BaseBandit, eps: float, init_proba: float = 1.0, seed: int | None = None) -> None:
        """
        eps (float): the probability to explore at each time step.
        init_proba (float): default to be 1.0; optimistic initialization
        """
        super().__init__(bandit)

        assert 0.0 <= eps <= 1.0
        self.eps = eps

        # optimistic initialization
        self.estimates = np.full(self.bandit.k, fill_value=init_proba, dtype=np.float64)

        # define random generator with seed for reproducibility
        self.rng = np.random.default_rng(seed=seed)

    @property
    def estimated_probas(self) -> npt.NDArray[np.float64]:
        return self.estimates

    def run_one_step(self) -> tuple[int, float]:
        # With probability epsilon pick random exploration, or pick the known best lever.

        if self.rng.random() < self.eps:
            # pure random exploration
            i = self.rng.integers(0, self.bandit.k)
        else:
            # greedy selection with random tie-breaking
            candidates = np.flatnonzero(self.estimates == self.estimates.max())
            i = int(self.rng.choice(candidates))

        r = self.bandit.generate_reward(i)
        self.estimates[i] += 1.0 / (self.counts[i] + 1) * (r - self.estimates[i])

        return i, r

Upper Confidence Bounds (UCB)

Random exploration gives us the opportunity to try actions we know little about. However, pure randomness can also cause us to waste time and re-exploring action we already have striong evidence are suboptimal (bad luck still happens!). Two broad alternatives exist:

Decay over time in -greedy, making exploration less frequent, or
Act optimistically for uncertain actions, favoring actions where our estimates are still unreliable.

The second idea leads to the class of Upper Confidence Bound (UCB) algorithms. The key intuition is simple:

If we are unsure about action’s value, we pretend it could be good until proven otherwise.

More formally, UCB defines upper confidence bound that measures the uncertainty in our estimate . With high probability, the true value satisfies:

The uncertainty term must shrink as we gather more data. Thus, it is a decreasing function of : the more we pull an arm, the more confident we become, and the smaller its uncertainty bonus should be.

Given this, the UCB policy selects the action whose optimistic estimate is highest:

This ensures a natural balance: well explored actions rely mostly on , while poorly explored actions get an extra boost from their larger uncertainty term.

Unified Definition

Information State

At time step , the UCB algorithms maintains:

empirical action-value estimates ,
action counts .

The quantities summarize the full interaction history.

Policy

UCB defines deterministic policy:

Unlike -greedy, exploration is not injected explicitly. Instead, it emerges through optimism in the face of uncertainty.

Action Selection

At each time step, the selected action is:

Update Rule

After selecting action and observing reward , the algorithm updates:

the action counts
the empirical estimate

Choosing the Uncertainty Bound

The remaining design choice is how to define . Different choices lead to different members of the UCB family, such as UCB1, which derives its bound from Hoeffding’s inequality.

Now the question is: how do we choose the uncertainty bound ?

Hoeffding’s Inequality

If we do not want to assign any prior knowledge about the shape of the reward distribution (e.g., Gaussian, exponential), we can rely on Hoeffding’s Inequality. This theorem is applicable on any bounded distribution.

A random variable is said to follow a bounded distribution if all its values lie within a fixed finite interval . In our case, Bernoulli rewards always lie in , so the boundedness assumption is naturally satisfied.

Note

Here are a few examples for intuition:

A Bernoulli distribution is bounded on interval .
A uniform distribution on interval e.g., is bounded.
A Gaussian distribution is not bounded because of its infinite tails.

Hoeffding’s Inequality (informal version)

Let be i.i.d. (independent and identically distributed) random variables, all bounded in the interval . The sample mean is

Then for any , Hoeffding’s inequality states:

This inequality bounds the probability that the true mean exceeds the empirical mean my more than .

Applying Hoeffding’s Inequality to Bandit Rewards

To apply this result to the multi-armed bandit setting, we observe that each fixed action defines its own random reward-generating process. Every time we select action , we obtain a reward drawn independently from the same bounded distribution. Therefore, Hoeffding’s inequality applies directly to each arm.

For a fixed target action , define:

as the reward random variable,
as the true mean reward,
as the sample mean reward,
and as the upper confidence bound.

By directly identifying Hoeffding’s variables with the bandit quantities:

we obtain:

This gives a probabilistic upper bound on how much the true reward of an action can exceed its empirical estimate.

Choosing the Upper Confidence Bound

We want to select the confidence bound so that the probability of underestimating the true mean is very small. Let us require this probability to be below a small threshold :

Solving for , we obtain:

This expression defines how much optimism we should add to the empirical estimate based on how many times the action has been sampled.

UCB1

From the previous section, we obtained a general form of the confidence bound:

The remaining question is how to choose the threshold probability . Intuitively, as time goes on and we collect more data, we want our confidence bounds to become tighter and failures to become increasingly unlikely. A simple and effective heuristic is to let the failure probability decrease with time.

A common choice is:

which makes the failure probabilities summable over time and enables strong regret guarantees.

Substituting this into the confidence bound gives:

This yields the classic UCB1 algorithm.

At each time step, UCB1 selects the action that maximizes the optimistic estimate of the reward:

Here:

promotes exploitation,
the square-root term promotes exploration, shrinking as increases,
and the term ensures that even rarely chosen actions are revisited occasionally.

Note

Why this works (one sentence intuition)

UCB1 always chooses the action with the highest plausible reward, where “plausible” is defined by a confidence interval that shrinks as evidence accumulates.

class UCB1(BaseSolver):
    def __init__(self, bandit: BaseBandit, init_proba: float = 1.0, seed: int | None = None):
        super().__init__(bandit)
        self.t = 0  # number of time steps
        self.estimates = np.full(shape=self.bandit.k, fill_value=init_proba, dtype=np.float64)
        self.rng = np.random.default_rng(seed)

    @property
    def estimated_probas(self) -> npt.NDArray[np.float64]:
        return self.estimates

    def run_one_step(self) -> tuple[int, float]:
        self.t += 1

        # Pick the best one with consideration of upper confidence bounds.
        ucb = self.estimates + np.sqrt(2 * np.log(self.t) / (1 + self.counts))

        # tie-breaking
        candidates = np.flatnonzero(ucb == ucb.max())
        i = int(self.rng.choice(candidates))

        r = self.bandit.generate_reward(i)

        self.estimates[i] += 1.0 / (self.counts[i] + 1) * (r - self.estimates[i])

        return i, r

Bayesian UCB

Bayesian UCB is an instance of the UCB principle in which uncertainty is quantified using the posterior distribution of the reward model.

In the UCB and UCB1 algorithms, we do not assume any specific form of the reward distribution. Because of this, we rely on Hoeffding’s inequality, which provides a very general but also somewhat loose confidence bound that works for any bounded distribution.

However, in some applications we may have prior knowledge about how rewards are distributed. When such information is available, we can replace Hoeffding’s generic bound with a distribution-aware confidence bound, leading to a more data-efficient strategy. This idea gives rise to Bayesian UCB.

Using Distributional Assumptions

For example, suppose we believe that the mean reward of each slot machine follows a Gaussian likelihood, which induces a Gaussian posterior distribution over the mean reward of each action. After observing rewards for a given action , the posterior is characterized by:

a posterior mean ,
and a posterior standard deviation

In this case, a natural choice for the upper confidence bound is the upper quantile of the posterior, for instance a 95% confidence bound:

where corresponds to a 95% credible interval for a Gaussian distribution.

The Bayesian UCB action selection rule then becomes:

Interpretation:

plays the role of exploitation (current best estimate),
captures uncertainty (how much we still do not know),
the constant controls how optimistic we are.

Compared to UCB1, where uncertainty depends only on , Bayesian UCB uses the full posterior uncertainty, which often leads to faster learning when the model assumptions are correct.

Key Difference from UCB1

UCB1	Bayesian UCB
No distributional assumption	Explicit reward model
Hoeffding bound	Posterior quantile
Worst-case guarantees	Model dependent efficiency

class BayesianUCB(BaseSolver):
    def __init__(
        self, bandit: BaseBandit, c: float = 2, init_a: float = 1, init_b: float = 1, seed: int | None = None
    ) -> None:
        super().__init__(bandit)

        self.c = c
        self._as = np.full(self.bandit.k, fill_value=init_a, dtype=np.float64)
        self._bs = np.full(self.bandit.k, fill_value=init_b, dtype=np.float64)

        self.t = 0
        self.rng = np.random.default_rng(seed)

    @property
    def estimated_probas(self) -> npt.NDArray[np.float64]:
        return self._as / (self._as + self._bs)

    def run_one_step(self) -> tuple[int, float]:
        from scipy.stats import beta

        self.t += 1

        # ensure each arm is tried at least once
        if self.t <= self.bandit.k:
            i = self.t - 1
        else:
            mu = self._as / (self._as + self._bs)  # posterior mean
            sigma = beta.std(self._as, self._bs)  # posterior std Beta(alpha, beta)
            confidence = mu + self.c * sigma

            # tie-breaking
            candidates = np.flatnonzero(confidence == confidence.max())
            i = self.rng.choice(candidates)

        r = self.bandit.generate_reward(i)

        # update Beta posterior for Bernoulli reward
        self._as[i] += r  # successes
        self._bs[i] += 1 - r  # failures

        return i, r

Thompson Sampling

Thompson Sampling defines a stochastic policy that selects actions in proportion to their posterior probability of being optimal.

Bayesian UCB still follows the same basic philosophy as UCB1. It builds an explicit confidence bound and then acts optimistically with respect to that bound. Thompson Sampling takes a more direct and fully Bayesian approach. Instead of computing an upper bound, it samples directly from the posterior distribution and acts on that sample.

The idea is remarkably simple:

Instead of asking “Which action could be best?”, Thompson Sampling asks “Which action is most likely to be the best right now?”

At each time step, we treat the unknown reward probability of each action as a random variable and maintain a posterior distribution over its value. Then:

We sample one possible reward from the posterior of each action.
We select the action with the highest sampled value.
We observe the reward and update the posterior.

This naturally balances exploration and exploitation:

actions with high uncertainty are more likely to occasionally produce large samples → exploration,
actions with high posterior mean consistently produce large samples → exploitation.

No explicit exploration parameters or confidence bound is required.

Thompson Sampling for Bernoulli Bandits (Beta-Bernoulli)

In the Bernoulli banding setting, the reward of each action is either or . The conjugate prior for the Bernoulli distribution is the Beta distribution, so we model each action as:

where:

counts observed successes,
counts observed failures.

Initially, we typically use a non-informative prior such as:

Action Selection

At time Thompson Sampling performs:

and selects:

That is, we draw one plausible value for each arm and act greedly with respect to this randomly sampled world.

Posterior Update

After observing the reward , we update:

This update is exact Bayesian inference for the Bernoulli-Beta model.

Why Thompson Sampling Works so Well

Thompson Sampling does not separate exploration from exploitation. Instead, exploration emerges naturally from uncertainty in the posterior:

If an action is well understood, its posterior is sharp (little randomness).
If an action is uncertain, its posterior is wide (occasional optimistic samples).

In contrast:

-greedy explores blindly,
UCB explores via deterministic optimism,
Thompson Sampling explores via probabilistic belief.

Relationship to Bayesian UCB

Bayesian UCB selects actions using:

which corresponds to choosing a fixed upper quantile of the posterior.

Thompson Sampling instead draws a random quantile at every time step. In this sense:

Bayesian UCB is optimistic; Thompson Sampling is probabilistic.

Both use Bayesian posteriors, but Thompson Sampling avoids manually choosing confidence levels.

class ThompsonSampling(BaseSolver):
    def __init__(self, bandit: BaseBandit, init_a: int = 1, init_b: int = 1, seed: int | None = None) -> None:
        super().__init__(bandit)

        self._as = np.full(self.bandit.k, fill_value=init_a, dtype=np.float64)
        self._bs = np.full(self.bandit.k, fill_value=init_b, dtype=np.float64)

        self.rng = np.random.default_rng(seed)

    @property
    def estimated_probas(self) -> npt.NDArray[np.float64]:
        return self._as / (self._as + self._bs)

    def run_one_step(self) -> tuple[int, float]:
        samples = self.rng.beta(self._as, self._bs)

        # tie-breaking
        candidates = np.flatnonzero(samples == samples.max())
        i = int(self.rng.choice(candidates))

        r = self.bandit.generate_reward(i)

        self._as[i] += r
        self._bs[i] += 1 - r

        return i, r

Benchmark

N_STEPS = 10_000
SEED = 0x42
K = 10

np.random.seed(SEED)
rng = np.random.default_rng(SEED)

# Probabilities {0.0, 0.1, ..., 0.9} then shuffle them
# probas = rng.uniform(0, 1, size=K)
probas = np.linspace(0, 1, K, endpoint=False, dtype=np.float64)
print(probas)
rng.shuffle(probas)

bbandit = BernoulliBandit(k=K, probas=probas, seed=SEED)
epsgreedy = EpsilonGreedy(bbandit, eps=0.01, seed=SEED)
epsgreedy.run(N_STEPS)

# Random is a special case of EpsilogGreedy
# bbandit = BernoulliBandit(k=K, probas=probas, seed=SEED)
# random = EpsilonGreedy(bbandit, eps=1.0, seed=SEED)
# random.run(N_STEPS)

bbandit = BernoulliBandit(k=K, probas=probas, seed=SEED)
ucb1 = UCB1(bbandit, seed=SEED)
ucb1.run(N_STEPS)

bbandit = BernoulliBandit(k=K, probas=probas, seed=SEED)
bayesian = BayesianUCB(bbandit, seed=SEED)
bayesian.run(N_STEPS)

bbandit = BernoulliBandit(k=K, probas=probas, seed=SEED)
thompson = ThompsonSampling(bbandit, seed=SEED)
thompson.run(N_STEPS)

[0.  0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]

with load_theme("ambivalent"):
    fig, ax = plt.subplots(ncols=3, nrows=1, figsize=(12, 4), facecolor="none", layout="constrained")

    solvers_labels = {
        r"$\epsilon$-greedy": epsgreedy,
        "UCB1": ucb1,
        "Bayesian": bayesian,
        "Thompson": thompson,
    }

    # --- 1) cumulative regret ---
    for label, solver in solvers_labels.items():
        ax[0].plot(np.cumsum(solver.regrets), label=label, clip_on=False)

    ax[0].set_xlabel("Time steps")
    ax[0].set_ylabel("Cumulative regret")

    # --- shared x for action-ranked plots ---
    sorted_indices = np.argsort(bbandit.probas)
    x = np.arange(bbandit.k)  # 0..k-1 (rank after sorting)
    p_true = bbandit.probas[sorted_indices]

    # jitter for scatter points (so methods don't overlap)
    n_methods = len(solvers_labels)
    jit = 0.12  # horizontal separation between methods (in "x units")
    offsets = (np.arange(n_methods) - (n_methods - 1) / 2) * jit

    # --- 2) estimated probability per action (jittered scatter + true line) ---
    ax[1].plot(
        x,
        p_true,
        linestyle="-.",
        marker="o",
        markersize=3,
        label="True $p(a)$",
        zorder=1,
        clip_on=False,
    )

    for off, (label, solver) in zip(offsets, solvers_labels.items(), strict=True):
        ax[1].scatter(
            x + off,
            solver.estimated_probas[sorted_indices],
            s=35,
            label=label,
            alpha=0.8,
            zorder=2,
            clip_on=False,
        )

    ax[1].set_xlabel(r"Actions sorted by $\theta$")
    ax[1].set_ylabel("Estimated probability")
    ax[1].set_xticks(x)
    ax[1].set_xticklabels([str(i) for i in x])  # or sorted_indices.astype(str) for original IDs
    ax[1].set_ylim(0.0, 1.0)

    # --- 3) action selection rate (grouped bars, centered on ranks) ---
    width = 0.18
    bar_offsets = (np.arange(n_methods) - (n_methods - 1) / 2) * width

    for off, (label, solver) in zip(bar_offsets, solvers_labels.items(), strict=True):
        ax[2].bar(
            x + off,
            solver.counts[sorted_indices] / len(solver.regrets) * 100.0,
            width=width,
            label=label,
            alpha=0.85,
            clip_on=False,
        )

    ax[2].set_xlabel(r"Actions sorted by $\theta$")
    ax[2].set_ylabel("% of trials")
    ax[2].set_xticks(x)
    ax[2].set_xticklabels([str(i) for i in x])  # or sorted_indices.astype(str)
    ax[2].set_ylim(0, 100)

    # (Optional) make the two right panels less "grid heavy" if your theme uses strong grids
    for a in (ax[1], ax[2]):
        a.grid(axis="y", alpha=0.25)
        a.set_axisbelow(True)

    # --- single shared legend (deduplicated) ---
    handles, labels = [], []
    for axis in fig.axes:
        _handles, _labels = axis.get_legend_handles_labels()
        handles.extend(_handles)
        labels.extend(_labels)

    by_label = dict(zip(labels, handles, strict=True))
    fig.legend(
        by_label.values(),
        by_label.keys(),
        loc=8,
        ncols=len(by_label),
        bbox_to_anchor=(0.5, -0.1),
        fancybox=True,
        frameon=True,
    )

plt.show()

Figure 1: The results of the experiment on solving Bernoulli bandit with K=10, slot machines with reward probabilities, {0.0, 0.1, …, 0.9}. Each solver runs 10,000 steps.

The figure above shows side-by-side comparison of four bandint strategies: -greedy, UCB1, Bayesian UCB, and Thompson Sampling. All algorithms are evaluated on the same 10-armed Bernoulli bandit. Each subplot highlights a different aspect of algorithmic behavior: regret, reward estimation, and exploration patterns. Together, they illustrate how the theoretical ideas introduced earlier play out in practice.

1. Cumulative Regret Over Time (left)

The left subplot shows how much regret each algorithm accumulates over 10,000 time steps. Lower curve indicate better performance.

Thompson Sampling performs best. Its regret curve rises slowly at first and then flatten, showing that it quickly identifies the optimal arm and almost never leaves it afterward.
Bayesian UCB is slightly worse but still competitive. Using posterior uncertainty leads to steady improvement without requiring an explicit exploration parameter.
-greedy suffers more early regret and coverges more slowly, since it explores randomly rather than strategically.
UCB1 explores aggresively and therefore incurs noticeably higher regret. This is expected in settings where several arms have relatively high reward probabilities, making early optimistic exploration particularly costly.

The qualitative ordering matches classic theoretical results: Thompson Sampling Bayesian UCB -greedy UCB1 for this type of environment.

2. Estimated Reward Probabilities (middle)

The middle subplot show how accurately each method estimates the reward probability of each arm after training. Arms are sorted by their true values, and the dashed line represents perfect estimation.

Thompson Sampling and Bayesian UCB are concentrated near the diagonal. Their estimates are reasonably accurate even for suboptimal arms.
-greedy is more scattered. Because it explores randomly and infrequently revisits some arms, several estimates remain biased or underdeveloped.
UCB1 tends to overestimate some suboptimal arms early on and then underexplore them later. UCB’s deterministic optimism often leads to distinctive estimation bias: Once the bonus term shrinks, there is little incentive to revisit an arm, even if its estimate is wrong.

This subplot highlights key difference: good decision-making does not always require perfectly accurate models, but algorithms that maintain richer uncertainty estimates (Bayesian UCB and Thompson Sampling) tend to form more reliable estimates.

3. Fraction of Pulls per Arm (right)

The right subplot shows how often each algorithm select each action. Here, the behavioral differences are most visible.

Thompson Sampling plays the best arm almost exclusively, with its bar nearly reaching 100%.
Bayesian UCB focuses heavily on the best arm but still allocates a small percentage of trials to others due to posterior uncertainty.
-greedy spreads its attention more broadly. Because exploration is random, even clearly suboptimal arms continue to receive occasional pulls.
UCB1 revisits several arms during the optimistic exploration phase. Once the bonus term shrinks, it commits strongly to the best arm, but the early exploration leaves a visible footprint.

This suboptimal emphasizes how each strategy allocates exploration effort:

-greedy: broad, unfocused exploration
UCB1: early over-exploration, later commitment
Bayesian UCB: exploration guided by posterior uncertainty
Thompson: exploration proportional to probabilities of being optimal

Putting It All Together

These three views (regret, estimation accuracy, and action frequencies) provide a comprehensive picture of each algorithm’s strengths and weaknesses:

Thompson Sampling is consistent and strong: low regret, accurate estimation, and efficient exploration.
Bayesian UCB offers a pricipled middle ground and performs well when prior structure is appropriate.
-greedy is simple but wasteful: random exploration leads to both under- and over-exploration.
UCB1 works as intended, but deterministic optimism causes large early regret when many arms have similar payoffs.

Results shown correspond to a single random seed; while relative performance may vary across runs, the qualitative behavior and average ordering are consistent with theoretical expectations.

Overall, the benchmark illustrates a central message of the exploration-exploitation dilemma: better uncertainty modeling leads to more efficient learning.

Conclusions

The benchmark highlights the core differences between bandit algorithms in practice:

Thompson Sampling achieves the lowest regret and concentrates almost all pulls on the optimal arm, reflecting efficient, uncertainty-aware exploration.
Bayesian UCB performs similarly well, balancing optimism with Bayesian posterior uncertainty.
-greedy is simple but wasteful: random exploration leads to slower convergence and less accurate value estimates.
UCB1 explores aggressively early on, which increases regret in environments with many high-reward arms.

Overall, algorithms that model uncertainty explicitly, such as Thompson Sampling and Bayesian UCB, deliver more focused exploration and stronger performance.

Appendix

Method	Exploration mechanism	Determistic?	Uses Posterior?
-greedy	Random with prob.	No	No
UCB1	Optimism via bound	Yes	No
Bayesian UCB	Posterior quantile	Yes	Yes
Thompson Sampling	Posterior sampling	No	Yes

Reuse

CC BY-NC-SA 4.0

Building a No‑Fluff Report Template in LaTeX

Gregor Cerar — Tue, 06 May 2025 22:00:00 GMT

Why a New Template?

Dense (less fluff) — Every square centimeter should serve the reader. Tighter vertical spacing and compact headings keep the narrative flowing.
Optional titles — Some documents benefit from a title; others (like a brief update) do not. The template should let me toggle them off with a single flag.
Flexible — Today, I might need a one‑pager. Tomorrow, a 20‑page appendix. Layout decisions (margins, font, color) should be parameterized — not hard‑wired.

Existing classes like article or even IEEEtran come close but still force unnecessary baggage on the author (abstract blocks, keywords, etc.). Then I stumbled upon the elegant ministate class. So I adapted it.

Meet ministate v3.0

ministate.cls

\ProvidesClass{ministate}[2023/03/29 v3.0 Minimalist statement class]
\LoadClass[11pt,a4paper]{article}

\usepackage[utf8]{inputenc} % from 2018, UTF-8 is default in LaTeX
\usepackage[T1]{fontenc}
\usepackage{lmodern}

\usepackage{microtype}

\usepackage[margin=0.8in]{geometry}
\usepackage{parskip}
\usepackage{fancyhdr}

\setlength{\headheight}{15.2pt}
\pagestyle{fancy}
\fancyhf{} % Clear all header and footer fields

%--------------------------------------------------%
%    Title, HeaderTitle, Author, HeaderAuthor,     %
%                 Custom Date                      %
%--------------------------------------------------%

\let\oldtitle\title
\let\oldauthor\author
\let\olddate\date

\def\@headertitle{}
\def\@headerauthor{}
\def\@headerdate{}

% Redefine the \title and \author commands
\renewcommand{\title}[1]{%
    \oldtitle{#1}%
    \ifx\@headertitle\@empty%
        \relax\def\@headertitle{#1}%
    \fi%
}
\renewcommand{\author}[1]{%
    \oldauthor{#1}%
    \ifx\@headerauthor\@empty%
        \relax\def\@headerauthor{#1}%
    \fi%
}

\renewcommand{\date}[1]{%
    \olddate{#1}%
    \ifx\@headerdate\@empty%
        \relax\def\@headerdate{#1}%
    \fi%
}

% Commands for explicitly setting the header title and header author
\newcommand{\headertitle}[1]{\def\@headertitle{#1}}
\newcommand{\headerauthor}[1]{\def\@headerauthor{#1}}

\fancypagestyle{ministate}{%
  \fancyhf{}% clear everything
  \fancyhead[L]{\textbf{\@headertitle}\ifx\@headerdate\@empty\else\ (\@headerdate)\fi}%
  \fancyhead[R]{\textbf{\@headerauthor}}%
  \fancyfoot[C]{\thepage}%
}

\pagestyle{ministate}

% Apply header settings including the custom date
%\fancyhead[L]{\textbf{\@headertitle}\ifx\@headerdate\@empty\else\ (\@headerdate)\fi} % Title (Custom Date)
%\fancyhead[R]{\textbf{\@headerauthor}} % Author

%\fancyfoot{} % Override existing foot numbering
%\fancyfoot[C]{\thepage} % Page number at center of footer


%--------------------------------------------------%
%                   Document Body                  %
%--------------------------------------------------%

% Usage:
% \title{Your Title Here}
% \author{Author Name}
% \headertitle{Your Header Title Here} - For custom header title
% \headerauthor{Your Header Author Here} - For custom header author
% \date{Custom Date or Empty String} - To change or remove the date

% Comment this block if we don't want header on the first page
\usepackage{etoolbox}   % load before you patch anything
\makeatletter
\patchcmd{\maketitle}%          the command to patch
  {\thispagestyle{plain}}%      code to replace
  {\thispagestyle{ministate}}%  replacement
  {}{}                          % ← success / failure actions (empty)
\makeatother


\makeatletter
\def\@maketitle{%
  \newpage
  \begin{center}%
    \let\footnote\thanks
    {\LARGE \@title\par}%
    \vskip 0.2em%
    {\large\begin{tabular}[t]{c}\@author\end{tabular}\par}%
    \vskip 0.2em%
    {\large \@date}\vskip 0.2em% % Commented out to remove the date
  \end{center}%
  \par
}%
\makeatother

An example document:

example.tex

\documentclass[11pt,a4paper,nonatbib]{./ministate}

% (optional) bibliography
%\usepackage[backend=biber,style=ieee,autocite=plain,sorting=none]{biblatex}
%\addbibresource{biblio.bib}
\usepackage[sfdefault]{atkinson}
\usepackage{fontawesome}

\usepackage{hyperref}
\usepackage{url}

\usepackage[english]{babel}
\usepackage[autostyle,english=british]{csquotes}

% (optional) prevent breaking words
\usepackage[none]{hyphenat}
\interdisplaylinepenalty=10000

% lorem ipsum generator
\usepackage{kantlipsum}


% ministate settings
\title{The Summary of Lorem Ipsum}
\headertitle{The Shorter Title}  % (optional) will use \title if not used

\author{Johnny English, PhD}
\headerauthor{Johnny E., PhD} % (optional) will use \author if not used

\date{May 7, 2025}

\begin{document}

\maketitle

\kant

%\clearpage
%\printbibliography[title={Osebna bibliografija}]

\end{document}

Outcome

Below is the rendered PDF output from the code above:

Reuse

CC BY-NC-SA 4.0

Visualizing Feature Maps from VGG11 and ResNet50 in PyTorch

Gregor Cerar — Mon, 05 May 2025 22:00:00 GMT

Prerequisites

Before we start, we need to install the following libraries: NumPy, Matplotlib, PyTorch, and Torchvision.

import math
from collections.abc import Callable
from pathlib import Path
from typing import Final, Literal

import numpy as np
import torch
from IPython.display import Markdown
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from torch import Tensor, nn
from torchvision import models
from torchvision.io import decode_image
from torchvision.transforms import v2 as T

In this article, we are going to use pre-trained neural networks. More specifically, weights trained on ImageNet-1K dataset.

But before that, we will prepare input images. We will size the image(s) to 224x224 and normalize it for optimal performance. The preparation step will make the pictures similar to the training dataset. See the link for more details on why this step is necessary.

# ImageNet normalization weights per channel
IMAGENET1K_MEAN = [0.485, 0.456, 0.406]
IMAGENET1K_STD = [0.229, 0.224, 0.225]

transform = T.Compose(
    [
        T.Resize(256),
        T.CenterCrop(224),
        T.ToImage(),
        T.ToDtype(torch.float32, scale=True),
        T.Normalize(IMAGENET1K_MEAN, IMAGENET1K_STD),
    ]
)


def load_image(path: str | Path) -> Tensor:
    # Transform images into tensors
    img: Tensor = transform(decode_image(str(path)))

    # Add dimension to imitate batch size equal to 1: (C,H,W) -> (B,C,H,W)
    img = img.unsqueeze(0)
    return img

def inverse_normalize(
    x_norm: Tensor,
    mean: list[float] = IMAGENET1K_MEAN,
    std: list[float] = IMAGENET1K_STD,
) -> Tensor:
    # Ensure mean and std have the correct shape
    _mean = torch.as_tensor(mean).to(x_norm.device).view(1, -1, 1, 1)
    _std = torch.as_tensor(std).to(x_norm.device).view(1, -1, 1, 1)
    # Inverse normalization: x = x_normalized * std + mean
    return x_norm.mul(_std).add(_mean)


reverse_transform = T.Compose(
    [
        T.Lambda(inverse_normalize),
        T.Lambda(lambda x: torch.clamp(x, min=0.0, max=1.0)),
    ]
)

sample = load_image("bridge.jpg")
orig_sample = reverse_transform(sample)

fig, ax = plt.subplots(frameon=False)
fig.subplots_adjust()
ax.imshow(orig_sample.squeeze(0).permute(1, 2, 0))
ax.axis("off")
plt.show()

Original image, resized

def get_activation(name: str, activations: dict[str, Tensor]) -> Callable:
    def hook(model: nn.Module, tensor: Tensor, output: Tensor) -> None:
        # map layer's `name` to layer's output value
        activations[name] = output.detach()

    return hook


def set_hooks(model: nn.Module, layer_ids: list[str], out: dict[str, Tensor]) -> None:
    layer_ids = [str(i) for i in layer_ids]
    for name, module in model.named_modules():
        if name in layer_ids:
            module.register_forward_hook(get_activation(name, out))

def visualize_feature_maps(
    feature_map: Tensor | np.ndarray,
    max_maps: int | None = None,
    max_cols: int = 8,
    figsize_per_plot: float = 1.0,
    norm: Literal["linear", "log", "symlog", "logit", None] = None,
    cmap: str = "viridis",
):
    if isinstance(feature_map, Tensor):
        feature_map = feature_map.cpu().numpy()

    if feature_map.ndim == 4:
        feature_map = feature_map.squeeze(0)  # remove batch dimension if present
    assert feature_map.ndim == 3, "Expected tensor shape (C, H, W)"

    C, H, W = feature_map.shape

    if max_maps:
        C = min(C, max_maps)

    n_cols = min(C, max_cols)
    n_rows = math.ceil(C / n_cols)

    figsize = (figsize_per_plot * n_cols, figsize_per_plot * n_rows)

    fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=figsize, frameon=False, squeeze=False)
    fig.subplots_adjust(wspace=0.03, hspace=0.03)

    for ax in axes.flat:
        ax.axis("off")

    for i in range(C):
        t = feature_map[i]
        axes.flat[i].imshow(t, cmap=cmap, norm=norm, aspect="equal", interpolation="none")

    return fig, axes

def minmax_scale_per_channel(arr: np.ndarray, eps: float = 1e-5) -> np.ndarray:
    """Per-channel MinMax normalization. Expects (C, W, H)."""
    assert arr.ndim == 3, f"{arr.ndim=}"

    c_min = arr.min(axis=(1, 2), keepdims=True)
    c_max = arr.max(axis=(1, 2), keepdims=True)

    scaled = (arr - c_min) / (c_max - c_min + eps)  # avoid division by zero
    return scaled


def pca_rgb(
    feature_map: np.ndarray | Tensor,
    n_components: Literal[1, 3] = 3,
    normalize: bool = True,
    random_state: int | None = None,
) -> np.ndarray:
    if isinstance(feature_map, torch.Tensor):
        feature_map = feature_map.cpu().numpy()

    if feature_map.ndim == 4:
        feature_map = feature_map.squeeze(0)  # remove batch dimension if present
    assert feature_map.ndim == 3, "Expected array shape (C, H, W)"

    C, H, W = feature_map.shape
    pca = PCA(n_components=n_components, random_state=random_state)
    flat = feature_map.reshape(C, -1).T
    rgb = pca.fit_transform(flat).T.reshape(n_components, H, W)

    if normalize:
        rgb = minmax_scale_per_channel(rgb)

    return rgb


def visualize_feature_maps_pca(
    feature_maps: dict[str, Tensor],
    n_components: Literal[1, 3] = 3,
    max_cols: int = 4,
    figsize_per_plot: float = 2.0,
    norm: Literal["linear", "log", "symlog", "logit", None] = None,
    subtitles: bool = True,
    cmap: str = "viridis",
):
    c = len(feature_maps)
    n_cols = min(c, max_cols)
    n_rows = math.ceil(c / n_cols)
    fig_size = (figsize_per_plot * n_cols, figsize_per_plot * n_rows)

    fig, axes = plt.subplots(n_rows, n_cols, figsize=fig_size, squeeze=False, frameon=False)
    fig.subplots_adjust(wspace=0.03, hspace=0.20, top=0.85)

    for ax in axes.flat:
        ax.axis("off")

    for ax, (layer, feature_map) in zip(axes.flat, feature_maps.items(), strict=False):
        rgb_features = pca_rgb(feature_map, n_components=n_components)
        rgb_features = rgb_features.transpose(1, 2, 0)
        rgb_features = rgb_features.squeeze()

        ax.imshow(rgb_features, cmap=cmap, norm=norm, aspect="equal", interpolation="none")
        if subtitles:
            ax.set_title(layer, color="0.5")

    return fig, axes

VGG

The VGG are deep neural networks introduced by (Simonyan and Zisserman 2014) in 2014. The VGG stacks many small 3x3 convolution filters in sequence. This simple “deeper‑is‑better” design once achieved top ImageNet performance while showing that depth and uniform layer structure can yield strong feature hierarchies, making VGG a popular baseline for vision tasks and transfer learning. Nowadays, they are considered outdated.

model = models.vgg11(weights=models.VGG11_Weights.IMAGENET1K_V1).features

# Let's inspect the VGG's feature extractor layers
model

Sequential(
  (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU(inplace=True)
  (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (3): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): ReLU(inplace=True)
  (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (6): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (7): ReLU(inplace=True)
  (8): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (9): ReLU(inplace=True)
  (10): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (11): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (12): ReLU(inplace=True)
  (13): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (14): ReLU(inplace=True)
  (15): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (16): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (17): ReLU(inplace=True)
  (18): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (19): ReLU(inplace=True)
  (20): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)

# cherry-pick layers of which outputs we want to see
selected_layers = ["0", "3", "6", "8", "11", "13", "16", "18"]

# add forward hooks to the model
vgg_activations = {}
set_hooks(model, selected_layers, vgg_activations)

# make forward pass through NN
with torch.no_grad():
    model(sample)

Below we visualize the feature maps generated by a few hand‑picked layers. A feature map (also called an activation map) is simply the tensor that a layer outputs (for example, output = conv(input)). During training, each convolutional layer learns a set of spatial kernels that act as filters (_see kernels in image processing*), allowing the network to draw ever‑richer patterns from the feature maps produced by the preceding layers.

for layer, filters in vgg_activations.items():
    display(Markdown(f"### Layer #{layer}"))
    visualize_feature_maps(filters, max_maps=8 * 8, norm="linear")
    plt.show()

Layer #0

Layer #3

Layer #6

Layer #8

Layer #11

Layer #13

Layer #16

Layer #18

Above, we noted that the number of visualizations grows with the number of filters. A large number of filters can be overwhelming when a layer produces dozens of maps. To condense this information, we can project the feature maps with principal‑component analysis (PCA). We treat each spatial position across all maps as a feature vector, run PCA, and then reconstruct the dominant components. The result is a single “average” activation image that captures the most salient variance across the entire stack of feature maps. It can be rendered in either 1‑channel (grayscale) or 3‑channel (RGB) form.

visualize_feature_maps_pca(vgg_activations, max_cols=4)
plt.show()

Principal‑component projections of the feature‑map stacks for the corresponding layers of VGG‑11.

ResNet

ResNets (Residual Networks), introduced by (Targ et al. 2016), add “skip” or residual connections that let inputs bypass one or more layers. These identity shortcuts make very deep CNNs (e.g., ResNet‑50/101/152) easier to train by mitigating vanishing gradients, enabling state‑of‑the‑art accuracy with hundreds of layers. [wiki]

model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V1)

# inspect layers within ResNet
model

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): Bottleneck(
      (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (2): Bottleneck(
      (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
  )
  (layer2): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): Bottleneck(
      (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (2): Bottleneck(
      (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (3): Bottleneck(
      (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
  )
  (layer3): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): Bottleneck(
      (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (2): Bottleneck(
      (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (3): Bottleneck(
      (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (4): Bottleneck(
      (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (5): Bottleneck(
      (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
  )
  (layer4): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
        (1): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (1): Bottleneck(
      (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
    (2): Bottleneck(
      (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
    )
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (fc): Linear(in_features=2048, out_features=1000, bias=True)
)

selected_layers = ["conv1", "layer1", "layer2", "layer3", "layer4"]
resnet_feature_maps: dict[str, Tensor] = {}
set_hooks(model, selected_layers, resnet_feature_maps)

with torch.no_grad():
    model(sample)

for layer, filters in resnet_feature_maps.items():
    display(Markdown(f'### Layer "{layer}"'))
    visualize_feature_maps(filters, max_maps=8 * 8, norm="linear")
    plt.show()

Layer “conv1”

Layer “layer1”

Layer “layer2”

Layer “layer3”

Layer “layer4”

visualize_feature_maps_pca(resnet_feature_maps, max_cols=3)
plt.show()

Principal‑component projections of the feature‑map stacks for the corresponding layers of ResNet-50.

Conclusions

This article introduced a lightweight technique for visualizing pre-selected neural network layers’ feature maps (layer‑wise outputs). These visualizations offer an intuitive window into what a convolutional network attends to at each processing stage.

For deeper, production‑grade interpretability, explore the rich ecosystem of explainability libraries and frameworks, such as Captum or SHAP, and take a broader look at the rapidly growing fields of eXplainable AI (XAI) and Responsible AI.

Captum
SHAP

References

Simonyan, Karen, and Andrew Zisserman. 2014. “Very Deep Convolutional Networks for Large-Scale Image Recognition.” arXiv Preprint arXiv:1409.1556.

Targ, Sasha, Diogo Almeida, and Kevin Lyman. 2016. “Resnet in Resnet: Generalizing Residual Architectures.” arXiv Preprint arXiv:1603.08029.

Reuse

CC BY-NC-SA 4.0

Research Compute Infrastructure

Gregor Cerar — Sun, 19 Nov 2023 23:00:00 GMT

Introduction

Recent technological advances have transformed education, elevating the quality of teaching and learning. Jupyter Notebooks have emerged as a leading tool for interactive computing, programming, and data analysis (Perkel 2018; Mendez et al. 2019; Granger et al. 2021). However, hardware limitations became a significant hurdle when handling larger research projects. While public cloud services are an option, they come with notable drawbacks. In response, we developed a private cloud solution for our lab using Kubernetes. This solution addresses cost and security concerns while ensuring adaptability. Through this technology, we have enabled efficient app management, scalability, and resource flexibility.

Jupyter Notebooks

A Jupyter Notebook is an open document format based on JSON¹. Notebooks are organized into a sequence of cells, with each cell containing code, descriptive text, equations, and rich outputs (e.g., text displays, tables, audio, images, and animations). Tools like JupyterLab provide a platform for interactive code execution, data analysis, and documentation, all within a single interface, culminating in a Jupyter Notebook. These notebooks support various programming languages (e.g., Python, R, Scala, C++) and allow users to write and execute code cells iteratively (using REPL² or WETL³ approaches), offering immediate visibility of intermediate results. This facilitates the creation of narrative-driven data analyses, educational materials, and interactive presentations. Due to their versatility and interactivity, Jupyter Notebooks are a robust teaching tool for learning, conducting data science, and computer research.

Because of these remarkable features, we decided to incorporate Jupyter Notebooks into our research lab’s educational and research processes. We encouraged students and researchers to use Jupyter Notebooks to document their work and share it more easily with others.

Scalability

However, for large-scale projects involving hefty data processing on personal computers, using Jupyter Notebooks becomes a significant challenge. We frequently run into hardware limitations like storage space, RAM, processing power, and access to compute accelerators, which can hinder or even halt our progress. These projects are typically in the early stages of research, analysis, or prototyping, so intensive optimizations are impractical because they can slow down experimental development. Two potential solutions emerge: running Jupyter Notebooks on the grid, HPC infrastructure, or cloud services.

HPC infrastructure, like SLING in Slovenia or EuroHPC on a European level, offers immense computational power. However, given that HPCs are significant investments, queue management solutions like SLURM are employed in the HPC world to optimize their use. Computation tasks must be pre-packaged with metadata, code, and input data. These tasks then join a waiting list. This approach is not aligned well with data-driven research, which aims for interactive programming and quick feedback, limiting the full utilization of Jupyter Notebooks. Hence, cloud services become a more common choice for these notebooks.

Public cloud platforms like Google Colab and Kaggle have popularized Jupyter Notebook usage. Users can access the service anytime without queues, edit notebooks, and utilize cloud computing resources, all via a browser. Both services are freely accessible in a limited version. However, due to high user demand, these platforms sometimes limit computational resources, affecting service quality. Alternatives include custom paid services in the public cloud (e.g., AWS, Azure, GCP, Alibaba Cloud) that tailor infrastructure to customer needs. However, public cloud services have drawbacks, including high rental costs, unpredictable market-affected expenses, and security concerns when handling sensitive data.

Private clouds are an alternative to the public cloud, addressing cost and security challenges. They are crucial for research labs and companies dealing with sensitive data or requiring high adaptability. It grants organizations more transparency and cost control based on their needs and capabilities. Despite the initial technical knowledge and infrastructure investment requirements, private clouds offer enhanced security, control, and flexibility, leading to more predictable costs in the long run.

Several technologies are available to set up a private cloud, including commercial options (e.g., VMware vSphere, Red Hat OpenShift, IBM Cloud Private) and open-source solutions (e.g., The Littlest JupyterHub, OpenStack, Eucalyptus, Kubernetes, or using Docker Compose [reference design, gcerar/jupyterhub-docker]). Among the open-source options, Kubernetes is the most popular solution.

Kubernetes (abbreviated as K8s) is an open-source platform designed for the automation, management, and deployment of applications within containers. Its advanced orchestration features allow for efficient application management, automatic scaling, monitoring of their performance, and high availability. It can simplify the development and maintenance of complex cloud-based applications.

Contrary to Docker and Docker Compose, which primarily focus on building, storing, and running individual containers, Kubernetes offers a much more comprehensive platform for managing containers across expansive environments that span multiple computing nodes. While Docker provides easy creation and operation of individual containers, and docker-compose allows defining multiple containers as application units, Kubernetes facilitates the management of entire clusters of these application units throughout their life cycle, which includes automatic deployment, dynamic adjustments based on load, recovery in case of errors, and more advanced service and network management.

In our research lab, due to the growing computational demands prevalent in data science and the desire to retain the recognizable workflow present in Jupyter Notebooks, we have developed our private cloud solution based on Kubernetes technology.

The following sections will present a private cloud setup featuring Jupyter Notebooks built on top of open-source solutions. The user experience closely resembles that of existing paid cloud services. The private cloud must meet the following requirements:

System Scalability: The cloud should allow for easily adding computing nodes to the cluster without disrupting the operational system, supporting larger research projects or teaching groups.
Efficient Resource Management: The system must enable precise allocation of resources to users. In this context, an administrator can define a balance between a lax and strict resource allocation policy.
Enhanced Collaboration Experience: The system should allow for straightforward sharing of Jupyter Notebooks among users, promoting collaboration on joint projects and idea exchange between researchers and students.
No Waiting Queues: The system should eliminate waiting queues, offering users immediate access to computational resources to the best of their capacity.

Architecture

We decided to base our private cloud on the Kubernetes platform to meet system scalability and resource management requirements, aiming to enhance the functionality, accessibility, and sharing of Jupyter Notebooks (Bussonnier 2018) within the Kubernetes private cloud. In this section, we will delve deeper into the system’s architecture that integrates services and elaborate on the design decisions. Subsequently, we describe the individual services within our infrastructure.

Table 1: Services and Selected Solutions.

Service	Solutions (Used in bold)
Turnkey solution?	Custom, NVIDIA DeepOps
Basic Infrastructure
Operating System	Ubuntu, RHEL, NixOS, Talos
Data Storage	ZFS, GlusterFS, Lustre, CEPH, iSCSI
System Management	Ansible, Terraform, Puppet, Chef
Internal Services
K8s Distribution	vanilla, MicroK8s, OpenShift, Rancher
K8s Installation	Helm, Kustomize
Network Manager	Calico, Canal, Flannel, Weave
Data Manager	csi-driver-nfs, Rook, OpenEBS
Traffic Balancing	MetalLB, cloud provider specific
Traffic Manager	Nginx, Traefik
GPU Manager	NVIDIA GPU-Operator
Services for Users
JupyterHub Manager	Z2JH (Zero-to-JupyterHub)
Metrics and Monitoring	kube-prometheus-stack, InfluxDB

Table 1, in its first column, lists all the services required for system operation. The second column lists the open-source solutions that can provide these services. Bolded services indicate those selected and used in our private cloud. We made our choices based on specific criteria. We first surveyed technologies and solutions utilized in related projects. We further narrowed our selection to open-source solutions tested in private clouds on native infrastructure. A significant factor in our decision-making was also an insight into the popularity of the projects, gauged by the number of stars in repositories, the number of forks of the project, and the level of development activity on GitHub/GitLab. In our decision-making process, we didn’t follow a single empirical metric but took multiple factors into account to ensure a comprehensive assessment of solutions.

Figure 1: A three-tier logical infrastructure diagram. At the bottom is the foundational infrastructure, followed by internal Kubernetes services in the middle, and on top are the services exposed to users.

The diagram in Figure 1 provides a high-level representation of our private cloud and its infrastructure across three levels. The first level comprises heterogeneous computing nodes, forming the foundational infrastructure. Each node operates its operating system, running a portion of the Kubernetes platform. The second level encompasses internal Kubernetes services, which are essential for operation and never directly accessed by users. The final third level includes the services availed by end-users.

Turnkey Solution?

When planning the private cloud, we initially explored turnkey solutions, including NVIDIA DeepOps. Despite its advantages, we built our custom solution for the following reasons. While DeepOps is an excellent turnkey solution with maintained source code on GitHub and offers commercial support, initial setup requires configuration file adjustments, including Ansible scripts for automated (re)configuration of installed Linux distribution. Its complexity discouraged us from further investing our time in tinkering with it.

One of our biggest concerns was the intricate solution that tries to be versatile and “simple”. However, this inevitably leads to hiding functionalities and, in case of issues, jumping around documentation of multiple unrelated internally used tools. Despite proclaimed simplicity, troubleshooting or upgrade problems require manual intervention, where a thorough understanding of Linux, DeepOps, its internal tooling, and their interactions is necessary for system control. Therefore, we decided to start with a minimalist solution and, over time, plan to expand the system to understand the infrastructure’s operation better.

Foundation Infrastructure

In this section, we discuss the foundation infrastructure of our private cloud solution. We’ll go through these building blocks, including the selection of container management tools and resource sharing, which are vital for the operation of the Kubernetes platform.

Operating System: We chose Ubuntu Server based on the Debian Linux distribution for our system. The advantage of widely used Debian-based Linux distributions is the abundance of available knowledge resources and support, making problem-solving more accessible. Among alternatives, like declarative binary reproducible NixOS and RHEL-based distributions, we also considered the Talos distribution specialized for Kubernetes. However, we preferred to stick with Ubuntu Server due to the Talos project’s novelty and associated risks.

Container Management: For container management, we selected ContainerD, also used in the DeepOps solution and officially supported by NVIDIA. It is an open-source tool that implements the CRI interface for communication between the operating system and Kubernetes for efficient and reliable container management.

Data Storage: For data storage, we chose ZFS, which resides on one of the nodes. Although solutions like HDFS, Gluster, Lustre, or Ceph are far more common in the HPC world, they require dedicated infrastructure and tools to offer features offered by ZFS out-of-the-box. Features include checkpoints, data deduplication, compression, a COW (copy-on-write) system to prevent data loss during writing, immunity to silent bit-rot, the ability to use disks as redundancy for mechanical failures, and the use of fast SSD devices as a cache. It also allows easy manual intervention in the event of incidents. However, at the time of writing, ZFS does not stretch across multiple nodes, posing a risk of cluster failure in case of a data-storing node’s malfunction (single point of failure). There is an ongoing effort to implement ZFS’ distributed RAID (dRAID) [src].

To access ZFS storage from Kubernetes, we used the NFS server, which is part of the Linux kernel. We chose NFS because it is one of the few methods that allow multiple containers to bind to the same mounting point (see table).

System Management: For remote management and node configuration, we use Ansible maintained by Red Hat. We selected it due to its prevalence in other significant open-source projects and positive experiences from past projects.

Kubernetes

In Kubernetes, everything operates as a service. These services provide various functionalities that enhance Kubernetes capabilities, such as storage access, CPU and GPU allocation, traffic management, and connecting services within a mesh network.

To support specific functionalities, appropriate services (much like operating system drivers) must be installed. These specialized services, often called “operators” in Kubernetes terminology [src], are essential. They not only deploy and manage functionalities but also respond to issues. Operators enhance Kubernetes by interfacing with standardized and version-controlled APIs.

Put simply, operators are deployed as controller pods (containers) that watch for changes to custom Kubernetes resources and react accordingly. They function as an intermediary layer, implementing application-specific logic that extends Kubernetes beyond its built-in capabilities.

Internal Services

In Kubernetes, internal services are not intended for end users but are crucial for the system’s operation. These services operate in the background, ensuring vital functionalities that enable the stable operation and management of the container environment. In this subsection, we will introduce key services within Kubernetes and explain their role in our infrastructure. We will describe each service’s primary functionality and examine alternatives we explored in making our decision.

Kubernetes Distribution: When choosing a Kubernetes distribution, we examined three options: Canonical MicroK8s, Red Hat OpenShift, and the basic “vanilla” Kubernetes distribution. “Vanilla” Kubernetes represents the unaltered version directly available in Google’s repository, without pre-installed applications or plugins. We went for the vanilla version as it provides flexibility and freedom of choice of the extensions.

MicroK8s is an excellent solution for quick experimentation and setting up the system on smaller devices with limited resources (e.g., Raspberry Pi). However, it has many pre-installed applications and uses Canonical’s Snap packaging system, which can complicate adjusting configuration files and accessing external services, such as the NFS server.

We ruled out OpenShift due to the complexity of managing security profiles that, for our use case, were excessive, requiring substantial effort to implement these profiles for each service. Therefore, we opted for the basic “vanilla” Kubernetes distribution, offering more flexible and straightforward customization tailored to our needs.

Kubernetes Package Deployment: To describe the implementation of services in Kubernetes, a straightforward approach is to write YAML configuration file(s) (also called manifest), which are then forwarded to Kubernetes via the command line. However, some services can be quite complex, leading developers to create service packages, making services more general-purpose and customizable through parameters. The most widespread packaging system is Helm, allowing for more portable and adaptable service packages. Helm uses YAML files as templates (much like forms), which are then filled out based on the provided parameters and sent to Kubernetes.

Network Operator: Kubernetes services must be interconnected to communicate with other services. We opted for the open-source Tigera Calico operator to manage interconnections. Given its prevalence and functionalities, we found it the most suitable solution.

Calico and Flannel are the most common solutions for network operators. Flannel is more minimalistic and operates as a network switch (layer 2) using technologies like Open vSwitch or VXLAN. In contrast, Calico routes traffic like a network router (layer 3). Especially in cases of multi-cluster (i.e., multiple physical locations) or hybrid cloud services, Calico emerges as a better choice.

Storage Operator: For effective storage management within the Kubernetes system, we used csi-driver-nfs. It allows us to use the already established NFS servers. With it, we ensure uninterrupted access to persistent storage for any service within our private cloud.

The csi-driver-nfs proved most suitable since we already had an NFS server on one of the nodes. It allows us straightforward and centralized storage management for all services within Kubernetes. Centralization brings about numerous advantages, yet also challenges. Among the latter is the system’s vulnerability during a potential outage of the node storing the data. Nonetheless, centralization facilitates easier troubleshooting and backup execution.

Bare-Metal Ingress Load-Balancer: To ensure balanced ingress (of incoming) traffic among entry points in our Kubernetes cluster, we decided to utilize the MetalLB solution. After thorough research, we could not find any other alternative. Most of the online documentation (e.g., tutorials, blogs) focuses on setting up infrastructure on public clouds such as AWS or Azure and using solutions tailored to the demands of public cloud providers. However, since our infrastructure is based on our hardware (i.e., bare-metal), we opted for MetalLB, which has proven reliable and effective in routing traffic among our Kubernetes cluster’s entry points.

Ingress Operator: While a network operator manages interconnection between services within Kubernetes, the ingress operator manages access to services from the outside world. For security reasons, direct access to the internal network is prohibited. While it is possible to enter the internal network through a proxy (i.e., kubectl proxy), that’s meant only for debugging purposes. The ingress operator is designed to resolve domain names and route traffic to the correct container and port, which we described in the service’s YAML manifest. Using domain name resolution has several advantages. Regardless of the service’s internal IP address, the ingress operator will always correctly direct traffic. The ingress operator can act as a load balancer when there is a high-traffic load, balancing traffic between multiple copies of service.

Among the most common solutions for ingress traffic management are NGINX and Traefik Ingress operators. We chose NGINX, but the operators’ interface is standardized, so there are almost no differences between the solutions. Regardless of the selected solution, once a new service is deployed, the operator will follow the service’s manifest and automatically route traffic to the appropriate container.

GPU Operator: For efficient management of access to compute accelerators, we decided to use the official NVIDIA GPU-Operator suite of services. This suite provides two distinct installation options for NVIDIA drivers. The first option leverages host drivers, while the second involves drivers packaged within containers. Initially, we opted for the first option, wanting to enable the use of accelerators outside the Kubernetes framework. However, due to issues with conflicting driver versions, we decided to utilize the drivers provided by the GPU-Operator.

User Services

In this section, we introduce the selected services available to end users of our private cloud, enabling efficient execution and management of their research and educational projects.

JupyterHub is one of the key services in our private cloud, providing users with easy access to computing resources, data, and Jupyter Notebooks for research and teaching purposes. To implement JupyterHub, we use the Z2JH (Zero-to-JupyterHub) implementation, developed by a team of researchers at the University of Berkeley in collaboration with the Jupyter community. This solution facilitates quick setup and maintenance.

Every individual user is granted access to an isolated container instance via their username and password or OAuth provider, such as GitHub, Google, or Auth0. An isolated instance offers a stripped-down Linux environment with limited internet access and without admin permissions. Kubernetes then ensures access to shared data resources, common directories, and the use of compute accelerators.

The JupyterHub user interface is similar to Google Colab or Kaggle services. Upon entering the isolated instance, JupyterLab is already running, and the user also has access to the Linux terminal. Additional tools and software packages can be installed using pip, conda, or mamba commands.

Grafana is a key service in our private cloud, facilitating a straightforward display of the current workload of the compute cluster and the availability of compute accelerators. This data visualization platform allows users to present information clearly and transparently, aiding them in making decisions regarding resource usage and optimizing their tasks. Utilizing Grafana ensures efficient and transparent resource monitoring, enhancing user experience. Data collection (Prometheus) and visualization (Grafana) are deployed by kube-prometheus-stack.

Deployment

In this section, we’ll present how we deployed our computing infrastructure. First, I’ll summarize the hardware decisions, caveats, and finally, the user experience with some screenshots.

Hardware

Table 2: Hardware specifications of the computing node.

Hardware	Specifications
Chassis	Supermicro A+ Server 4124GS-TNR, 4U size, up to PCI-E 8 GPUs
CPU	2x AMD EPYC 75F3 (32C/64T, up to 4.0GHz, 256MB L3 cache)
Memory	1TB (16x64GB) REG ECC DDR4, 3200MHz
System	2x 2TB SSD NVMe, software RAID-1 (mirror)
Storage	6x 8TB SSD SATA, software RAID-Z1 (1 disk redundancy)
GPU	2x NVIDIA A100 80GB PCI-E

When we bought the hardware in early 2022, we chose third-generation AMD EPYC processors. Specifically, we went for the F-series, which has higher base and turbo frequencies — up to 4.0GHz — at the cost of fewer cores. We picked a CPU with the highest available TDP of 280W. We installed server-grade registered error-correcting memory at the highest frequency supported by the processor and populated all eight channels on both processors. Sixteen sticks of RAM in total. Although we considered solutions from Intel, AMD EPYC processors had better price-to-performance ratios.

From the perspective of numerical performance, our significant concern was Intel-optimized libraries, such as Intel MKL, often found in numerical tools. The library has a “bug” that causes non-Intel processors to utilize a slower SSE instead of more advanced AVX vectorization instructions [src]. OpenBLAS is a good alternative but requires some effort to install it. See Anaconda no-mkl package.

We chose two NVMe drives configured in the mirror configuration (RAID-1) for the system drive. We selected six 8TB SSD SATA drives configured in ZFS RAID-Z1 for data storage, which has one drive redundancy. We also chose two A100 GPUs as accelerators.

NVIDIA A100 GPUs come in two form factors: PCI-E and SXM4. The SXM4 proprietary form factor has a higher TDP and high-bandwidth NVLink interconnections between every GPU through NVSwitch hardware. The downside of SXM4 is that it will only support Ampere generation GPUs and require a special motherboard. The PCI-E variant has a lower TDP, and NVLink can only be across two GPUs. However, we decided against vendor lock-in, limiting ourselves to one brand and generation, and went with the PCI-E variant.

We considered the most likely workflow scenarios. We expected most communication to be CPU-to-GPU, with GPUs sliced into several instances via MIG (Multi-Instance GPU). When MIG mode is enabled, each GPU is partitioned into isolated instances that share the physical GPU resources but do not have access to NVLink interconnects. The slicing configuration can be changed at runtime by recreating the MIG instances.

Figure 2: The computing node on my desk underwent final checks before being installed in the server rack.

User Experience

After deploying the hardware and software stack, we conducted a month-long live test to stabilize the configuration. During this period, users were informed that we might reboot the system or make significant changes without responsibility for any potential data loss, though we aimed to minimize such occurrences.

We made two key decisions about resource allocation. Users can utilize all available memory and CPU cores. When CPU demand is high, Kubernetes and the operating system manage the scheduling of tasks. In cases of high memory usage, the job consuming the most memory is terminated to protect other running tasks.

Feedback from students and researchers was overwhelmingly positive, highlighting the high speed, numerous cores, ample memory, and dedicated GPU access without interference.

During the testing phase, “testers” identified several issues, which were promptly addressed. These included adding a shared folder with datasets and Jupyter Notebooks, shared package cache, and better persistence of running tasks in JupyterLab.

JupyterHub

JupyterHub has become a crucial component of our research infrastructure, enhancing our workflow significantly. Its smooth integration was largely due to the interface and functionality of JupyterHub, which closely resemble the tools our researchers and students were familiar with. This similarity played a key role in its quick adoption and high user satisfaction.

Figure 3: JupyterHub offers a list of predefined containers, where some of them offer a GPU instance.

Upon logging into JupyterHub, users are presented with a list of predefined containers (as shown in Figure 3). Our recent update includes several options:

A basic minimal working environment.
A comprehensive data science environment equipped with multiple packages and support for Python, R, and Julia.
A selection of containers offering GPU instances.

The development environment greets users with a layout similar to modern IDEs, featuring a file explorer on the left and code editor tabs on the right (see Figure 4).

Figure 4: JupyterLab workspace with familiar layout: a file explorer on the left and code editor tabs on the right.

Grafana

For transparent insight into infrastructure availability, the user has read-only access to the Grafana dashboard. Dashboard visualizes computing resource utilization including metrics like total and per-container CPU usage, memory usage per container, GPU utilization, temperature readings, and storage I/O (see Figure 5).

Figure 5: Visualization of computing cluster utilization showing total and per-container CPU utilization, per-container memory utilization, GPU slices utilization, temperatures, and storage I/O.

Conclusions

This article introduced our private cloud solution based on Kubernetes technology. This solution offers a scalable environment for using Jupyter Notebooks, an effective educational tool for data-driven narrative analysis, creating learning materials, and interactive presentations. Additionally, the system allows for the concurrent sharing of computing resources, significantly enhancing the utilization of our entire infrastructure.

The JupyterHub service on the Kubernetes platform facilitates easy access to the work environment and ensures user isolation, allowing for uninterrupted work and research. Users benefit from storage space and shared folders for file sharing, promoting collaboration and teamwork. Users also have access to compute accelerators when available.

We discuss our solution’s key components, architecture, and design decisions, revealing the technology choices that led to efficient operation and an exceptional user experience. Our focus has been on open-source platforms that have proven reliable and effective in our environment. As the core platform, Kubernetes enables scalable container management and high availability, while JupyterHub provides easy access to services and simplifies user management.

We plan to enhance our solution with additional services and technologies to improve user experience and increase our cloud’s performance. We remain open to new technologies and approaches that contribute to the better functioning of our private cloud solution for research and education. Data from Prometheus will be crucial for analyzing infrastructure utilization and understanding the extent of user competition for computing resources.

References

Bussonnier, Matthias. 2018. “Jupyter and HPC: Current state and future roadmap.” In Exascale Computing Project. https://www.exascaleproject.org/event/jupyter/.

Granger, Brian E. et al. 2021. “Jupyter: Thinking and Storytelling With Code and Data.” Computing in Science & Engineering 23 (2): 7–14. https://doi.org/10.1109/MCSE.2021.3059263.

Mendez, Kevin M et al. 2019. “Toward collaborative open data science in metabolomics using Jupyter Notebooks and cloud computing.” Metabolomics 15 (10): 1–16.

Perkel, Jeffrey M. 2018. “Why Jupyter is data scientists’ computational notebook of choice.” Nature 563 (7732): 145–47.

Footnotes

JSON: JavaScript Object Notation↩︎
REPL: read–eval–print loop↩︎
WETL: write-eval-think-loop↩︎

Reuse

CC BY-NC-SA 4.0

Generative Adversarial Networks

Gregor Cerar — Mon, 09 Oct 2023 22:00:00 GMT

Introduction

Generative Adversarial Networks (GANs) are an innovative class of unsupervised neural networks that have revolutionized the field of artificial intelligence. They were first introduced in Generative Adversarial Networks (Goodfellow et al. 2014) and consist of two separate neural networks: the generator (creates data) and the discriminator (evaluates data authenticity). The generator aims to fool the discriminator by producing realistic data, while the discriminator tries to differentiate real from fake. Over iterations, the generator’s data becomes more convincing.

As an analogy, consider two kids, one drawing counterfeit money (“Generator”) and another assessing its realism (“Discriminator”). Over time, the counterfeit drawings become increasingly convincing.

Vanilla GAN

The most fundamental variant of GAN is the “vanilla” GAN, where “vanilla” signifies the model in its original and most straightforward form rather than a flavor. To better understand its mechanism, I’ve illustrated its structure on Figure 1.

Figure 1: GAN architecture

Generator takes random noise as input and produces fabricated data .
- represents the input vector, a noise vector from the Gaussian distribution.
- denotes generator neural network weights.
- is a fabricated data sample meant for the discriminator.
Discriminator differentiates between real and generated data.
- represents input vectors, which come from either a real dataset () or from the set of fabricated samples ().
- denodes discriminator neural network weights.

Objective Function

The interaction between the Generator and the Discriminator can be quantified by their objective or loss functions:

Discriminator’s Objective: For real data , wants near . For generated data , it targets close to . Its objective is:

Generator’s Objective: aims for to approach , given by:

Both and continuously improve to outperform each other in this game.

Minimax Game in GANs

Vanilla GANs are structured around the minimax game from game theory:

In essence:

Discriminator: Maximizes its capacity to differentiate real data from generated.
Generator: Minimizes the discriminator’s success rate by producing superior forgeries.

The iterative competition refines both, targeting a proficient Generator and a perceptive Discriminator.

Prepare Components

In the upcoming sections, we’ll do the following steps to prepare the development environment:

Import necessary libraries, primarily PyTorch and Matplotlib.
Define constants, including project path and seed, for consistency.
Determine the computational device (e.g., GPU).
Provide a weight initialization helper function.

from collections.abc import Callable, Sequence
from pathlib import Path
from typing import Any, Final

import joblib
import numpy as np
from matplotlib import pyplot as plt

%config InlineBackend.figure_formats = {'retina', 'png'}

import torch
from torch import Tensor, nn, optim
from torch.utils.data import ConcatDataset, DataLoader, Dataset
from torchinfo import summary
from torchvision import transforms as T
from torchvision.utils import make_grid
from tqdm import tqdm

SEED: Final[int] = 42

PROJECT_PATH = Path(".").resolve()
FIGURE_PATH = PROJECT_PATH / "figures"
DATASET_PATH = Path.home() / "datasets"

# Common constants for all experiments
IMG_DIM: Final[tuple[int, int, int]] = (1, 28, 28)

device = torch.device("cpu")

if torch.cuda.is_available():
    device = torch.device("cuda")

def weights_init(net: nn.Module) -> None:
    for m in net.modules():
        if isinstance(m, nn.Conv2d | nn.ConvTranspose2d):
            nn.init.normal_(m.weight, 0.0, 0.02)
            if m.bias is not None:
                nn.init.constant_(m.bias, 0.0)

        elif isinstance(m, nn.BatchNorm1d | nn.BatchNorm2d):
            nn.init.normal_(m.weight, 1.0, 0.02)
            if m.bias is not None:
                nn.init.constant_(m.bias, 0.0)

        elif isinstance(m, nn.Linear):
            nn.init.normal_(m.weight, 0, 0.02)
            if m.bias is not None:
                nn.init.constant_(m.bias, 0.0)

Generator

The Generator in GANs acts as an artist, crafting data.

Input: Takes random noise, typically from a standard normal distribution.
Architecture: Uses dense layers, progressively increasing data dimensions.
Output: Reshapes data to desired format (e.g., image). Often uses ‘tanh’ for activation.
Objective: Generate data indistinguishable from real by the Discriminator.

class Generator(nn.Module):
    def __init__(self, out_dim: Sequence[int], nz: int = 100, ngf: int = 256, alpha: float = 0.2):
        """
        :param out_dim: output image dimension / shape
        :param nz: size of the latent z vector $z$
        :param ngf: size of feature maps (units in the hidden layers) in the generator
        :param alpha: negative slope of leaky ReLU activation
        """
        super().__init__()
        self.out_dim = out_dim
        self.model = nn.Sequential(
            nn.Linear(nz, ngf),
            nn.LeakyReLU(alpha, inplace=True),
            nn.Linear(ngf, 2 * ngf),
            nn.LeakyReLU(alpha, inplace=True),
            nn.Linear(2 * ngf, 4 * ngf),
            nn.LeakyReLU(alpha, inplace=True),
            nn.Linear(4 * ngf, int(np.prod(self.out_dim))),
            nn.Tanh(),
        )

    def forward(self, x: Tensor) -> Tensor:
        x = self.model(x)
        x = torch.reshape(x, (x.size(0), *self.out_dim))
        return x


summary(Generator(out_dim=(1, 28, 28)), input_size=[128, 100])

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Generator                                [128, 1, 28, 28]          --
├─Sequential: 1-1                        [128, 784]                --
│    └─Linear: 2-1                       [128, 256]                25,856
│    └─LeakyReLU: 2-2                    [128, 256]                --
│    └─Linear: 2-3                       [128, 512]                131,584
│    └─LeakyReLU: 2-4                    [128, 512]                --
│    └─Linear: 2-5                       [128, 1024]               525,312
│    └─LeakyReLU: 2-6                    [128, 1024]               --
│    └─Linear: 2-7                       [128, 784]                803,600
│    └─Tanh: 2-8                         [128, 784]                --
==========================================================================================
Total params: 1,486,352
Trainable params: 1,486,352
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 190.25
==========================================================================================
Input size (MB): 0.05
Forward/backward pass size (MB): 2.64
Params size (MB): 5.95
Estimated Total Size (MB): 8.63
==========================================================================================

Discriminator

The Discriminator is GAN’s evaluator, distinguishing real from fake data.

Input: Takes either real data samples or those from the Generator.
Architecture: Employs dense layers for binary classification of the input.
Output: Uses a sigmoid activation, yielding a score between 0-1, reflecting the data’s authenticity.
Objective: Recognize real data and identify fake data from the Generator.

class Discriminator(nn.Module):
    def __init__(self, input_dim: Sequence[int], ndf: int = 128, alpha: float = 0.2):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(int(np.prod(input_dim)), 4 * ndf),
            nn.LeakyReLU(alpha, inplace=True),
            nn.Dropout(0.3),
            nn.Linear(4 * ndf, 2 * ndf),
            nn.LeakyReLU(alpha, inplace=True),
            nn.Dropout(0.3),
            nn.Linear(2 * ndf, ndf),
            nn.LeakyReLU(alpha, inplace=True),
            nn.Dropout(0.3),
            nn.Linear(ndf, 1),
            # nn.Sigmoid(),
        )

    def forward(self, x: Tensor) -> Tensor:
        x = torch.reshape(x, (x.size(0), -1))
        return self.model(x)


summary(Discriminator(input_dim=(1, 28, 28)), input_size=[128, 1, 28, 28])

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Discriminator                            [128, 1]                  --
├─Sequential: 1-1                        [128, 1]                  --
│    └─Linear: 2-1                       [128, 512]                401,920
│    └─LeakyReLU: 2-2                    [128, 512]                --
│    └─Dropout: 2-3                      [128, 512]                --
│    └─Linear: 2-4                       [128, 256]                131,328
│    └─LeakyReLU: 2-5                    [128, 256]                --
│    └─Dropout: 2-6                      [128, 256]                --
│    └─Linear: 2-7                       [128, 128]                32,896
│    └─LeakyReLU: 2-8                    [128, 128]                --
│    └─Dropout: 2-9                      [128, 128]                --
│    └─Linear: 2-10                      [128, 1]                  129
==========================================================================================
Total params: 566,273
Trainable params: 566,273
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 72.48
==========================================================================================
Input size (MB): 0.40
Forward/backward pass size (MB): 0.92
Params size (MB): 2.27
Estimated Total Size (MB): 3.59
==========================================================================================

Training Loop

The training process is iterative:

Update Discriminator: With the Generator static, improve the Discriminator’s detection of real vs. fake.
Update Generator: With a static Discriminator, enhance the Generator’s ability to deceive.

Training continues until the Generator produces almost authentic data. Equilibrium is reached when the Discriminator sees every input as equally likely real or fake, assigning a probability of .

Note

Using .eval() and .train() modes initially seemed promising for faster training. However, they affected layers like BatchNorm2d and Dropout, making the GAN diverge. Also, switching between eval and train modes is not free of charge.

def train_step(
    generator: nn.Module,
    discriminator: nn.Module,
    optim_G: optim.Optimizer,
    optim_D: optim.Optimizer,
    criterion: Callable[[torch.Tensor, torch.Tensor], torch.Tensor],
    real_data: torch.Tensor,
    noise_dim: int,
    device: torch.device,
) -> tuple[float, float]:
    batch_size = real_data.size(0)
    real_data = real_data.to(device, non_blocking=True)

    ### Train Discriminator
    optim_D.zero_grad(set_to_none=True)

    noise = torch.randn(batch_size, noise_dim, device=device)

    output_real = discriminator(real_data)
    real_labels = torch.ones_like(output_real)
    loss_D_real = criterion(output_real, real_labels)

    fake_data = generator(noise)
    output_fake = discriminator(fake_data.detach())
    fake_labels = torch.zeros_like(output_fake)
    loss_D_fake = criterion(output_fake, fake_labels)

    loss_D = (loss_D_real + loss_D_fake) / 2

    loss_D.backward()
    optim_D.step()

    ### Train Generator
    optim_G.zero_grad(set_to_none=True)

    # Freeze D params so autograd does not waste work computing their grads
    for p in discriminator.parameters():
        p.requires_grad_(False)

    fake_data = generator(noise)
    output_fake = discriminator(fake_data)
    target_for_g = torch.ones_like(output_fake)
    loss_G = criterion(output_fake, target_for_g)

    loss_G.backward()
    optim_G.step()

    for p in discriminator.parameters():
        p.requires_grad_(True)

    return loss_G.detach().item(), loss_D.detach().item()

Evaluation

Before evaluation, we configured the learning rate (LR), optimizer’s parameters, batch size, and data loader settings for all experiments. We used the MNIST digits and MNIST fashion datasets for assessment.

OPTIMIZER_LR = 0.0002
L2_NORM = 1e-5
OPTIMIZER_BETAS = (0.5, 0.999)
N_EPOCHS = 100
BATCH_SIZE = 128

g = torch.Generator()
g.manual_seed(SEED)

loader_kwargs = {
    "num_workers": joblib.cpu_count(only_physical_cores=True),
    "pin_memory": True,
    "shuffle": True,
    "batch_size": BATCH_SIZE,
    "prefetch_factor": 2,
    "persistent_workers": True,
    "worker_init_fn": seed_worker,
    "generator": g,
}

MNIST Digits Dataset

The MNIST (Modified National Institute of Standards and Technology) dataset is a well-known collection of handwritten digits, extensively used in the fields of machine learning and computer vision for training and testing purposes. Its simplicity and size make it a popular choice for introductory courses and experiments in image recognition.

In total, the dataset contains 70,000 grayscale images of handwritten digits (from 0 to 9). Each image is 28x28 pixels.

def get_mnist_dataset(transform: T.Compose | None = None) -> Dataset:
    from torchvision.datasets import MNIST

    root = str(DATASET_PATH)
    trainset = MNIST(root=root, train=True, download=True, transform=transform)
    testset = MNIST(root=root, train=False, download=True, transform=transform)
    # Combine train and test dataset for more samples.
    dataset = ConcatDataset([trainset, testset])
    return dataset

NOISE_DIM = 100

transform = T.Compose([T.ToTensor(), T.Normalize(0.5, 0.5)])

dataset = get_mnist_dataset(transform=transform)
dataloader = DataLoader(dataset, **loader_kwargs)

# set seed for random generators
set_random_seed(seed=SEED)

# benchmark_noise is used for the animation to show how output evolve on the same vector
benchmark_noise = torch.randn(16 * 16, NOISE_DIM, device=device)

generator = Generator(out_dim=IMG_DIM, nz=NOISE_DIM).to(device)
generator.apply(weights_init)

discriminator = Discriminator(input_dim=IMG_DIM).to(device)
discriminator.apply(weights_init)

optimizer_G = optim.AdamW(
    generator.parameters(),
    lr=OPTIMIZER_LR,
    betas=OPTIMIZER_BETAS,
    weight_decay=L2_NORM,
)

optimizer_D = optim.AdamW(
    discriminator.parameters(),
    lr=OPTIMIZER_LR,
    betas=OPTIMIZER_BETAS,
    weight_decay=L2_NORM,
)

criterion = nn.BCEWithLogitsLoss().to(device)

animation: list[Tensor] = []

g_losses: list[float] = []
d_losses: list[float] = []

for _ in tqdm(range(N_EPOCHS), unit="epochs"):
    generator.train()
    discriminator.train()

    for samples_real, _ in dataloader:
        g_loss, d_loss = train_step(
            generator,
            discriminator,
            optimizer_G,
            optimizer_D,
            criterion,
            samples_real,
            NOISE_DIM,
            device,
        )

        g_losses.append(g_loss)
        d_losses.append(d_loss)

    generator.eval()
    with torch.inference_mode():
        images = generator(benchmark_noise)
        images = images.cpu()

        images = make_grid(images, nrow=16, normalize=True)
        images = (images * 255).clamp(0, 255).to(torch.uint8)

        animation.append(images)

100%|██████████| 100/100 [05:44<00:00,  3.45s/epochs]

Generator and Discriminator loss evolution over epochs using Vanilla GAN on the MNIST digit dataset.

Fashion MNIST Dataset

The Fashion MNIST dataset is a collection of grayscale images of 10 different categories of clothing items, designed as a more challenging alternative to the classic MNIST dataset of handwritten digits. Each image in the dataset is 28x28 pixels. The 10 categories include items like t-shirts/tops, trousers, pullovers, dresses, coats, sandals, and more. With 70,000 images, Fashion MNIST is commonly used for benchmarking machine learning algorithms, especially in image classification tasks.

NOISE_DIM: int = 100

def get_mnist_fashion_dataset(transform: T.Compose | None = None) -> Dataset:
    from torchvision.datasets import FashionMNIST

    root = str(DATASET_PATH)
    trainset = FashionMNIST(root=root, train=True, download=True, transform=transform)
    testset = FashionMNIST(root=root, train=False, download=True, transform=transform)
    # Combine train and test dataset for more samples.
    dataset = ConcatDataset([trainset, testset])
    return dataset

transform = T.Compose([T.ToTensor(), T.Normalize(0.5, 0.5)])

data = get_mnist_fashion_dataset(transform=transform)
dataloader = DataLoader(data, **loader_kwargs)

# set seed for random generators
set_random_seed(seed=SEED)

# benchmark_noise is used for the animation to show how output evolve on same vector
benchmark_noise = torch.randn(16 * 16, NOISE_DIM, device=device)

generator = Generator(out_dim=IMG_DIM, nz=NOISE_DIM).to(device)
generator.apply(weights_init)

discriminator = Discriminator(input_dim=IMG_DIM).to(device)
discriminator.apply(weights_init)

optimizer_G = optim.AdamW(
    generator.parameters(),
    lr=OPTIMIZER_LR,
    betas=OPTIMIZER_BETAS,
    weight_decay=L2_NORM,
)

optimizer_D = optim.AdamW(
    discriminator.parameters(),
    lr=OPTIMIZER_LR,
    betas=OPTIMIZER_BETAS,
    weight_decay=L2_NORM,
)

criterion = nn.BCEWithLogitsLoss().to(device)

animation = []

g_losses, d_losses = [], []
for _ in tqdm(range(N_EPOCHS), unit="epochs"):
    generator.train()
    discriminator.train()

    for samples_real, _ in dataloader:
        g_loss, d_loss = train_step(
            generator,
            discriminator,
            optimizer_G,
            optimizer_D,
            criterion,
            samples_real,
            NOISE_DIM,
            device,
        )

        g_losses.append(g_loss)
        d_losses.append(d_loss)

    generator.eval()
    with torch.inference_mode():
        images = generator(benchmark_noise)
        images = images.cpu()

        images = make_grid(images, nrow=16, normalize=True)
        images = (images * 255).clamp(0, 255).to(torch.uint8)

        animation.append(images)

100%|██████████| 100/100 [05:47<00:00,  3.47s/epochs]

Generator and Discriminator loss evolution over epochs using Vanilla GAN on the MNIST fashion dataset.

DCGAN

DCGAN, short for Deep Convolutional Generative Adversarial Network, differs from vanilla GAN by using convolutional layers. This design makes DCGAN better for image data. With specific architectural guidelines, DCGAN trains more consistently and generates clearer images than vanilla GANs across various hyperparameters.

Setting Up DCGANs

Generator

class Generator(nn.Module):
    def __init__(self, nz: int = 100, ngf: int = 32, nc: int = 1):
        """
        :param nz: size of the latent z vector
        :param ngf: size of feature maps in generator
        :param nc: number of channels in the training images.
        """
        super().__init__()
        self.layers = nn.Sequential(
            nn.ConvTranspose2d(nz, 4 * ngf, 4, 1, 0, bias=False),
            nn.BatchNorm2d(4 * ngf),
            nn.ReLU(inplace=True),
            nn.ConvTranspose2d(4 * ngf, 2 * ngf, 3, 2, 1, bias=False),
            nn.BatchNorm2d(2 * ngf),
            nn.ReLU(inplace=True),
            nn.ConvTranspose2d(2 * ngf, ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(inplace=True),
            nn.ConvTranspose2d(ngf, nc, 4, 2, 1, bias=False),
            nn.Tanh(),
        )

    def forward(self, x: Tensor) -> Tensor:
        x = torch.reshape(x, (x.size(0), -1, 1, 1))
        return self.layers(x)


summary(Generator(), input_size=(128, 100))

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Generator                                [128, 1, 28, 28]          --
├─Sequential: 1-1                        [128, 1, 28, 28]          --
│    └─ConvTranspose2d: 2-1              [128, 128, 4, 4]          204,800
│    └─BatchNorm2d: 2-2                  [128, 128, 4, 4]          256
│    └─ReLU: 2-3                         [128, 128, 4, 4]          --
│    └─ConvTranspose2d: 2-4              [128, 64, 7, 7]           73,728
│    └─BatchNorm2d: 2-5                  [128, 64, 7, 7]           128
│    └─ReLU: 2-6                         [128, 64, 7, 7]           --
│    └─ConvTranspose2d: 2-7              [128, 32, 14, 14]         32,768
│    └─BatchNorm2d: 2-8                  [128, 32, 14, 14]         64
│    └─ReLU: 2-9                         [128, 32, 14, 14]         --
│    └─ConvTranspose2d: 2-10             [128, 1, 28, 28]          512
│    └─Tanh: 2-11                        [128, 1, 28, 28]          --
==========================================================================================
Total params: 312,256
Trainable params: 312,256
Non-trainable params: 0
Total mult-adds (Units.GIGABYTES): 1.76
==========================================================================================
Input size (MB): 0.05
Forward/backward pass size (MB): 24.26
Params size (MB): 1.25
Estimated Total Size (MB): 25.56
==========================================================================================

Discriminator

class Discriminator(nn.Module):
    def __init__(self, ndf: int = 32, nc: int = 1, alpha: float = 0.2):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(nc, ndf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ndf),
            nn.LeakyReLU(alpha, inplace=True),
            nn.Conv2d(ndf, 2 * ndf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(2 * ndf),
            nn.LeakyReLU(alpha, inplace=True),
            nn.Conv2d(2 * ndf, 4 * ndf, 3, 2, 1, bias=False),
            nn.BatchNorm2d(4 * ndf),
            nn.LeakyReLU(alpha, inplace=True),
            nn.Conv2d(4 * ndf, 1, 4, 1, 0, bias=False),
            # nn.Sigmoid(),
        )

    def forward(self, x: Tensor) -> Tensor:
        x = self.layers(x)
        x = torch.reshape(x, (x.size(0), -1))
        return x


summary(Discriminator(), input_size=(BATCH_SIZE, 1, 28, 28))

==========================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
==========================================================================================
Discriminator                            [128, 1]                  --
├─Sequential: 1-1                        [128, 1, 1, 1]            --
│    └─Conv2d: 2-1                       [128, 32, 14, 14]         512
│    └─BatchNorm2d: 2-2                  [128, 32, 14, 14]         64
│    └─LeakyReLU: 2-3                    [128, 32, 14, 14]         --
│    └─Conv2d: 2-4                       [128, 64, 7, 7]           32,768
│    └─BatchNorm2d: 2-5                  [128, 64, 7, 7]           128
│    └─LeakyReLU: 2-6                    [128, 64, 7, 7]           --
│    └─Conv2d: 2-7                       [128, 128, 4, 4]          73,728
│    └─BatchNorm2d: 2-8                  [128, 128, 4, 4]          256
│    └─LeakyReLU: 2-9                    [128, 128, 4, 4]          --
│    └─Conv2d: 2-10                      [128, 1, 1, 1]            2,048
==========================================================================================
Total params: 109,504
Trainable params: 109,504
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 369.68
==========================================================================================
Input size (MB): 0.40
Forward/backward pass size (MB): 23.46
Params size (MB): 0.44
Estimated Total Size (MB): 24.30
==========================================================================================

Evaluation

MNIST Digits Dataset

NOISE_DIM = 128

transform = T.Compose(
    [
        T.ToTensor(),
        T.Normalize(0.5, 0.5),
    ]
)

data = get_mnist_dataset(transform)
dataloader = DataLoader(data, **loader_kwargs)

# set seed for random generators
set_random_seed()

# benchmark_noise is used for the animation to show how output evolve on same vector
benchmark_noise = torch.randn(16 * 16, NOISE_DIM, device=device)

generator = Generator(nz=NOISE_DIM, ngf=32, nc=IMG_DIM[0]).to(device)
generator.apply(weights_init)

discriminator = Discriminator(ndf=32, nc=IMG_DIM[0]).to(device)
discriminator.apply(weights_init)

optimizer_G = optim.AdamW(
    generator.parameters(),
    lr=OPTIMIZER_LR,
    betas=OPTIMIZER_BETAS,
    weight_decay=L2_NORM,
)

optimizer_D = optim.AdamW(
    discriminator.parameters(),
    lr=OPTIMIZER_LR,
    betas=OPTIMIZER_BETAS,
    weight_decay=L2_NORM,
)

criterion = nn.BCEWithLogitsLoss().to(device)

animation = []

g_losses, d_losses = [], []
for _ in tqdm(range(N_EPOCHS), unit="epochs"):
    generator.train()
    discriminator.train()

    for samples_real, _ in dataloader:
        g_loss, d_loss = train_step(
            generator,
            discriminator,
            optimizer_G,
            optimizer_D,
            criterion,
            samples_real,
            NOISE_DIM,
            device,
        )

        g_losses.append(g_loss)
        d_losses.append(d_loss)

    generator.eval()
    with torch.inference_mode():
        images = generator(benchmark_noise)
        images = images.cpu()

        images = make_grid(images, nrow=16, normalize=True)
        images = (images * 255).clamp(0, 255).to(torch.uint8)

        animation.append(images)

100%|██████████| 100/100 [04:50<00:00,  2.91s/epochs]

Generator and Discriminator loss evolution over epochs using DCGAN on the MNIST digit dataset.

MNIST Fashion Dataset

NOISE_DIM = 128

transform = T.Compose(
    [
        T.ToTensor(),
        T.Normalize(0.5, 0.5),
    ]
)

data = get_mnist_fashion_dataset(transform)
dataloader = DataLoader(data, **loader_kwargs)

# set seed for random generators
set_random_seed()

# benchmark_noise is used for the animation to show how output evolve on same vector
benchmark_noise = torch.randn(16 * 16, NOISE_DIM, device=device)

generator = Generator(nz=NOISE_DIM, ngf=32, nc=IMG_DIM[0]).to(device)
generator.apply(weights_init)

discriminator = Discriminator(ndf=32, nc=IMG_DIM[0]).to(device)
discriminator.apply(weights_init)

optimizer_G = optim.AdamW(
    generator.parameters(),
    lr=OPTIMIZER_LR,
    betas=OPTIMIZER_BETAS,
    weight_decay=L2_NORM,
)

optimizer_D = optim.AdamW(
    discriminator.parameters(),
    lr=OPTIMIZER_LR,
    betas=OPTIMIZER_BETAS,
    weight_decay=L2_NORM,
)

criterion = nn.BCEWithLogitsLoss().to(device)

animation = []

g_losses, d_losses = [], []
for _ in tqdm(range(N_EPOCHS), unit="epochs"):
    generator.train()
    discriminator.train()

    for samples_real, _ in dataloader:
        g_loss, d_loss = train_step(
            generator,
            discriminator,
            optimizer_G,
            optimizer_D,
            criterion,
            samples_real,
            NOISE_DIM,
            device,
        )

        g_losses.append(g_loss)
        d_losses.append(d_loss)

    generator.eval()
    with torch.inference_mode():
        images = generator(benchmark_noise)
        images = images.cpu()

        images = make_grid(images, nrow=16, normalize=True)
        images = (images * 255).clamp(0, 255).to(torch.uint8)

        animation.append(images)

100%|██████████| 100/100 [04:55<00:00,  2.96s/epochs]

Generator and Discriminator loss evolution over epochs using DCGAN on the MNIST fashion dataset.

Conclusion

Generative Adversarial Networks (GANs) represent an innovative class of unsupervised neural networks that have significantly impacted the field of artificial intelligence (AI). They consist of two components: a Generator that improves its output and a Discriminator that enhances its evaluative skills. In a competitive yet symbiotic relationship, these two networks converge towards a dynamic equilibrium. This interaction exemplifies the strength of GANs and the adaptability of adversarial learning in AI, blending creative generation with critical assessment.

In this post, I explore the original GAN, often referred to as the “vanilla” GAN. My goal was to understand the basic mechanics of GANs. Meanwhile, others have advanced this technology, applying it to a range of innovative and fascinating new areas.

18 Impressive Applications of Generative Adversarial Networks (GANs)

Tip

TODO for future refactoring: Replace nn.Sigmoid() in both Discriminator classes with raw logits and use nn.BCEWithLogitsLoss() instead of nn.BCELoss(). This combines the Sigmoid activation with the binary cross-entropy loss in a single numerically stable operation, following PyTorch best practices for GAN training.

References

Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, et al. 2014. Generative Adversarial Networks. https://arxiv.org/abs/1406.2661.

Reuse

CC BY-NC-SA 4.0

Neural Style Transfer

Gregor Cerar — Thu, 14 Sep 2023 22:00:00 GMT

Introduction

Neural Style Transfer (NST) is a deep learning technique that combines the content of one image with the style of another, like giving your photo a Van Gogh-esque makeover.

Using convolutional neural networks, NST examines both images’ features and creates a new image that merges the content’s structure with the style’s attributes. This technique became a hit due to its novel outcomes, leading to its adoption in various apps and platforms and highlighting deep learning’s prowess in image transformation.

Introduced initially in “A Neural Algorithm of Artistic Style” (Gatys et al. 2015), this method transfers art styles between images. Eager to learn how it works, I’ve implemented the original approach from scratch and presented a few cherry-picked transformed examples.

Prerequisites

Before we get started, we need to install NumPy, Matplotlib, PyTorch deep learning framework, and finally, Torchvision library.

from collections.abc import Iterable, Sequence
from pathlib import Path

import numpy as np
from matplotlib import pyplot as plt

%config InlineBackend.figure_formats = {'retina', 'png'}

import torch
from torch import Tensor, nn, optim
from torch.nn import functional as F
from torchvision import models
from torchvision.io import decode_image
from torchvision.transforms import functional as VF
from torchvision.transforms import v2 as T
from torchvision.utils import make_grid
from tqdm import tqdm

# Random seed for reproducibility
SEED = 42

# Size of the output image
IMG_SIZE = 512

Although it is possible to run neural networks on a CPU, using compute accelerators, such as GPU, will do transformation much faster. Here, I utilize my NVIDIA RTX 3090, where I also took advantage of available tensor cores and reduced precision data type bfloat16 for faster transformation.

AMP_ENABLED = False

device = torch.device("cpu")

if torch.cuda.is_available():
    device = torch.device("cuda")

    if torch.cuda.is_bf16_supported():
        AMP_ENABLED = True

Implementation

Figure 1: The Neural Style Transfer framework introduced by Gatys et al. distinguishes style and content features from designated layers.

Implementing NST was initially confusing since it does not follow the typical boilerplate used in deep learning. In the following sections, I’ll delve into its implementation step by step and often refer back to Figure 1. The steps are as follows:

Prepare the content, style, and target images.
Prepare a pre-trained VGG neural network and prevent changes to its weights.
Introduce three unique loss metrics.
Adjust the neural network to extract features during forward-backward passes, applying gradient modifications to the target image. The neural network stays unchanged in the process.
Iterate through this process.

# Weights for different features (were these used by original authors?)
STYLE_LAYERS_DEFAULT = {
    "conv1_1": 0.75,
    "conv2_1": 0.5,
    "conv3_1": 0.2,
    "conv4_1": 0.2,
    "conv5_1": 0.2,
}

CONTENT_LAYERS_DEFAULT = ("conv5_2",)

CONTENT_WEIGHT = 8  # "alpha" in the literature (default: 8)
STYLE_WEIGHT = 70  # "beta" in the literature (default: 70)
TV_WEIGHT = 10  # "gamma" in the literature (default: 10)


LEARNING_RATE = 0.004
N_EPOCHS = 5_000

Loss metrics

To effectively implement Neural Style Transfer, we need to quantify how well the generated image matches both the content and style of our source images. This is done using loss metrics. Let’s delve into the specifics of these metrics and how they drive the NST process.

Content loss metric

Content loss is calculated through Euclidean distance (i.e., mean squared error) between the respective intermediate higher-level feature representation and of original input image and the content image at layer .

Hence, a given input image is encoded in each layer of the CNN by the filter responses to that image. A layer with distinct filters has feature maps of size , where is the height times the width of the feature map. So the response in a layer can be stored in a matrix where is the activation of the filter at position in layer .

def content_loss_func(target_features: dict[str, Tensor], precomputed_content_features: dict[str, Tensor]) -> Tensor:
    """Calculate content loss metric for give layers."""

    device = next(iter(target_features.values())).device
    content_loss = torch.tensor(0.0, device=device)

    for layer in precomputed_content_features:
        target_feature = target_features[layer]
        content_feature = precomputed_content_features[layer]

        content_loss += F.mse_loss(target_feature, content_feature)

    return content_loss

Style loss

The style loss is more convolved than the content loss. We compute it by comparing the Gram matrices of the feature maps from the style image and the generated image.

First, let’s understand the Gram matrix. Given the feature map of size , where is the number of channels and are the spatial dimensions, the Gram matrix is of size and is computed as

where is the inner product between vectorized feature maps and . This results in a matrix that captures the correlation between different feature maps and, thus, the style information.

def gram_matrix(tensor: Tensor) -> Tensor:
    (b, c, h, w) = tensor.size()

    # reshape into (C x (H x W))
    features = tensor.view(b * c, h * w)

    # compute the gram product
    gram = torch.mm(features, features.t())

    return gram

The style loss between the Gram matrix of the generated image and that of style image (at a specific layer ) is:

Where is the style loss for layer , and are the numbers of channels and height times width in the feature representation of layer , respectively. and are the gram matrices of the intermediate representation of the style image and the input base image respectively.

The total style loss is:

def style_loss_func(
    target_features: dict[str, Tensor], style_features: dict[str, Tensor], precomputed_style_grams: dict[str, Tensor]
) -> Tensor:
    device = next(iter(target_features.values())).device
    style_loss = torch.tensor(0.0, device=device)

    for layer in style_features:
        target_feature = target_features[layer]
        target_gram = gram_matrix(target_feature)

        style_gram = precomputed_style_grams[layer]

        _, c, h, w = target_feature.shape

        weight = STYLE_LAYERS_DEFAULT[layer]
        layer_style_loss = weight * F.mse_loss(target_gram, style_gram) / (c * h * w)
        style_loss += layer_style_loss

    return style_loss

Total Variation Loss

Total Variation (TV) loss, also known as Total Variation Regularization, is commonly added to the Neural Style Transfer objective to encourage spatial smoothness in the generated image. Without it, the output might exhibit noise or oscillations, particularly in regions where the content and style objectives don’t offer much guidance.

Given an image of size (height, width, channels), the Total Variation loss is defined as the sum of the absolute differences between neighboring pixel values:

where is the pixel value at position .

In simple terms, this loss penalizes abrupt changes in pixel values from one to its neighbors. By minimizing this loss, the generated image becomes smoother, reducing artifacts and unwanted noise. When combined with content and style losses, the TV loss ensures that the resulting image not only captures the content and style of the source images but also looks visually coherent and smooth.

def total_variance_loss_func(target: Tensor) -> Tensor:
    tv_loss = F.l1_loss(target[:, :, :, :-1], target[:, :, :, 1:]) + F.l1_loss(
        target[:, :, :-1, :], target[:, :, 1:, :]
    )

    return tv_loss

Total Loss

The total loss combines three loss metric components, each targeting a specific aspect of the image generation process. Let’s recap the components:

Content Loss: Ensures the generated image resembles the content image’s content.
Style Loss: Ensures the generated image captures the stylistic features of the style image.
Total Variation Loss: Encourages spatial smoothness in the generated image, reducing artifacts and noise.

Given the above components, the total loss for Neural Style Transfer can be formulated as:

, , and are weight factors that determine the relative importance of the content, style, and the total variation losses, respectively. By adjusting these weights, one can control the balance between content preservation, style transfer intensity, and the smoothness of the generated image. The algorithm aims to adjust the generated image to minimize the total loss.

Input preparation

Here we specify path to content and style images:

content_path = "./bridge.jpg"
style_path = "./walking-in-the-rain.jpg"

Neural Style Transfer Process

For feature extraction, we’ll leverage VGG19, pre-trained on ImageNet, same as the original authors. Note that we set the model to evaluation mode, ensuring we only use VGG19 to extract features without altering its weights. We also transfer the neural network (NN) to a chosen device, ideally a GPU, for optimal performance.

Note

An intriguing choice by Gatys et al. was to modify VGG-19, replacing max pooling with average pooling, aiming for visually superior results. However, a challenge arises: our NN was initially trained with MaxPool2d layers. Substituting them can affect activations due to reduced output values. To counteract this, we’ve introduced a custom ScaledAvgPool2d.

# We will use a frozen pre-trained VGG neural network for feature extraction.
# In the original paper, authors have used VGG19 (without batch normalization)
model = models.vgg19(weights=models.VGG19_Weights.IMAGENET1K_V1).features


# Authors in the original paper suggested using AvgPool instead of MaxPool
# for more pleasing results. However, changing the pooling also affects
# activation, so the input needs to be scaled (can't find the original source).
class ScaledAvgPool2d(nn.Module):
    def __init__(self, kernel_size, stride, padding=0, scale_factor=2.0):
        super().__init__()
        self.avgpool = torch.nn.AvgPool2d(kernel_size, stride, padding)
        self.scale_factor = scale_factor

    def forward(self, x):
        return self.avgpool(x) * self.scale_factor


# (OPTIONAL) Replace max-pooling layers with custom average pooling layers
# for i, layer in enumerate(model):
#   if isinstance(layer, torch.nn.MaxPool2d):
#       model[i] = ScaledAvgPool2d(kernel_size=2, stride=2, padding=0)

model = model.eval().requires_grad_(False).to(device)

The pretrained VGG model used normalized ImageNet samples for better performance. For effective style transfer, we’ll follow suit to improve feature extraction. Though images will appear altered post-normalization, they are reverted to their original state after the NST process. Next, we’ll transform the content and style images by:

Loading them from storage.
Resizing while maintaining aspect ratio.
Converting to tensors.
Normalizing using ImageNet weights.

# ImageNet normalization weights per channel
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

transform = T.Compose(
    [
        T.ToImage(),
        T.Resize(IMG_SIZE),  # Shorter edge of the image will be matched to `IMG_SIZE`
        T.ToDtype(torch.float32, scale=True),
        T.Normalize(IMAGENET_MEAN, IMAGENET_STD),
    ]
)


def load_image(path: str | Path) -> Tensor:
    img = decode_image(str(path))

    # Transform images into tensors
    img: Tensor = transform(img)

    # Add dimension to imitate batch size equal to 1: (C,H,W) -> (B,C,H,W)
    img = img.unsqueeze(0)
    return img

The following code will prepares content , style , and target images. The target image is a clone of the content image and we enable computation of gradients on it.

# The "style" image from which we obtain style
style = load_image(style_path).to(device)

# The "content" image on which we apply style
content = load_image(content_path).to(device)

# The "target" image to store the outcome
target = content.clone().requires_grad_(True).to(device)

The function below retrieves feature maps from designated layers. As shown in Figure 1:

Content feature map comes from relu5_2.
Style feature maps are sourced from relu1_1, relu2_1, relu3_1, relu4_1, and relu5_1.

def get_features(image: Tensor, model: nn.Module, layers: Iterable[str] | None = None) -> dict[str, Tensor]:
    if layers is None:
        layers = tuple(STYLE_LAYERS_DEFAULT.keys()) + CONTENT_LAYERS_DEFAULT

    features = {}
    block_num = 1
    conv_num = 0

    x = image

    for layer in model.children():
        x = layer(x)

        if isinstance(layer, nn.Conv2d):
            # produce layer name to find matching convolutions from the paper
            # and store their output for further processing.
            conv_num += 1
            name = f"conv{block_num}_{conv_num}"
            if name in layers:
                features[name] = x

        elif isinstance(layer, nn.MaxPool2d | nn.AvgPool2d | ScaledAvgPool2d):
            # In VGG, each block ends with max/avg pooling layer.
            block_num += 1
            conv_num = 0

        elif isinstance(layer, nn.BatchNorm2d | nn.ReLU):
            pass

        else:
            raise Exception(f"Unknown layer: {layer}")

    return features

Since content and style images never change, we can precompute their feature maps and grams to speed up the NST process.

# Precompute content features, style features, and style gram matrices.
content_features = get_features(content, model, CONTENT_LAYERS_DEFAULT)
style_features = get_features(style, model, STYLE_LAYERS_DEFAULT)

style_grams = {layer: gram_matrix(style_features[layer]) for layer in style_features}

Next, we will use Adam optimizer, where we specify that only target image is considered for optimization.

optimizer = optim.Adam([target], lr=LEARNING_RATE)

The final step of NST is to transfer style using everything we’ve implemented. We extract feature maps, compute total loss, perform steps using gradient descent, and repeat the process N_EPOCHS times. Gradient changes will apply only to the target image.

To notably enhance NST speed, I utilized mixed precision with the unique bfloat16 found in newer hardware. Traditional half-precision float16 doesn’t yield the same results. I’ve tested it. Probably because of the issue with gradient scaling.

pbar = tqdm(range(N_EPOCHS))

for _ in pbar:
    with torch.autocast("cuda", dtype=torch.bfloat16, enabled=AMP_ENABLED):
        target_features = get_features(target, model)

        content_loss = CONTENT_WEIGHT * content_loss_func(target_features, content_features)
        style_loss = STYLE_WEIGHT * style_loss_func(target_features, style_features, style_grams)
        tv_loss = TV_WEIGHT * total_variance_loss_func(target)

        total_loss = content_loss + style_loss + tv_loss

    optimizer.zero_grad()
    total_loss.backward()

    optimizer.step()

    pbar.set_postfix_str(
        f"total_loss={total_loss.item():.2f} "  # noqa: E501
        f"content_loss={content_loss.item():.2f} "
        f"style_loss={style_loss.item():.2f} "
        f"tv_loss={tv_loss.item():.2f}"
    )

100%|██████████| 5000/5000 [01:37<00:00, 51.45it/s, total_loss=43.91 content_loss=8.70 style_loss=29.11 tv_loss=6.11]

As mentioned before, images need to be denormalized (i.e. reverted back) to correct colors. After that we compare content, style and target images side-by-side.

class InverseNormalize:
    def __init__(self, mean: Sequence[float], std: Sequence[float]) -> None:
        self.mean = torch.as_tensor(mean)
        self.std = torch.as_tensor(std)

    def __call__(self, x_norm: Tensor) -> Tensor:
        # Ensure mean and std have the correct shape
        mean = self.mean.to(x_norm.device).view(-1, 1, 1)
        std = self.std.to(x_norm.device).view(-1, 1, 1)

        # Inverse normalization: x = x_normalized * std + mean
        x = x_norm.mul(std).add(mean)
        return x


class Clip:
    def __init__(self, vmin: float = 0.0, vmax: float = 1.0) -> None:
        self.vmin = vmin
        self.vmax = vmax

    def __call__(self, x: Tensor) -> Tensor:
        return torch.clamp(x, self.vmin, self.vmax)


inv_transform_preview = T.Compose(
    [
        InverseNormalize(IMAGENET_MEAN, IMAGENET_STD),
        T.Resize(IMG_SIZE, antialias=True),
        T.CenterCrop((IMG_SIZE, IMG_SIZE)),
        Clip(),
    ]
)

imgs = [inv_transform_preview(i.detach().squeeze().cpu()) for i in (content, style, target)]

grid = make_grid(imgs)


def show(imgs):
    if not isinstance(imgs, list):
        imgs = [imgs]

    fig, axs = plt.subplots(ncols=len(imgs), figsize=(15, 5), squeeze=False, dpi=92, tight_layout=True, frameon=False)
    for i, img in enumerate(imgs):
        img = img.detach()
        img = VF.to_pil_image(img)
        axs[0, i].imshow(np.asarray(img))
        axs[0, i].set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])


show(grid)

Successfuly applied neural style transfer. The content image (left), the style image (center), and final target image (right).

Conclusions

Neural Style Transfer (NST) was a breakthrough deep learning approach that can transfer artistic style from one image to another. The key takeaway from my experience is the incredible potential of neural networks in merging art and tech, seamlessly blending the styles of different artworks with original images.

What stood out was the use of a pre-trained neural network for feature extraction, extracting feature maps from particular layers, and then the ability to balance the content and style weight parameters to maintain the essence of the original image while effectively imitating the artistic style.

Although the NST achieves pleasing results, it was soon overshadowed by faster and more advanced methods, such as DALL-E, Stable Diffusion, and Midjourney. However, it represented a significant milestone toward artistic AI and generative AI models.

Acknowledgements

Helpful articles and code repositories while writing my implementation:

Gregor Koehler et al. gkoehler/pytorch-neural-style-transfer (best resource in my opinion)
Ritul’s Medium article (good resource)
Pragati Baheti blog visually present style extraction
Aleksa Gordić (gordicaleksa/pytorch-neural-style-transfer)
ProGamerGov/neural-style-pt
Katherine Crowson (rowsonkb/style-transfer-pytorch)
Derrick Mwiti’s Medium article
Aman Kumar Mallik’s Medium article

I want to acknowledge the following artworks:

“Gray Bridge and Trees” by Martin Damboldt
“Walking in the Rain” by Leonid Afremov
“The Starry Night” by Vincent van Gogh

For a complete list of acknowledgments, please visit my GitHub repository:

gcerar/pytorch-neural-style-transfer

Appendix

Examples

A few cherry-picked examples of style transfer:

References

Gatys, Leon A, Alexander S Ecker, and Matthias Bethge. 2015. “A Neural Algorithm of Artistic Style.” arXiv Preprint arXiv:1508.06576.

Reuse

CC BY-NC-SA 4.0