The multi-armed bandit problem is the problem of sequential decision making under uncertainty. Given a number of alternatives, at every time step you choose one of the options and receive a noisy reward, and the goal is to maximize the total reward over time. This problem has many applications, including in reinforcement learning, portfolio allocation, clinical trials, and more. Inherently, the multi-armed bandit problem is about the exploration-exploitation trade-off: how do you choose between sticking with the option you believe is best and exploring other options?
Thompson Sampling is a Bayesian approach to the multi-armed bandit problem: at every time step you sample from the posterior distribution of the reward for each arm, and choose the one that gives the highest reward. This is also known as probability matching, since the probability that an arm is optimal is matched to the probability that sampling from that arm’s posterior distribution gives the highest reward.
In this article we will do a deep dive into Thompson Sampling for the multi-armed bandit problem with normally-distributed rewards. We’ll derive the necessary mathematics and evaluate the algorithm by comparing to a well-known alternative called the Upper Confidence Bound (UCB) algorithm. All code can be found on GitHub.
But first, here is an applet that illustrates the problem. You can play with the parameters to see how they affect the exploration-exploitation trade-off. You’ll notice that over time the uncertainty for the reward estimate of the optimal arm becomes very small.
Can you find out how each parameter changes the behavior? Are there settings that lead the algorithm to commit to a suboptimal arm?
In the next section I’ll present a derivation of the posterior distribution for Gaussian bandits. If you’d like to skip the math and jump directly to the (interactive) evaluation of the algorithm, click here.
Mathematical Derivation
We start with a few definitions. We consider the $k$-armed bandit problem with the action space $\mathcal{A} = \{1, \ldots, k\}.$ We assume that the reward for each arm $a \in \mathcal{A}$ is independently normally distributed with mean $\mu_a$ and variance $\sigma_a^2.$ If we write $A_t \in \mathcal{A}$ for the arm that was chosen at time $t$ and $X_t \in \mathbb{R}$ for the reward, then the likelihood function can be written as $P(X_t \given A_t = a) = \mathcal{N}(X_t ; \mu_a, \sigma_a^2).$
We’ll assume the most general scenario where both the mean and variance of the reward is unknown. The conjugate prior for this case is the normal-inverse-gamma distribution, written as $\text{N-}\Gamma^{-1},$ such that $$ P(\mu_a, \sigma_a^2 \given m_a, \nu_a, \alpha_a, \beta_a) \sim \text{N-}\Gamma^{-1}(m_a, \nu_a, \alpha_a, \beta_a) \qquad \forall a \in \mathcal{A}, $$ where $m_a, \nu_a, \alpha_a,$ and $\beta_a$ are the hyperparameters and represent the prior mean $(m_a),$ the prior “count” $(\nu_a > 0),$ prior shape parameter $(\alpha_a > 0),$ and prior scale parameter $(\beta_a > 0).$ Occasionally we will write $$ P(\mu_a, \sigma_a^2 \given m_a, \nu_a, \alpha_a, \beta_a) = P(\mu_a, \sigma_a^2 \given \omega_a), $$ using the shorthand notation $\omega_a = (m_a, \nu_a, \alpha_a, \beta_a)$ for all hyperparameters.
When running the experiment, we’ll choose an action $A_t$ and observe a
reward $X_t$ at each time step. Since we assume the arms are independent, we
only have to do the analysis for a single arm $a \in \mathcal{A}.$ We’ll use
$$
\mathcal{D}_{t,a} = \{ X_{\tau} : A_{\tau} = a, \tau = 1, \ldots, t
- 1\},
$$
for the set of rewards received for arm $a$ up to but excluding time $t.$ By
Bayes Theorem we then have
for each arm:
$$
\begin{align*}
P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a} ) &= \frac{
P(\mathcal{D}_{t,a} \given \mu_a, \sigma_a^2 ) P(\mu_a, \sigma_a^2 \given
\omega_a )}{P(\mathcal{D}_{t,a})} \\\
&\propto P(\mathcal{D}_{t,a} \given \mu_a, \sigma_a^2 ) P(\mu_a,
\sigma_a^2 \given \omega_a),
\end{align*}
$$
where we’ll ignore the denominator $P(\mathcal{D}_{t,a})$ because it doesn’t
depend on $\mu_a$ or $\sigma_a^2.$ The next step is therefore to derive the
functional form of the posterior $P(\mu_a, \sigma_a^2 \given
\mathcal{D}_{t,a}).$ While this is generally well known and can be quite a
tedious algebraic exercise, I’ll show the main steps below for didactic
purposes.
Let’s introduce $N_t(a) = |\mathcal{D}_{t,a}|$ for the number of times arm
$a$ was pulled before time $t,$ and denote the average reward $\bar{x}_{t,a}$
and the sum of squares $s_{t,a}$ as
$$\begin{align*}
\bar{x}_{t,a} &= \frac{1}{N_t(a)} \sum_{\tau = 1}^{t - 1} \id{A_{\tau} =
a} X_{\tau} \\\
s_{t,a} &= \sum_{\tau = 1}^{t - 1} \id{A_{\tau} = a} (X_{\tau} -
\bar{x}_{t,a})^2,
\end{align*}
$$
where $\id{Z}$ is the indicator
function that equals 1 if
condition $Z$ is true and 0 otherwise. Using the likelihood function for the
normal distribution we find
$$
\begin{align*}
P(\mathcal{D}_{t,a} \given \mu_a, \sigma_a^2) &= \left( \frac{1}{\sigma_a
\sqrt{2\pi}} \right)^{N_t(a)} \exp \left( - \frac{1}{2\sigma_a^2}
\sum_{\tau = 1}^{t - 1} \id{A_{\tau} = a} (X_{\tau} - \mu_a)^2 \right)
\\\
&= \left( \frac{1}{\sigma_a \sqrt{2\pi}} \right)^{N_t(a)} \exp \left( -
\frac{1}{2\sigma_a^2} \left[ N_t(a) (\bar{x}_{t,a} - \mu_a)^2 + s_{t,a}
\right] \right).
\end{align*}
$$
The normal-inverse-gamma prior distribution on the parameters is given by
$$
\begin{align*}
P(\mu_a, \sigma_a^2 \given \omega_a) &= P(\mu_a, \sigma_a^2 \given m_a,
\nu_a, \alpha_a, \beta_a) \\\
&= \frac{\sqrt{\nu_a}}{\sigma_a \sqrt{2\pi}}
\frac{\beta_a^{\alpha_a}}{\Gamma(\alpha_a)} \left( \frac{1}{\sigma_a^2}
\right)^{\alpha_a + 1} \exp \left( - \frac{2 \beta_a + \nu_a (\mu_a - m_a)^2
}{ 2 \sigma_a^2 } \right).
\end{align*}
$$
Using Bayes’ theorem the posterior then becomes
$$
\begin{align*}
P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a}) &\propto P(\mathcal{D}_{t,a}
\given \mu_a, \sigma_a^2 ) P(\mu_a, \sigma_a^2 \given m_a, \nu_a,
\alpha_a, \beta_a) \\\
&= \left( \frac{1}{\sigma_a \sqrt{2\pi}} \right)^{N_t(a)} \exp \left( -
\frac{1}{2\sigma_a^2} \left[ N_t(a) (\bar{x}_{t,a} - \mu_a)^2 + s_{t,a}
\right] \right) \\\
&\qquad \cdot \frac{\sqrt{\nu_a}}{\sigma_a \sqrt{2\pi}}
\frac{\beta_a^{\alpha_a}}{\Gamma(\alpha_a)} \left( \frac{1}{\sigma_a^2}
\right)^{\alpha_a + 1} \exp \left( - \frac{2 \beta_a + \nu_a (\mu_a -
m_a)^2 }{ 2 \sigma_a^2 } \right).
\end{align*}
$$
The reason we’ve chosen a conjugate
prior for the normal
distribution is that the posterior is now also a normal-inverse-gamma
distribution. To obtain the parameters of the posterior distribution, we’ll
have to rewrite the above expression into a form where we have a normal
distribution over $\mu_a$ and an inverse gamma distribution over
$\sigma_a^2.$ That is, we want to obtain:
$$
P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a}) \propto \mathcal{N}(\mu_a ;
\sigma_a^2, \omega_a) \mathcal{IG}(\sigma_a^2 ; \omega_a),
$$
where the normal distribution depends in some way on $\sigma_a^2$ and
$\omega_a,$ and the inverse gamma distribution depends only on the
hyperparameters $\omega_a.$ By rearranging the expression for the posterior
given above it can be shown that
$$
P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a}) \propto \mathcal{N}(\mu_a ;
\rho_{t,a}, \zeta_a^2) \; \mathcal{IG}(\sigma_a^2 ; \tfrac{1}{2}N_t(a) +
\alpha_a, \beta_{t,a})
$$
where
$$
\begin{align*}
\rho_{t,a} &= \frac{\nu_a m_a + N_t(a) \bar{x}_{t,a}}{\nu_a + N_t(a)}
\\\
\zeta_a^2 &= \frac{\sigma_a^2}{N_t(a) + \nu_a} \\\
\beta_{t,a} &= \beta_a + \tfrac{1}{2} s_{t,a} + \frac{N_t(a) \nu_a
(\bar{x}_{t,a} - m_a)^2}{2 (N_t(a) + \nu_a)}.
\end{align*}
$$
Algorithm
Having derived the expression for the posterior distribution, we can present the Thompson sampling algorithm for Gaussian bandits:
- For each arm $a \in \mathcal{A}$:
- Draw $\hat{\sigma}_a^2 \sim \mathcal{IG}(\tfrac{1}{2}N_t(a) + \alpha_a, \beta_{t, a}).$
- Compute $\hat{\zeta}_a^2 = \hat{\sigma}_a^2 / (N_t(a) + \nu_a).$
- Draw $\hat{\mu}_a \sim \mathcal{N}(\rho_{t,a}, \hat{\zeta}_a^2).$
- Select the arm with the highest expected reward: $$ A_t = \underset{a \in \mathcal{A}}{\operatorname{arg\,max\;}} \mathbb{E}[X_t \given \hat{\mu}_a, \hat{\sigma}_a^2] = \underset{a \in \mathcal{A}}{\operatorname{arg\,max\;}}\hat{\mu}_a. $$
- Pull arm $A_t$ and observe reward $X_t.$
- Update $N_t(a),$ $\bar{x}_{t,a},$ $\rho_{t,a},$ $s_{t,a},$ and $\beta_{t,a}$ for $a = A_t.$ Go to step 1.
For the last step, it helps to know the following incremental update equations
so we can get an efficient implementation:
$$
\begin{align*}
N_{t+1}(a) &= N_t(a) + 1 \\\\
\bar{x}_{t+1,a} &= \bar{x}_{t,a} + \frac{1}{N_t(a) + 1} (X_t -
\bar{x}_{t,a}) \\\\
s_{t+1,a} &= s_{t,a} + X_t^2 + N_t(a) \bar{x}_{t,a}^2 - N_{t+1}(a)
\bar{x}_{t+1,a}^2.
\end{align*}
$$
Evaluation
To evaluate the convergence properties of bandit algorithms, it is common to look at the regret, which is defined as $$ R_t = t \mu^* - \mathbb{E}\left[\sum_{\tau=1}^t X_{\tau}\right], $$ where $\mu^*$ is the maximum reward possible in the experiment: $\mu^* = \max_{a \in \mathcal{A}} \mu_a.$ The regret measures how much the expected total collected rewards differ from the maximum achievable reward for the environment (if we would always pull the best arm). In the experiments below, we’ll instead use what’s known as the pseudo-regret, $$ \bar{R}_t = t \mu^* - \sum_{\tau = 1}^{t} \mu_{A_{\tau}}, $$ as it removes the expectation and the effect of noise, and is somewhat easier to interpret.
We’ll take a look at the behavior of Thompson Sampling for different hyperparameters configurations below. I’ll also introduce the Upper Confidence Bound (UCB) algorithm, another commonly used bandit algorithm, and see how the two algorithms compare.
If you’d like to play with the evaluation yourself, you can use the next applet. The applet also supports the $\varepsilon$-greedy algorithm, which chooses a random arm with probability $\varepsilon$ and otherwise selects the one with the highest average reward so far.
Can you find out what algorithm performs best and with which parameter settings? Are there differences between short-term and long-term performance?