The multi-armed bandit problem is the problem of sequential decision making under uncertainty. Given a number of alternatives, at every time step you choose one of the options and receive a noisy reward, and the goal is to maximize the total reward over time. This problem has many applications, including in reinforcement learning, portfolio allocation, clinical trials, and more. Inherently, the multi-armed bandit problem is about the exploration-exploitation trade-off: how do you choose between sticking with the option you believe is best and exploring other options?

Thompson Sampling is a Bayesian approach to the multi-armed bandit problem: at every time step you sample from the posterior distribution of the reward for each arm, and choose the one that gives the highest reward. This is also known as probability matching, since the probability that an arm is optimal is matched to the probability that sampling from that arm’s posterior distribution gives the highest reward.

In this article we will do a deep dive into Thompson Sampling for the multi-armed bandit problem with normally-distributed rewards. We’ll derive the necessary mathematics and evaluate the algorithm by comparing to a well-known alternative called the Upper Confidence Bound (UCB) algorithm. All code can be found on GitHub.

But first, here is an applet that illustrates the problem. You can play with the parameters to see how they affect the exploration-exploitation trade-off. You’ll notice that over time the uncertainty for the reward estimate of the optimal arm becomes very small.

Can you find out how each parameter changes the behavior? Are there settings that lead the algorithm to commit to a suboptimal arm?

In the next section I’ll present a derivation of the posterior distribution for Gaussian bandits. If you’d like to skip the math and jump directly to the (interactive) evaluation of the algorithm, click here.

## Mathematical Derivation

We start with a few definitions. We consider the $k$-armed bandit problem with the action space $\mathcal{A} = \{1, \ldots, k\}.$ We assume that the reward for each arm $a \in \mathcal{A}$ is independently normally distributed with mean $\mu_a$ and variance $\sigma_a^2.$ If we write $A_t \in \mathcal{A}$ for the arm that was chosen at time $t$ and $X_t \in \mathbb{R}$ for the reward, then the likelihood function can be written as $P(X_t \given A_t = a) = \mathcal{N}(X_t ; \mu_a, \sigma_a^2).$

We’ll assume the most general scenario where both the mean and variance of the reward is unknown. The conjugate prior for this case is the normal-inverse-gamma distribution, written as $\text{N-}\Gamma^{-1},$ such that $$ P(\mu_a, \sigma_a^2 \given m_a, \nu_a, \alpha_a, \beta_a) \sim \text{N-}\Gamma^{-1}(m_a, \nu_a, \alpha_a, \beta_a) \qquad \forall a \in \mathcal{A}, $$ where $m_a, \nu_a, \alpha_a,$ and $\beta_a$ are the hyperparameters and represent the prior mean $(m_a),$ the prior “count” $(\nu_a > 0),$ prior shape parameter $(\alpha_a > 0),$ and prior scale parameter $(\beta_a > 0).$ Occasionally we will write $P(\mu_a, \sigma_a^2 \given m_a, \nu_a, \alpha_a, \beta_a) = P(\mu_a, \sigma_a^2 \given \omega_a)$ using the shorthand notation $\omega_a = (m_a, \nu_a, \alpha_a, \beta_a)$ for all hyperparameters.

When running the experiment, we’ll choose an action $A_t$ and observe a
reward $X_t$ at each time step. Since we assume the arms are independent, we
only have to do the analysis for a single arm $a \in \mathcal{A}.$ We’ll write
$\mathcal{D}_{t,a} = \{ X_{\tau} : A_{\tau} = a, \tau = 1, \ldots, t
- 1\}$ for the set of rewards received for arm $a$ up to but excluding time
$t.$ By Bayes Theorem we
then have for each arm:
$$\begin{align*}
P( \mu_a, \sigma_a^2 \given \mathcal{D}_{t,a} ) &= \frac{
P(\mathcal{D}_{t,a} \given \mu_a, \sigma_a^2 ) P(\mu_a, \sigma_a^2 \given
\omega_a )}{P(\mathcal{D}_{t,a})} \\\

&\propto P(\mathcal{D}_{t,a} \given \mu_a, \sigma_a^2 ) P(\mu_a,
\sigma_a^2 \given \omega_a),
\end{align*}
$$
where we’ll ignore the denominator $P(\mathcal{D}_{t,a})$ because it doesn’t
depend on $\mu_a$ or $\sigma_a^2.$ The next step is therefore to derive the
functional form of the posterior $P(\mu_a, \sigma_a^2 \given
\mathcal{D}_{t,a}).$ While this is generally well known and can be quite a
tedious algebraic exercise, I’ll show the main steps below for didactic
purposes.

Let’s introduce $N_t(a) = |\mathcal{D}_{t,a}|$ for the number of times arm
$a$ was pulled before time $t,$ and denote the average reward $\bar{x}_{t,a}$
and the sum of squares $s_{t,a}$ as
$$\begin{align*}
\bar{x}_{t,a} &= \frac{1}{N_t(a)} \sum_{\tau = 1}^{t - 1} \id{A_{\tau} =
a} X_{\tau} \\\

s_{t,a} &= \sum_{\tau = 1}^{t - 1} \id{A_{\tau} = a} (X_{\tau} -
\bar{x}_{t,a})^2,
\end{align*}
$$
where $\id{Z}$ is the indicator
function that equals 1 if
condition $Z$ is true and 0 otherwise. Using the likelihood function for the
normal distribution we find
$$
\begin{align*}
P(\mathcal{D}_{t,a} \given \mu_a, \sigma_a^2) &= \left( \frac{1}{\sigma_a
\sqrt{2\pi}} \right)^{N_t(a)} \exp \left( - \frac{1}{2\sigma_a^2}
\sum_{\tau = 1}^{t - 1} \id{A_{\tau} = a} (X_{\tau} - \mu_a)^2 \right)
\\\

&= \left( \frac{1}{\sigma_a \sqrt{2\pi}} \right)^{N_t(a)} \exp \left( -
\frac{1}{2\sigma_a^2} \left[ N_t(a) (\bar{x}_{t,a} - \mu_a)^2 + s_{t,a}
\right] \right).
\end{align*}
$$

The normal-inverse-gamma prior distribution on the parameters is given by
$$
\begin{align*}
P(\mu_a, \sigma_a^2 \given \omega_a) &= P(\mu_a, \sigma_a^2 \given m_a,
\nu_a, \alpha_a, \beta_a) \\\

&= \frac{\sqrt{\nu_a}}{\sigma_a \sqrt{2\pi}}
\frac{\beta_a^{\alpha_a}}{\Gamma(\alpha_a)} \left( \frac{1}{\sigma_a^2}
\right)^{\alpha_a + 1} \exp \left( - \frac{2 \beta_a + \nu_a (\mu_a - m_a)^2
}{ 2 \sigma_a^2 } \right).
\end{align*}
$$
Using Bayes’ theorem the posterior then becomes
$$
\begin{align*}
P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a}) &\propto P(\mathcal{D}_{t,a}
\given \mu_a, \sigma_a^2 ) P(\mu_a, \sigma_a^2 \given m_a, \nu_a,
\alpha_a, \beta_a) \\\

&= \left( \frac{1}{\sigma_a \sqrt{2\pi}} \right)^{N_t(a)} \exp \left( -
\frac{1}{2\sigma_a^2} \left[ N_t(a) (\bar{x}_{t,a} - \mu_a)^2 + s_{t,a}
\right] \right) \\\

&\qquad \cdot \frac{\sqrt{\nu_a}}{\sigma_a \sqrt{2\pi}}
\frac{\beta_a^{\alpha_a}}{\Gamma(\alpha_a)} \left( \frac{1}{\sigma_a^2}
\right)^{\alpha_a + 1} \exp \left( - \frac{2 \beta_a + \nu_a (\mu_a -
m_a)^2 }{ 2 \sigma_a^2 } \right).
\end{align*}
$$

The reason we’ve chosen a conjugate
prior for the normal
distribution is that the posterior is now *also* a normal-inverse-gamma
distribution. To obtain the parameters of the posterior distribution, we’ll
have to rewrite the above expression into a form where we have a normal
distribution over $\mu_a$ and an inverse gamma distribution over
$\sigma_a^2.$ That is, we want to obtain:
$$
P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a}) \propto \mathcal{N}(\mu_a ;
\sigma_a^2, \omega_a) \mathcal{IG}(\sigma_a^2 ; \omega_a),
$$
where the normal distribution depends in some way on $\sigma_a^2$ and
$\omega_a,$ and the inverse gamma distribution depends only on the
hyperparameters $\omega_a.$ By rearranging the expression for the posterior
given above it can be shown that
$$
P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a}) \propto \mathcal{N}(\mu_a ;
\rho_{t,a}, \zeta_a^2) \; \mathcal{IG}(\sigma_a^2 ; \tfrac{1}{2}N_t(a) +
\alpha_a, \beta_{t,a})
$$
where
$$
\begin{align*}
\rho_{t,a} &= \frac{\nu_a m_a + N_t(a) \bar{x}_{t,a}}{\nu_a + N_t(a)}
\\\

\zeta_a^2 &= \frac{\sigma_a^2}{N_t(a) + \nu_a} \\\

\beta_{t,a} &= \beta_a + \tfrac{1}{2} s_{t,a} + \frac{N_t(a) \nu_a
(\bar{x}_{t,a} - m_a)^2}{2 (N_t(a) + \nu_a)}.
\end{align*}
$$

## Algorithm

Having derived the expression for the posterior distribution, we can present the Thompson sampling algorithm for Gaussian bandits:

- For each arm $a \in \mathcal{A}$:
- Draw $\hat{\sigma}_a^2 \sim \mathcal{IG}(\tfrac{1}{2}N_t(a) + \alpha_a, \beta_{t, a}).$
- Compute $\hat{\zeta}_a^2 = \hat{\sigma}_a^2 / (N_t(a) + \nu_a).$
- Draw $\hat{\mu}_a \sim \mathcal{N}(\rho_{t,a}, \hat{\zeta}_a^2).$

- Select the arm with the highest expected reward: $$ A_t = \underset{a \in \mathcal{A}}{\operatorname{arg\,max\;}} \mathbb{E}[X_t \given \hat{\mu}_a, \hat{\sigma}_a^2] = \underset{a \in \mathcal{A}}{\operatorname{arg\,max\;}}\hat{\mu}_a. $$
- Pull arm $A_t$ and observe reward $X_t.$
- Update $N_t(a),$ $\bar{x}_{t,a},$ $\rho_{t,a},$ $s_{t,a},$ and $\beta_{t,a}$ for $a = A_t.$ Go to step 1.

For the last step, it helps to know the following incremental update equations
so we can get an efficient implementation:
$$
\begin{align*}
N_{t+1}(a) &= N_t(a) + 1 \\\\

\bar{x}_{t+1,a} &= \bar{x}_{t,a} + \frac{1}{N_t(a) + 1} (X_t -
\bar{x}_{t,a}) \\\\

s_{t+1,a} &= s_{t,a} + X_t^2 + N_t(a) \bar{x}_{t,a}^2 - N_{t+1}(a)
\bar{x}_{t+1,a}^2.
\end{align*}
$$

## Evaluation

To evaluate the convergence properties of bandit algorithms, it is common to
look at the *regret*, which is defined as
$$
R_t = t \mu^* - \mathbb{E}\left[\sum_{\tau=1}^t X_{\tau}\right],
$$
where $\mu^*$ is the maximum reward possible in the experiment: $\mu^* =
\max_{a \in \mathcal{A}} \mu_a.$ The regret measures how much the expected
total collected rewards differ from the maximum achievable reward for the
environment (if we would always pull the best arm). In the experiments below,
we’ll instead use what’s known as the *pseudo-regret*,
$$
\bar{R}_t = t \mu^* - \sum_{\tau = 1}^{t} \mu_{A_{\tau}},
$$
as it removes the expectation and the effect of noise, and is somewhat easier
to interpret.

We’ll take a look at the behavior of Thompson Sampling for different hyperparameters configurations below. I’ll also introduce the Upper Confidence Bound (UCB) algorithm, another commonly used bandit algorithm, and see how the two algorithms compare.

If you’d like to play with the evaluation yourself, you can use the next applet. The applet also supports the $\varepsilon$-greedy algorithm, which chooses a random arm with probability $\varepsilon$ and otherwise selects the one with the highest average reward so far.

Can you find out what algorithm performs best and with which parameter settings? Are there differences between short-term and long-term performance?

### Hyperparameter Settings

To explore the behavior of Thompson Sampling for different hyperparameter configurations, we run a simulation and show how the regret evolves over time. In these experiments the true mean reward for the arm follows $\mu_a \sim \mathcal{N}(0, 1)$ and the standard deviation follows $\sigma_a \sim U(0.5, 4.0),$ with $U(a, b)$ the uniform distribution on $[a, b).$ The next figure shows the result of this experiment for several distinct hyperparameter configurations, averaged over 1000 independent repetitions.

There are a few interesting behaviors that we can see in the graph. Perhaps the most notable is the regret curve that goes straight up immediately, which corresponds to a hyperparameter setting of $m_a = 0,$ $\nu_a = 1,$ $\alpha_a = 100$ and $\beta_a = 1.$ These parameters approach the limit case where $\alpha_a \rightarrow \infty$ and $\beta_a \rightarrow 0,$ which results in the mean and variance of the inverse gamma distribution both going to zero. This in turn means that the variance of the normal distribution for $\hat{\mu}_a$ will go to zero. As a consequence, the algorithm will do very little exploration and is much more likely to pick a suboptimal arm, resulting in the steep regret curve.

We can also see the effect of overestimating the true mean in the curve with $m_a = 10,$ which leads to a steep initial regret curve before the algorithm adapts. The other curves show the effect of a low value of $\alpha_a,$ over or underestimating the pseudocount $\nu_a,$ and the effect of using high values for both $\alpha_a$ and $\beta_a$ (zoom in on the graph for a closer look). We see that generally using $\alpha_a = \beta_a = 1$ works quite well.

### Upper Confidence Bound

The Upper Confidence Bound (UCB) algorithm is derived from a property of a certain class of probability distributions called sub-Gaussian distributions. At each iteration, the UCB algorithm selects the arm $A_t$ such that $$ A_t = \underset{a \in \mathcal{A}}{\operatorname{arg max}} \; \bar{x}_{t,a} + c\sqrt{\frac{\ln(t)}{N_t(a)}}. $$ For some hyperparameter $c > 0$ that controls the exploration vs. exploitation tradeoff.

We’ll skip the derivation in the main text, but you can expand this section to show the details and learn more about UCB.

In the next graph we compare the UCB algorithm with different values of $c$ against Thompson Sampling.

As the graph shows, the performance of UCB is quite sensitive to the particular value of $c,$ with $c = 2$ performing best. The best parameters for Thompson sampling in this experiment correspond to $m_a = 0,$ $\nu_a = 1,$ $\alpha_a = \beta_a = 1.$ While UCB with $c = 2$ has lower regret in the beginning it ends up with a steeper slope than the best Thompson sampling configuration, so we see that Thompson sampling will do better in the long run. You can play around with other configurations using the applet above.

## References & Further Reading

If this post has made you curious about bandit algorithms and you’re interested in learning more, a good resource is banditalgs.com, which is maintained by Tor Lattimore and Csaba Szepesvári. An extensive reference book by the same authors is available here.

Below are some additional references on Thompson Sampling you may find interesting:

- Chapelle & Li – An Empirical Evaluation of Thompson Sampling (2011). One of the papers that spurred recent interest in Thompson Sampling.
- Russo et al. – A Tutorial on Thompson Sampling (2017). Extensive tutorial on Thompson Sampling with many nice practical examples.
- Honda & Takemura – Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors (2014). Thorough analysis of Thompson Sampling for Gaussian bandits, which we also studied above, but using a slightly different prior specification (essentially using $\beta_a = 0).$
- Mazumdar et al. – On Thompson Sampling with Langevin Algorithms (2020). As you can see from the applet at the top of this page, eventually it becomes clear that some arms are suboptimal, so why keep sampling from them at every iteration? This paper presents approximations to the algorithm, while maintaining its theoretical properties.
- Weng – The Multi-Armed Bandit Problem and Its Solutions (2018). Great blog post on the multi-armed bandit problem, with a focus on Bernoulli bandits.

The multi-armed bandit problem is easy to state and to understand, but there is a rich collection of algorithms and variations of the problem, and a wealth of interesting mathematical analysis that goes into understanding them fully, so there’s a lot more to learn!

Thank you for reading!