An Exploration of Thompson Sampling

The multi-armed bandit problem is the problem of sequential decision making under uncertainty. Given a number of alternatives, at every time step you choose one of the options and receive a noisy reward, and the goal is to maximize the total reward over time. This problem has many applications, including in reinforcement learning, portfolio allocation, clinical trials, and more. Inherently, the multi-armed bandit problem is about the exploration-exploitation trade-off: how do you choose between sticking with the option you believe is best and exploring other options?

Thompson Sampling is a Bayesian approach to the multi-armed bandit problem: at every time step you sample from the posterior distribution of the reward for each arm, and choose the one that gives the highest reward. This is also known as probability matching, since the probability that an arm is optimal is matched to the probability that sampling from that arm’s posterior distribution gives the highest reward.

In this article we will do a deep dive into Thompson Sampling for the multi-armed bandit problem with normally-distributed rewards. We’ll derive the necessary mathematics and evaluate the algorithm by comparing to a well-known alternative called the Upper Confidence Bound (UCB) algorithm. All code can be found on GitHub.

But first, here is an applet that illustrates the problem. You can play with the parameters to see how they affect the exploration-exploitation trade-off. You’ll notice that over time the uncertainty for the reward estimate of the optimal arm becomes very small.

This interactive visualization allows you to play with Thompson Sampling for Gaussian bandits. The colors will change according to the current posterior distribution of the reward for each arm. You can change the hyperparameters of the algorithm by changing the sliders, which will have an effect on the trade-off between exploration and exploitation.

Can you find out how each parameter changes the behavior? Are there settings that lead the algorithm to commit to a suboptimal arm?

In the next section I’ll present a derivation of the posterior distribution for Gaussian bandits. If you’d like to skip the math and jump directly to the (interactive) evaluation of the algorithm, click here.

Mathematical Derivation

We start with a few definitions. We consider the $k$-armed bandit problem with the action space $\mathcal{A} = \{1, \ldots, k\}.$ We assume that the reward for each arm $a \in \mathcal{A}$ is independently normally distributed with mean $\mu_a$ and variance $\sigma_a^2.$ If we write $A_t \in \mathcal{A}$ for the arm that was chosen at time $t$ and $X_t \in \mathbb{R}$ for the reward, then the likelihood function can be written as $P(X_t \given A_t = a) = \mathcal{N}(X_t ; \mu_a, \sigma_a^2).$

We’ll assume the most general scenario where both the mean and variance of the reward is unknown. The conjugate prior for this case is the normal-inverse-gamma distribution, written as $\text{N-}\Gamma^{-1},$ such that $$ P(\mu_a, \sigma_a^2 \given m_a, \nu_a, \alpha_a, \beta_a) \sim \text{N-}\Gamma^{-1}(m_a, \nu_a, \alpha_a, \beta_a) \qquad \forall a \in \mathcal{A}, $$ where $m_a, \nu_a, \alpha_a,$ and $\beta_a$ are the hyperparameters and represent the prior mean $(m_a),$ the prior “count” $(\nu_a > 0),$ prior shape parameter $(\alpha_a > 0),$ and prior scale parameter $(\beta_a > 0).$ Occasionally we will write $$ P(\mu_a, \sigma_a^2 \given m_a, \nu_a, \alpha_a, \beta_a) = P(\mu_a, \sigma_a^2 \given \omega_a), $$ using the shorthand notation $\omega_a = (m_a, \nu_a, \alpha_a, \beta_a)$ for all hyperparameters.

When running the experiment, we’ll choose an action $A_t$ and observe a reward $X_t$ at each time step. Since we assume the arms are independent, we only have to do the analysis for a single arm $a \in \mathcal{A}.$ We’ll use $$ \mathcal{D}_{t,a} = \{ X_{\tau} : A_{\tau} = a, \tau = 1, \ldots, t - 1\}, $$ for the set of rewards received for arm $a$ up to but excluding time $t.$ By Bayes Theorem we then have for each arm: $$ \begin{align*} P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a} ) &= \frac{ P(\mathcal{D}_{t,a} \given \mu_a, \sigma_a^2 ) P(\mu_a, \sigma_a^2 \given \omega_a )}{P(\mathcal{D}_{t,a})} \\\
&\propto P(\mathcal{D}_{t,a} \given \mu_a, \sigma_a^2 ) P(\mu_a, \sigma_a^2 \given \omega_a), \end{align*} $$ where we’ll ignore the denominator $P(\mathcal{D}_{t,a})$ because it doesn’t depend on $\mu_a$ or $\sigma_a^2.$ The next step is therefore to derive the functional form of the posterior $P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a}).$ While this is generally well known and can be quite a tedious algebraic exercise, I’ll show the main steps below for didactic purposes.

Let’s introduce $N_t(a) = |\mathcal{D}_{t,a}|$ for the number of times arm $a$ was pulled before time $t,$ and denote the average reward $\bar{x}_{t,a}$ and the sum of squares $s_{t,a}$ as $$\begin{align*} \bar{x}_{t,a} &= \frac{1}{N_t(a)} \sum_{\tau = 1}^{t - 1} \id{A_{\tau} = a} X_{\tau} \\\
s_{t,a} &= \sum_{\tau = 1}^{t - 1} \id{A_{\tau} = a} (X_{\tau} - \bar{x}_{t,a})^2, \end{align*} $$ where $\id{Z}$ is the indicator function that equals 1 if condition $Z$ is true and 0 otherwise. Using the likelihood function for the normal distribution we find $$ \begin{align*} P(\mathcal{D}_{t,a} \given \mu_a, \sigma_a^2) &= \left( \frac{1}{\sigma_a \sqrt{2\pi}} \right)^{N_t(a)} \exp \left( - \frac{1}{2\sigma_a^2} \sum_{\tau = 1}^{t - 1} \id{A_{\tau} = a} (X_{\tau} - \mu_a)^2 \right) \\\
&= \left( \frac{1}{\sigma_a \sqrt{2\pi}} \right)^{N_t(a)} \exp \left( - \frac{1}{2\sigma_a^2} \left[ N_t(a) (\bar{x}_{t,a} - \mu_a)^2 + s_{t,a} \right] \right). \end{align*} $$

▸ Show Likelihood Derivation

The normal-inverse-gamma prior distribution on the parameters is given by $$ \begin{align*} P(\mu_a, \sigma_a^2 \given \omega_a) &= P(\mu_a, \sigma_a^2 \given m_a, \nu_a, \alpha_a, \beta_a) \\\
&= \frac{\sqrt{\nu_a}}{\sigma_a \sqrt{2\pi}} \frac{\beta_a^{\alpha_a}}{\Gamma(\alpha_a)} \left( \frac{1}{\sigma_a^2} \right)^{\alpha_a + 1} \exp \left( - \frac{2 \beta_a + \nu_a (\mu_a - m_a)^2 }{ 2 \sigma_a^2 } \right). \end{align*} $$ Using Bayes’ theorem the posterior then becomes $$ \begin{align*} P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a}) &\propto P(\mathcal{D}_{t,a} \given \mu_a, \sigma_a^2 ) P(\mu_a, \sigma_a^2 \given m_a, \nu_a, \alpha_a, \beta_a) \\\
&= \left( \frac{1}{\sigma_a \sqrt{2\pi}} \right)^{N_t(a)} \exp \left( - \frac{1}{2\sigma_a^2} \left[ N_t(a) (\bar{x}_{t,a} - \mu_a)^2 + s_{t,a} \right] \right) \\\
&\qquad \cdot \frac{\sqrt{\nu_a}}{\sigma_a \sqrt{2\pi}} \frac{\beta_a^{\alpha_a}}{\Gamma(\alpha_a)} \left( \frac{1}{\sigma_a^2} \right)^{\alpha_a + 1} \exp \left( - \frac{2 \beta_a + \nu_a (\mu_a - m_a)^2 }{ 2 \sigma_a^2 } \right). \end{align*} $$

The reason we’ve chosen a conjugate prior for the normal distribution is that the posterior is now also a normal-inverse-gamma distribution. To obtain the parameters of the posterior distribution, we’ll have to rewrite the above expression into a form where we have a normal distribution over $\mu_a$ and an inverse gamma distribution over $\sigma_a^2.$ That is, we want to obtain: $$ P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a}) \propto \mathcal{N}(\mu_a ; \sigma_a^2, \omega_a) \mathcal{IG}(\sigma_a^2 ; \omega_a), $$ where the normal distribution depends in some way on $\sigma_a^2$ and $\omega_a,$ and the inverse gamma distribution depends only on the hyperparameters $\omega_a.$ By rearranging the expression for the posterior given above it can be shown that $$ P(\mu_a, \sigma_a^2 \given \mathcal{D}_{t,a}) \propto \mathcal{N}(\mu_a ; \rho_{t,a}, \zeta_a^2) \; \mathcal{IG}(\sigma_a^2 ; \tfrac{1}{2}N_t(a) + \alpha_a, \beta_{t,a}) $$ where $$ \begin{align*} \rho_{t,a} &= \frac{\nu_a m_a + N_t(a) \bar{x}_{t,a}}{\nu_a + N_t(a)} \\\
\zeta_a^2 &= \frac{\sigma_a^2}{N_t(a) + \nu_a} \\\
\beta_{t,a} &= \beta_a + \tfrac{1}{2} s_{t,a} + \frac{N_t(a) \nu_a (\bar{x}_{t,a} - m_a)^2}{2 (N_t(a) + \nu_a)}. \end{align*} $$

▸ Show Posterior Derivation

Algorithm

Having derived the expression for the posterior distribution, we can present the Thompson sampling algorithm for Gaussian bandits:

  1. For each arm $a \in \mathcal{A}$:
    • Draw $\hat{\sigma}_a^2 \sim \mathcal{IG}(\tfrac{1}{2}N_t(a) + \alpha_a, \beta_{t, a}).$
    • Compute $\hat{\zeta}_a^2 = \hat{\sigma}_a^2 / (N_t(a) + \nu_a).$
    • Draw $\hat{\mu}_a \sim \mathcal{N}(\rho_{t,a}, \hat{\zeta}_a^2).$
  2. Select the arm with the highest expected reward: $$ A_t = \underset{a \in \mathcal{A}}{\operatorname{arg\,max\;}} \mathbb{E}[X_t \given \hat{\mu}_a, \hat{\sigma}_a^2] = \underset{a \in \mathcal{A}}{\operatorname{arg\,max\;}}\hat{\mu}_a. $$
  3. Pull arm $A_t$ and observe reward $X_t.$
  4. Update $N_t(a),$ $\bar{x}_{t,a},$ $\rho_{t,a},$ $s_{t,a},$ and $\beta_{t,a}$ for $a = A_t.$ Go to step 1.

For the last step, it helps to know the following incremental update equations so we can get an efficient implementation: $$ \begin{align*} N_{t+1}(a) &= N_t(a) + 1 \\\\
\bar{x}_{t+1,a} &= \bar{x}_{t,a} + \frac{1}{N_t(a) + 1} (X_t - \bar{x}_{t,a}) \\\\
s_{t+1,a} &= s_{t,a} + X_t^2 + N_t(a) \bar{x}_{t,a}^2 - N_{t+1}(a) \bar{x}_{t+1,a}^2. \end{align*} $$

Evaluation

To evaluate the convergence properties of bandit algorithms, it is common to look at the regret, which is defined as $$ R_t = t \mu^* - \mathbb{E}\left[\sum_{\tau=1}^t X_{\tau}\right], $$ where $\mu^*$ is the maximum reward possible in the experiment: $\mu^* = \max_{a \in \mathcal{A}} \mu_a.$ The regret measures how much the expected total collected rewards differ from the maximum achievable reward for the environment (if we would always pull the best arm). In the experiments below, we’ll instead use what’s known as the pseudo-regret, $$ \bar{R}_t = t \mu^* - \sum_{\tau = 1}^{t} \mu_{A_{\tau}}, $$ as it removes the expectation and the effect of noise, and is somewhat easier to interpret.

We’ll take a look at the behavior of Thompson Sampling for different hyperparameters configurations below. I’ll also introduce the Upper Confidence Bound (UCB) algorithm, another commonly used bandit algorithm, and see how the two algorithms compare.

If you’d like to play with the evaluation yourself, you can use the next applet. The applet also supports the $\varepsilon$-greedy algorithm, which chooses a random arm with probability $\varepsilon$ and otherwise selects the one with the highest average reward so far.

This applet allows you to evaluate different bandit algorithms with different parameter settings. You can use a comma-separated list of values to try multiple configurations of the same algorithm simultaneously, or leave a field empty to disable an algorithm. Pause the simulation and use your mouse or thumb to see which curve belongs to which algorithm. The reward for the arms is normally distributed as $\mathcal{N}(\mu_a, \sigma_a^2)$ where $\mu_a \sim \mathcal{N}(0, 1)$ and $\sigma_a \sim U(0.5, 4.0).$

Can you find out what algorithm performs best and with which parameter settings? Are there differences between short-term and long-term performance?
ε-Greedy
$\varepsilon$
UCB
$c$
ThompsonSampling
$m_a$
$\nu_a$
$\alpha_a$
$\beta_a$
Repetitions
$r$

Hyperparameter Settings

To explore the behavior of Thompson Sampling for different hyperparameter configurations, we run a simulation and show how the regret evolves over time. In these experiments the true mean reward for the arm follows $\mu_a \sim \mathcal{N}(0, 1)$ and the standard deviation follows $\sigma_a \sim U(0.5, 4.0),$ with $U(a, b)$ the uniform distribution on $[a, b).$ The next figure shows the result of this experiment for several distinct hyperparameter configurations, averaged over 1000 independent repetitions.

Regret curves for Thompson Sampling with six different hyperparameter configurations. You can mouse over the lines to show the hyperparameter settings and zoom in and out as you like. Results are averaged over 1000 runs.

There are a few interesting behaviors that we can see in the graph. Perhaps the most notable is the regret curve that goes straight up immediately, which corresponds to a hyperparameter setting of $m_a = 0,$ $\nu_a = 1,$ $\alpha_a = 100$ and $\beta_a = 1.$ These parameters approach the limit case where $\alpha_a \rightarrow \infty$ and $\beta_a \rightarrow 0,$ which results in the mean and variance of the inverse gamma distribution both going to zero. This in turn means that the variance of the normal distribution for $\hat{\mu}_a$ will go to zero. As a consequence, the algorithm will do very little exploration and is much more likely to pick a suboptimal arm, resulting in the steep regret curve.

We can also see the effect of overestimating the true mean in the curve with $m_a = 10,$ which leads to a steep initial regret curve before the algorithm adapts. The other curves show the effect of a low value of $\alpha_a,$ over or underestimating the pseudocount $\nu_a,$ and the effect of using high values for both $\alpha_a$ and $\beta_a$ (zoom in on the graph for a closer look). We see that generally using $\alpha_a = \beta_a = 1$ works quite well.

Upper Confidence Bound

The Upper Confidence Bound (UCB) algorithm is derived from a property of a certain class of probability distributions called sub-Gaussian distributions. At each iteration, the UCB algorithm selects the arm $A_t$ such that $$ A_t = \underset{a \in \mathcal{A}}{\operatorname{arg max}} \; \bar{x}_{t,a} + c\sqrt{\frac{\ln(t)}{N_t(a)}}. $$ For some hyperparameter $c > 0$ that controls the exploration vs. exploitation tradeoff.

We’ll skip the derivation in the main text, but you can expand this section to show the details and learn more about UCB.

▸ Show UCB details & derivation

In the next graph we compare the UCB algorithm with different values of $c$ against Thompson Sampling.

Comparison of Thompson Sampling and UCB for different hyperparameter settings, averaged over 2000 runs.

As the graph shows, the performance of UCB is quite sensitive to the particular value of $c,$ with $c = 2$ performing best. The best parameters for Thompson sampling in this experiment correspond to $m_a = 0,$ $\nu_a = 1,$ $\alpha_a = \beta_a = 1.$ While UCB with $c = 2$ has lower regret in the beginning it ends up with a steeper slope than the best Thompson sampling configuration, so we see that Thompson sampling will do better in the long run. You can play around with other configurations using the applet above.

References & Further Reading

If this post has made you curious about bandit algorithms and you’re interested in learning more, a good resource is banditalgs.com, which is maintained by Tor Lattimore and Csaba Szepesvári. An extensive reference book by the same authors is available here.

Below are some additional references on Thompson Sampling you may find interesting:

  1. Chapelle & Li – An Empirical Evaluation of Thompson Sampling (2011). One of the papers that spurred recent interest in Thompson Sampling.
  2. Russo et al. – A Tutorial on Thompson Sampling (2017). Extensive tutorial on Thompson Sampling with many nice practical examples.
  3. Honda & Takemura – Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors (2014). Thorough analysis of Thompson Sampling for Gaussian bandits, which we also studied above, but using a slightly different prior specification (essentially using $\beta_a = 0).$
  4. Mazumdar et al. – On Thompson Sampling with Langevin Algorithms (2020). As you can see from the applet at the top of this page, eventually it becomes clear that some arms are suboptimal, so why keep sampling from them at every iteration? This paper presents approximations to the algorithm, while maintaining its theoretical properties.
  5. Weng – The Multi-Armed Bandit Problem and Its Solutions (2018). Great blog post on the multi-armed bandit problem, with a focus on Bernoulli bandits.

The multi-armed bandit problem is easy to state and to understand, but there is a rich collection of algorithms and variations of the problem, and a wealth of interesting mathematical analysis that goes into understanding them fully, so there’s a lot more to learn!

Thank you for reading!