LinearTS and LinearUCB
Linear Thompson Sampling and Linear UCB are two of the basic contextual bandit algorithms. The main idea is to use a linear model to regress the reward on the context and combine this, with an uncertainty measure, to also account for unexplored parts of the context space. The uncertainty measure is usually the inverse of the covariance matrix of the chosen contexts. This measure is then used to sample the parameters in the case of Thompson Sampling or to determine the upper confidence bound in the case of UCB.
The key part is that both need to compute the inverse of the covariance matrix of the chosen contexts.
This calculation can become expensive for high-dimensional contexts. Therefore, we also provide a DiagonalPrecApprox-
variant. This variant uses a diagonal approximation of the covariance matrix and is much faster to compute.
LinearUCBBandit(n_features, selector=None, buffer=None, train_batch_size=32, eps=0.01, lambda_=1.0, lazy_uncertainty_update=False, clear_buffer_after_train=True, exploration_rate=1.0)
Bases: LinearBandit[Tensor]
Linear Upper Confidence Bound Bandit.
This implementation supports both standard and combinatorial bandit settings.
Implementation details
Standard setting:
-
Initialize: \(M^{-1} = I \cdot \lambda\), \(b = 0\), \(\theta = 0\)
-
Compute UCB scores: \(U_k(t) = x_k^T \hat{\theta}_t + \alpha \sqrt{x_k^T M^{-1} x_k}\) where \(\alpha\) is the exploration parameter
-
Update:
\(b = b + r_t x_{a_t}\)
\(M^{-1} = \left( M + x_{a_t}^T x_{a_t} + \varepsilon I \right)^{-1}\)
\(M^{-1} = \frac{M^{-1} + \left( M^{-1} \right)^T}{2}\)
\(\theta_t = M^{-1} b\)
(We store \(M^{-1}\), the precision matrix, because we also allow the Sherman-Morrison update.)
Combinatorial setting:
-
Uses the same initialization and UCB formula as the standard setting
-
Action selection: Use a selector (oracle \(\mathcal{O}_S\)) to select multiple arms as a super-action \(S_t\)
-
Updates \(M\) and \(\theta\) for each chosen arm \(i \in S_t\)
References
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_features
|
int
|
The number of features in the bandit model. |
required |
selector
|
AbstractSelector | None
|
The selector used to choose the best action. Default is ArgMaxSelector (if None). |
None
|
buffer
|
AbstractBanditDataBuffer[Tensor, Any] | None
|
The buffer used for storing the data for continuously updating the neural network. |
None
|
train_batch_size
|
int
|
The mini-batch size used for the train loop (started by |
32
|
eps
|
float
|
Small value to ensure invertibility of the precision matrix. Added to the diagonal. |
0.01
|
lambda_
|
float
|
Prior precision for the precision matrix. Acts as a regularization parameter. |
1.0
|
lazy_uncertainty_update
|
bool
|
If True the precision matrix will not be updated during forward, but during the update step. |
False
|
clear_buffer_after_train
|
bool
|
If True the buffer will be cleared after training. This is necessary because the data is not needed anymore after training once. Only set it to False if you know what you are doing. |
True
|
exploration_rate
|
float
|
The exploration parameter for LinUCB. In the original paper this is denoted as alpha. Must be greater than 0. |
1.0
|
Source code in src/calvera/bandits/linear_ucb_bandit.py
DiagonalPrecApproxLinearUCBBandit(n_features, selector=None, buffer=None, train_batch_size=32, eps=0.01, lambda_=1.0, lazy_uncertainty_update=False, clear_buffer_after_train=True, exploration_rate=1.0)
Bases: LinearUCBBandit
LinearUCB but the precision matrix is updated using a diagonal approximation.
Instead of doing a full update, only \(\text{diag}(\Sigma^{-1})^{-1} = \text{diag}(X X^T)^{-1}\) is used. For compatibility reasons the precision matrix is still stored as a full matrix.
Source code in src/calvera/bandits/linear_ucb_bandit.py
LinearTSBandit(n_features, selector=None, buffer=None, train_batch_size=32, eps=0.01, lambda_=1.0, lazy_uncertainty_update=False, clear_buffer_after_train=True)
Bases: LinearBandit[ActionInputType]
Linear Thompson Sampling Bandit.
This implementation supports both standard and combinatorial bandit settings.
Implementation details
Standard setting:
-
Initialize: \(M^{-1} = I \cdot \lambda\), \(b = 0\), \(\theta = 0\)
-
Sample: \(\tilde{\theta}_t \sim \mathcal{N}(\theta_t, M^{-1})\)
-
Score: \(S_k(t) = x_k^T \tilde{\theta}_t\)
-
Update:
\(b = b + r_t x_{a_t}\)
\(M^{-1} = \left(M + x_{a_t}^T x_{a_t} + \varepsilon I \right)^{-1}\)
\(M^{-1} = \frac{M^{-1} + \left( M^{-1} \right)^T}{2}\)
\(\theta_t = M^{-1} b\)
(We store \(M^{-1}\), the precision matrix, because we also allow the Sherman-Morrison update.)
Combinatorial setting:
-
Uses the same initialization and sampling as the standard setting
-
Action selection: Use a selector (oracle \(\mathcal{O}_S\)) to select multiple arms as a super-action \(S_t\)
-
Updates \(M\) and \(\theta\) for each chosen arm \(i \in S_t\)
References
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_features
|
int
|
The number of features in the bandit model. |
required |
selector
|
AbstractSelector | None
|
The selector used to choose the best action. Default is ArgMaxSelector (if None). |
None
|
buffer
|
AbstractBanditDataBuffer[ActionInputType, Any] | None
|
The buffer used for storing the data for continuously updating the neural network. |
None
|
train_batch_size
|
int
|
The mini-batch size used for the train loop (started by |
32
|
eps
|
float
|
Small value to ensure invertibility of the precision matrix. Added to the diagonal. |
0.01
|
lambda_
|
float
|
Prior precision for the precision matrix. Acts as a regularization parameter. |
1.0
|
lazy_uncertainty_update
|
bool
|
If True the precision matrix will not be updated during forward, but during the update step. |
False
|
clear_buffer_after_train
|
bool
|
If True the buffer will be cleared after training. This is necessary because the data is not needed anymore after training. Only set it to False if you know what you are doing. |
True
|
Source code in src/calvera/bandits/linear_ts_bandit.py
DiagonalPrecApproxLinearTSBandit(n_features, selector=None, buffer=None, train_batch_size=32, eps=0.01, lambda_=1.0, lazy_uncertainty_update=False, clear_buffer_after_train=True)
Bases: LinearTSBandit[Tensor]
LinearTS but the precision matrix is updated using a diagonal approximation.
Instead of doing a full update, only \(\text{diag}(\Sigma^{-1})^{-1} = \text{diag}(X X^T)^{-1}\) is used. For compatibility reasons the precision matrix is still stored as a full matrix.