NeuralTS and NeuralUCB
The NeuralUCBBandit
and the NeuralTSBandit
share the same interface.
Both use a neural network to learn the reward function for a given contextualized actions.
To estimate the uncertainty the gradients of the estimated reward of the chosen action with respect to the network parameters are used.
These gradients are used to build a precision matrix which is used to compute the UCB or perform Thompson sampling.
NeuralUCBBandit(n_features, network, buffer=None, selector=None, exploration_rate=1.0, train_batch_size=32, learning_rate=0.001, weight_decay=1.0, learning_rate_decay=1.0, learning_rate_scheduler_step_size=1, early_stop_threshold=0.001, min_samples_required_for_training=1024, initial_train_steps=1024, warm_start=True)
Bases: NeuralBandit
NeuralUCB bandit implementation as a PyTorch Lightning module.
The NeuralUCB algorithm using a neural network for function approximation with diagonal approximation for exploration. This implementation supports both standard and combinatorial bandit settings.
Implementation details
Standard setting:
-
UCB: \(u_{t,a} = f(x_{t,a}; \theta_{t-1}) + \sqrt{\lambda \nu \cdot g(x_{t,a}; \theta_{t-1})^T Z_{t-1}^{-1} g(x_{t,a}; \theta_{t-1})}\)
-
Update: \(Z_t = Z_{t-1} + g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)
Combinatorial setting:
-
Same UCB formula for each arm
-
Select super arm: \(S_t = \mathcal{O}_S(u_t)\)
-
Update includes gradients from all chosen arms: \(Z_t = Z_{t-1} + \sum_{a \in S_t} g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)
References
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_features
|
int
|
Number of input features. Must be greater 0. |
required |
network
|
Module
|
Neural network module for function approximation. |
required |
buffer
|
AbstractBanditDataBuffer[Tensor, Any] | None
|
Buffer for storing bandit interaction data. See superclass for further information. |
None
|
selector
|
AbstractSelector | None
|
The selector used to choose the best action. Default is |
None
|
exploration_rate
|
float
|
Exploration parameter for UCB. Called \(\nu\) in the original paper. Must be greater 0. |
1.0
|
train_batch_size
|
int
|
Size of mini-batches for training. Must be greater 0. |
32
|
learning_rate
|
float
|
The learning rate for the optimizer of the neural network.
Passed to |
0.001
|
weight_decay
|
float
|
The regularization parameter for the neural network.
Passed to |
1.0
|
learning_rate_decay
|
float
|
Multiplicative factor for learning rate decay.
Passed to |
1.0
|
learning_rate_scheduler_step_size
|
int
|
The step size for the learning rate decay.
Passed to |
1
|
early_stop_threshold
|
float | None
|
Loss threshold for early stopping. None to disable. Must be greater equal 0. |
0.001
|
min_samples_required_for_training
|
int
|
If less samples have been added via |
1024
|
initial_train_steps
|
int
|
For the first |
1024
|
warm_start
|
bool
|
If |
True
|
Source code in src/calvera/bandits/neural_bandit.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 |
|
NeuralTSBandit(n_features, network, buffer=None, selector=None, exploration_rate=1.0, train_batch_size=32, learning_rate=0.001, weight_decay=1.0, learning_rate_decay=1.0, learning_rate_scheduler_step_size=1, early_stop_threshold=0.001, min_samples_required_for_training=64, initial_train_steps=1024, num_samples_per_arm=1, warm_start=True)
Bases: NeuralBandit
Neural Thompson Sampling (TS) bandit implementation as a PyTorch Lightning module.
Implements the NeuralTS algorithm using a neural network for function approximation with a diagonal approximation. The module maintains a history of contexts and rewards, and periodically updates the network parameters via gradient descent. This implementation supports both standard and combinatorial bandit settings.
Implementation details
Standard setting:
-
\(\sigma_{t,a} = \sqrt{\lambda \nu \cdot g(x_{t,a}; \theta_{t-1})^T Z_{t-1}^{-1} g(x_{t,a}; \theta_{t-1})}\)
-
Sample rewards: \(\tilde{v}_{t,k} \sim \mathcal{N}(f(x_{t,a}; \theta_{t-1}), \sigma^2_{t,a})\)
-
Update: \(Z_t = Z_{t-1} + g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)
Combinatorial setting:
-
Same variance and sampling formulas for each arm
-
Select super arm: \(S_t = \mathcal{O}_S(\tilde{v}_t)\)
-
Update includes gradients from all chosen arms: \(Z_t = Z_{t-1} + \sum_{a \in S_t} g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)
References
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_features
|
int
|
Number of input features. Must be greater 0. |
required |
network
|
Module
|
Neural network module for function approximation. |
required |
buffer
|
AbstractBanditDataBuffer[Tensor, Any] | None
|
Buffer for storing bandit interaction data. |
None
|
selector
|
AbstractSelector | None
|
Action selector for the bandit. Defaults to ArgMaxSelector (if None). |
None
|
exploration_rate
|
float
|
Exploration parameter for UCB. Called \(\nu\) in the original paper. Defaults to 1. Must be greater 0. |
1.0
|
train_batch_size
|
int
|
Size of mini-batches for training. Defaults to 32. Must be greater 0. |
32
|
learning_rate
|
float
|
The learning rate for the optimizer of the neural network.
Passed to |
0.001
|
weight_decay
|
float
|
The regularization parameter for the neural network.
Passed to |
1.0
|
learning_rate_decay
|
float
|
Multiplicative factor for learning rate decay.
Passed to |
1.0
|
learning_rate_scheduler_step_size
|
int
|
The step size for the learning rate decay.
Passed to |
1
|
early_stop_threshold
|
float | None
|
Loss threshold for early stopping. None to disable. Defaults to 1e-3. Must be greater equal 0. |
0.001
|
min_samples_required_for_training
|
int
|
If less samples have been added via |
64
|
initial_train_steps
|
int
|
For the first |
1024
|
num_samples_per_arm
|
int
|
Number of samples to draw from each Normal distribution in Thompson Sampling. Defaults to 1. Must be greater than 0. |
1
|
warm_start
|
bool
|
If |
True
|