Skip to content

NeuralTS and NeuralUCB

The NeuralUCBBandit and the NeuralTSBandit share the same interface. Both use a neural network to learn the reward function for a given contextualized actions. To estimate the uncertainty the gradients of the estimated reward of the chosen action with respect to the network parameters are used. These gradients are used to build a precision matrix which is used to compute the UCB or perform Thompson sampling.

NeuralUCBBandit(n_features, network, buffer=None, selector=None, exploration_rate=1.0, train_batch_size=32, learning_rate=0.001, weight_decay=1.0, learning_rate_decay=1.0, learning_rate_scheduler_step_size=1, early_stop_threshold=0.001, min_samples_required_for_training=1024, initial_train_steps=1024, warm_start=True)

Bases: NeuralBandit

NeuralUCB bandit implementation as a PyTorch Lightning module.

The NeuralUCB algorithm using a neural network for function approximation with diagonal approximation for exploration. This implementation supports both standard and combinatorial bandit settings.

Implementation details

Standard setting:

  • UCB: \(u_{t,a} = f(x_{t,a}; \theta_{t-1}) + \sqrt{\lambda \nu \cdot g(x_{t,a}; \theta_{t-1})^T Z_{t-1}^{-1} g(x_{t,a}; \theta_{t-1})}\)

  • Update: \(Z_t = Z_{t-1} + g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)

Combinatorial setting:

  • Same UCB formula for each arm

  • Select super arm: \(S_t = \mathcal{O}_S(u_t)\)

  • Update includes gradients from all chosen arms: \(Z_t = Z_{t-1} + \sum_{a \in S_t} g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)

References

Parameters:

Name Type Description Default
n_features int

Number of input features. Must be greater 0.

required
network Module

Neural network module for function approximation.

required
buffer AbstractBanditDataBuffer[Tensor, Any] | None

Buffer for storing bandit interaction data. See superclass for further information.

None
selector AbstractSelector | None

The selector used to choose the best action. Default is ArgMaxSelector (if None).

None
exploration_rate float

Exploration parameter for UCB. Called \(\nu\) in the original paper. Must be greater 0.

1.0
train_batch_size int

Size of mini-batches for training. Must be greater 0.

32
learning_rate float

The learning rate for the optimizer of the neural network. Passed to lr of torch.optim.Adam. Must be greater than 0.

0.001
weight_decay float

The regularization parameter for the neural network. Passed to weight_decay of torch.optim.Adam. Called \(\lambda\) in the original paper. Must be greater than 0 because the NeuralUCB algorithm is based on this parameter.

1.0
learning_rate_decay float

Multiplicative factor for learning rate decay. Passed to gamma of torch.optim.lr_scheduler.StepLR. Default is 1.0 (i.e. no decay). Must be greater than 0.

1.0
learning_rate_scheduler_step_size int

The step size for the learning rate decay. Passed to step_size of torch.optim.lr_scheduler.StepLR. Must be greater than 0.

1
early_stop_threshold float | None

Loss threshold for early stopping. None to disable. Must be greater equal 0.

0.001
min_samples_required_for_training int

If less samples have been added via record_feedback than this value, the network is not trained. Must be greater 0. Default is 1024.

1024
initial_train_steps int

For the first initial_train_steps samples, the network is always trained even if less new data than min_samples_required_for_training has been seen. Therefore, this value is only required if min_samples_required_for_training is set. Set to 0 to disable this feature. Must be greater equal 0.

1024
warm_start bool

If False the parameters of the network are reset in order to be retrained from scratch using network.reset_parameters() everytime a retraining of the network occurs. If True the network is trained from the current state.

True
Source code in src/calvera/bandits/neural_bandit.py
def __init__(
    self,
    n_features: int,
    network: nn.Module,
    buffer: AbstractBanditDataBuffer[torch.Tensor, Any] | None = None,
    selector: AbstractSelector | None = None,
    exploration_rate: float = 1.0,
    train_batch_size: int = 32,
    learning_rate: float = 1e-3,
    weight_decay: float = 1.0,
    learning_rate_decay: float = 1.0,
    learning_rate_scheduler_step_size: int = 1,
    early_stop_threshold: float | None = 1e-3,
    min_samples_required_for_training: int = 1024,
    initial_train_steps: int = 1024,
    warm_start: bool = True,
) -> None:
    r"""Initialize the NeuralUCB bandit module.

    Args:
        n_features: Number of input features. Must be greater 0.
        network: Neural network module for function approximation.
        buffer: Buffer for storing bandit interaction data. See superclass for further information.
        selector: The selector used to choose the best action. Default is `ArgMaxSelector` (if None).
        exploration_rate: Exploration parameter for UCB. Called $\nu$ in the original paper.
            Must be greater 0.
        train_batch_size: Size of mini-batches for training. Must be greater 0.
        learning_rate: The learning rate for the optimizer of the neural network.
            Passed to `lr` of `torch.optim.Adam`.
            Must be greater than 0.
        weight_decay: The regularization parameter for the neural network.
            Passed to `weight_decay` of `torch.optim.Adam`. Called $\lambda$ in the original paper.
            Must be greater than 0 because the NeuralUCB algorithm is based on this parameter.
        learning_rate_decay: Multiplicative factor for learning rate decay.
            Passed to `gamma` of `torch.optim.lr_scheduler.StepLR`.
            Default is 1.0 (i.e. no decay). Must be greater than 0.
        learning_rate_scheduler_step_size: The step size for the learning rate decay.
            Passed to `step_size` of `torch.optim.lr_scheduler.StepLR`.
            Must be greater than 0.
        early_stop_threshold: Loss threshold for early stopping. None to disable.
            Must be greater equal 0.
        min_samples_required_for_training: If less samples have been added via `record_feedback`
            than this value, the network is not trained.
            Must be greater 0. Default is 1024.
        initial_train_steps: For the first `initial_train_steps` samples, the network is always trained even if
            less new data than `min_samples_required_for_training` has been seen. Therefore, this value is only
            required if `min_samples_required_for_training` is set. Set to 0 to disable this feature.
            Must be greater equal 0.
        warm_start: If `False` the parameters of the network are reset in order to be retrained from scratch using
            `network.reset_parameters()` everytime a retraining of the network occurs. If `True` the network is
            trained from the current state.
    """
    assert weight_decay >= 0, "Regularization parameter must be greater equal 0."
    assert exploration_rate > 0, "Exploration rate must be greater than 0."
    assert learning_rate > 0, "Learning rate must be greater than 0."
    assert learning_rate_decay >= 0, "The learning rate decay must be greater equal 0."
    assert learning_rate_scheduler_step_size > 0, "Learning rate must be greater than 0."
    assert (
        min_samples_required_for_training is not None and min_samples_required_for_training > 0
    ), "min_samples_required_for_training must not be None and must be greater than 0."
    assert (
        early_stop_threshold is None or early_stop_threshold >= 0
    ), "Early stop threshold must be greater than or equal to 0."
    assert initial_train_steps >= 0, "Initial training steps must be greater than or equal to 0."

    super().__init__(
        n_features=n_features,
        buffer=buffer,
        train_batch_size=train_batch_size,
        selector=selector,
    )

    self.save_hyperparameters(
        {
            "exploration_rate": exploration_rate,
            "train_batch_size": train_batch_size,
            "learning_rate": learning_rate,
            "weight_decay": weight_decay,
            "learning_rate_decay": learning_rate_decay,
            "learning_rate_scheduler_step_size": learning_rate_scheduler_step_size,
            "min_samples_required_for_training": min_samples_required_for_training,
            "early_stop_threshold": early_stop_threshold,
            "initial_train_steps": initial_train_steps,
            "warm_start": warm_start,
        }
    )

    # Model parameters: Initialize θ_t
    self.theta_t = network.to(self.device)
    self.theta_t_init = self.theta_t.state_dict().copy() if not self.hparams["warm_start"] else None

    self.total_params = sum(p.numel() for p in self.theta_t.parameters() if p.requires_grad)

    # Initialize Z_0 = λI
    self.register_buffer(
        "Z_t",
        self.hparams["weight_decay"] * torch.ones((self.total_params,), device=self.device),
    )

NeuralTSBandit(n_features, network, buffer=None, selector=None, exploration_rate=1.0, train_batch_size=32, learning_rate=0.001, weight_decay=1.0, learning_rate_decay=1.0, learning_rate_scheduler_step_size=1, early_stop_threshold=0.001, min_samples_required_for_training=64, initial_train_steps=1024, num_samples_per_arm=1, warm_start=True)

Bases: NeuralBandit

Neural Thompson Sampling (TS) bandit implementation as a PyTorch Lightning module.

Implements the NeuralTS algorithm using a neural network for function approximation with a diagonal approximation. The module maintains a history of contexts and rewards, and periodically updates the network parameters via gradient descent. This implementation supports both standard and combinatorial bandit settings.

Implementation details

Standard setting:

  • \(\sigma_{t,a} = \sqrt{\lambda \nu \cdot g(x_{t,a}; \theta_{t-1})^T Z_{t-1}^{-1} g(x_{t,a}; \theta_{t-1})}\)

  • Sample rewards: \(\tilde{v}_{t,k} \sim \mathcal{N}(f(x_{t,a}; \theta_{t-1}), \sigma^2_{t,a})\)

  • Update: \(Z_t = Z_{t-1} + g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)

Combinatorial setting:

  • Same variance and sampling formulas for each arm

  • Select super arm: \(S_t = \mathcal{O}_S(\tilde{v}_t)\)

  • Update includes gradients from all chosen arms: \(Z_t = Z_{t-1} + \sum_{a \in S_t} g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)

References

Parameters:

Name Type Description Default
n_features int

Number of input features. Must be greater 0.

required
network Module

Neural network module for function approximation.

required
buffer AbstractBanditDataBuffer[Tensor, Any] | None

Buffer for storing bandit interaction data.

None
selector AbstractSelector | None

Action selector for the bandit. Defaults to ArgMaxSelector (if None).

None
exploration_rate float

Exploration parameter for UCB. Called \(\nu\) in the original paper. Defaults to 1. Must be greater 0.

1.0
train_batch_size int

Size of mini-batches for training. Defaults to 32. Must be greater 0.

32
learning_rate float

The learning rate for the optimizer of the neural network. Passed to lr of torch.optim.Adam. Default is 1e-3. Must be greater than 0.

0.001
weight_decay float

The regularization parameter for the neural network. Passed to weight_decay of torch.optim.Adam. Called \(\lambda\) in the original paper. Default is 1.0. Must be greater than 0 because the NeuralUCB algorithm is based on this parameter.

1.0
learning_rate_decay float

Multiplicative factor for learning rate decay. Passed to gamma of torch.optim.lr_scheduler.StepLR. Default is 1.0 (i.e. no decay). Must be greater than 0.

1.0
learning_rate_scheduler_step_size int

The step size for the learning rate decay. Passed to step_size of torch.optim.lr_scheduler.StepLR. Default is 1. Must be greater than 0.

1
early_stop_threshold float | None

Loss threshold for early stopping. None to disable. Defaults to 1e-3. Must be greater equal 0.

0.001
min_samples_required_for_training int

If less samples have been added via record_feedback than this value, the network is not trained. Defaults to 64. Must be greater 0.

64
initial_train_steps int

For the first initial_train_steps samples, the network is always trained even if less new data than min_samples_required_for_training has been seen. Therefore, this value is only required if min_samples_required_for_training is set. Set to 0 to disable this feature. Defaults to 1024. Must be greater equal 0.

1024
num_samples_per_arm int

Number of samples to draw from each Normal distribution in Thompson Sampling. Defaults to 1. Must be greater than 0.

1
warm_start bool

If False the parameters of the network are reset in order to be retrained from scratch using network.reset_parameters() ever

True
Source code in src/calvera/bandits/neural_ts_bandit.py
def __init__(
    self,
    n_features: int,
    network: nn.Module,
    buffer: AbstractBanditDataBuffer[torch.Tensor, Any] | None = None,
    selector: AbstractSelector | None = None,
    exploration_rate: float = 1.0,
    train_batch_size: int = 32,
    learning_rate: float = 1e-3,
    weight_decay: float = 1.0,
    learning_rate_decay: float = 1.0,
    learning_rate_scheduler_step_size: int = 1,
    early_stop_threshold: float | None = 1e-3,
    min_samples_required_for_training: int = 64,
    initial_train_steps: int = 1024,
    num_samples_per_arm: int = 1,
    warm_start: bool = True,
) -> None:
    r"""Initialize the NeuralTS bandit module.

    Args:
        n_features: Number of input features. Must be greater 0.
        network: Neural network module for function approximation.
        buffer: Buffer for storing bandit interaction data.
        selector: Action selector for the bandit. Defaults to ArgMaxSelector (if None).
        exploration_rate: Exploration parameter for UCB. Called $\nu$ in the original paper.
            Defaults to 1. Must be greater 0.
        train_batch_size: Size of mini-batches for training. Defaults to 32. Must be greater 0.
        learning_rate: The learning rate for the optimizer of the neural network.
            Passed to `lr` of `torch.optim.Adam`.
            Default is 1e-3. Must be greater than 0.
        weight_decay: The regularization parameter for the neural network.
            Passed to `weight_decay` of `torch.optim.Adam`. Called $\lambda$ in the original paper.
            Default is 1.0. Must be greater than 0 because the NeuralUCB algorithm is based on this parameter.
        learning_rate_decay: Multiplicative factor for learning rate decay.
            Passed to `gamma` of `torch.optim.lr_scheduler.StepLR`.
            Default is 1.0 (i.e. no decay). Must be greater than 0.
        learning_rate_scheduler_step_size: The step size for the learning rate decay.
            Passed to `step_size` of `torch.optim.lr_scheduler.StepLR`.
            Default is 1. Must be greater than 0.
        early_stop_threshold: Loss threshold for early stopping. None to disable.
            Defaults to 1e-3. Must be greater equal 0.
        min_samples_required_for_training: If less samples have been added via `record_feedback`
            than this value, the network is not trained.
            Defaults to 64. Must be greater 0.
        initial_train_steps: For the first `initial_train_steps` samples, the network is always trained even if
            less new data than `min_samples_required_for_training` has been seen. Therefore, this value is only
            required if `min_samples_required_for_training` is set. Set to 0 to disable this feature.
            Defaults to 1024. Must be greater equal 0.
        num_samples_per_arm: Number of samples to draw from each Normal distribution in Thompson Sampling.
            Defaults to 1. Must be greater than 0.
        warm_start: If `False` the parameters of the network are reset in order to be retrained from scratch using
            `network.reset_parameters()` ever
    """
    assert num_samples_per_arm > 0, "Number of samples must be greater than 0."

    super().__init__(
        n_features=n_features,
        network=network,
        buffer=buffer,
        selector=selector,
        exploration_rate=exploration_rate,
        train_batch_size=train_batch_size,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        learning_rate_decay=learning_rate_decay,
        learning_rate_scheduler_step_size=learning_rate_scheduler_step_size,
        early_stop_threshold=early_stop_threshold,
        min_samples_required_for_training=min_samples_required_for_training,
        initial_train_steps=initial_train_steps,
        warm_start=warm_start,
    )

    self.save_hyperparameters({"num_samples_per_arm": num_samples_per_arm})