NeuralTS and NeuralUCB

The NeuralUCBBandit and the NeuralTSBandit share the same interface. Both use a neural network to learn the reward function for a given contextualized actions. To estimate the uncertainty the gradients of the estimated reward of the chosen action with respect to the network parameters are used. These gradients are used to build a precision matrix which is used to compute the UCB or perform Thompson sampling.

`NeuralUCBBandit(n_features, network, buffer=None, selector=None, exploration_rate=1.0, train_batch_size=32, learning_rate=0.001, weight_decay=1.0, learning_rate_decay=1.0, learning_rate_scheduler_step_size=1, early_stop_threshold=0.001, min_samples_required_for_training=1024, initial_train_steps=1024, warm_start=True)`

Bases: NeuralBandit

NeuralUCB bandit implementation as a PyTorch Lightning module.

The NeuralUCB algorithm using a neural network for function approximation with diagonal approximation for exploration. This implementation supports both standard and combinatorial bandit settings.

Implementation details

Standard setting:

UCB: \(u_{t,a} = f(x_{t,a}; \theta_{t-1}) + \sqrt{\lambda \nu \cdot g(x_{t,a}; \theta_{t-1})^T Z_{t-1}^{-1} g(x_{t,a}; \theta_{t-1})}\)
Update: \(Z_t = Z_{t-1} + g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)

Combinatorial setting:

Same UCB formula for each arm
Select super arm: \(S_t = \mathcal{O}_S(u_t)\)
Update includes gradients from all chosen arms: \(Z_t = Z_{t-1} + \sum_{a \in S_t} g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)

References

Parameters:

Name	Type	Description	Default
`n_features`	`int`	Number of input features. Must be greater 0.	required
`network`	`Module`	Neural network module for function approximation.	required
`buffer`	`AbstractBanditDataBuffer[Tensor, Any] \| None`	Buffer for storing bandit interaction data. See superclass for further information.	`None`
`selector`	`AbstractSelector \| None`	The selector used to choose the best action. Default is `ArgMaxSelector` (if None).	`None`
`exploration_rate`	`float`	Exploration parameter for UCB. Called \(\nu\) in the original paper. Must be greater 0.	`1.0`
`train_batch_size`	`int`	Size of mini-batches for training. Must be greater 0.	`32`
`learning_rate`	`float`	The learning rate for the optimizer of the neural network. Passed to `lr` of `torch.optim.Adam`. Must be greater than 0.	`0.001`
`weight_decay`	`float`	The regularization parameter for the neural network. Passed to `weight_decay` of `torch.optim.Adam`. Called \(\lambda\) in the original paper. Must be greater than 0 because the NeuralUCB algorithm is based on this parameter.	`1.0`
`learning_rate_decay`	`float`	Multiplicative factor for learning rate decay. Passed to `gamma` of `torch.optim.lr_scheduler.StepLR`. Default is 1.0 (i.e. no decay). Must be greater than 0.	`1.0`
`learning_rate_scheduler_step_size`	`int`	The step size for the learning rate decay. Passed to `step_size` of `torch.optim.lr_scheduler.StepLR`. Must be greater than 0.	`1`
`early_stop_threshold`	`float \| None`	Loss threshold for early stopping. None to disable. Must be greater equal 0.	`0.001`
`min_samples_required_for_training`	`int`	If less samples have been added via `record_feedback` than this value, the network is not trained. Must be greater 0. Default is 1024.	`1024`
`initial_train_steps`	`int`	For the first `initial_train_steps` samples, the network is always trained even if less new data than `min_samples_required_for_training` has been seen. Therefore, this value is only required if `min_samples_required_for_training` is set. Set to 0 to disable this feature. Must be greater equal 0.	`1024`
`warm_start`	`bool`	If `False` the parameters of the network are reset in order to be retrained from scratch using `network.reset_parameters()` everytime a retraining of the network occurs. If `True` the network is trained from the current state.	`True`

Source code in src/calvera/bandits/neural_bandit.py

def __init__(
    self,
    n_features: int,
    network: nn.Module,
    buffer: AbstractBanditDataBuffer[torch.Tensor, Any] | None = None,
    selector: AbstractSelector | None = None,
    exploration_rate: float = 1.0,
    train_batch_size: int = 32,
    learning_rate: float = 1e-3,
    weight_decay: float = 1.0,
    learning_rate_decay: float = 1.0,
    learning_rate_scheduler_step_size: int = 1,
    early_stop_threshold: float | None = 1e-3,
    min_samples_required_for_training: int = 1024,
    initial_train_steps: int = 1024,
    warm_start: bool = True,
) -> None:
    r"""Initialize the NeuralUCB bandit module.

    Args:
        n_features: Number of input features. Must be greater 0.
        network: Neural network module for function approximation.
        buffer: Buffer for storing bandit interaction data. See superclass for further information.
        selector: The selector used to choose the best action. Default is `ArgMaxSelector` (if None).
        exploration_rate: Exploration parameter for UCB. Called $\nu$ in the original paper.
            Must be greater 0.
        train_batch_size: Size of mini-batches for training. Must be greater 0.
        learning_rate: The learning rate for the optimizer of the neural network.
            Passed to `lr` of `torch.optim.Adam`.
            Must be greater than 0.
        weight_decay: The regularization parameter for the neural network.
            Passed to `weight_decay` of `torch.optim.Adam`. Called $\lambda$ in the original paper.
            Must be greater than 0 because the NeuralUCB algorithm is based on this parameter.
        learning_rate_decay: Multiplicative factor for learning rate decay.
            Passed to `gamma` of `torch.optim.lr_scheduler.StepLR`.
            Default is 1.0 (i.e. no decay). Must be greater than 0.
        learning_rate_scheduler_step_size: The step size for the learning rate decay.
            Passed to `step_size` of `torch.optim.lr_scheduler.StepLR`.
            Must be greater than 0.
        early_stop_threshold: Loss threshold for early stopping. None to disable.
            Must be greater equal 0.
        min_samples_required_for_training: If less samples have been added via `record_feedback`
            than this value, the network is not trained.
            Must be greater 0. Default is 1024.
        initial_train_steps: For the first `initial_train_steps` samples, the network is always trained even if
            less new data than `min_samples_required_for_training` has been seen. Therefore, this value is only
            required if `min_samples_required_for_training` is set. Set to 0 to disable this feature.
            Must be greater equal 0.
        warm_start: If `False` the parameters of the network are reset in order to be retrained from scratch using
            `network.reset_parameters()` everytime a retraining of the network occurs. If `True` the network is
            trained from the current state.
    """
    assert weight_decay >= 0, "Regularization parameter must be greater equal 0."
    assert exploration_rate > 0, "Exploration rate must be greater than 0."
    assert learning_rate > 0, "Learning rate must be greater than 0."
    assert learning_rate_decay >= 0, "The learning rate decay must be greater equal 0."
    assert learning_rate_scheduler_step_size > 0, "Learning rate must be greater than 0."
    assert (
        min_samples_required_for_training is not None and min_samples_required_for_training > 0
    ), "min_samples_required_for_training must not be None and must be greater than 0."
    assert (
        early_stop_threshold is None or early_stop_threshold >= 0
    ), "Early stop threshold must be greater than or equal to 0."
    assert initial_train_steps >= 0, "Initial training steps must be greater than or equal to 0."

    super().__init__(
        n_features=n_features,
        buffer=buffer,
        train_batch_size=train_batch_size,
        selector=selector,
    )

    self.save_hyperparameters(
        {
            "exploration_rate": exploration_rate,
            "train_batch_size": train_batch_size,
            "learning_rate": learning_rate,
            "weight_decay": weight_decay,
            "learning_rate_decay": learning_rate_decay,
            "learning_rate_scheduler_step_size": learning_rate_scheduler_step_size,
            "min_samples_required_for_training": min_samples_required_for_training,
            "early_stop_threshold": early_stop_threshold,
            "initial_train_steps": initial_train_steps,
            "warm_start": warm_start,
        }
    )

    # Model parameters: Initialize θ_t
    self.theta_t = network.to(self.device)
    self.theta_t_init = self.theta_t.state_dict().copy() if not self.hparams["warm_start"] else None

    self.total_params = sum(p.numel() for p in self.theta_t.parameters() if p.requires_grad)

    # Initialize Z_0 = λI
    self.register_buffer(
        "Z_t",
        self.hparams["weight_decay"] * torch.ones((self.total_params,), device=self.device),
    )

`NeuralTSBandit(n_features, network, buffer=None, selector=None, exploration_rate=1.0, train_batch_size=32, learning_rate=0.001, weight_decay=1.0, learning_rate_decay=1.0, learning_rate_scheduler_step_size=1, early_stop_threshold=0.001, min_samples_required_for_training=64, initial_train_steps=1024, num_samples_per_arm=1, warm_start=True)`

Bases: NeuralBandit

Neural Thompson Sampling (TS) bandit implementation as a PyTorch Lightning module.

Implements the NeuralTS algorithm using a neural network for function approximation with a diagonal approximation. The module maintains a history of contexts and rewards, and periodically updates the network parameters via gradient descent. This implementation supports both standard and combinatorial bandit settings.

Implementation details

Standard setting:

\(\sigma_{t,a} = \sqrt{\lambda \nu \cdot g(x_{t,a}; \theta_{t-1})^T Z_{t-1}^{-1} g(x_{t,a}; \theta_{t-1})}\)
Sample rewards: \(\tilde{v}_{t,k} \sim \mathcal{N}(f(x_{t,a}; \theta_{t-1}), \sigma^2_{t,a})\)
Update: \(Z_t = Z_{t-1} + g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)

Combinatorial setting:

Same variance and sampling formulas for each arm
Select super arm: \(S_t = \mathcal{O}_S(\tilde{v}_t)\)
Update includes gradients from all chosen arms: \(Z_t = Z_{t-1} + \sum_{a \in S_t} g(x_{t,a_t}; \theta_{t-1})g(x_{t,a_t}; \theta_{t-1})^T\)

References

Parameters:

Name	Type	Description	Default
`n_features`	`int`	Number of input features. Must be greater 0.	required
`network`	`Module`	Neural network module for function approximation.	required
`buffer`	`AbstractBanditDataBuffer[Tensor, Any] \| None`	Buffer for storing bandit interaction data.	`None`
`selector`	`AbstractSelector \| None`	Action selector for the bandit. Defaults to ArgMaxSelector (if None).	`None`
`exploration_rate`	`float`	Exploration parameter for UCB. Called \(\nu\) in the original paper. Defaults to 1. Must be greater 0.	`1.0`
`train_batch_size`	`int`	Size of mini-batches for training. Defaults to 32. Must be greater 0.	`32`
`learning_rate`	`float`	The learning rate for the optimizer of the neural network. Passed to `lr` of `torch.optim.Adam`. Default is 1e-3. Must be greater than 0.	`0.001`
`weight_decay`	`float`	The regularization parameter for the neural network. Passed to `weight_decay` of `torch.optim.Adam`. Called \(\lambda\) in the original paper. Default is 1.0. Must be greater than 0 because the NeuralUCB algorithm is based on this parameter.	`1.0`
`learning_rate_decay`	`float`	Multiplicative factor for learning rate decay. Passed to `gamma` of `torch.optim.lr_scheduler.StepLR`. Default is 1.0 (i.e. no decay). Must be greater than 0.	`1.0`
`learning_rate_scheduler_step_size`	`int`	The step size for the learning rate decay. Passed to `step_size` of `torch.optim.lr_scheduler.StepLR`. Default is 1. Must be greater than 0.	`1`
`early_stop_threshold`	`float \| None`	Loss threshold for early stopping. None to disable. Defaults to 1e-3. Must be greater equal 0.	`0.001`
`min_samples_required_for_training`	`int`	If less samples have been added via `record_feedback` than this value, the network is not trained. Defaults to 64. Must be greater 0.	`64`
`initial_train_steps`	`int`	For the first `initial_train_steps` samples, the network is always trained even if less new data than `min_samples_required_for_training` has been seen. Therefore, this value is only required if `min_samples_required_for_training` is set. Set to 0 to disable this feature. Defaults to 1024. Must be greater equal 0.	`1024`
`num_samples_per_arm`	`int`	Number of samples to draw from each Normal distribution in Thompson Sampling. Defaults to 1. Must be greater than 0.	`1`
`warm_start`	`bool`	If `False` the parameters of the network are reset in order to be retrained from scratch using `network.reset_parameters()` ever	`True`

Source code in src/calvera/bandits/neural_ts_bandit.py

def __init__(
    self,
    n_features: int,
    network: nn.Module,
    buffer: AbstractBanditDataBuffer[torch.Tensor, Any] | None = None,
    selector: AbstractSelector | None = None,
    exploration_rate: float = 1.0,
    train_batch_size: int = 32,
    learning_rate: float = 1e-3,
    weight_decay: float = 1.0,
    learning_rate_decay: float = 1.0,
    learning_rate_scheduler_step_size: int = 1,
    early_stop_threshold: float | None = 1e-3,
    min_samples_required_for_training: int = 64,
    initial_train_steps: int = 1024,
    num_samples_per_arm: int = 1,
    warm_start: bool = True,
) -> None:
    r"""Initialize the NeuralTS bandit module.

    Args:
        n_features: Number of input features. Must be greater 0.
        network: Neural network module for function approximation.
        buffer: Buffer for storing bandit interaction data.
        selector: Action selector for the bandit. Defaults to ArgMaxSelector (if None).
        exploration_rate: Exploration parameter for UCB. Called $\nu$ in the original paper.
            Defaults to 1. Must be greater 0.
        train_batch_size: Size of mini-batches for training. Defaults to 32. Must be greater 0.
        learning_rate: The learning rate for the optimizer of the neural network.
            Passed to `lr` of `torch.optim.Adam`.
            Default is 1e-3. Must be greater than 0.
        weight_decay: The regularization parameter for the neural network.
            Passed to `weight_decay` of `torch.optim.Adam`. Called $\lambda$ in the original paper.
            Default is 1.0. Must be greater than 0 because the NeuralUCB algorithm is based on this parameter.
        learning_rate_decay: Multiplicative factor for learning rate decay.
            Passed to `gamma` of `torch.optim.lr_scheduler.StepLR`.
            Default is 1.0 (i.e. no decay). Must be greater than 0.
        learning_rate_scheduler_step_size: The step size for the learning rate decay.
            Passed to `step_size` of `torch.optim.lr_scheduler.StepLR`.
            Default is 1. Must be greater than 0.
        early_stop_threshold: Loss threshold for early stopping. None to disable.
            Defaults to 1e-3. Must be greater equal 0.
        min_samples_required_for_training: If less samples have been added via `record_feedback`
            than this value, the network is not trained.
            Defaults to 64. Must be greater 0.
        initial_train_steps: For the first `initial_train_steps` samples, the network is always trained even if
            less new data than `min_samples_required_for_training` has been seen. Therefore, this value is only
            required if `min_samples_required_for_training` is set. Set to 0 to disable this feature.
            Defaults to 1024. Must be greater equal 0.
        num_samples_per_arm: Number of samples to draw from each Normal distribution in Thompson Sampling.
            Defaults to 1. Must be greater than 0.
        warm_start: If `False` the parameters of the network are reset in order to be retrained from scratch using
            `network.reset_parameters()` ever
    """
    assert num_samples_per_arm > 0, "Number of samples must be greater than 0."

    super().__init__(
        n_features=n_features,
        network=network,
        buffer=buffer,
        selector=selector,
        exploration_rate=exploration_rate,
        train_batch_size=train_batch_size,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        learning_rate_decay=learning_rate_decay,
        learning_rate_scheduler_step_size=learning_rate_scheduler_step_size,
        early_stop_threshold=early_stop_threshold,
        min_samples_required_for_training=min_samples_required_for_training,
        initial_train_steps=initial_train_steps,
        warm_start=warm_start,
    )

    self.save_hyperparameters({"num_samples_per_arm": num_samples_per_arm})