Skip to content

Bandit Interface

Below is the interface that all bandit algorithms share, defined in the AbstractBandit class. The idea is that assertions happen in the forward() method for the input and in the training_step() method for the update using the provided rewards and chosen contextualized actions. The outwards facing methods are forward() and training_step(). forward() is used for inference and training_step() is used for training. So, when implementing a new bandit, the following methods need to be implemented:

  • _predict_action(self, x: torch.Tensor) -> torch.Tensor: Predicts the action for the given context.
  • _update(self, x: torch.Tensor, y: torch.Tensor) -> None: Updates the bandit with the given context and reward.

AbstractBandit(n_features, buffer=None, train_batch_size=32, selector=None)

Bases: ABC, LightningModule, Generic[ActionInputType]

Defines the interface for all Bandit algorithms by implementing pytorch Lightning Module methods.

Parameters:

Name Type Description Default
n_features int

The number of features in the contextualized actions.

required
buffer AbstractBanditDataBuffer[ActionInputType, Any] | None

The buffer used for storing the data for continuously updating the neural network.

None
train_batch_size int

The mini-batch size used for the train loop (started by trainer.fit()).

32
selector AbstractSelector | None

The selector used to choose the best action. Default is ArgMaxSelector (if None).

None
Source code in src/calvera/bandits/abstract_bandit.py
def __init__(
    self,
    n_features: int,
    buffer: AbstractBanditDataBuffer[ActionInputType, Any] | None = None,
    train_batch_size: int = 32,
    selector: AbstractSelector | None = None,
):
    """Initializes the Bandit.

    Args:
        n_features: The number of features in the contextualized actions.
        buffer: The buffer used for storing the data for continuously updating the neural network.
        train_batch_size: The mini-batch size used for the train loop (started by `trainer.fit()`).
        selector: The selector used to choose the best action. Default is ArgMaxSelector (if None).
    """
    assert n_features > 0, "The number of features must be greater than 0."
    assert train_batch_size > 0, "The batch_size for training must be greater than 0."

    super().__init__()

    if buffer is None:
        self.buffer = TensorDataBuffer(
            retrieval_strategy=AllDataRetrievalStrategy(),
            max_size=None,
            device=self.device,
        )
    else:
        self.buffer = buffer

    self.selector = selector if selector is not None else ArgMaxSelector()

    self.save_hyperparameters(
        {
            "n_features": n_features,
            "train_batch_size": train_batch_size,
        }
    )

forward(*args, **kwargs)

Forward pass.

Given the contextualized actions, selects a single best action, or a set of actions in the case of combinatorial bandits. This can be computed for many samples in one batch.

Parameters:

Name Type Description Default
contextualized_actions

Tensor of shape (batch_size, n_actions, n_features).

required
*args Any

Additional arguments. Passed to the _predict_action method

()
**kwargs Any

Additional keyword arguments. Passed to the _predict_action method.

{}

Returns:

Name Type Description
chosen_actions Tensor

One-hot encoding of which actions were chosen. Shape: (batch_size, n_actions).

p Tensor

The probability of the chosen actions. In the combinatorial case, this will be a super set of actions. Non-probabilistic algorithms should always return 1. Shape: (batch_size, ).

Source code in src/calvera/bandits/abstract_bandit.py
def forward(
    self,
    *args: Any,
    **kwargs: Any,
) -> tuple[torch.Tensor, torch.Tensor]:
    """Forward pass.

    Given the contextualized actions, selects a single best action, or a set of actions in the case of combinatorial
    bandits. This can be computed for many samples in one batch.

    Args:
        contextualized_actions: Tensor of shape (batch_size, n_actions, n_features).
        *args: Additional arguments. Passed to the `_predict_action` method
        **kwargs: Additional keyword arguments. Passed to the `_predict_action` method.

    Returns:
        chosen_actions: One-hot encoding of which actions were chosen.
            Shape: (batch_size, n_actions).
        p: The probability of the chosen actions. In the combinatorial case,
            this will be a super set of actions. Non-probabilistic algorithms should always return 1.
            Shape: (batch_size, ).
    """
    contextualized_actions = kwargs.get(
        "contextualized_actions", args[0]
    )  # shape: (batch_size, n_actions, n_features)
    assert contextualized_actions is not None, "contextualized_actions must be passed."

    if isinstance(contextualized_actions, torch.Tensor):
        assert contextualized_actions.ndim >= 3, (
            "Chosen actions must have shape (batch_size, num_actions, ...) "
            f"but got shape {contextualized_actions.shape}"
        )
        batch_size = contextualized_actions.shape[0]
    elif isinstance(contextualized_actions, tuple | list):
        assert len(contextualized_actions) > 1, "Tuple must contain at least 2 tensors"
        assert contextualized_actions[0].ndim >= 3, (
            "Chosen actions must have shape (batch_size, num_actions, ...) "
            f"but got shape {contextualized_actions[0].shape}"
        )
        batch_size = contextualized_actions[0].shape[0]
        assert all(
            action_item.ndim >= 3 for action_item in contextualized_actions
        ), "All tensors in tuple must have shape (batch_size, num_actions, ...)"
    else:
        raise ValueError(
            f"Contextualized actions must be a torch.Tensor or a tuple of torch.Tensors."
            f"Received {type(contextualized_actions)}."
        )

    result, p = self._predict_action(*args, **kwargs)

    # assert result.shape[0] == batch_size (
    #     f"Batch size mismatch. Expected shape {batch_size} but got {result.shape[0]}"
    # )

    assert (
        p.ndim == 1 and p.shape[0] == batch_size and torch.all(p >= 0) and torch.all(p <= 1)
    ), f"The probabilities must be between 0 and 1 and have shape {batch_size} but got shape {p.shape}"

    return result, p

training_step(batch, batch_idx)

Perform a single update step.

See the documentation for the LightningModule's training_step method. Acts as a wrapper for the _update method in case we want to change something for every bandit or use the update independently from lightning, e.g. in tests.

Parameters:

Name Type Description Default
batch BufferDataFormat[ActionInputType]

The output of your data iterable, usually a DataLoader. It may contain 2 or 3 elements: contextualized_actions: shape (batch_size, n_chosen_actions, n_features). [Optional: embedded_actions: shape (batch_size, n_chosen_actions, n_features).] realized_rewards: shape (batch_size, n_chosen_actions). The embedded_actions are only passed and required for certain bandits like the NeuralLinearBandit.

required
batch_idx int

The index of this batch. Note that if a separate DataLoader is used for each step, this will be reset for each new data loader.

required
data_loader_idx

The index of the data loader. This is useful if you have multiple data loaders at once and want to do something different for each one.

required
*args

Additional arguments. Passed to the _update method.

required
**kwargs

Additional keyword arguments. Passed to the _update method.

required

Returns:

Type Description
Tensor

The loss value. In most cases, it makes sense to return the negative reward. Shape: (1,). Since we do not use the lightning optimizer, this value is only relevant for logging/visualization of the training process.

Source code in src/calvera/bandits/abstract_bandit.py
def training_step(self, batch: BufferDataFormat[ActionInputType], batch_idx: int) -> torch.Tensor:
    """Perform a single update step.

    See the documentation for the LightningModule's `training_step` method.
    Acts as a wrapper for the `_update` method in case we want to change something for every bandit or use the
    update independently from lightning, e.g. in tests.

    Args:
        batch: The output of your data iterable, usually a DataLoader. It may contain 2 or 3 elements:
            contextualized_actions: shape (batch_size, n_chosen_actions, n_features).
            [Optional: embedded_actions: shape (batch_size, n_chosen_actions, n_features).]
            realized_rewards: shape (batch_size, n_chosen_actions).
            The embedded_actions are only passed and required for certain bandits like the NeuralLinearBandit.
        batch_idx: The index of this batch. Note that if a separate DataLoader is used for each step,
            this will be reset for each new data loader.
        data_loader_idx: The index of the data loader. This is useful if you have multiple data loaders
            at once and want to do something different for each one.
        *args: Additional arguments. Passed to the `_update` method.
        **kwargs: Additional keyword arguments. Passed to the `_update` method.

    Returns:
        The loss value. In most cases, it makes sense to return the negative reward.
            Shape: (1,). Since we do not use the lightning optimizer, this value is only relevant
            for logging/visualization of the training process.
    """
    assert len(batch) == 4, (
        "Batch must contain four tensors: (contextualized_actions, embedded_actions, rewards, chosen_actions)."
        "`embedded_actions` and `chosen_actions` can be None."
    )

    realized_rewards: torch.Tensor = batch[2]  # shape: (batch_size, n_chosen_arms)

    assert realized_rewards.ndim == 2, "Rewards must have shape (batch_size, n_chosen_arms)"
    assert realized_rewards.device == self.device, "Realized reward must be on the same device as the model."

    batch_size, n_chosen_arms = realized_rewards.shape

    (
        contextualized_actions,
        embedded_actions,
    ) = batch[:2]

    if self._custom_data_loader_passed:
        self.record_feedback(contextualized_actions, realized_rewards)

    if isinstance(contextualized_actions, torch.Tensor):
        assert (
            contextualized_actions.device == self.device
        ), "Contextualized actions must be on the same device as the model."

        assert contextualized_actions.ndim >= 3, (
            f"Chosen actions must have shape (batch_size, n_chosen_arms, ...) "
            f"but got shape {contextualized_actions.shape}"
        )
        assert contextualized_actions.shape[0] == batch_size and contextualized_actions.shape[1] == n_chosen_arms, (
            "Chosen contextualized actions must have shape (batch_size, n_chosen_arms, ...) "
            f"same as reward. Expected shape ({(batch_size, n_chosen_arms)}, ...) "
            f"but got shape {contextualized_actions.shape}"
        )
    elif isinstance(contextualized_actions, tuple | list):
        assert all(
            action.device == self.device for action in contextualized_actions
        ), "Contextualized actions must be on the same device as the model."

        assert len(contextualized_actions) > 1 and contextualized_actions[0].ndim >= 3, (
            "The tuple of contextualized_actions must contain more than one element and be of shape "
            "(batch_size, n_chosen_arms, ...)."
        )
        assert (
            contextualized_actions[0].shape[0] == batch_size and contextualized_actions[0].shape[1] == n_chosen_arms
        ), (
            "Chosen contextualized actions must have shape (batch_size, n_chosen_arms, ...) "
            f"same as reward. Expected shape ({(batch_size, n_chosen_arms)}, ...) "
            f"but got shape {contextualized_actions[0].shape}"
        )
    else:
        raise ValueError(
            f"Contextualized actions must be a torch.Tensor or a tuple of torch.Tensors. "
            f"Received {type(contextualized_actions)}."
        )

    if embedded_actions is not None:
        assert embedded_actions.device == self.device, "Embedded actions must be on the same device as the model."
        assert (
            embedded_actions.ndim == 3
        ), "Embedded actions must have shape (batch_size, n_chosen_arms, n_features)"
        assert embedded_actions.shape[0] == batch_size and embedded_actions.shape[1] == n_chosen_arms, (
            "Chosen embedded actions must have shape (batch_size, n_chosen_arms, n_features) "
            f"same as reward. Expected shape ({(batch_size, n_chosen_arms)}, n_features) "
            f"but got shape {embedded_actions[0].shape}"
        )

    loss = self._update(
        batch,
        batch_idx,
    )

    assert loss.ndim == 0, "Loss must be a scalar value."

    return loss

_predict_action(contextualized_actions, **kwargs) abstractmethod

Forward pass, computed batch-wise.

Given the contextualized actions, selects a single best action, or a set of actions in the case of combinatorial bandits. Next to the action(s), the selector also returns the probability of chosing this action. This will allow for logging and Batch Learning from Logged Bandit Feedback (BLBF). Deterministic algorithms like UCB will always return 1.

Parameters:

Name Type Description Default
contextualized_actions ActionInputType

Input into bandit or network containing all actions. Either Tensor of shape (batch_size, n_actions, n_features) or a tuple of tensors of shape (batch_size, n_actions, n_features) if there are several inputs to the model.

required
**kwargs Any

Additional keyword arguments.

{}

Returns:

Name Type Description
chosen_actions Tensor

One-hot encoding of which actions were chosen. Shape: (batch_size, n_actions).

p Tensor

The probability of the chosen actions. In the combinatorial case, this will be one probability for the super set of actions. Deterministic algorithms (like UCB) should always return 1. Shape: (batch_size, ).

Source code in src/calvera/bandits/abstract_bandit.py
@abstractmethod
def _predict_action(
    self,
    contextualized_actions: ActionInputType,
    **kwargs: Any,
) -> tuple[torch.Tensor, torch.Tensor]:
    """Forward pass, computed batch-wise.

    Given the contextualized actions, selects a single best action, or a set of actions in the case of combinatorial
    bandits. Next to the action(s), the selector also returns the probability of chosing this action. This will
    allow for logging and Batch Learning from Logged Bandit Feedback (BLBF). Deterministic algorithms like UCB will
    always return 1.

    Args:
        contextualized_actions: Input into bandit or network containing all actions. Either Tensor of shape
            (batch_size, n_actions, n_features) or a tuple of tensors of shape (batch_size, n_actions, n_features)
            if there are several inputs to the model.
        **kwargs: Additional keyword arguments.

    Returns:
        chosen_actions: One-hot encoding of which actions were chosen.
            Shape: (batch_size, n_actions).
        p: The probability of the chosen actions. In the combinatorial case,
            this will be one probability for the super set of actions. Deterministic algorithms (like UCB) should
            always return 1. Shape: (batch_size, ).
    """
    pass

_update(*args, **kwargs) abstractmethod

Abstract method to perform a single update step. Should be implemented by the concrete bandit classes.

Parameters:

Name Type Description Default
batch

The output of your data iterable, usually a DataLoader. It contains 4 elements: contextualized_actions: shape (batch_size, n_chosen_actions, n_features). [Optional: embedded_actions: shape (batch_size, n_chosen_actions, n_features).] embedded_actions: only passed and required for certain bandits like the NeuralLinearBandit. realized_rewards: shape (batch_size, n_chosen_actions). chosen_actions: only passed and required for certain bandits like the NeuralLinearBandit.

required
batch_idx

The index of this batch. Note that if a separate DataLoader is used for each step, this will be reset for each new data loader.

required
data_loader_idx

The index of the data loader. This is useful if you have multiple data loaders at once and want to do something different for each one.

required
*args Any

Additional arguments.

()
**kwargs Any

Additional keyword arguments.

{}

Returns:

Type Description
Tensor

The loss value. In most cases, it makes sense to return the negative reward. Shape: (1,). If we do not use the lightning optimizer, this value is only relevant for logging/visualization of the training process.

Source code in src/calvera/bandits/abstract_bandit.py
@abstractmethod
def _update(
    self,
    *args: Any,
    **kwargs: Any,
) -> torch.Tensor:
    """Abstract method to perform a single update step. Should be implemented by the concrete bandit classes.

    Args:
        batch: The output of your data iterable, usually a DataLoader. It contains 4 elements:
            contextualized_actions: shape (batch_size, n_chosen_actions, n_features).
            [Optional: embedded_actions: shape (batch_size, n_chosen_actions, n_features).]
            embedded_actions: only passed and required for certain bandits like the NeuralLinearBandit.
            realized_rewards: shape (batch_size, n_chosen_actions).
            chosen_actions: only passed and required for certain bandits like the NeuralLinearBandit.
        batch_idx: The index of this batch. Note that if a separate DataLoader is used for each step,
            this will be reset for each new data loader.
        data_loader_idx: The index of the data loader. This is useful if you have multiple data loaders
            at once and want to do something different for each one.
        *args: Additional arguments.
        **kwargs: Additional keyword arguments.

    Returns:
        The loss value. In most cases, it makes sense to return the negative reward.
            Shape: (1,). If we do not use the lightning optimizer, this value is only relevant
            for logging/visualization of the training process.
    """
    pass

DummyBandit(n_features, k=1)

Bases: AbstractBandit[ActionInputType]

A dummy bandit that always selects random actions.

Parameters:

Name Type Description Default
n_features int

The number of features in the bandit model. Must be positive.

required
k int

Number of actions to select. Must be positive. Default is 1.

1
Source code in src/calvera/bandits/abstract_bandit.py
def __init__(self, n_features: int, k: int = 1) -> None:
    """Initializes a DummyBandit with a RandomSelector.

    Args:
        n_features: The number of features in the bandit model. Must be positive.
        k: Number of actions to select. Must be positive. Default is 1.
    """
    super().__init__(
        selector=RandomSelector(k=k),
        n_features=n_features,
    )
    self.automatic_optimization = False
    # Please don't ask. Lightning requires any parameter to be registered in order to train it on cuda.
    self.register_parameter("_", None)