Bandit Interface
Below is the interface that all bandit algorithms share, defined in the AbstractBandit
class. The idea is that assertions happen in the forward()
method for the input and in the training_step()
method for the update using the provided rewards and chosen contextualized actions.
The outwards facing methods are forward()
and training_step()
. forward()
is used for inference and training_step()
is used for training.
So, when implementing a new bandit, the following methods need to be implemented:
_predict_action(self, x: torch.Tensor) -> torch.Tensor
: Predicts the action for the given context._update(self, x: torch.Tensor, y: torch.Tensor) -> None
: Updates the bandit with the given context and reward.
AbstractBandit(n_features, buffer=None, train_batch_size=32, selector=None)
Bases: ABC
, LightningModule
, Generic[ActionInputType]
Defines the interface for all Bandit algorithms by implementing pytorch Lightning Module methods.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_features
|
int
|
The number of features in the contextualized actions. |
required |
buffer
|
AbstractBanditDataBuffer[ActionInputType, Any] | None
|
The buffer used for storing the data for continuously updating the neural network. |
None
|
train_batch_size
|
int
|
The mini-batch size used for the train loop (started by |
32
|
selector
|
AbstractSelector | None
|
The selector used to choose the best action. Default is ArgMaxSelector (if None). |
None
|
Source code in src/calvera/bandits/abstract_bandit.py
forward(*args, **kwargs)
Forward pass.
Given the contextualized actions, selects a single best action, or a set of actions in the case of combinatorial bandits. This can be computed for many samples in one batch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
contextualized_actions
|
Tensor of shape (batch_size, n_actions, n_features). |
required | |
*args
|
Any
|
Additional arguments. Passed to the |
()
|
**kwargs
|
Any
|
Additional keyword arguments. Passed to the |
{}
|
Returns:
Name | Type | Description |
---|---|---|
chosen_actions |
Tensor
|
One-hot encoding of which actions were chosen. Shape: (batch_size, n_actions). |
p |
Tensor
|
The probability of the chosen actions. In the combinatorial case, this will be a super set of actions. Non-probabilistic algorithms should always return 1. Shape: (batch_size, ). |
Source code in src/calvera/bandits/abstract_bandit.py
training_step(batch, batch_idx)
Perform a single update step.
See the documentation for the LightningModule's training_step
method.
Acts as a wrapper for the _update
method in case we want to change something for every bandit or use the
update independently from lightning, e.g. in tests.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch
|
BufferDataFormat[ActionInputType]
|
The output of your data iterable, usually a DataLoader. It may contain 2 or 3 elements: contextualized_actions: shape (batch_size, n_chosen_actions, n_features). [Optional: embedded_actions: shape (batch_size, n_chosen_actions, n_features).] realized_rewards: shape (batch_size, n_chosen_actions). The embedded_actions are only passed and required for certain bandits like the NeuralLinearBandit. |
required |
batch_idx
|
int
|
The index of this batch. Note that if a separate DataLoader is used for each step, this will be reset for each new data loader. |
required |
data_loader_idx
|
The index of the data loader. This is useful if you have multiple data loaders at once and want to do something different for each one. |
required | |
*args
|
Additional arguments. Passed to the |
required | |
**kwargs
|
Additional keyword arguments. Passed to the |
required |
Returns:
Type | Description |
---|---|
Tensor
|
The loss value. In most cases, it makes sense to return the negative reward. Shape: (1,). Since we do not use the lightning optimizer, this value is only relevant for logging/visualization of the training process. |
Source code in src/calvera/bandits/abstract_bandit.py
333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 |
|
_predict_action(contextualized_actions, **kwargs)
abstractmethod
Forward pass, computed batch-wise.
Given the contextualized actions, selects a single best action, or a set of actions in the case of combinatorial bandits. Next to the action(s), the selector also returns the probability of chosing this action. This will allow for logging and Batch Learning from Logged Bandit Feedback (BLBF). Deterministic algorithms like UCB will always return 1.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
contextualized_actions
|
ActionInputType
|
Input into bandit or network containing all actions. Either Tensor of shape (batch_size, n_actions, n_features) or a tuple of tensors of shape (batch_size, n_actions, n_features) if there are several inputs to the model. |
required |
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
chosen_actions |
Tensor
|
One-hot encoding of which actions were chosen. Shape: (batch_size, n_actions). |
p |
Tensor
|
The probability of the chosen actions. In the combinatorial case, this will be one probability for the super set of actions. Deterministic algorithms (like UCB) should always return 1. Shape: (batch_size, ). |
Source code in src/calvera/bandits/abstract_bandit.py
_update(*args, **kwargs)
abstractmethod
Abstract method to perform a single update step. Should be implemented by the concrete bandit classes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch
|
The output of your data iterable, usually a DataLoader. It contains 4 elements: contextualized_actions: shape (batch_size, n_chosen_actions, n_features). [Optional: embedded_actions: shape (batch_size, n_chosen_actions, n_features).] embedded_actions: only passed and required for certain bandits like the NeuralLinearBandit. realized_rewards: shape (batch_size, n_chosen_actions). chosen_actions: only passed and required for certain bandits like the NeuralLinearBandit. |
required | |
batch_idx
|
The index of this batch. Note that if a separate DataLoader is used for each step, this will be reset for each new data loader. |
required | |
data_loader_idx
|
The index of the data loader. This is useful if you have multiple data loaders at once and want to do something different for each one. |
required | |
*args
|
Any
|
Additional arguments. |
()
|
**kwargs
|
Any
|
Additional keyword arguments. |
{}
|
Returns:
Type | Description |
---|---|
Tensor
|
The loss value. In most cases, it makes sense to return the negative reward. Shape: (1,). If we do not use the lightning optimizer, this value is only relevant for logging/visualization of the training process. |
Source code in src/calvera/bandits/abstract_bandit.py
DummyBandit(n_features, k=1)
Bases: AbstractBandit[ActionInputType]
A dummy bandit that always selects random actions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_features
|
int
|
The number of features in the bandit model. Must be positive. |
required |
k
|
int
|
Number of actions to select. Must be positive. Default is 1. |
1
|