ddql_optimal_execution.agent.DDQL¶

class ddql_optimal_execution.agent.DDQL(state_dict: Optional[dict] = None, greedy_decay_rate: float = 0.95, target_update_rate: int = 15, initial_greediness: float = 1, mode: str = 'train', lr: float = 0.001, state_size: int = 5, initial_budget: int = 100, horizon: int = 100, gamma: float = 0.99, quadratic_penalty_coefficient: float = 0.01, verbose: bool = False)[source]¶

The DDQL class inherits from the Agent class. It is an agent that implements a Double Deep Q-Learning algorithm.

Parameters

state_dict (dict, optional) – A dictionary containing the state of the agent, by default None
greedy_decay_rate (float, optional) – The greedy decay rate, by default 0.95
target_update_rate (int, optional) – The target update rate, by default 15
initial_greediness (float, optional) – The initial greediness, by default 1
mode (str, optional) – The mode, by default “train”
lr (float, optional) – The learning rate, by default 1e-3
state_size (int, optional) – The state size, by default 5
initial_budget (int, optional) – The initial budget, by default 100
horizon (int, optional) – The horizon, by default 100
gamma (float, optional) – The gamma parameter used in the Q-Learning algorithm, by default 0.99
quadratic_penalty_coefficient (float, optional) – The quadratic penalty coefficient used to penalize the agent for selling big quantities of stocks, by default 0.01

device¶

The device used to run the agent

Type: torch.device

main_net¶

The main neural network used to predict the Q-values of the state-action pairs

Type: QNet

target_net¶

The target neural network used to predict the Q-values of the state-action pairs

Type: QNet

state_size¶

The state size

Type: int

greedy_decay_rate¶

The greedy decay rate

Type: float

target_update_rate¶

The target update rate

Type: int

initial_greediness¶

The initial greediness of the agent. It is used to determine the probability of the agent taking a random action.

Type: float

greediness¶

The current greediness of the agent.

Type: float

mode¶

The mode of the agent. It can be either “train” or “test”.

Type: str

lr¶

The learning rate used to update the weights of the neural network.

Type: float

gamma¶

The gamma parameter used in the Q-Learning algorithm.

Type: float

quadratic_penalty_coefficient¶

The quadratic penalty coefficient used to penalize the agent for selling big quantities of stocks.

Type: float

optimizer¶

The optimizer used to update the weights of the neural network.

Type: torch.optim

loss_fn¶

The loss function used to calculate the loss between the predicted Q-values and the target Q-values.

Type: torch.nn

__init__(state_dict: Optional[dict] = None, greedy_decay_rate: float = 0.95, target_update_rate: int = 15, initial_greediness: float = 1, mode: str = 'train', lr: float = 0.001, state_size: int = 5, initial_budget: int = 100, horizon: int = 100, gamma: float = 0.99, quadratic_penalty_coefficient: float = 0.01, verbose: bool = False) → None[source]¶

Methods

`__init__`([state_dict, greedy_decay_rate, ...])
`eval`()	This function sets the mode to "eval" and puts the main network in evaluation mode.
`get_action`(state)	This function returns a tensor that is either a random binomial distribution or the index of the maximum value in the output of a neural network, depending on certain conditions.
`learn`(experience_batch)	This function trains a neural network using a batch of experiences and updates the target network periodically.
`train`()	This function sets the mode to "train" and trains the main neural network.

__complete_target(experience_batch: numpy.ndarray) → Tuple[torch.Tensor, torch.Tensor, torch.Tensor]¶

This function takes in a batch of experiences and returns the corresponding targets, actions, and states for training a reinforcement learning agent.

Parameters

experience_batch (np.ndarray) – experience_batch is a numpy array containing a batch of experiences. Each experience is a
The (dictionary containing information about a single step taken by the agent in the environment.) –
"state" (dictionary contains keys such as) –
"action" –
"reward" –
"next_state" –
"dist2Horizon". (and) –

Returns

a tuple of three torch Tensors

Return type

targets, actions, and states.

__update_target_net() → None¶: This function updates the target network by loading the state dictionary of the main network.

_abc_impl = <_abc_data object>¶

eval() → None[source]¶: This function sets the mode to “eval” and puts the main network in evaluation mode.

get_action(state: ddql_optimal_execution.state._state.State) → int[source]¶

This function returns a tensor that is either a random binomial distribution or the index of the maximum value in the output of a neural network, depending on certain conditions.

Parameters

state (State) – The state parameter is an instance of the State class, which contains information about the
typically (current state of the environment in which the agent is operating. This information) –
position (includes things like the agent's current) –
board (the state of the game) –
other (and any) –
needs (relevant information that the agent) –

Return type

an integer that represents the action to be taken based on the given state. If the greediness

parameter is set and the mode is “train”, a random binomial distribution is generated using the state’s inventory as the number of trials and the probability of success as 1/inventory. Otherwise, the action is determined by the main neural network’s output, which is the index of the maximum value in the output Q-values tensor.

learn(experience_batch: numpy.ndarray) → None[source]¶

This function trains a neural network using a batch of experiences and updates the target network periodically.

Parameters

experience_batch (np.ndarray) – The experience_batch parameter is a numpy array containing a batch of experiences, where each
(state (experience is a tuple of) –
action –
reward –
next_state –
the (dist2Horizon). This batch is used to update) –
backpropagation. (neural network's weights through) –

train() → None[source]¶: This function sets the mode to “train” and trains the main neural network.