a2rl.Simulator.sample#

Simulator.sample(seq, max_size=3, as_token=False, correct_unseen_token=True)[source]#

Given a batch of context, perform one step sampling for actions and rewards.

Example:

Input:

seq = [[1,2], [3,4]]
max_size = 2

Output:

wi.WiDataFrame([]
    [10, 11], # From context [1,2]
    [12, 13], # From context [1,2]
    [20, 21], # From context [3,4]
    [22, 23], # From context [3,4]
])

Parameters:

seq (ndarray) – a batch of context (s, a, r, ..., s). Must end with states dataframe token. Shape is (batch_size, context_length). If context_length is greater than AutoTokenizer.block_size, then this input sequence will be silently trimmed to (batch_size, block_size).
max_size (int) – Number of samples to return.
as_token (bool) – whether the returned dataframe should be in tokenized format, or in the original value space (approximated).
correct_unseen_token (bool) – Map unseen token to the closest valid token when True.

Return type:

WiDataFrame

Returns:

Whatif dataframe where each row is a sample with actions and rewards columns. The as_token determines whether the dataframe contents are tokenized or in the original value space (approximated).

Shape is (batch_size * max_size, ...). Starting with the 1st context’s actions, followed by the context’s actions and so on.

Note

Ensure the correct context sequence (s, a, r, ..., s) is passed in.
Return max_size of sampling for each context. Result may not be unique.
Each rows of return result represent actions, rewards and values.