a2rl.Simulator.lookahead#

Simulator.lookahead(seq, action, correct_unseen_token=True)[source]#

Given a batch of context, and a batch of actions, simulates the expected rewards and next states for all combination of contexts and actions.

This is a simulated step to get the estimated reward and next step, it can be run multiple times for planning purpose.

Examples 1 - Rewards and action have dim of 2

Input:

seq = np.array([[1,2], [3,4]])
action = np.array([[10,20], [30,40]])

Output:

reward = np.array([
                [80, 81], # From seq = [1,2], action = [10,20]
                [82, 83], # From seq = [1,2], action = [30,40]
                [90, 91], # From seq = [3,4], action = [10,20]
                [92, 93], # From seq = [3,4], action = [30,40]
                ])

next_states = np.array([
                    [180, 181], # From seq = [1,2], action = [10,20]
                    [182, 183], # From seq = [1,2], action = [30,40]
                    [190, 191], # From seq = [3,4], action = [10,20]
                    [192, 193], # From seq = [3,4], action = [30,40]
                    ])

Examples 2 - Reward has dim of 1, action is a list

Input:

seq = np.array([1,2])
action = [10,20]

Output:

reward = np.array([80, 81])
next_states = np.array([180, 181])
Parameters:
  • seq (ndarray) – Context (s, a, r, …, s). Must end with states dataframe token.

  • action (ndarray) – Action dataframe token to be performed.

  • correct_unseen_token (bool) – Map unseen token to the closest valid token when True.

Return type:

tuple[ndarray, ndarray]

Returns:

Return rewards array, and next states array.