a2rl.Simulator.lookahead#
- Simulator.lookahead(seq, action, correct_unseen_token=True)[source]#
Given a batch of context, and a batch of actions, simulates the expected rewards and next states for all combination of contexts and actions.
This is a simulated step to get the estimated reward and next step, it can be run multiple times for planning purpose.
Examples 1 - Rewards and action have dim of 2
Input:
seq = np.array([[1,2], [3,4]]) action = np.array([[10,20], [30,40]])
Output:
reward = np.array([ [80, 81], # From seq = [1,2], action = [10,20] [82, 83], # From seq = [1,2], action = [30,40] [90, 91], # From seq = [3,4], action = [10,20] [92, 93], # From seq = [3,4], action = [30,40] ]) next_states = np.array([ [180, 181], # From seq = [1,2], action = [10,20] [182, 183], # From seq = [1,2], action = [30,40] [190, 191], # From seq = [3,4], action = [10,20] [192, 193], # From seq = [3,4], action = [30,40] ])
Examples 2 - Reward has dim of 1, action is a list
Input:
seq = np.array([1,2]) action = [10,20]
Output:
reward = np.array([80, 81]) next_states = np.array([180, 181])
- Parameters:
- Return type:
- Returns:
Return rewards array, and next states array.