In this section we will show how to use the a2rl.Simulator to get a recommendation.

The way Simulator provides recommendation is different from a typical Reinforcement Learning approach, where you need to first train a RL agent (e.g. SAC, PPO) with a simulator, then only the agent can recommend an action.

First a Q-value has been calculated internally when you load the data using wi_df.add_value(). Then the Simulator is trained with sequences of states, actions, rewards, Q-value. In order to choose an action, you just need to sample multiple trajectory based on the current context.

%matplotlib inline
%load_ext autoreload
%autoreload 2

import my_nb_path  # isort: skip
import os
from pathlib import Path

import numpy as np

import a2rl as wi
from a2rl.nbtools import pprint, print  # Enable color outputs when rich is installed.
Load Dataset#

Instantiate a tokenizer given the selected dataset.

wi_df = wi.read_csv_dataset(wi.sample_dataset_path("chiller"))

# Speed up training for demo purpose
wi_df = wi_df.iloc[:1000]
tokenizer = wi.AutoTokenizer(wi_df, block_size_row=2)

condenser_inlet_temp evaporator_heat_load_rt staging system_power_consumption value
0 29.5 455.4 1 756.4 1007.294795
1 30.2 913.1 0 959.3 780.987943
Train a model#

Default hyperparam is located at src/a2rl/config.yaml. Alternative you can (1) specify your own configuration file using config_dir and config_name, or (2) passing in the configuration as parameter config. Refer to GPTBuilder for more info.

model_dir = "model-simulator"
config = None  # Default training configuration

# To run in fast mode, set env var NOTEBOOK_FAST_RUN=1 prior to starting Jupyter
if os.environ.get("NOTEBOOK_FAST_RUN", "0") != "0":
    config = {
        "train_config": {
            "epochs": 1,
            "batch_size": 512,
            "embedding_dim": 512,
            "gpt_n_layer": 1,
            "gpt_n_head": 1,
            "learning_rate": 6e-4,
            "num_workers": 0,
            "lr_decay": True,

Start GPT model training.

2023-05-22 10:08:55.525 | INFO     | a2rl.simulator:fit:753 - {'epochs': 1, 'batch_size': 512, 'embedding_dim': 512, 'gpt_n_layer': 1, 'gpt_n_head': 1, 'learning_rate': 0.0006, 'num_workers': 0, 'lr_decay': True}
2023-05-22 10:09:00.403 | INFO     | a2rl.simulator:fit:787 - Training time in mins: 0.08
CPU times: user 9.12 s, sys: 351 ms, total: 9.47 s
Wall time: 4.9 s
  (tok_emb): Embedding(351, 512)
  (drop): Dropout(p=0.1, inplace=False)
  (blocks): Sequential(
    (0): Block(
      (ln1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (attn): CausalSelfAttention(
        (key): Linear(in_features=512, out_features=512, bias=True)
        (query): Linear(in_features=512, out_features=512, bias=True)
        (value): Linear(in_features=512, out_features=512, bias=True)
        (attn_drop): Dropout(p=0.1, inplace=False)
        (resid_drop): Dropout(p=0.1, inplace=False)
        (proj): Linear(in_features=512, out_features=512, bias=True)
      (mlp): Sequential(
        (0): Linear(in_features=512, out_features=2048, bias=True)
        (1): GELU(approximate='none')
        (2): Linear(in_features=2048, out_features=512, bias=True)
        (3): Dropout(p=0.1, inplace=False)
  (ln_f): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (head): Linear(in_features=512, out_features=351, bias=False)

Plot the original GPT token vs predicted horizon given initial context window.

builder.evaluate(context_len=5, sample=False, horizon=50);

The graph above is like behaviour cloning. The model will active according to historical pattern. In the next graph, you can sample different trajectory when sample=True.

builder.evaluate(context_len=5, sample=True, horizon=50);

Get Recommendation#

simulator = wi.Simulator(tokenizer, builder.model)
condenser_inlet_temp evaporator_heat_load_rt staging system_power_consumption value
0 26 45 344 173 300
1 33 104 343 213 261

Get a custom context sequence.

Note: The sequence should ends with state, i.e. (s,a,r…s)

custom_context = tokenizer.df_tokenized.sequence[:7]
array([ 26,  45, 344, 173, 300,  33, 104])

One step sample#

sample returns a dataframe whose columns are (actions, reward, value, next states) given the context. The contents of the dataframe is in the original space (approximated).

recommendation_df = simulator.sample(custom_context, max_size=10, as_token=False)
staging system_power_consumption value
0 6 839.9500 804.951501
1 0 861.1995 921.267877
2 1 777.8640 1069.562352
3 6 1100.9950 660.180949
4 2 1028.8500 981.629609
5 2 1000.1635 987.438738
6 6 1199.6880 792.387914
7 5 741.0085 804.951501
8 0 861.1995 1357.274873
9 1 918.1375 1301.132534

Build Your Own Planner#

If you want to build your own planner, whatif provides a few lower level api.

Get valid actions#

get_valid_actions return a dataframe of potential action (in tokenized forms) given the context.

Let’s get some custom context, assume always up to current states, and find out the next top_k actions.

valid_actions = simulator.get_valid_actions(custom_context, max_size=2)
0 351
1 353

One step lookahead#

lookahead return reward and next states, given the context and action.

Let pick an action to simulate the reward and next states. This api does not change the simulator internal counter and states

custom_context = np.array([0, 100])
action_seq = [valid_actions.loc[0, :]]
print(f"Given the context: {custom_context} and action: {action_seq}\n")

reward, next_states = simulator.lookahead(custom_context, action_seq)
Given the context: [  0 100] and action: [staging    351
Name: 0, dtype: int64]

reward=array([180, 296])
next_states=array([22, 69])


Get a gym compatible simulator using SimulatorWrapper.

sim_wrapper = wi.SimulatorWrapper(env=simulator)
Get the action to gym encoding mapping. Gym expect action to be a list of continuous integer.

{'staging': {'0': 0,
  '1': 1,
  '10': 2,
  '2': 3,
  '3': 4,
  '4': 5,
  '5': 6,
  '6': 7,
  '7': 8,
  '8': 9,
  '9': 10}}
array([30.2, 913.1], dtype=object)
obs, reward, done, info = sim_wrapper.step([0])
obs, reward
(array([31.1505, 733.693], dtype=object), 731.473)
Dict('condenser_inlet_temp': Box(26.3, 31.9, (1,), float32), 'evaporator_heat_load_rt': Box(185.6, 1436.0, (1,), float32))

3rd Party Tools#

Use with 3rd party package like stable_baseline3.

As PPO requires observation in an array of np.float32, use OpenAI Gym’s observation wrapper to perform transformation as needed by your training agent.


import gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.ppo import MlpPolicy

class CustomObservation(gym.ObservationWrapper):
    def __init__(self, env: gym.Env):
        self.observation_space = gym.spaces.Box(

    def observation(self, observation):
        new_obs = observation.astype(np.float32)
        return new_obs

new_sim = CustomObservation(sim_wrapper)
model = PPO(MlpPolicy, new_sim, verbose=0)

obs = new_sim.reset()
for i in range(2):
    action, _state = model.predict(obs, deterministic=True)
    obs, reward, done, info = new_sim.step(action)
    if done:
        obs = new_sim.reset()

mean_reward, std_reward = evaluate_policy(model, new_sim, n_eval_episodes=1)
print(f"Mean reward:{mean_reward:.2f} +/- {std_reward:.2f}")
Mean reward:84940.77 +/- 0.00
CPU times: user 1min 15s, sys: 115 ms, total: 1min 15s
Wall time: 42.9 s
