Simulator#
In this section we will show how to use the a2rl.Simulator
to get a recommendation.
The way Simulator provides recommendation is different from a typical Reinforcement Learning approach, where you need to first train a RL agent (e.g. SAC, PPO) with a simulator, then only the agent can recommend an action.
First a Q-value has been calculated internally when you load the data using wi_df.add_value()
. Then the Simulator is trained with sequences of states, actions, rewards, Q-value. In order to choose an action, you just need to sample multiple trajectory based on the current context.
[1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2
import my_nb_path # isort: skip
import os
from pathlib import Path
import numpy as np
import a2rl as wi
from a2rl.nbtools import pprint, print # Enable color outputs when rich is installed.
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/lightning_fabric/__init__.py:36: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('lightning_fabric')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
__import__("pkg_resources").declare_namespace(__name__)
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/pytorch_lightning/__init__.py:36: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('pytorch_lightning')`.
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
__import__("pkg_resources").declare_namespace(__name__)
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/nptyping/typing_.py:51: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`. (Deprecated NumPy 1.24)
Bool8 = np.bool8
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/nptyping/typing_.py:54: DeprecationWarning: `np.object0` is a deprecated alias for ``np.object0` is a deprecated alias for `np.object_`. `object` can be used instead. (Deprecated NumPy 1.24)`. (Deprecated NumPy 1.24)
Object0 = np.object0
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/nptyping/typing_.py:66: DeprecationWarning: `np.int0` is a deprecated alias for `np.intp`. (Deprecated NumPy 1.24)
Int0 = np.int0
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/nptyping/typing_.py:80: DeprecationWarning: `np.uint0` is a deprecated alias for `np.uintp`. (Deprecated NumPy 1.24)
UInt0 = np.uint0
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/nptyping/typing_.py:107: DeprecationWarning: `np.void0` is a deprecated alias for `np.void`. (Deprecated NumPy 1.24)
Void0 = np.void0
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/nptyping/typing_.py:112: DeprecationWarning: `np.bytes0` is a deprecated alias for `np.bytes_`. (Deprecated NumPy 1.24)
Bytes0 = np.bytes0
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/nptyping/typing_.py:114: DeprecationWarning: `np.str0` is a deprecated alias for `np.str_`. (Deprecated NumPy 1.24)
Str0 = np.str0
Load Dataset#
Instantiate a tokenizer given the selected dataset.
[2]:
wi_df = wi.read_csv_dataset(wi.sample_dataset_path("chiller"))
wi_df.add_value()
# Speed up training for demo purpose
wi_df = wi_df.iloc[:1000]
tokenizer = wi.AutoTokenizer(wi_df, block_size_row=2)
tokenizer.df.head(2)
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:279: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 0 are removed. Consider decreasing the number of bins.
warnings.warn(
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/sklearn/preprocessing/_discretization.py:279: UserWarning: Bins whose width are too small (i.e., <= 1e-8) in feature 0 are removed. Consider decreasing the number of bins.
warnings.warn(
[2]:
condenser_inlet_temp | evaporator_heat_load_rt | staging | system_power_consumption | value | |
---|---|---|---|---|---|
0 | 29.5 | 455.4 | 1 | 756.4 | 1007.294795 |
1 | 30.2 | 913.1 | 0 | 959.3 | 780.987943 |
[3]:
tokenizer.df_tokenized.head(2)
[3]:
condenser_inlet_temp | evaporator_heat_load_rt | staging | system_power_consumption | value | |
---|---|---|---|---|---|
0 | 26 | 45 | 344 | 173 | 300 |
1 | 33 | 104 | 343 | 213 | 261 |
Train a model#
Default hyperparam is located at src/a2rl/config.yaml
. Alternative you can (1) specify your own configuration file using config_dir
and config_name
, or (2) passing in the configuration as parameter config
. Refer to GPTBuilder
for more info.
[4]:
model_dir = "model-simulator"
config = None # Default training configuration
################################################################################
# To run in fast mode, set env var NOTEBOOK_FAST_RUN=1 prior to starting Jupyter
################################################################################
if os.environ.get("NOTEBOOK_FAST_RUN", "0") != "0":
config = {
"train_config": {
"epochs": 1,
"batch_size": 512,
"embedding_dim": 512,
"gpt_n_layer": 1,
"gpt_n_head": 1,
"learning_rate": 6e-4,
"num_workers": 0,
"lr_decay": True,
}
}
from IPython.display import Markdown
display(
Markdown(
'<p style="color:firebrick; background-color:yellow; font-weight:bold">'
"NOTE: notebook runs in fast mode. Use only 1 epoch. Results may differ."
)
)
################################################################################
builder = wi.GPTBuilder(tokenizer, model_dir, config)
NOTE: notebook runs in fast mode. Use only 1 epoch. Results may differ.
Start GPT model training.
[5]:
%%time
builder.fit()
2023-05-22 10:08:55.525 | INFO | a2rl.simulator:fit:753 - {'epochs': 1, 'batch_size': 512, 'embedding_dim': 512, 'gpt_n_layer': 1, 'gpt_n_head': 1, 'learning_rate': 0.0006, 'num_workers': 0, 'lr_decay': True}
2023-05-22 10:09:00.403 | INFO | a2rl.simulator:fit:787 - Training time in mins: 0.08
CPU times: user 9.12 s, sys: 351 ms, total: 9.47 s
Wall time: 4.9 s
[5]:
GPT(
(tok_emb): Embedding(351, 512)
(drop): Dropout(p=0.1, inplace=False)
(blocks): Sequential(
(0): Block(
(ln1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(ln2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(attn): CausalSelfAttention(
(key): Linear(in_features=512, out_features=512, bias=True)
(query): Linear(in_features=512, out_features=512, bias=True)
(value): Linear(in_features=512, out_features=512, bias=True)
(attn_drop): Dropout(p=0.1, inplace=False)
(resid_drop): Dropout(p=0.1, inplace=False)
(proj): Linear(in_features=512, out_features=512, bias=True)
)
(mlp): Sequential(
(0): Linear(in_features=512, out_features=2048, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=2048, out_features=512, bias=True)
(3): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(head): Linear(in_features=512, out_features=351, bias=False)
)
Plot the original GPT token vs predicted horizon given initial context window.
[6]:
builder.evaluate(context_len=5, sample=False, horizon=50);
The graph above is like behaviour cloning. The model will active according to historical pattern. In the next graph, you can sample different trajectory when sample=True
.
[7]:
builder.evaluate(context_len=5, sample=True, horizon=50);
Get Recommendation#
[8]:
simulator = wi.Simulator(tokenizer, builder.model)
simulator.tokenizer.df_tokenized.head(2)
[8]:
condenser_inlet_temp | evaporator_heat_load_rt | staging | system_power_consumption | value | |
---|---|---|---|---|---|
0 | 26 | 45 | 344 | 173 | 300 |
1 | 33 | 104 | 343 | 213 | 261 |
Get a custom context sequence.
Note: The sequence should ends with state, i.e. (s,a,r…s)
[9]:
custom_context = tokenizer.df_tokenized.sequence[:7]
custom_context
[9]:
array([ 26, 45, 344, 173, 300, 33, 104])
One step sample#
sample
returns a dataframe whose columns are (actions, reward, value, next states) given the context. The contents of the dataframe is in the original space (approximated).
[10]:
recommendation_df = simulator.sample(custom_context, max_size=10, as_token=False)
recommendation_df
[10]:
staging | system_power_consumption | value | |
---|---|---|---|
0 | 6 | 839.9500 | 804.951501 |
1 | 0 | 861.1995 | 921.267877 |
2 | 1 | 777.8640 | 1069.562352 |
3 | 6 | 1100.9950 | 660.180949 |
4 | 2 | 1028.8500 | 981.629609 |
5 | 2 | 1000.1635 | 987.438738 |
6 | 6 | 1199.6880 | 792.387914 |
7 | 5 | 741.0085 | 804.951501 |
8 | 0 | 861.1995 | 1357.274873 |
9 | 1 | 918.1375 | 1301.132534 |
Build Your Own Planner#
If you want to build your own planner, whatif
provides a few lower level api.
Get valid actions#
get_valid_actions
return a dataframe of potential action (in tokenized forms) given the context.
Let’s get some custom context, assume always up to current states, and find out the next top_k actions.
[11]:
valid_actions = simulator.get_valid_actions(custom_context, max_size=2)
valid_actions
[11]:
staging | |
---|---|
0 | 351 |
1 | 353 |
One step lookahead#
lookahead
return reward and next states, given the context and action.
Let pick an action to simulate the reward and next states. This api does not change the simulator internal counter and states
[12]:
custom_context = np.array([0, 100])
action_seq = [valid_actions.loc[0, :]]
print(f"Given the context: {custom_context} and action: {action_seq}\n")
reward, next_states = simulator.lookahead(custom_context, action_seq)
print(f"{reward=}")
print(f"{next_states=}")
Given the context: [ 0 100] and action: [staging 351
Name: 0, dtype: int64]
reward=array([180, 296])
next_states=array([22, 69])
Gym#
Get a gym compatible simulator using SimulatorWrapper
.
[13]:
sim_wrapper = wi.SimulatorWrapper(env=simulator)
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/gym/core.py:317: DeprecationWarning: WARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.
deprecation(
Get the action to gym encoding mapping. Gym expect action to be a list of continuous integer.
[14]:
sim_wrapper.gym_action_to_enc
[14]:
{'staging': {'0': 0,
'1': 1,
'10': 2,
'2': 3,
'3': 4,
'4': 5,
'5': 6,
'6': 7,
'7': 8,
'8': 9,
'9': 10}}
[15]:
sim_wrapper.reset()
[15]:
array([30.2, 913.1], dtype=object)
[16]:
obs, reward, done, info = sim_wrapper.step([0])
obs, reward
[16]:
(array([31.1505, 733.693], dtype=object), 731.473)
[17]:
sim_wrapper.observation_space
[17]:
Dict('condenser_inlet_temp': Box(26.3, 31.9, (1,), float32), 'evaporator_heat_load_rt': Box(185.6, 1436.0, (1,), float32))
[18]:
sim_wrapper.action_space
[18]:
MultiDiscrete([11])
3rd Party Tools#
Use with 3rd party package like stable_baseline3
.
As PPO requires observation in an array of np.float32, use OpenAI Gym’s observation wrapper to perform transformation as needed by your training agent.
[19]:
%%time
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.ppo import MlpPolicy
class CustomObservation(gym.ObservationWrapper):
def __init__(self, env: gym.Env):
super().__init__(env)
self.observation_space = gym.spaces.Box(
low=-np.inf,
high=np.inf,
shape=(len(self.tokenizer.state_indices),),
dtype=np.float32,
)
def observation(self, observation):
new_obs = observation.astype(np.float32)
return new_obs
new_sim = CustomObservation(sim_wrapper)
model = PPO(MlpPolicy, new_sim, verbose=0)
model.learn(total_timesteps=2)
obs = new_sim.reset()
for i in range(2):
action, _state = model.predict(obs, deterministic=True)
obs, reward, done, info = new_sim.step(action)
if done:
obs = new_sim.reset()
mean_reward, std_reward = evaluate_policy(model, new_sim, n_eval_episodes=1)
print(f"Mean reward:{mean_reward:.2f} +/- {std_reward:.2f}")
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/gym/core.py:317: DeprecationWarning: WARN: Initializing wrapper in old step API which returns one bool instead of two. It is recommended to set `new_step_api=True` to use new step API. This will be the default behaviour in future.
deprecation(
/opt/hostedtoolcache/Python/3.10.11/x64/lib/python3.10/site-packages/stable_baselines3/common/evaluation.py:65: UserWarning: Evaluation environment is not wrapped with a ``Monitor`` wrapper. This may result in reporting modified episode lengths and rewards, if other wrappers happen to modify these. Consider wrapping environment first with ``Monitor`` wrapper.
warnings.warn(
Mean reward:84940.77 +/- 0.00
CPU times: user 1min 15s, sys: 115 ms, total: 1min 15s
Wall time: 42.9 s