{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simulator\n",
    "\n",
    "In this section we will show how to use the `a2rl.Simulator` to get a recommendation.\n",
    "\n",
    "The way Simulator provides recommendation is different from a typical Reinforcement Learning approach, where you need to first train a RL agent (e.g. SAC, PPO) with a simulator, then only the agent can recommend an action.\n",
    "\n",
    "First a Q-value has been calculated internally when you load the data using `wi_df.add_value()`. Then the Simulator is trained with sequences of states, actions, rewards, Q-value. In order to choose an action, you just need to sample multiple trajectory based on the current context.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "\n",
    "import my_nb_path  # isort: skip\n",
    "import os\n",
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "import a2rl as wi\n",
    "from a2rl.nbtools import pprint, print  # Enable color outputs when rich is installed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Dataset\n",
    "\n",
    "Instantiate a tokenizer given the selected dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wi_df = wi.read_csv_dataset(wi.sample_dataset_path(\"chiller\"))\n",
    "wi_df.add_value()\n",
    "\n",
    "# Speed up training for demo purpose\n",
    "wi_df = wi_df.iloc[:1000]\n",
    "tokenizer = wi.AutoTokenizer(wi_df, block_size_row=2)\n",
    "\n",
    "tokenizer.df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer.df_tokenized.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train a model\n",
    "\n",
    "Default hyperparam is located at `src/a2rl/config.yaml`. Alternative you can (1) specify your own configuration file using `config_dir` and `config_name`, or (2) passing in the configuration as parameter `config`. Refer to `GPTBuilder` for more info."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_dir = \"model-simulator\"\n",
    "config = None  # Default training configuration\n",
    "\n",
    "################################################################################\n",
    "# To run in fast mode, set env var NOTEBOOK_FAST_RUN=1 prior to starting Jupyter\n",
    "################################################################################\n",
    "if os.environ.get(\"NOTEBOOK_FAST_RUN\", \"0\") != \"0\":\n",
    "    config = {\n",
    "        \"train_config\": {\n",
    "            \"epochs\": 1,\n",
    "            \"batch_size\": 512,\n",
    "            \"embedding_dim\": 512,\n",
    "            \"gpt_n_layer\": 1,\n",
    "            \"gpt_n_head\": 1,\n",
    "            \"learning_rate\": 6e-4,\n",
    "            \"num_workers\": 0,\n",
    "            \"lr_decay\": True,\n",
    "        }\n",
    "    }\n",
    "\n",
    "    from IPython.display import Markdown\n",
    "\n",
    "    display(\n",
    "        Markdown(\n",
    "            '<p style=\"color:firebrick; background-color:yellow; font-weight:bold\">'\n",
    "            \"NOTE: notebook runs in fast mode. Use only 1 epoch. Results may differ.\"\n",
    "        )\n",
    "    )\n",
    "################################################################################\n",
    "builder = wi.GPTBuilder(tokenizer, model_dir, config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Start GPT model training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "builder.fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the original GPT token vs predicted horizon given initial context window."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "builder.evaluate(context_len=5, sample=False, horizon=50);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The graph above is like behaviour cloning. The model will active according to historical pattern. In the next graph, you can sample different trajectory when `sample=True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "nbsphinx-thumbnail"
    ]
   },
   "outputs": [],
   "source": [
    "builder.evaluate(context_len=5, sample=True, horizon=50);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Recommendation\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "simulator = wi.Simulator(tokenizer, builder.model)\n",
    "simulator.tokenizer.df_tokenized.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get a custom context sequence. \n",
    "\n",
    "**Note:** The sequence should ends with state, i.e. (s,a,r...s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "custom_context = tokenizer.df_tokenized.sequence[:7]\n",
    "custom_context"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One step sample\n",
    "\n",
    "`sample` returns a dataframe whose columns are (actions, reward, value, next states) given the\n",
    "context. The contents of the dataframe is in the original space (approximated)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "recommendation_df = simulator.sample(custom_context, max_size=10, as_token=False)\n",
    "recommendation_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build Your Own Planner\n",
    "\n",
    "If you want to build your own planner, `whatif` provides a few lower level api."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get valid actions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`get_valid_actions` return a dataframe of potential action (in tokenized forms) given the context.\n",
    "\n",
    "Let's get some custom context, assume always up to current states, and find out the next top_k actions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "valid_actions = simulator.get_valid_actions(custom_context, max_size=2)\n",
    "valid_actions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One step lookahead\n",
    "\n",
    "`lookahead` return reward and next states, given the context and action.\n",
    "\n",
    "Let pick an action to simulate the reward and next states. This api does not change the simulator internal counter and states"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "custom_context = np.array([0, 100])\n",
    "action_seq = [valid_actions.loc[0, :]]\n",
    "print(f\"Given the context: {custom_context} and action: {action_seq}\\n\")\n",
    "\n",
    "reward, next_states = simulator.lookahead(custom_context, action_seq)\n",
    "print(f\"{reward=}\")\n",
    "print(f\"{next_states=}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gym\n",
    "\n",
    "Get a gym compatible simulator using `SimulatorWrapper`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sim_wrapper = wi.SimulatorWrapper(env=simulator)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the action to gym encoding mapping. Gym expect action to be a list of continuous integer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sim_wrapper.gym_action_to_enc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sim_wrapper.reset()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "obs, reward, done, info = sim_wrapper.step([0])\n",
    "obs, reward"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sim_wrapper.observation_space"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sim_wrapper.action_space"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3rd Party Tools "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use with 3rd party package like `stable_baseline3`. \n",
    "\n",
    "As PPO requires observation in an array of np.float32, use OpenAI Gym's observation wrapper to perform transformation as needed by your training agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "import gym\n",
    "from stable_baselines3 import PPO\n",
    "from stable_baselines3.common.evaluation import evaluate_policy\n",
    "from stable_baselines3.ppo import MlpPolicy\n",
    "\n",
    "\n",
    "class CustomObservation(gym.ObservationWrapper):\n",
    "    def __init__(self, env: gym.Env):\n",
    "        super().__init__(env)\n",
    "        self.observation_space = gym.spaces.Box(\n",
    "            low=-np.inf,\n",
    "            high=np.inf,\n",
    "            shape=(len(self.tokenizer.state_indices),),\n",
    "            dtype=np.float32,\n",
    "        )\n",
    "\n",
    "    def observation(self, observation):\n",
    "        new_obs = observation.astype(np.float32)\n",
    "        return new_obs\n",
    "\n",
    "\n",
    "new_sim = CustomObservation(sim_wrapper)\n",
    "model = PPO(MlpPolicy, new_sim, verbose=0)\n",
    "model.learn(total_timesteps=2)\n",
    "\n",
    "obs = new_sim.reset()\n",
    "for i in range(2):\n",
    "    action, _state = model.predict(obs, deterministic=True)\n",
    "    obs, reward, done, info = new_sim.step(action)\n",
    "    if done:\n",
    "        obs = new_sim.reset()\n",
    "\n",
    "mean_reward, std_reward = evaluate_policy(model, new_sim, n_eval_episodes=1)\n",
    "print(f\"Mean reward:{mean_reward:.2f} +/- {std_reward:.2f}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  },
  "toc-autonumbering": true,
  "vscode": {
   "interpreter": {
    "hash": "22f92e4608f34d3393fc5e7884f8906c6794e2d0198ea9b43992c442775a4328"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}