a2rl.AutoTokenizer#
- class a2rl.AutoTokenizer(df, block_size_row, train_ratio=1.0, copy=True, field_tokenizer=<factory>)[source]#
Bases:
object
Auto tokenizer process input Whatif dataset and provide data-level helper functions for Trainer and Simulator.
Dataframe token
refers to the tokenized dataframe column values.GPT token
refers to the input token passed to GPT model.tokenized_val_to_gpt_token_map
property give the mapping between dataframe token and GPT token.- Parameters:
df (
WiDataFrame
) – This is a WiDataFrame.block_size_row (
int
) – Number of rows to be used as context windows for GPT model. If there aren
columns in dataframe, the context windows is calculated asn * block_size_row
tokens.train_ratio (
float
) – The ratio of data to be used for training. Default is 0.8 (80%).
Note
Context length that is greater than
block_size_row
will be discarded before passing to GPT model for next token prediction.Examples
You can instantiate a AutoTokenizer with whatif dataframe, and specific the block_size_row in term of number of rows in dataframe.
>>> import a2rl as wi >>> import numpy as np >>> import pandas as pd >>> df = pd.DataFrame( ... np.array( ... [ ... [0, 10, 20, 200], ... [1, 12, 21, 225], ... [2, 15, 22, 237], ... ] ... ), ... columns=["s1", "s2", "a", "r"], ... ) >>> wi_df = wi.WiDataFrame(df, states=["s1", "s2"], actions=["a"], rewards=["r"]) >>> wi_df.add_value() s1 s2 a r value 0 0 10 20 200 184... 1 1 12 21 225 154... 2 2 15 22 237 0...
Retrived discretized dataframe using
df_tokenized
property.>>> field_tokenizer = wi.DiscreteTokenizer(num_bins_strategy="uniform") >>> tokenizer = wi.AutoTokenizer(wi_df, 1, field_tokenizer=field_tokenizer) >>> tokenizer.df_tokenized s1 s2 a r value 0 0 100 200 300 499 1 50 140 250 367 483 2 99 199 299 399 400
To tokenize a new dataframe, use
AutoTokenizer.field_tokenizer.transform()
>>> new_df = pd.DataFrame( ... np.array( ... [ ... [0, 14, 25, 210], ... [2, 15, 26, 211], ... ] ... ), ... columns=["s1", "s2", "a", "r"], ... ) >>> new_wi_df = wi.WiDataFrame(new_df, states=["s1", "s2"], actions=["a"], rewards=["r"]) >>> new_wi_df = new_wi_df.add_value() >>> tokenizer.field_tokenizer.transform(new_wi_df) s1 s2 a r value 0 0 180 299 327 474 1 99 199 299 329 400
Note
The data for each column cannot have just a single value.
In order to reuse a tokenizer, the dataframe must have the same columns. In this example, you must create the
value
column as well by callingadd_value()
.You can transform the dataframe token into GPT token or vice versa as follows.
>>> seq = np.array([0, 100, 200, 300, 499]) >>> gpt_token = tokenizer.gpt_tokenize(seq) >>> gpt_token array([ 0, 3, 6, 9, 14]) >>> gpt_token_inv = tokenizer.gpt_inverse_tokenize(gpt_token) >>> gpt_token_inv array([ 0, 100, 200, 300, 499])
To convert sequence back into dataframe.
>>> tokenizer.from_seq_to_dataframe(seq) s1 s2 a r value 0 0.01 10.025 20.01 200.185 121.99732
Methods
from_seq_to_dataframe
(seq[, inverse])Convert sequence of tokenized value back into original value, in the form of dataframe.
gpt_inverse_tokenize
(seq)Convert input sequence from GPT token to dataframe token.
gpt_tokenize
(seq)Convert input sequence from dataframe token to GPT token.
Attributes