Tokenizers
tokenizers
DummyTokenizer
DummyTokenizer(*args, **kwargs)
Bases: Tokenizer
A dummy tokenizer that splits the input text on whitespace and returns the tokens as is.
This tokenizer will generally under-estimate token counts in English and latin languages (where words comprise more than one token on average), and will give very poor results for languages where the whitespace/"word" heuristic doesn't work well (e.g. Chinese, Japanese, Korean, Thai).
However, it requires no dependencies beyond the Python standard library, using str.split()
Source code in llmeter/tokenizers.py
169 170 | |
Tokenizer
Tokenizer(*args, **kwargs)
Bases: ABC
Source code in llmeter/tokenizers.py
13 14 | |
load
staticmethod
load(tokenizer_info)
Loads a tokenizer from a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer_info
|
Dict
|
The tokenizer information to load. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Tokenizer |
Tokenizer
|
The loaded tokenizer. |
Source code in llmeter/tokenizers.py
62 63 64 65 66 67 68 69 70 71 72 73 | |
load_from_file
staticmethod
load_from_file(tokenizer_path)
Loads a tokenizer from a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer_path
|
UPath
|
The path to the serialized tokenizer file. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Tokenizer |
Tokenizer
|
The loaded tokenizer. |
Source code in llmeter/tokenizers.py
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
to_dict
staticmethod
to_dict(tokenizer)
Serializes a tokenizer to a dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to serialize. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Dict |
dict
|
The serialized tokenizer. |
Source code in llmeter/tokenizers.py
75 76 77 78 79 80 81 82 83 84 85 86 | |
save_tokenizer
save_tokenizer(tokenizer, output_path)
Save a tokenizer information to a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokenizer
|
Tokenizer
|
The tokenizer to serialize. |
required |
output_path
|
UPath
|
The path to save the serialized tokenizer to. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
UPath |
UPath
|
The path to the serialized tokenizer file. |
Source code in llmeter/tokenizers.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | |