Skip to content

Tokenizers

tokenizers

DummyTokenizer

DummyTokenizer(*args, **kwargs)

Bases: Tokenizer

A dummy tokenizer that splits the input text on whitespace and returns the tokens as is.

This tokenizer will generally under-estimate token counts in English and latin languages (where words comprise more than one token on average), and will give very poor results for languages where the whitespace/"word" heuristic doesn't work well (e.g. Chinese, Japanese, Korean, Thai).

However, it requires no dependencies beyond the Python standard library, using str.split()

Source code in llmeter/tokenizers.py
169
170
def __init__(self, *args, **kwargs):
    pass

Tokenizer

Tokenizer(*args, **kwargs)

Bases: ABC

Source code in llmeter/tokenizers.py
13
14
def __init__(self, *args, **kwargs):
    pass

load staticmethod

load(tokenizer_info)

Loads a tokenizer from a dictionary.

Parameters:

Name Type Description Default
tokenizer_info Dict

The tokenizer information to load.

required

Returns:

Name Type Description
Tokenizer Tokenizer

The loaded tokenizer.

Source code in llmeter/tokenizers.py
62
63
64
65
66
67
68
69
70
71
72
73
@staticmethod
def load(tokenizer_info: dict) -> Tokenizer:
    """
    Loads a tokenizer from a dictionary.

    Args:
        tokenizer_info (Dict): The tokenizer information to load.

    Returns:
        Tokenizer: The loaded tokenizer.
    """
    return _load_tokenizer_from_info(tokenizer_info)

load_from_file staticmethod

load_from_file(tokenizer_path)

Loads a tokenizer from a file.

Parameters:

Name Type Description Default
tokenizer_path UPath

The path to the serialized tokenizer file.

required

Returns:

Name Type Description
Tokenizer Tokenizer

The loaded tokenizer.

Source code in llmeter/tokenizers.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
@staticmethod
def load_from_file(tokenizer_path: UPath | None) -> Tokenizer:
    """
    Loads a tokenizer from a file.

    Args:
        tokenizer_path (UPath): The path to the serialized tokenizer file.

    Returns:
        Tokenizer: The loaded tokenizer.
    """
    if tokenizer_path is None:
        return DummyTokenizer()
    with open(tokenizer_path, "r") as f:
        tokenizer_info = json.load(f)

    return _load_tokenizer_from_info(tokenizer_info)

to_dict staticmethod

to_dict(tokenizer)

Serializes a tokenizer to a dictionary.

Parameters:

Name Type Description Default
tokenizer Tokenizer

The tokenizer to serialize.

required

Returns:

Name Type Description
Dict dict

The serialized tokenizer.

Source code in llmeter/tokenizers.py
75
76
77
78
79
80
81
82
83
84
85
86
@staticmethod
def to_dict(tokenizer: Any) -> dict:
    """
    Serializes a tokenizer to a dictionary.

    Args:
        tokenizer (Tokenizer): The tokenizer to serialize.

    Returns:
        Dict: The serialized tokenizer.
    """
    return _to_dict(tokenizer)

save_tokenizer

save_tokenizer(tokenizer, output_path)

Save a tokenizer information to a file.

Parameters:

Name Type Description Default
tokenizer Tokenizer

The tokenizer to serialize.

required
output_path UPath

The path to save the serialized tokenizer to.

required

Returns:

Name Type Description
UPath UPath

The path to the serialized tokenizer file.

Source code in llmeter/tokenizers.py
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def save_tokenizer(tokenizer: Any, output_path: UPath | str) -> UPath:
    """
    Save a tokenizer information to a file.

    Args:
        tokenizer (Tokenizer): The tokenizer to serialize.
        output_path (UPath): The path to save the serialized tokenizer to.

    Returns:
        UPath: The path to the serialized tokenizer file.
    """
    tokenizer_info = _to_dict(tokenizer)

    output_path = UPath(output_path)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    with open(output_path, "w") as f:
        json.dump(tokenizer_info, f)

    return output_path