Skip to content

Tokenizers

tokenizers

DummyTokenizer

DummyTokenizer(*args, **kwargs)

Bases: Tokenizer

A dummy tokenizer that splits the input text on whitespace and returns the tokens as is.

This tokenizer will generally under-estimate token counts in English and latin languages (where words comprise more than one token on average), and will give very poor results for languages where the whitespace/"word" heuristic doesn't work well (e.g. Chinese, Japanese, Korean, Thai).

However, it requires no dependencies beyond the Python standard library, using str.split()

Source code in llmeter/tokenizers.py
174
175
def __init__(self, *args, **kwargs):
    pass

Tokenizer

Tokenizer(*args, **kwargs)

Bases: ABC

Source code in llmeter/tokenizers.py
17
18
def __init__(self, *args, **kwargs):
    pass

load staticmethod

load(tokenizer_info)

Loads a tokenizer from a dictionary.

Parameters:

Name Type Description Default
tokenizer_info Dict

The tokenizer information to load.

required

Returns:

Name Type Description
Tokenizer Tokenizer

The loaded tokenizer.

Source code in llmeter/tokenizers.py
67
68
69
70
71
72
73
74
75
76
77
78
@staticmethod
def load(tokenizer_info: dict) -> Tokenizer:
    """
    Loads a tokenizer from a dictionary.

    Args:
        tokenizer_info (Dict): The tokenizer information to load.

    Returns:
        Tokenizer: The loaded tokenizer.
    """
    return _load_tokenizer_from_info(tokenizer_info)

load_from_file staticmethod

load_from_file(tokenizer_path)

Loads a tokenizer from a file.

Parameters:

Name Type Description Default
tokenizer_path ReadablePathLike

The path to the serialized tokenizer file.

required

Returns:

Name Type Description
Tokenizer Tokenizer

The loaded tokenizer.

Source code in llmeter/tokenizers.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
@staticmethod
def load_from_file(tokenizer_path: ReadablePathLike | None) -> Tokenizer:
    """
    Loads a tokenizer from a file.

    Args:
        tokenizer_path (ReadablePathLike): The path to the serialized tokenizer file.

    Returns:
        Tokenizer: The loaded tokenizer.
    """
    if tokenizer_path is None:
        return DummyTokenizer()
    tokenizer_path = ensure_path(tokenizer_path)
    with tokenizer_path.open("r") as f:
        tokenizer_info = json.load(f)

    return _load_tokenizer_from_info(tokenizer_info)

to_dict staticmethod

to_dict(tokenizer)

Serializes a tokenizer to a dictionary.

Parameters:

Name Type Description Default
tokenizer Tokenizer

The tokenizer to serialize.

required

Returns:

Name Type Description
Dict dict

The serialized tokenizer.

Source code in llmeter/tokenizers.py
80
81
82
83
84
85
86
87
88
89
90
91
@staticmethod
def to_dict(tokenizer: Any) -> dict:
    """
    Serializes a tokenizer to a dictionary.

    Args:
        tokenizer (Tokenizer): The tokenizer to serialize.

    Returns:
        Dict: The serialized tokenizer.
    """
    return _to_dict(tokenizer)

save_tokenizer

save_tokenizer(tokenizer, output_path)

Save a tokenizer information to a file.

Parameters:

Name Type Description Default
tokenizer Tokenizer

The tokenizer to serialize.

required
output_path WritablePathLike

The path to save the serialized tokenizer to.

required

Returns:

Name Type Description
UPath UPath

The path to the serialized tokenizer file.

Source code in llmeter/tokenizers.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
def save_tokenizer(tokenizer: Any, output_path: WritablePathLike) -> UPath:
    """
    Save a tokenizer information to a file.

    Args:
        tokenizer (Tokenizer): The tokenizer to serialize.
        output_path (WritablePathLike): The path to save the serialized tokenizer to.

    Returns:
        UPath: The path to the serialized tokenizer file.
    """
    tokenizer_info = _to_dict(tokenizer)

    output_path = ensure_path(output_path)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    with output_path.open("w") as f:
        json.dump(tokenizer_info, f)

    return output_path