Skip to content

Tokenizer

labl.utils.tokenizer.Tokenizer

Tokenizer(
    transform: AbstractTransform | Compose,
    has_bos_token: bool = False,
    has_eos_token: bool = False,
)

Bases: ABC

Base class for tokenizers.

This class provides a common interface for tokenizing and detokenizing text, unifying the behavior of jiwer and transformers tokenizers for alignment and visualization.

Attributes:

Name Type Description
transform AbstractTransform | Compose

The transformation to apply to the input strings. This should be a composition of transformations that includes a final step producing a list of list of tokens, following jiwer transformations.

has_bos_token bool

Whether the tokenizer sets a beginning-of-sequence token. Defaults to False.

has_eos_token bool

Whether the tokenizer sets an end-of-sequence token. Defaults to False.

__call__

__call__(
    texts: str | list[str], with_offsets: bool = False
) -> (
    list[list[str]]
    | tuple[list[list[str]], Sequence[Sequence[OffsetType]]]
)

Tokenizes one or more input strings.

Parameters:

Name Type Description Default

texts

str | list[str]

The strings to tokenize.

required

with_offsets

bool

If True, returns the (start, end) character indices of the tokens. If False, returns only the tokens.

False

Returns:

Type Description
list[list[str]] | tuple[list[list[str]], Sequence[Sequence[OffsetType]]]

The tokens of the input strings, and optionally the character spans of the tokens.

tokenize

tokenize(texts: str | list[str]) -> list[list[str]]

Tokenizes one or more input texts.

Parameters:

Name Type Description Default

texts

str | list[str]

The strings to tokenize.

required

Returns:

Type Description
list[list[str]]

A list of lists, each containing the tokens of the corresponding input string.

detokenize abstractmethod

detokenize(
    tokens: list[str] | list[list[str]],
) -> list[str]

Detokenizes the input tokens.

Parameters:

Name Type Description Default

tokens

list[str] | list[list[str]]

The tokens of one or more strings to detokenize.

required

Returns:

Type Description
list[str]

A list containing the detokenized string(s).

tokenize_with_offsets abstractmethod

tokenize_with_offsets(
    texts: str | list[str],
    add_gaps: bool = False,
    gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]

Tokenizes the input texts and returns the character spans of the tokens.

Parameters:

Name Type Description Default

texts

str | list[str]

The texts to tokenize.

required

add_gaps

bool

Whether gaps should be added before/after tokens and offsets.

False

gap_token

str

The token to use for gaps. Default: .

'▁'

Returns:

Type Description
list[list[str]]

The tokens of the input texts, and tuples (start_idx, end_idx) marking the position of tokens

list[list[OffsetType]]

in the original text. If the token is not present in the original text, None is used instead.

labl.utils.tokenizer.WhitespaceTokenizer

WhitespaceTokenizer(word_delimiter: str = ' ')

Bases: Tokenizer

Tokenizer that uses whitespace to split the input strings into tokens.

Hardcodes the Compose([Strip(), ReduceToListOfListOfWords()]) transformation for tokenization.

Parameters:

Name Type Description Default

word_delimiter

str

The delimiter to use for splitting words. Defaults to whitespace.

' '

detokenize

detokenize(
    tokens: list[str] | list[list[str]],
) -> list[str]

Detokenizes the input tokens using whitespace.

Parameters:

Name Type Description Default

tokens

list[str] | list[list[str]]

The tokens of one or more strings to detokenize.

required

Returns:

Type Description
list[str]

A list containing the detokenized string(s).

tokenize_with_offsets

tokenize_with_offsets(
    texts: str | list[str],
    add_gaps: bool = False,
    gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]

Tokenizes the input texts and returns the character spans of the tokens.

Parameters:

Name Type Description Default

texts

str | list[str]

The strings to tokenize.

required

add_gaps

bool

Whether gaps should be added before/after tokens and offsets.

False

gap_token

str

The token to use for gaps. Default: .

'▁'

Returns:

Type Description
list[list[str]]

The tokens of the input texts, and tuples (start_idx, end_idx) marking the position of tokens

list[list[OffsetType]]

in the original text. If the token is not present in the original text, None is used instead.

labl.utils.tokenizer.WordBoundaryTokenizer

WordBoundaryTokenizer(exp: str = SPLIT_REGEX)

Bases: Tokenizer

Tokenizer that uses word boundaries to split the input strings into tokens.

Hardcodes the Compose([Strip(), RegexReduceToListOfListOfWords()]) transformation for tokenization.

Parameters:

Name Type Description Default

exp

str

The Regex expression to use for splitting. Defaults to r"[\w']+|[.,!?:;'”#$%&\(\)\*\+-/<=>@\[\]^_{|}~"]. This regex keeps words (including contractions) together as single tokens, and treats each punctuation mark or special character as its own separate token.

SPLIT_REGEX

detokenize

detokenize(
    tokens: list[str] | list[list[str]],
) -> list[str]

Detokenizes the input tokens using word boundaries.

Parameters:

Name Type Description Default

tokens

list[str] | list[list[str]]

The tokens of one or more texts to detokenize.

required

Returns:

Type Description
list[str]

A list containing the detokenized string(s).

tokenize_with_offsets

tokenize_with_offsets(
    texts: str | list[str],
    add_gaps: bool = False,
    gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]

Tokenizes the input texts and returns the character spans of the tokens.

Parameters:

Name Type Description Default

texts

str | list[str]

The strings to tokenize.

required

add_gaps

bool

Whether gaps should be added before/after tokens and offsets.

False

gap_token

str

The token to use for gaps. Default: .

'▁'

Returns:

Type Description
list[list[str]]

The tokens of the input texts, and tuples (start_idx, end_idx) marking the position of tokens

list[list[OffsetType]]

in the original text. If the token is not present in the original text, None is used instead.

labl.utils.tokenizer.HuggingfaceTokenizer

HuggingfaceTokenizer(
    tokenizer_or_id: str
    | PreTrainedTokenizer
    | PreTrainedTokenizerFast,
    add_special_tokens: bool = False,
    has_bos_token: bool = True,
    has_eos_token: bool = True,
    **kwargs,
)

Bases: Tokenizer

Tokenizer that uses a transformers.PreTrainedTokenizer to split the input strings into tokens. Hardcodes the ReduceToListOfListOfTokens transformation for tokenization.

Parameters:

Name Type Description Default

tokenizer_or_id

str | PreTrainedTokenizer | PreTrainedTokenizerFast

The tokenizer or its ID. If a string is provided, it will be used to load the tokenizer from the transformers library.

required

add_special_tokens

bool

Whether to add special tokens to the tokenized output. Defaults to False.

False

has_bos_token

bool

Whether the tokenizer sets a beginning-of-sequence token. Defaults to True.

True

has_eos_token

bool

Whether the tokenizer sets an end-of-sequence token. Defaults to True.

True

kwargs

dict

Additional keyword arguments to pass to the tokenizer initialization.

{}

detokenize

detokenize(
    tokens: list[str] | list[list[str]],
) -> list[str]

tokenize_with_offsets

tokenize_with_offsets(
    texts: str | list[str],
    add_gaps: bool = False,
    gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]

Tokenizes the input texts and returns the character spans of the tokens.

Parameters:

Name Type Description Default

texts

str | list[str]

The strings to tokenize.

required

add_gaps

bool

Whether gaps should be added before/after tokens and offsets.

False

gap_token

str

The token to use for gaps. Default: .

'▁'

Returns:

Type Description
tuple[list[list[str]], list[list[OffsetType]]]

The tokens of the input texts, and the character spans of the tokens.