Tokenizer

labl.utils.tokenizer.Tokenizer ¶

Tokenizer(
    transform: AbstractTransform | Compose,
    has_bos_token: bool = False,
    has_eos_token: bool = False,
)

Bases: ABC

Base class for tokenizers.

This class provides a common interface for tokenizing and detokenizing text, unifying the behavior of jiwer and transformers tokenizers for alignment and visualization.

Attributes:

Name	Type	Description
`transform`	`AbstractTransform \| Compose`	The transformation to apply to the input strings. This should be a composition of transformations that includes a final step producing a list of list of tokens, following jiwer transformations.
`has_bos_token`	`bool`	Whether the tokenizer sets a beginning-of-sequence token. Defaults to False.
`has_eos_token`	`bool`	Whether the tokenizer sets an end-of-sequence token. Defaults to False.

call ¶

__call__(
    texts: str | list[str], with_offsets: bool = False
) -> (
    list[list[str]]
    | tuple[list[list[str]], Sequence[Sequence[OffsetType]]]
)

Tokenizes one or more input strings.

Parameters:

Name	Type	Description	Default
`texts` ¶	`str \| list[str]`	The strings to tokenize.	required
`with_offsets` ¶	`bool`	If True, returns the (start, end) character indices of the tokens. If False, returns only the tokens.	`False`

Returns:

Type	Description
`list[list[str]] \| tuple[list[list[str]], Sequence[Sequence[OffsetType]]]`	The tokens of the input strings, and optionally the character spans of the tokens.

tokenize ¶

tokenize(texts: str | list[str]) -> list[list[str]]

Tokenizes one or more input texts.

Parameters:

Name	Type	Description	Default
`texts` ¶	`str \| list[str]`	The strings to tokenize.	required

Returns:

Type	Description
`list[list[str]]`	A list of lists, each containing the tokens of the corresponding input string.

detokenize `abstractmethod` ¶

detokenize(
    tokens: list[str] | list[list[str]],
) -> list[str]

Detokenizes the input tokens.

Parameters:

Name	Type	Description	Default
`tokens` ¶	`list[str] \| list[list[str]]`	The tokens of one or more strings to detokenize.	required

Returns:

Type	Description
`list[str]`	A list containing the detokenized string(s).

tokenize_with_offsets `abstractmethod` ¶

tokenize_with_offsets(
    texts: str | list[str],
    add_gaps: bool = False,
    gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]

Tokenizes the input texts and returns the character spans of the tokens.

Parameters:

Name	Type	Description	Default
`texts` ¶	`str \| list[str]`	The texts to tokenize.	required
`add_gaps` ¶	`bool`	Whether gaps should be added before/after tokens and offsets.	`False`
`gap_token` ¶	`str`	The token to use for gaps. Default: `▁`.	`'▁'`

Returns:

Type	Description
`list[list[str]]`	The tokens of the input texts, and tuples `(start_idx, end_idx)` marking the position of tokens
`list[list[OffsetType]]`	in the original text. If the token is not present in the original text, None is used instead.

labl.utils.tokenizer.WhitespaceTokenizer ¶

WhitespaceTokenizer(word_delimiter: str = ' ')

Bases: Tokenizer

Tokenizer that uses whitespace to split the input strings into tokens.

Hardcodes the Compose([Strip(), ReduceToListOfListOfWords()]) transformation for tokenization.

Parameters:

Name	Type	Description	Default
`word_delimiter` ¶	`str`	The delimiter to use for splitting words. Defaults to whitespace.	`' '`

detokenize ¶

detokenize(
    tokens: list[str] | list[list[str]],
) -> list[str]

Detokenizes the input tokens using whitespace.

Parameters:

Name	Type	Description	Default
`tokens` ¶	`list[str] \| list[list[str]]`	The tokens of one or more strings to detokenize.	required

Returns:

Type	Description
`list[str]`	A list containing the detokenized string(s).

tokenize_with_offsets ¶

tokenize_with_offsets(
    texts: str | list[str],
    add_gaps: bool = False,
    gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]

Tokenizes the input texts and returns the character spans of the tokens.

Parameters:

Name	Type	Description	Default
`texts` ¶	`str \| list[str]`	The strings to tokenize.	required
`add_gaps` ¶	`bool`	Whether gaps should be added before/after tokens and offsets.	`False`
`gap_token` ¶	`str`	The token to use for gaps. Default: `▁`.	`'▁'`

Returns:

Type	Description
`list[list[str]]`	The tokens of the input texts, and tuples (start_idx, end_idx) marking the position of tokens
`list[list[OffsetType]]`	in the original text. If the token is not present in the original text, None is used instead.

labl.utils.tokenizer.WordBoundaryTokenizer ¶

WordBoundaryTokenizer(exp: str = SPLIT_REGEX)

Bases: Tokenizer

Tokenizer that uses word boundaries to split the input strings into tokens.

Hardcodes the Compose([Strip(), RegexReduceToListOfListOfWords()]) transformation for tokenization.

Parameters:

Name	Type	Description	Default
`exp` ¶	`str`	The Regex expression to use for splitting. Defaults to `r"[\w']+\|[.,!?:;'”#$%&\*\+-/<=>@\[\]^_{\|}~"]`. This regex keeps words (including contractions) together as single tokens, and treats each punctuation mark or special character as its own separate token.	`SPLIT_REGEX`

detokenize ¶

detokenize(
    tokens: list[str] | list[list[str]],
) -> list[str]

Detokenizes the input tokens using word boundaries.

Parameters:

Name	Type	Description	Default
`tokens` ¶	`list[str] \| list[list[str]]`	The tokens of one or more texts to detokenize.	required

Returns:

Type	Description
`list[str]`	A list containing the detokenized string(s).

tokenize_with_offsets ¶

tokenize_with_offsets(
    texts: str | list[str],
    add_gaps: bool = False,
    gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]

Tokenizes the input texts and returns the character spans of the tokens.

Parameters:

Name	Type	Description	Default
`texts` ¶	`str \| list[str]`	The strings to tokenize.	required
`add_gaps` ¶	`bool`	Whether gaps should be added before/after tokens and offsets.	`False`
`gap_token` ¶	`str`	The token to use for gaps. Default: `▁`.	`'▁'`

Returns:

Type	Description
`list[list[str]]`	The tokens of the input texts, and tuples (start_idx, end_idx) marking the position of tokens
`list[list[OffsetType]]`	in the original text. If the token is not present in the original text, None is used instead.

labl.utils.tokenizer.HuggingfaceTokenizer ¶

HuggingfaceTokenizer(
    tokenizer_or_id: str
    | PreTrainedTokenizer
    | PreTrainedTokenizerFast,
    add_special_tokens: bool = False,
    has_bos_token: bool = True,
    has_eos_token: bool = True,
    **kwargs,
)

Bases: Tokenizer

Tokenizer that uses a transformers.PreTrainedTokenizer to split the input strings into tokens. Hardcodes the ReduceToListOfListOfTokens transformation for tokenization.

Parameters:

Name	Type	Description	Default
`tokenizer_or_id` ¶	`str \| PreTrainedTokenizer \| PreTrainedTokenizerFast`	The tokenizer or its ID. If a string is provided, it will be used to load the tokenizer from the `transformers` library.	required
`add_special_tokens` ¶	`bool`	Whether to add special tokens to the tokenized output. Defaults to False.	`False`
`has_bos_token` ¶	`bool`	Whether the tokenizer sets a beginning-of-sequence token. Defaults to True.	`True`
`has_eos_token` ¶	`bool`	Whether the tokenizer sets an end-of-sequence token. Defaults to True.	`True`
`kwargs` ¶	`dict`	Additional keyword arguments to pass to the tokenizer initialization.	`{}`

detokenize ¶

detokenize(
    tokens: list[str] | list[list[str]],
) -> list[str]

tokenize_with_offsets ¶

tokenize_with_offsets(
    texts: str | list[str],
    add_gaps: bool = False,
    gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]

Tokenizes the input texts and returns the character spans of the tokens.

Parameters:

Name	Type	Description	Default
`texts` ¶	`str \| list[str]`	The strings to tokenize.	required
`add_gaps` ¶	`bool`	Whether gaps should be added before/after tokens and offsets.	`False`
`gap_token` ¶	`str`	The token to use for gaps. Default: `▁`.	`'▁'`

Returns:

Type	Description
`tuple[list[list[str]], list[list[OffsetType]]]`	The tokens of the input texts, and the character spans of the tokens.

Tokenizer

labl.utils.tokenizer.Tokenizer ¶

call ¶

`texts` ¶

`with_offsets` ¶

tokenize ¶

`texts` ¶

detokenize `abstractmethod` ¶

`tokens` ¶

tokenize_with_offsets `abstractmethod` ¶

`texts` ¶

`add_gaps` ¶

`gap_token` ¶

labl.utils.tokenizer.WhitespaceTokenizer ¶

`word_delimiter` ¶

detokenize ¶

`tokens` ¶

tokenize_with_offsets ¶

`texts` ¶

`add_gaps` ¶

`gap_token` ¶

labl.utils.tokenizer.WordBoundaryTokenizer ¶

`exp` ¶

detokenize ¶

`tokens` ¶

tokenize_with_offsets ¶

`texts` ¶

`add_gaps` ¶

`gap_token` ¶

labl.utils.tokenizer.HuggingfaceTokenizer ¶

`tokenizer_or_id` ¶

`add_special_tokens` ¶

`has_bos_token` ¶

`has_eos_token` ¶

`kwargs` ¶

detokenize ¶

tokenize_with_offsets ¶

`texts` ¶

`add_gaps` ¶

`gap_token` ¶

Tokenizer

labl.utils.tokenizer.Tokenizer ¶

__call__ ¶

texts ¶

with_offsets ¶

tokenize ¶

texts ¶

detokenize abstractmethod ¶

tokens ¶

tokenize_with_offsets abstractmethod ¶

texts ¶

add_gaps ¶

gap_token ¶

labl.utils.tokenizer.WhitespaceTokenizer ¶

word_delimiter ¶

detokenize ¶

tokens ¶

tokenize_with_offsets ¶

texts ¶

add_gaps ¶

gap_token ¶

labl.utils.tokenizer.WordBoundaryTokenizer ¶

exp ¶

detokenize ¶

tokens ¶

tokenize_with_offsets ¶

texts ¶

add_gaps ¶

gap_token ¶

labl.utils.tokenizer.HuggingfaceTokenizer ¶

tokenizer_or_id ¶

add_special_tokens ¶

has_bos_token ¶

has_eos_token ¶

kwargs ¶

detokenize ¶

tokenize_with_offsets ¶

texts ¶

add_gaps ¶

gap_token ¶

call ¶

`texts` ¶

`with_offsets` ¶

`texts` ¶

detokenize `abstractmethod` ¶

`tokens` ¶

tokenize_with_offsets `abstractmethod` ¶

`texts` ¶

`add_gaps` ¶

`gap_token` ¶

`word_delimiter` ¶

`tokens` ¶

`texts` ¶

`add_gaps` ¶

`gap_token` ¶

`exp` ¶

`tokens` ¶

`texts` ¶

`add_gaps` ¶

`gap_token` ¶

`tokenizer_or_id` ¶

`add_special_tokens` ¶

`has_bos_token` ¶

`has_eos_token` ¶

`kwargs` ¶

`texts` ¶

`add_gaps` ¶

`gap_token` ¶