Tokenizer
labl.utils.tokenizer.Tokenizer
¶
Tokenizer(
transform: AbstractTransform | Compose,
has_bos_token: bool = False,
has_eos_token: bool = False,
)
Bases: ABC
Base class for tokenizers.
This class provides a common interface for tokenizing and detokenizing text, unifying the behavior of
jiwer and transformers tokenizers for alignment and visualization.
Attributes:
| Name | Type | Description |
|---|---|---|
transform |
AbstractTransform | Compose
|
The transformation to apply to the input strings. This should be a composition of transformations that includes a final step producing a list of list of tokens, following jiwer transformations. |
has_bos_token |
bool
|
Whether the tokenizer sets a beginning-of-sequence token. Defaults to False. |
has_eos_token |
bool
|
Whether the tokenizer sets an end-of-sequence token. Defaults to False. |
__call__
¶
__call__(
texts: str | list[str], with_offsets: bool = False
) -> (
list[list[str]]
| tuple[list[list[str]], Sequence[Sequence[OffsetType]]]
)
Tokenizes one or more input strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | list[str]
|
The strings to tokenize. |
required |
|
bool
|
If True, returns the (start, end) character indices of the tokens. If False, returns only the tokens. |
False
|
Returns:
| Type | Description |
|---|---|
list[list[str]] | tuple[list[list[str]], Sequence[Sequence[OffsetType]]]
|
The tokens of the input strings, and optionally the character spans of the tokens. |
tokenize
¶
tokenize(texts: str | list[str]) -> list[list[str]]
Tokenizes one or more input texts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | list[str]
|
The strings to tokenize. |
required |
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
A list of lists, each containing the tokens of the corresponding input string. |
detokenize
abstractmethod
¶
detokenize(
tokens: list[str] | list[list[str]],
) -> list[str]
Detokenizes the input tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
list[str] | list[list[str]]
|
The tokens of one or more strings to detokenize. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
A list containing the detokenized string(s). |
tokenize_with_offsets
abstractmethod
¶
tokenize_with_offsets(
texts: str | list[str],
add_gaps: bool = False,
gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]
Tokenizes the input texts and returns the character spans of the tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | list[str]
|
The texts to tokenize. |
required |
|
bool
|
Whether gaps should be added before/after tokens and offsets. |
False
|
|
str
|
The token to use for gaps. Default: |
'▁'
|
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
The tokens of the input texts, and tuples |
list[list[OffsetType]]
|
in the original text. If the token is not present in the original text, None is used instead. |
labl.utils.tokenizer.WhitespaceTokenizer
¶
WhitespaceTokenizer(word_delimiter: str = ' ')
Bases: Tokenizer
Tokenizer that uses whitespace to split the input strings into tokens.
Hardcodes the Compose([Strip(), ReduceToListOfListOfWords()]) transformation for tokenization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
The delimiter to use for splitting words. Defaults to whitespace. |
' '
|
detokenize
¶
detokenize(
tokens: list[str] | list[list[str]],
) -> list[str]
Detokenizes the input tokens using whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
list[str] | list[list[str]]
|
The tokens of one or more strings to detokenize. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
A list containing the detokenized string(s). |
tokenize_with_offsets
¶
tokenize_with_offsets(
texts: str | list[str],
add_gaps: bool = False,
gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]
Tokenizes the input texts and returns the character spans of the tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | list[str]
|
The strings to tokenize. |
required |
|
bool
|
Whether gaps should be added before/after tokens and offsets. |
False
|
|
str
|
The token to use for gaps. Default: |
'▁'
|
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
The tokens of the input texts, and tuples (start_idx, end_idx) marking the position of tokens |
list[list[OffsetType]]
|
in the original text. If the token is not present in the original text, None is used instead. |
labl.utils.tokenizer.WordBoundaryTokenizer
¶
WordBoundaryTokenizer(exp: str = SPLIT_REGEX)
Bases: Tokenizer
Tokenizer that uses word boundaries to split the input strings into tokens.
Hardcodes the Compose([Strip(), RegexReduceToListOfListOfWords()]) transformation for tokenization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str
|
The Regex expression to use for splitting.
Defaults to |
SPLIT_REGEX
|
detokenize
¶
detokenize(
tokens: list[str] | list[list[str]],
) -> list[str]
Detokenizes the input tokens using word boundaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
list[str] | list[list[str]]
|
The tokens of one or more texts to detokenize. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
A list containing the detokenized string(s). |
tokenize_with_offsets
¶
tokenize_with_offsets(
texts: str | list[str],
add_gaps: bool = False,
gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]
Tokenizes the input texts and returns the character spans of the tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | list[str]
|
The strings to tokenize. |
required |
|
bool
|
Whether gaps should be added before/after tokens and offsets. |
False
|
|
str
|
The token to use for gaps. Default: |
'▁'
|
Returns:
| Type | Description |
|---|---|
list[list[str]]
|
The tokens of the input texts, and tuples (start_idx, end_idx) marking the position of tokens |
list[list[OffsetType]]
|
in the original text. If the token is not present in the original text, None is used instead. |
labl.utils.tokenizer.HuggingfaceTokenizer
¶
HuggingfaceTokenizer(
tokenizer_or_id: str
| PreTrainedTokenizer
| PreTrainedTokenizerFast,
add_special_tokens: bool = False,
has_bos_token: bool = True,
has_eos_token: bool = True,
**kwargs,
)
Bases: Tokenizer
Tokenizer that uses a transformers.PreTrainedTokenizer to split the input strings into tokens.
Hardcodes the ReduceToListOfListOfTokens transformation for tokenization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | PreTrainedTokenizer | PreTrainedTokenizerFast
|
The tokenizer or its ID.
If a string is provided, it will be used to load the tokenizer from the |
required |
|
bool
|
Whether to add special tokens to the tokenized output. Defaults to False. |
False
|
|
bool
|
Whether the tokenizer sets a beginning-of-sequence token. Defaults to True. |
True
|
|
bool
|
Whether the tokenizer sets an end-of-sequence token. Defaults to True. |
True
|
|
dict
|
Additional keyword arguments to pass to the tokenizer initialization. |
{}
|
tokenize_with_offsets
¶
tokenize_with_offsets(
texts: str | list[str],
add_gaps: bool = False,
gap_token: str = "▁",
) -> tuple[list[list[str]], list[list[OffsetType]]]
Tokenizes the input texts and returns the character spans of the tokens.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
str | list[str]
|
The strings to tokenize. |
required |
|
bool
|
Whether gaps should be added before/after tokens and offsets. |
False
|
|
str
|
The token to use for gaps. Default: |
'▁'
|
Returns:
| Type | Description |
|---|---|
tuple[list[list[str]], list[list[OffsetType]]]
|
The tokens of the input texts, and the character spans of the tokens. |