Dataset¶
labl.data.base_sequence.BaseLabeledDataset
¶
Bases: BaseLabeledSequence[EntryType | BaseMultiLabelEntry[EntryType]], ABC
Base class for all dataset classes containing BaseLabeledEntry objects.
Source code in labl/data/base_sequence.py
get_agreement
¶
get_agreement(
other: BaseLabeledSequence[
EntryType | BaseMultiLabelEntry[EntryType]
]
| None = None,
level_of_measurement: LevelOfMeasurement | None = None,
) -> AgreementOutput
Compute the inter-annotator agreement for the token labels of all label sets using Krippendorff's alpha.
Source code in labl/data/base_sequence.py
labl.data.labeled_dataset.LabeledDataset
¶
Bases: BaseLabeledDataset[LabeledEntry]
Dataset class for handling collections of LabeledEntry objects.
Attributes:
| Name | Type | Description |
|---|---|---|
data |
list[LabeledEntry]
|
A list of LabeledEntry objects. |
Source code in labl/data/base_sequence.py
from_spans
classmethod
¶
from_spans(
texts: list[str],
spans: list[list[Span]] | list[list[SpanType]],
infos: list[InfoDictType] | None = None,
tokenizer: str
| Tokenizer
| PreTrainedTokenizer
| PreTrainedTokenizerFast
| None = None,
tokenizer_kwargs: dict = {},
) -> LabeledDataset
Create a LabeledDataset from a set of texts and one or more spans for each text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
list[str]
|
The set of text. |
required |
|
list[list[Span]] | list[list[dict[str, str | int | float | None]]]
|
A list of spans for each text. |
required |
|
list[dict[str, str | int | float | bool]] | None
|
A list of dictionaries containing additional information for each entry. If None, no additional information is added. Defaults to None. |
None
|
|
str | Tokenizer | PreTrainedTokenizer | PreTrainedTokenizerFast | None
|
A |
None
|
|
dict
|
Additional arguments for the tokenizer. |
{}
|
Source code in labl/data/labeled_dataset.py
from_tagged
classmethod
¶
from_tagged(
tagged: list[str],
tokenizer: str
| Tokenizer
| PreTrainedTokenizer
| PreTrainedTokenizerFast
| None = None,
keep_tags: list[str] = [],
ignore_tags: list[str] = [],
tokenizer_kwargs: dict = {},
infos: list[InfoDictType] | None = None,
) -> LabeledDataset
Create a LabeledDataset from a set of tagged texts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
list[str]
|
The set of tagged text. |
required |
|
str | Tokenizer | PreTrainedTokenizer | PreTrainedTokenizerFast | None
|
A |
None
|
|
list[dict[str, str | int | float | bool]] | None
|
A list of dictionaries containing additional information for each entry. If None, no additional information is added. Defaults to None. |
None
|
|
list[str]
|
A list of tags to keep. |
[]
|
|
list[str]
|
A list of tags to ignore. |
[]
|
|
dict
|
Additional arguments for the tokenizer. |
{}
|
Source code in labl/data/labeled_dataset.py
from_tokens
classmethod
¶
from_tokens(
tokens: list[list[str]],
labels: Sequence[Sequence[LabelType]],
infos: list[InfoDictType] | None = None,
keep_labels: list[str] = [],
ignore_labels: list[str] = [],
tokenizer: str
| Tokenizer
| PreTrainedTokenizer
| PreTrainedTokenizerFast
| None = None,
tokenizer_kwargs: dict = {},
) -> LabeledDataset
Create a LabeledDataset from a set of tokenized texts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
list[list[str]] | None
|
A list of lists of string tokens. |
required |
|
list[list[str | int | float | None]] | None
|
A list of lists of labels for the tokens. |
required |
|
list[dict[str, str | int | float | bool]] | None
|
A list of dictionaries containing additional information for each entry. If None, no additional information is added. Defaults to None. |
None
|
|
list[str]
|
A list of labels to keep. |
[]
|
|
list[str]
|
A list of labels to ignore. |
[]
|
|
str | Tokenizer | PreTrainedTokenizer | PreTrainedTokenizerFast | None
|
A |
None
|
|
dict
|
Additional arguments for the tokenizer. |
{}
|
Source code in labl/data/labeled_dataset.py
labl.data.edited_dataset.EditedDataset
¶
Bases: BaseLabeledDataset[EditedEntry]
Dataset class for handling collections of EditedEntry and MultiEditEntry objects.
Attributes:
| Name | Type | Description |
|---|---|---|
data |
list[EditedEntry] | list[MultiEditEntry]
|
A list of |
Source code in labl/data/base_sequence.py
from_edits
classmethod
¶
from_edits(
texts: list[str],
edits: list[str] | list[list[str]],
infos: list[InfoDictType]
| list[list[InfoDictType]]
| None = None,
tokenizer: str
| Tokenizer
| PreTrainedTokenizer
| PreTrainedTokenizerFast
| None = None,
tokenizer_kwargs: dict = {},
with_gaps: bool = True,
sub_label: str = "S",
ins_label: str = "I",
del_label: str = "D",
gap_token: str = "▁",
) -> EditedDataset
Create an EditedDataset from a set of texts and one or more edits for each text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
list[str]
|
The set of text. |
required |
|
list[str] | list[list[str]] | None
|
One or more edited version for each text. |
required |
|
list[dict[str, str | int | float | bool]] | list[list[dict[str, str | int | float | bool]]] | None
|
A list of dictionaries containing additional information for each entry.
If multiple edits are provided for each text, |
None
|
|
str | Tokenizer | PreTrainedTokenizer | PreTrainedTokenizerFast | None
|
A |
None
|
|
dict
|
Additional arguments for the tokenizer. |
{}
|
|
bool
|
Whether to add gaps to the tokens and offsets. Gaps are used to mark the positions of insertions and deletions in the original/edited texts, respectively. If false, those are merged to the next token to the right. Default: True. |
True
|
|
str
|
The label for substitutions. Default: "S". |
'S'
|
|
str
|
The label for insertions. Default: "I". |
'I'
|
|
str
|
The label for deletions. Default: "D". |
'D'
|
|
str
|
The token to use for gaps. Default: "▁". |
'▁'
|
Source code in labl/data/edited_dataset.py
from_edits_dataframe
classmethod
¶
from_edits_dataframe(
df,
text_column: str,
edit_column: str,
entry_ids: str | list[str],
infos_columns: list[str] = [],
tokenizer: str
| Tokenizer
| PreTrainedTokenizer
| PreTrainedTokenizerFast
| None = None,
tokenizer_kwargs: dict[str, Any] = {},
with_gaps: bool = True,
sub_label: str = "S",
ins_label: str = "I",
del_label: str = "D",
gap_token: str = "▁",
) -> EditedDataset
Create an EditedDataset from a pandas.DataFrame with edits.
Every row in the DataFrame is an entry identified univocally by entry_ids. The text_column contains the
original text, and the edit_column contains the edits. If multiple columns with the same entry_ids are
present, they are all treated as edits of the same text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
|
DataFrame
|
The DataFrame containing the text and edits. |
required |
|
str
|
The name of the column in the dataframe containing the original text. |
required |
|
str
|
The name of the column in the dataframe containing the edited text. |
required |
|
str | list[str]
|
One or more column names acting as unique identifiers for each entry. If
multiple entries are found with the same |
required |
|
list[str]
|
A list of columns containing additional information for each entry. |
[]
|
|
str | Tokenizer | PreTrainedTokenizer | PreTrainedTokenizerFast | None
|
A |
None
|
|
dict[str, Any]
|
description. Defaults to {}. |
{}
|
|
bool
|
Whether to add gaps to the tokens and offsets. Gaps are used to mark the positions of insertions and deletions in the original/edited texts, respectively. If false, those are merged to the next token to the right. Default: True. |
True
|
|
str
|
The label for substitutions. Default: "S". |
'S'
|
|
str
|
The label for insertions. Default: "I". |
'I'
|
|
str
|
The label for deletions. Default: "D". |
'D'
|
|
str
|
The token to use for gaps. Default: "▁". |
'▁'
|
Returns:
| Type | Description |
|---|---|
EditedDataset
|
An |
Source code in labl/data/edited_dataset.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | |
merge_gap_annotations
¶
merge_gap_annotations(
merge_fn: Callable[[Sequence[LabelType]], LabelType]
| None = None,
keep_final_gap: bool = True,
) -> None
Merge gap annotations in the tokens of orig and edit.
This method is equivalent to calling EditedEntry.from_edits with with_gaps=False. Gap annotations are merged
to the next non-gap token to the right, and the gap label is added to the label of the non-gap token. The last
gap is kept to account for insertions at the end of the text.
E.g. GAP Hello GAP World GAP ! GAP becomes Hello World ! GAP.
I S I I IS I I