Translation
labl.datasets.translation.load_qe4pe
¶
load_qe4pe(
configs: Qe4peTask | list[Qe4peTask] = "main",
langs: Qe4peLanguage | list[Qe4peLanguage] = [
"ita",
"nld",
],
domains: Qe4peDomain | list[Qe4peDomain] | None = None,
speed_groups: Qe4peSpeedGroup
| list[Qe4peSpeedGroup]
| None = None,
highlight_modalities: Qe4peHighlightModality
| list[Qe4peHighlightModality]
| None = None,
tokenizer: str
| Tokenizer
| PreTrainedTokenizer
| PreTrainedTokenizerFast
| None = None,
tokenizer_kwargs: dict[str, Any] = {},
filter_issues: bool = True,
with_gaps: bool = True,
sub_label: str = "S",
ins_label: str = "I",
del_label: str = "D",
gap_token: str = "▁",
) -> dict[str, dict[str, EditedDataset]]
Load the QE4PE dataset by Sarti et al. (2025), containing multiple edits over a single set of machine-translated sentences in two languages (Italian and Dutch).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
configs
|
Literal["pretask", "main", "posttask"] | list[Literal["pretask", "main", "posttask"]], *optional*
|
One or more task configurations to load. Defaults to "main". Available options: "pretask", "main", "posttask". |
'main'
|
langs
|
Literal["ita", "nld"] | list[Literal["ita", "nld"]], *optional*
|
One or more languages to load. Defaults to ["ita", "nld"]. Available options: "ita", "nld". |
['ita', 'nld']
|
domains
|
Literal["biomedical", "social"] | list[Literal["biomedical", "social"]] | None, *optional*
|
One or more text categories to load. Defaults to ["biomedical", "social"]. Available options: "biomedical", "social". |
None
|
speed_groups
|
Literal["faster", "avg", "slower"] | list[Literal["faster", "avg", "slower"]] | None, *optional*
|
One or more translator speed groups to load. Defaults to ["faster", "avg", "slower"]. Available options: "faster", "avg", "slower". |
None
|
highlight_modalities
|
Literal["no_highlight", "oracle", "supervised", "unsupervised"] | list[Literal["no_highlight", "oracle", "supervised", "unsupervised"]] | None, *optional*
|
One or more highlight modalities to load. Defaults to all modalities. Available options: "no_highlight", "oracle", "supervised", "unsupervised". |
None
|
filter_issues
|
bool, *optional*
|
Whether to filter out issues from the dataset. Defaults to True. |
True
|
tokenizer
|
str | Tokenizer | PreTrainedTokenizer | PreTrainedTokenizerFast, *optional*
|
The tokenizer to use for tokenization. If None, a default whitespace tokenizer will be used. |
None
|
tokenizer_kwargs
|
dict[str, Any], *optional*
|
Additional arguments for the tokenizer. |
{}
|
with_gaps
|
bool, *optional*
|
Whether to include gaps in the tokenization. Defaults to True. |
True
|
sub_label
|
str, *optional*
|
The label for substitutions. Defaults to "S". |
'S'
|
ins_label
|
str, *optional*
|
The label for insertions. Defaults to "I". |
'I'
|
del_label
|
str, *optional*
|
The label for deletions. Defaults to "D". |
'D'
|
gap_token
|
str, *optional*
|
The token used for gaps. Defaults to "▁". |
'▁'
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, EditedDataset]]
|
A dictionary containing the loaded datasets for each task and language.
The keys are the task configurations, and the values are dictionaries with language keys
and |
Source code in labl/datasets/translation/qe4pe.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | |
labl.datasets.translation.load_divemt
¶
load_divemt(
configs: DivemtTask | list[DivemtTask] = "main",
langs: DivemtLanguage | list[DivemtLanguage] = [
"ara",
"nld",
"ita",
"tur",
"ukr",
"vie",
],
mt_models: DivemtMTModel | list[DivemtMTModel] = [
"gtrans",
"mbart50",
],
tokenizer: str
| Tokenizer
| PreTrainedTokenizer
| PreTrainedTokenizerFast
| None = None,
tokenizer_kwargs: dict[str, Any] = {},
with_gaps: bool = True,
sub_label: str = "S",
ins_label: str = "I",
del_label: str = "D",
gap_token: str = "▁",
) -> dict[str, dict[str, dict[str, EditedDataset]]]
Load the DivEMT dataset by Sarti et al. (2022), containing edits over two sets of machine-translated sentences across six typologically diverse languages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
configs
|
Literal["warmup", "main"] | list[Literal["warmup", "main"]], *optional*
|
One or more task configurations to load. Defaults to "main". Available options: "warmup", "main". |
'main'
|
langs
|
Literal["ara", "nld", "ita", "tur", "ukr", "vie"] | list[Literal["ara", "nld", "ita", "tur", "ukr", "vie"]], *optional*
|
One or more languages to load. Defaults to ["ara", "nld", "ita", "tur", "ukr", "vie"]. Available options: "ara", "nld", "ita", "tur", "ukr", "vie". |
['ara', 'nld', 'ita', 'tur', 'ukr', 'vie']
|
mt_models
|
Literal["gtrans", "mbart50"] | list[Literal["gtrans", "mbart50"]], *optional*
|
One or more models for which post-edits need to be loaded. Defaults to ["gtrans", "mbart50"]. Available options: "gtrans", "mbart50". |
['gtrans', 'mbart50']
|
tokenizer
|
str | Tokenizer | PreTrainedTokenizer | PreTrainedTokenizerFast, *optional*
|
The tokenizer to use for tokenization. If None, a default whitespace tokenizer will be used. |
None
|
tokenizer_kwargs
|
dict[str, Any], *optional*
|
Additional arguments for the tokenizer. |
{}
|
with_gaps
|
bool, *optional*
|
Whether to include gaps in the tokenization. Defaults to True. |
True
|
sub_label
|
str, *optional*
|
The label for substitutions. Defaults to "S". |
'S'
|
ins_label
|
str, *optional*
|
The label for insertions. Defaults to "I". |
'I'
|
del_label
|
str, *optional*
|
The label for deletions. Defaults to "D". |
'D'
|
gap_token
|
str, *optional*
|
The token used for gaps. Defaults to "▁". |
'▁'
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, dict[str, EditedDataset]]]
|
A dictionary containing the loaded datasets for each task, language, and MT model.
The keys are the task configurations, and the values are dictionaries with language keys
and |
Source code in labl/datasets/translation/divemt.py
labl.datasets.translation.load_wmt24esa
¶
load_wmt24esa(
langs: Wmt24EsaLanguage
| list[Wmt24EsaLanguage]
| None = None,
domains: Wmt24EsaDomain
| list[Wmt24EsaDomain]
| None = None,
mt_models: Wmt24EsaMTModel
| list[Wmt24EsaMTModel]
| None = None,
tokenizer: str
| Tokenizer
| PreTrainedTokenizer
| PreTrainedTokenizerFast
| None = None,
tokenizer_kwargs: dict[str, Any] = {},
) -> dict[str, dict[str, LabeledDataset]]
Load the WMT24 ESA annotations from Kocmi et al. (2024), containing partially overlapping segments across multiple language pairs with a single set of ESA annotations over multiple MT system outputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
langs
|
Wmt24EsaLanguage | list[Wmt24EsaLanguage] | None
|
One or more languages to load. Defaults to |
None
|
domains
|
Wmt24EsaDomain | list[Wmt24EsaDomain] | None
|
One or more text categories to load. Defaults to |
None
|
mt_models
|
Wmt24EsaMTModel | list[Wmt24EsaMTModel] | None
|
One or more models for which annotations need to be loaded. Defaults to all models.
Available options: |
None
|
tokenizer
|
str | Tokenizer | PreTrainedTokenizer | PreTrainedTokenizerFast, *optional*
|
The tokenizer to use for tokenization. If None, a default whitespace tokenizer will be used. |
None
|
tokenizer_kwargs
|
dict[str, Any], *optional*
|
Additional arguments for the tokenizer. |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, LabeledDataset]]
|
A dictionary containing the loaded datasets for each MT model and language.
The keys are the task configurations, and the values are dictionaries with language keys
and |
Source code in labl/datasets/translation/wmt24_esa.py
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | |