Word-level Edit Analysis with labl 🏷️¶
In this notebook, we will use labl to analyze machine translation post-edits from multiple annotators, extracting useful statistics and visualizations. Finally, we will compare the annotator edit proportions with the error spans predicted by the word-level quality estimation model XCOMET-XXL to evaluate its performance.
Firstly, we load some edit data hosted on the 🤗 datasets Hub. For this purpose, we will use the QE4PE dataset, containing a set of 315 sentences each with 12 human post-edits for English-Italian and English-Dutch (more info). The large amount of annotators will prove useful for analyzing agreement.
# type: ignore
from datasets import load_dataset
full_main_dict = load_dataset("gsarti/qe4pe", "main")
full_main = full_main_dict["train"].to_pandas()
main = full_main[(~full_main["has_issue"]) & (full_main["translator_main_id"] != "no_highlight_t4")]
ita_main = main[main["tgt_lang"] == "ita"].reset_index(drop=True)
nld_main = main[main["tgt_lang"] == "nld"].reset_index(drop=True)
print("Italian main data:", len(ita_main), "total edits")
print("Dutch main data:", len(nld_main), "total edits")
We will now create an EditDataset containing the multiple post-edits for each sentence using the from_edits_dataframe method, allowing for quick import from a pandas DataFrame. The required columns are:
text_column: The name of the column containing the text before edits.edit_column: The name of the column containing the text after edits.entry_ids: A list of column names to be used as unique identifiers for each entry. This is useful when the same sentence has multiple edits, as in this case.
from labl import EditedDataset
ita = EditedDataset.from_edits_dataframe(
ita_main,
text_column="mt_text",
edit_column="pe_text",
entry_ids=["doc_id", "segment_in_doc_id"],
)
print("Italian main data:", len(ita), "unique entries")
nld = EditedDataset.from_edits_dataframe(
nld_main,
text_column="mt_text",
edit_column="pe_text",
entry_ids=["doc_id", "segment_in_doc_id"],
)
print("Dutch main data:", len(nld), "unique entries")
We can now visualize the contents of each entry by simply printing it. EditedDataset is a list-like object containing entries, and since multiple edits are available for each entry, every entry is also a list-like object of EditedEntry. An EditedEntry is, in essence, a combination of two LabeledEntry objects (see the Quickstart tutorial), one for the original text and one for the edited text, plus some additional information regarding edit alignments.
The aligned attribute is obtained using jiwer and corresponds to the Levenshtein alignment between the original and edited text. Since no tokenizer was provided, whitespace tokenization was used by default.
Handling gaps¶
You might also note that orig.tokens and edit.tokens contain gap tokens (▁, see e.g. the MLQE-PE dataset for an example of gap usage). These are added by default when importing edits to keep annotations for insertions and deletions distinct on both sequences (for example, the insertion label I on the second gap of orig.tokens marks that the token ricerche was added in that position in edit.tokens, while the deletion label D on the fourth gap of edit.tokens marks that the token ricerche was deleted from orig.tokens).
If you want to restrict analysis on the actual tokens, gap annotations can be trasferred to the token on the right to obtain a more compact representation of the sequence. By default, labels are added together (so if a gap marked with I is followed by a token marked with S, the resulting label will be IS), but the merging behavior can be customized with the merge_fn argument:
Agreement¶
We can now easily obtain a measure of the edit agreement between annotators using get_agreement using Krippendorff's alpha coefficient. Provided that every entry has multiple edits, the agreement will be computed across all annotations of the original text, and for every annotator pair:
The agreement is quite low, but currently we are considering every type of edit as a separate label (including the combinations derived from merging, e.g. IS and ID). We can try to relabel the entries to use a single label to mark edits (e.g. E), and see how this affects the agreement computation. Relabeling with the relabel method can be done either with a relabel_map dictionary specifying the mapping from old to new labels, or with a relabel_fn function that takes a label and returns the new label. The latter is useful when we want to apply a more complex relabeling strategy, such as merging multiple labels into one.
⚠️ While relabeling affects all properties of the orig and edit LabeledEntry attributes in each EditedEntry, it does not affect the aligned attribute, which cannot be changed after the entry is created. This does not affect in any way the rest of the analysis.
The new agreement is now a Spearman's rank correlation coefficient, since the relabeling resulted in a binary labeling scheme. We can mark all unchanged tokens with a label K for "kept" to compute the agreement on both E and K labels. Correlation is not defined across multiple label sets, so the full attribute is None.
from labl.data import LabeledDataset
ita_main_unique = ita_main.groupby(["doc_id", "segment_in_doc_id"]).first().reset_index(drop=True)
all_spans = []
for spans_str in ita_main_unique["mt_xcomet_errors"]:
curr_spans = []
list_dic_span = eval(spans_str)
for span in list_dic_span:
curr_spans.append(
{
"start": span["start"],
"end": span["end"],
"label": span["severity"],
"text": span["text"],
}
)
all_spans.append(curr_spans)
ita_xcomet_spans = LabeledDataset.from_spans(
texts=list(ita_main_unique["mt_text"]),
spans=all_spans,
)