Word-level Edit Analysis with `labl` 🏷️¶

In this notebook, we will use labl to analyze machine translation post-edits from multiple annotators, extracting useful statistics and visualizations. Finally, we will compare the annotator edit proportions with the error spans predicted by the word-level quality estimation model XCOMET-XXL to evaluate its performance.

Firstly, we load some edit data hosted on the 🤗 datasets Hub. For this purpose, we will use the QE4PE dataset, containing a set of 315 sentences each with 12 human post-edits for English-Italian and English-Dutch (more info). The large amount of annotators will prove useful for analyzing agreement.

# type: ignore
from datasets import load_dataset

full_main_dict = load_dataset("gsarti/qe4pe", "main")
full_main = full_main_dict["train"].to_pandas()
main = full_main[(~full_main["has_issue"]) &amp; (full_main["translator_main_id"] != "no_highlight_t4")]

ita_main = main[main["tgt_lang"] == "ita"].reset_index(drop=True)
nld_main = main[main["tgt_lang"] == "nld"].reset_index(drop=True)

print("Italian main data:", len(ita_main), "total edits")
print("Dutch main data:", len(nld_main), "total edits")

Italian main data: 3780 total edits
Dutch main data: 3780 total edits

We will now create an EditDataset containing the multiple post-edits for each sentence using the from_edits_dataframe method, allowing for quick import from a pandas DataFrame. The required columns are:

text_column: The name of the column containing the text before edits.
edit_column: The name of the column containing the text after edits.
entry_ids: A list of column names to be used as unique identifiers for each entry. This is useful when the same sentence has multiple edits, as in this case.

from labl import EditedDataset

ita = EditedDataset.from_edits_dataframe(
    ita_main,
    text_column="mt_text",
    edit_column="pe_text",
    entry_ids=["doc_id", "segment_in_doc_id"],
)
print("Italian main data:", len(ita), "unique entries")

nld = EditedDataset.from_edits_dataframe(
    nld_main,
    text_column="mt_text",
    edit_column="pe_text",
    entry_ids=["doc_id", "segment_in_doc_id"],
)
print("Dutch main data:", len(nld), "unique entries")

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Extracting texts and edits: 100%|██████████| 315/315 [00:00<00:00, 1501.24entries/s]
Creating EditedDataset: 100%|██████████| 315/315 [00:00<00:00, 777.58entries/s]

Italian main data: 315 unique entries

Extracting texts and edits: 100%|██████████| 315/315 [00:00<00:00, 1516.79entries/s]
Creating EditedDataset: 100%|██████████| 315/315 [00:00<00:00, 861.47entries/s]

Dutch main data: 315 unique entries

We can now visualize the contents of each entry by simply printing it. EditedDataset is a list-like object containing entries, and since multiple edits are available for each entry, every entry is also a list-like object of EditedEntry. An EditedEntry is, in essence, a combination of two LabeledEntry objects (see the Quickstart tutorial), one for the original text and one for the edited text, plus some additional information regarding edit alignments.

# Accessing all edits for the first unique entry
id_0_all_edits = ita[0]

# Accessing the first edit for the fist unique entry
id_0_first_edit = ita[0][5]

# Visualize the contents of an edited entry
print(id_0_first_edit)

orig.text:
            Esistono limitate ricerche riguardanti la continuità, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
edit.text:
            Esistono ricerche limitate riguardanti la costanza, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
orig.tokens:
            ▁ Esistono ▁ limitate ▁ ricerche ▁ riguardanti ▁ la ▁ continuità, ▁ la ▁ stabilità ▁ e ▁ il ▁ ruolo ▁ del ▁ paese ▁ di ▁ origine ▁ nel ▁ temperamento ▁ del ▁ neonato ▁ prematuro ▁ durante ▁ il ▁ primo ▁ anno ▁ di ▁ vita. ▁
                       I                   D                                S                                                                                                                                                             

edit.tokens:
            ▁ Esistono ▁ ricerche ▁ limitate ▁ riguardanti ▁ la ▁ costanza, ▁ la ▁ stabilità ▁ e ▁ il ▁ ruolo ▁ del ▁ paese ▁ di ▁ origine ▁ nel ▁ temperamento ▁ del ▁ neonato ▁ prematuro ▁ durante ▁ il ▁ primo ▁ anno ▁ di ▁ vita. ▁
                                I            D                            S                                                                                                                                                             

aligned:
            ORIG: Esistono ******** limitate ricerche riguardanti la continuità, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
            EDIT: Esistono ricerche limitate ******** riguardanti la   costanza, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
                                  I                 D                          S

The aligned attribute is obtained using jiwer and corresponds to the Levenshtein alignment between the original and edited text. Since no tokenizer was provided, whitespace tokenization was used by default.

Handling gaps¶

You might also note that orig.tokens and edit.tokens contain gap tokens (▁, see e.g. the MLQE-PE dataset for an example of gap usage). These are added by default when importing edits to keep annotations for insertions and deletions distinct on both sequences (for example, the insertion label I on the second gap of orig.tokens marks that the token ricerche was added in that position in edit.tokens, while the deletion label D on the fourth gap of edit.tokens marks that the token ricerche was deleted from orig.tokens).

If you want to restrict analysis on the actual tokens, gap annotations can be trasferred to the token on the right to obtain a more compact representation of the sequence. By default, labels are added together (so if a gap marked with I is followed by a token marked with S, the resulting label will be IS), but the merging behavior can be customized with the merge_fn argument:

# Merge gap annotations in-place
ita.merge_gap_annotations(keep_final_gap=False)
print(ita[0][5])

orig.text:
            Esistono limitate ricerche riguardanti la continuità, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
edit.text:
            Esistono ricerche limitate riguardanti la costanza, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
orig.tokens:
            Esistono limitate ricerche riguardanti la continuità, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
                            I        D                          S                                                                                                                   

edit.tokens:
            Esistono ricerche limitate riguardanti la costanza, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
                            I                    D            S                                                                                                                   

aligned:
            ORIG: Esistono ******** limitate ricerche riguardanti la continuità, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
            EDIT: Esistono ricerche limitate ******** riguardanti la   costanza, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
                                  I                 D                          S

Agreement¶

We can now easily obtain a measure of the edit agreement between annotators using get_agreement using Krippendorff's alpha coefficient. Provided that every entry has multiple edits, the agreement will be computed across all annotations of the original text, and for every annotator pair:

agreement_output = ita.get_agreement()
print(agreement_output)

AgreementOutput(
    type: krippendorff_nominal,
    full: 0.3234,
    pair:
            | A0   | A1   | A2   | A3   | A4   | A5   | A6   | A7   | A8   | A9   | A10  | A11  |
        A0  |      | 0.36 | 0.53 | 0.32 | 0.18 | 0.37 | 0.35 | 0.36 | 0.41 | 0.35 | 0.35 | 0.37 |
        A1  | 0.36 |      | 0.18 | 0.32 | 0.33 | 0.27 | 0.4  | 0.42 | 0.35 | 0.32 | 0.36 | 0.34 |
        A2  | 0.53 | 0.18 |      | 0.45 | 0.23 | 0.37 | 0.39 | 0.35 | 0.34 | 0.56 | 0.34 | 0.38 |
        A3  | 0.32 | 0.32 | 0.45 |      | 0.34 | 0.38 | 0.34 | 0.32 | 0.33 | 0.38 | 0.29 | 0.41 |
        A4  | 0.18 | 0.33 | 0.23 | 0.34 |      | 0.32 | 0.33 | 0.31 | 0.28 | 0.21 | 0.33 | 0.27 |
        A5  | 0.37 | 0.27 | 0.37 | 0.38 | 0.32 |      | 0.3  | 0.34 | 0.31 | 0.33 | 0.34 | 0.32 |
        A6  | 0.35 | 0.4  | 0.39 | 0.34 | 0.33 | 0.3  |      | 0.31 | 0.28 | 0.27 | 0.3  | 0.3  |
        A7  | 0.36 | 0.42 | 0.35 | 0.32 | 0.31 | 0.34 | 0.31 |      | 0.34 | 0.4  | 0.34 | 0.34 |
        A8  | 0.41 | 0.35 | 0.34 | 0.33 | 0.28 | 0.31 | 0.28 | 0.34 |      | 0.31 | 0.33 | 0.35 |
        A9  | 0.35 | 0.32 | 0.56 | 0.38 | 0.21 | 0.33 | 0.27 | 0.4  | 0.31 |      | 0.36 | 0.3  |
        A10 | 0.35 | 0.36 | 0.34 | 0.29 | 0.33 | 0.34 | 0.3  | 0.34 | 0.33 | 0.36 |      | 0.35 |
        A11 | 0.37 | 0.34 | 0.38 | 0.41 | 0.27 | 0.32 | 0.3  | 0.34 | 0.35 | 0.3  | 0.35 |      |

)

The agreement is quite low, but currently we are considering every type of edit as a separate label (including the combinations derived from merging, e.g. IS and ID). We can try to relabel the entries to use a single label to mark edits (e.g. E), and see how this affects the agreement computation. Relabeling with the relabel method can be done either with a relabel_map dictionary specifying the mapping from old to new labels, or with a relabel_fn function that takes a label and returns the new label. The latter is useful when we want to apply a more complex relabeling strategy, such as merging multiple labels into one.

⚠️ While relabeling affects all properties of the orig and edit LabeledEntry attributes in each EditedEntry, it does not affect the aligned attribute, which cannot be changed after the entry is created. This does not affect in any way the rest of the analysis.

ita.relabel(lambda lab: "E" if lab is not None else None)

# Visualize the contents of an edited entry
print(ita[0][5])

orig.text:
            Esistono limitate ricerche riguardanti la continuità, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
edit.text:
            Esistono ricerche limitate riguardanti la costanza, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
orig.tokens:
            Esistono limitate ricerche riguardanti la continuità, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
                            E        E                          E                                                                                                                   

edit.tokens:
            Esistono ricerche limitate riguardanti la costanza, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
                            E                    E            E                                                                                                                   

aligned:
            ORIG: Esistono ******** limitate ricerche riguardanti la continuità, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
            EDIT: Esistono ricerche limitate ******** riguardanti la   costanza, la stabilità e il ruolo del paese di origine nel temperamento del neonato prematuro durante il primo anno di vita.
                                  I                 D                          S

agreement_output = ita.get_agreement()
print(agreement_output)

AgreementOutput(
    type: spearmanr_binary,
    full: None,
    pair:
            | A0   | A1   | A2   | A3   | A4   | A5   | A6   | A7   | A8   | A9   | A10  | A11  |
        A0  |      | 0.34 | 0.22 | 0.33 | 0.27 | 0.3  | 0.33 | 0.27 | 0.28 | 0.3  | 0.27 | 0.27 |
        A1  | 0.34 |      | 0.22 | 0.38 | 0.33 | 0.36 | 0.4  | 0.32 | 0.34 | 0.34 | 0.36 | 0.35 |
        A2  | 0.22 | 0.22 |      | 0.26 | 0.21 | 0.23 | 0.25 | 0.18 | 0.22 | 0.28 | 0.2  | 0.21 |
        A3  | 0.33 | 0.38 | 0.26 |      | 0.36 | 0.37 | 0.39 | 0.31 | 0.36 | 0.36 | 0.35 | 0.35 |
        A4  | 0.27 | 0.33 | 0.21 | 0.36 |      | 0.4  | 0.35 | 0.32 | 0.39 | 0.25 | 0.34 | 0.34 |
        A5  | 0.3  | 0.36 | 0.23 | 0.37 | 0.4  |      | 0.37 | 0.34 | 0.37 | 0.33 | 0.34 | 0.38 |
        A6  | 0.33 | 0.4  | 0.25 | 0.39 | 0.35 | 0.37 |      | 0.37 | 0.37 | 0.35 | 0.38 | 0.38 |
        A7  | 0.27 | 0.32 | 0.18 | 0.31 | 0.32 | 0.34 | 0.37 |      | 0.37 | 0.34 | 0.39 | 0.4  |
        A8  | 0.28 | 0.34 | 0.22 | 0.36 | 0.39 | 0.37 | 0.37 | 0.37 |      | 0.3  | 0.36 | 0.37 |
        A9  | 0.3  | 0.34 | 0.28 | 0.36 | 0.25 | 0.33 | 0.35 | 0.34 | 0.3  |      | 0.33 | 0.31 |
        A10 | 0.27 | 0.36 | 0.2  | 0.35 | 0.34 | 0.34 | 0.38 | 0.39 | 0.36 | 0.33 |      | 0.4  |
        A11 | 0.27 | 0.35 | 0.21 | 0.35 | 0.34 | 0.38 | 0.38 | 0.4  | 0.37 | 0.31 | 0.4  |      |

)

The new agreement is now a Spearman's rank correlation coefficient, since the relabeling resulted in a binary labeling scheme. We can mark all unchanged tokens with a label K for "kept" to compute the agreement on both E and K labels. Correlation is not defined across multiple label sets, so the full attribute is None.

from labl.data import LabeledDataset

ita_main_unique = ita_main.groupby(["doc_id", "segment_in_doc_id"]).first().reset_index(drop=True)

all_spans = []
for spans_str in ita_main_unique["mt_xcomet_errors"]:
    curr_spans = []
    list_dic_span = eval(spans_str)
    for span in list_dic_span:
        curr_spans.append(
            {
                "start": span["start"],
                "end": span["end"],
                "label": span["severity"],
                "text": span["text"],
            }
        )
    all_spans.append(curr_spans)

ita_xcomet_spans = LabeledDataset.from_spans(
    texts=list(ita_main_unique["mt_text"]),
    spans=all_spans,
)

Creating labeled dataset: 100%|██████████| 315/315 [00:00<00:00, 2770.52entries/s]

print(ita_xcomet_spans[5])

text:
       La continuità del temperamento dai 6 ai 12 mesi varia a seconda del paese: le madri cilene hanno riportato un aumento del sorriso e della risata e del livello di attività dai 6 ai 12 mesi, e le madri del Regno Unito hanno riportato una diminuzione del sorriso e della risata e un aumento della paura dai 6 ai 12 mesi.
tagged:
       La continuità del temperamento dai 6 ai 12 mesi varia a seconda del paese: le madri cilene hanno riportato un aumento del sorriso<minor> e</minor> della<minor> risata</minor> e del livello di attività dai 6 ai 12 mesi, e le madri del Regno Unito hanno riportato una diminuzione del sorriso e della risata e un aumento della paura dai 6 ai 12 mesi.
tokens:
       La continuità del temperamento dai 6 ai 12 mesi varia a seconda del paese: le madri cilene hanno riportato un aumento del sorriso     e della risata e del livello di attività dai 6 ai 12 mesi, e le madri del Regno Unito hanno riportato una diminuzione del sorriso e della risata e un aumento della paura dai 6 ai 12 mesi.
                                                                                                                                         minor        minor                                                                                                                                                                             

spans:
       0: 129:131 (e) => minor
       1: 137:144 (risata) => minor

ita_xcomet_spans.relabel(lambda lab: "E" if lab is not None else None)
print(ita_xcomet_spans[5])

text:
       La continuità del temperamento dai 6 ai 12 mesi varia a seconda del paese: le madri cilene hanno riportato un aumento del sorriso e della risata e del livello di attività dai 6 ai 12 mesi, e le madri del Regno Unito hanno riportato una diminuzione del sorriso e della risata e un aumento della paura dai 6 ai 12 mesi.
tagged:
       La continuità del temperamento dai 6 ai 12 mesi varia a seconda del paese: le madri cilene hanno riportato un aumento del sorriso<E> e</E> della<E> risata</E> e del livello di attività dai 6 ai 12 mesi, e le madri del Regno Unito hanno riportato una diminuzione del sorriso e della risata e un aumento della paura dai 6 ai 12 mesi.
tokens:
       La continuità del temperamento dai 6 ai 12 mesi varia a seconda del paese: le madri cilene hanno riportato un aumento del sorriso e della risata e del livello di attività dai 6 ai 12 mesi, e le madri del Regno Unito hanno riportato una diminuzione del sorriso e della risata e un aumento della paura dai 6 ai 12 mesi.
                                                                                                                                         E            E                                                                                                                                                                             

spans:
       0: 129:131 (None) => E
       1: 137:144 (None) => E

for idx in range(len(ita[0])):
    agreement = ita_xcomet_spans.get_agreement(LabeledDataset([e[idx].orig for e in ita]))
    print(f"Agreement of XCOMET with annotator {idx}: {agreement.pair}")

Agreement of XCOMET with annotator 0: 0.21915394517009773
Agreement of XCOMET with annotator 1: 0.2275337204552564
Agreement of XCOMET with annotator 2: 0.22868301380157635
Agreement of XCOMET with annotator 3: 0.20886058547534597
Agreement of XCOMET with annotator 4: 0.18324181750361304
Agreement of XCOMET with annotator 5: 0.2350649104677996
Agreement of XCOMET with annotator 6: 0.2599132539885884
Agreement of XCOMET with annotator 7: 0.1887801438815674
Agreement of XCOMET with annotator 8: 0.18871477020233016
Agreement of XCOMET with annotator 9: 0.24598840796027605
Agreement of XCOMET with annotator 10: 0.1912293643198062
Agreement of XCOMET with annotator 11: 0.19964725018467855

Word-level Edit Analysis with labl 🏷️¶

Handling gaps¶

Agreement¶

Word-level Edit Analysis with `labl` 🏷️¶