A Linguistic Features

The following list of features was used in the context of Chapter 3 experiments and is a summary of the full set of features presented in Brunato et al. (2020):

A.1 Raw Text Properties and Lexical Variety

  • Sentence length (n_tokens): Length of the sentence in terms of number of tokens.

  • Word length (char_per_tok): Average number of characters per word in a sentence, excluding punctuation.

  • Type/Token Ratio for forms and lemmas (ttr_form, ttr_lemma): Ratio between the number of lexical types and the number of tokens within a sentence.

A.2 Morpho-syntacting Information

  • Distribution of grammatical categories (upos_dist_*, xpos_dist_*): Percentage distribution in the sentence of the 17 core part-of-speech categories present in the Universal POS tagset (adjective, adverb, interjection, noun, proper noun, verb, adposition, auxiliary, coordinating conjunction, determiner, numeral, particle, pronoun and subordinating conjunction, punctuation, and symbols).

  • Lexical density (lexical_density): Ratio of content words (verbs, nouns, adjectives, and adverbs) over the total number of words in a sentence.

  • Inflectional morphology (aux_mood_*, aux_tense_*): Percentage distribution in the sentence of a set of inflectional features (Mood, Number, Person, Tense and Verbal Form*) over lexical verbs and auxiliaries of each sentence.

A.3 Verbal Predicate Structure

  • Distribution of verbal heads (vb_head_per_sent): Number of verbal heads in the sentence, corresponding to the number of main or subordinate clauses co-occurring in it.

  • Distribution of verbal roots (dep_dist_root): Percentage of verbal roots out of the total sentence roots.

  • Verb arity (verb_arity): Average number of dependency links sharing the same verbal head per sentence, excluding punctuation and copula dependencies.

A.4 Global and Local Parsed Tree Structures

  • Syntactic tree depth (parse_depth): Maximum syntactic tree depth extracted for the sentence, i.e., the longest path in terms of dependency links from the root of the dependency tree to some leaf.

  • Average and maximum length of dependency links (avg_links_len, max_links_len)

  • Number and average length of prepositional chains (n_prep_chains, prep_chain_len), with the latter expressed in number of tokens.

  • Subject-object ordering (subj_pre, subj_post, obj_pre, obj_post): Relative order of the subject and object arguments with respect to the verbal root of the clause in the sentence.

A.5 Syntactic Relations

  • Distribution of dependency relations (dep_dist_*): Percentage distribution of the 37 universal relations in the UD dependency annotation scheme.

A.6 Subordination Phenomena

  • Distribution of main and subordinate clauses (princ_prop_dist, sub_prop_dist): Percentage distribution of main vs subordinate clauses in the sentence.

  • Relative ordering of subordinates (sub_pre, sub_post): As for subjects and objects, whether the subordinate occurs in pre-verbal or post-verbal position in the sentence.

  • Average length of embedded subordinates (sub_chain_len): Average length of subordinate clauses recursively embedded into each other to form a subordinate chain.

Readers are referred to the original paper by Brunato et al. (2020) and the Profiling-UD webpage24 for additional details on linguistic features.

References

Brunato, Dominique, Andrea Cimino, Felice Dell’Orletta, Giulia Venturi, and Simonetta Montemagni. 2020. “Profiling-UD: A Tool for Linguistic Profiling of Texts.” In Proceedings of the 12th Language Resources and Evaluation Conference, 7145–51. Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.883.