A Linguistic Features
The following list of features was used in the context of Chapter 3 experiments and is a summary of the full set of features presented in Brunato et al. (2020):
A.1 Raw Text Properties and Lexical Variety
Sentence length (n_tokens): Length of the sentence in terms of number of tokens.
Word length (char_per_tok): Average number of characters per word in a sentence, excluding punctuation.
Type/Token Ratio for forms and lemmas (ttr_form, ttr_lemma): Ratio between the number of lexical types and the number of tokens within a sentence.
A.2 Morpho-syntacting Information
Distribution of grammatical categories (upos_dist_*, xpos_dist_*): Percentage distribution in the sentence of the 17 core part-of-speech categories present in the Universal POS tagset (adjective, adverb, interjection, noun, proper noun, verb, adposition, auxiliary, coordinating conjunction, determiner, numeral, particle, pronoun and subordinating conjunction, punctuation, and symbols).
Lexical density (lexical_density): Ratio of content words (verbs, nouns, adjectives, and adverbs) over the total number of words in a sentence.
Inflectional morphology (aux_mood_*, aux_tense_*): Percentage distribution in the sentence of a set of inflectional features (Mood, Number, Person, Tense and Verbal Form*) over lexical verbs and auxiliaries of each sentence.
A.3 Verbal Predicate Structure
Distribution of verbal heads (vb_head_per_sent): Number of verbal heads in the sentence, corresponding to the number of main or subordinate clauses co-occurring in it.
Distribution of verbal roots (dep_dist_root): Percentage of verbal roots out of the total sentence roots.
Verb arity (verb_arity): Average number of dependency links sharing the same verbal head per sentence, excluding punctuation and copula dependencies.
A.4 Global and Local Parsed Tree Structures
Syntactic tree depth (parse_depth): Maximum syntactic tree depth extracted for the sentence, i.e., the longest path in terms of dependency links from the root of the dependency tree to some leaf.
Average and maximum length of dependency links (avg_links_len, max_links_len)
Number and average length of prepositional chains (n_prep_chains, prep_chain_len), with the latter expressed in number of tokens.
Subject-object ordering (subj_pre, subj_post, obj_pre, obj_post): Relative order of the subject and object arguments with respect to the verbal root of the clause in the sentence.
A.5 Syntactic Relations
- Distribution of dependency relations (dep_dist_*): Percentage distribution of the 37 universal relations in the UD dependency annotation scheme.
A.6 Subordination Phenomena
Distribution of main and subordinate clauses (princ_prop_dist, sub_prop_dist): Percentage distribution of main vs subordinate clauses in the sentence.
Relative ordering of subordinates (sub_pre, sub_post): As for subjects and objects, whether the subordinate occurs in pre-verbal or post-verbal position in the sentence.
Average length of embedded subordinates (sub_chain_len): Average length of subordinate clauses recursively embedded into each other to form a subordinate chain.
Readers are referred to the original paper by Brunato et al. (2020) and the Profiling-UD webpage24 for additional details on linguistic features.
References
Brunato, Dominique, Andrea Cimino, Felice Dell’Orletta, Giulia Venturi, and Simonetta Montemagni. 2020. “Profiling-UD: A Tool for Linguistic Profiling of Texts.” In Proceedings of the 12th Language Resources and Evaluation Conference, 7145–51. Marseille, France: European Language Resources Association. https://www.aclweb.org/anthology/2020.lrec-1.883.