Does It Capture STEL? Towards a Modular and Content-Controlled Linguistic Style Evaluation Benchmark

Anna Wegmann and Dong Nguyen

Natural language is not only about what is said (i.e., content), but also about how it is said (i.e., linguistic style). Linguistic style and social context are highly interrelated. For example, people can accommodate their linguistic style to each other based on social power differences. Furthermore, linguistic style can influence perception, e.g., the persuasiveness of news or the success of pitches on crowdsourcing platforms. As a result, style is relevant for natural language understanding, e.g., in author profiling, abuse detection or understanding conversational interactions. Additionally, style can be important to address in natural language generation, including identity modeling in dialogue systems and style preservation in machine translation.

There are several general evaluation benchmarks for different linguistic phenomena but less emphasis has been put on linguistic style. Nevertheless, natural language processing literature shows a variety of approaches for the evaluation of style measuring methods: They have been tested on whether they group texts by the same authors together, whether they can correctly classify the style for ground truth datasets and whether ‘similar style words’ are similarly represented. However, these evaluation approaches are (i) often application-specific, (ii) rarely used to compare different style methods, (iii) usually do not control for content and (iv) often do not test for fine-grained style differences.

These shortcomings (i)-(iv) might be the result of the following challenges for the construction of style evaluation methods: 1. Style is a highly ambiguous and elusive term. We propose a modular framework where components can be removed or added to fit an application or specific understanding of style. 2. Variation in
style can be very small. Our proposed evaluation framework can be used to test for fine-grained style differences. 3. Style is hard to disentangle from content as the two are often correlated. For example, people might speak more formally in a job interview than in a bar with friends. Thus, language models and methods might pick up on spurious content correlations in a benchmark that does not control for content.

To this end, we propose the modular, fine-grained and content-controlled similarity-based STyle EvaLuation framework (STEL). Our STEL framework demonstrates two general dimensions of style (formal/informal and simple/complex) as well as two specific characteristics of style (contraction and number substitution). By design, the STEL characteristics tasks are easier to solve than the STEL dimension tasks. STEL contains 815 task instances per dimension and 100 task instances per characteristic. Any method that can calculate a similarity between two sentences can be evaluated: (1) Methods that calculate similarity values directly (e.g., edit distance or cross-encoders) and (2) vector representations of a sentence by using a similarity measure between them (e.g., cosine similarity). This similarity-based setup also simplifies task extension (c.f. modularity). We find that the RoBERTa base model outperforms simple versions of commonly used style measuring approaches like LIWC, punctuation frequency or character 3-grams. We invite the addition of complementary tasks and hope that this framework will facilitate the development of improved style-sensitive models and methods.