Conversational data collection through Wizard of Oz experiments for emotion detection and stylometric analysis

Sofie Labat, Thomas Demeester and Véronique Hoste

Recent advances in cognitive technologies, such as natural language processing, are transforming customer service (CS) delivery models (Deloitte Digital, 2021). These technologies do not only assist human operators in their daily tasks, but they can also be used to design automated conversational agents that can interact with humans in a personalized and empathetic manner. In this respect, the automatic modelling of fine-grained emotion trajectories in CS dialogues forms an important application. The term ‘trajectory’ hints at the fact that emotions are considered as dynamic attributes of the customer that can change at each utterance in the conversation. To better predict such shifts, we view (i) the event happen prior to the dialogue and (ii) the response strategies of operators as part of the trajectory.

Many small and medium-sized enterprises (SMEs) do, however, not dispose of the necessary data to train such machine learning systems. We therefore compiled a large multilingual Twitter corpus containing 275k conversations (Hadifar et al., 2021). Upon a closer inspection of 9,489 Dutch conversations in the corpus (Labat et al., under review), we noticed that many of these Twitter conversations are too short to model emotion shifts, as most companies focus on redirecting their customers to private channels for the purpose of complaint handling (see also Van Herck et al., 2020).

To address this issue, we have collected a novel resource of text-based dialogues in the domain of customer service by means of Wizard of Oz (WOZ) experiments. During WOZ experiments, participants are told they are talking to an automated conversational agent, while this agent is in fact a human operator. Based on our insights from a pilot study with 16 participants (presented at CLIN31), we have now completed a full-scale study which involved 179 participants. 73.7% of these participants believed to be talking to an actual chatbot. In our setup, each participant had 12 conversations that were all grounded in an event linked to a commercial sector (e-commerce, telecommunication, tourism) and a sentiment trajectory (from negative/neutral to negative/neutral/positive). The operator utterances were simultaneously written and labelled along a set of 14 possible categories. The participant utterances were annotated afterwards for emotions along a categorical taxonomy of 10 emotion labels and an additional neutral category, and with valence-arousal-dominance scores (Mehrabian and Russell, 1974). We performed an inter-annotator agreement (IAA) study and achieved a Krippendorff’s α of 0.608 on the categorical framework and Krippendorff’s α scores of 0.585, 0.495, and 0.314 on the valence-arousal-dominance dimensions, respectively. We also collected profile data (age, gender, personality) on the participants for the goal of (i) modelling interpersonal emotional variance and (ii) performing stylometric experiments. Besides our dataset description, we also aim to present the first results of our machine learning systems for (i) emotion detection and (ii) stylometric analysis by the time of the CLIN conference.