A novel textual data augmentation method for identifying comparative text from user-generated content

Na Wei and Shenghui Wang

Mining user-generated content on e-commerce platforms and social media is timely and more objective compared with other information access channels for gaining competitive intelligence. Identifying comparative text from large volumes of non-comparative text is an important but challenging task. On one hand, existing methods are time-consuming and not generalizable across different domains. On the other hand, the datasets for the task generally suffer from the severe imbalance issue. To address abovementioned problems, we propose a framework adopting advanced deep learning methods to automatically learn features and a novel textual data augmentation method named TA3S to deal with the data imbalance issue. Specifically, the TA3S method simultaneously considers the syntactic structure and semantic information of comparative text samples. Moreover, in order to support the successful implementation of TA3S, we develop a novel method based on word embedding and label propagation algorithm to distinguish between synonymous and antonymous substitute words. The experiments on two real-world datasets demonstrate the feasibility and effectiveness of our framework, and present that our framework outperforms state-of-the-art methods in identifying comparative text from user-generated content.