Combining domain-adapted BERT models and statistical filtering methods for unsupervised keyword extraction from tenders

Jens Van Nooten, Walter Daelemans and Andriy Kosar

In this paper, we propose a new method for keyword extraction from Dutch tender documents (documents that describe offers for the delivery of goods or services). Having access to good keywords and key phrases for tender categories is important for users looking for relevant tenders, especially over time as keywords sometimes change. Each document is annotated with one or multiple industry-specific codes that denote the subject of the tender.

The proposed approach embeds keywords and cluster texts in the same space, which allows the most representative keywords for a group of tenders to be extracted. For the embedding method, transformer-based embeddings fine-tuned for semantic similarity were used, which enables the most representative keywords for a group of related tenders to be selected based on distance. The BERT model that was used for creating embeddings was domain-adapted on a dataset consisting of tender-related text and fine-tuned on a Dutch translation of the Stanford Natural Language Inference (SNLI) Corpus. For generating keywords, statistical methods, such as IDF and word-frequency-based post-filtering are used in conjunction with a predefined grammar filter. The resulting keyword sets were manually evaluated by domain experts and implemented in a keyword recommendation system.