Collected Works: corpora for philosophers

Martin Reynaert, Arianna Betti and Jesse de Does

We aim to facilitate the application of corpus-based methods to the study of the history of ideas by creating a user-friendly environment for the creation and deployment of multi-lingual philosophical text corpora of different periods. We draw primarily on available digital collections of philosophers’ works and on existing CLARIN infrastructure to give Digital Humanities researchers access, ideally, to an author’s complete works.

To this aim, we are gradually extending the extensive toolkit for FoLiA XML towards other languages and times. This allows for uniform processing and linguistic enrichment of the works collected. The INT tool Autosearch allows third parties to upload and index their own corpora through the BlackLab back-end and we use it to make available to invited researchers selected corpora in the WhiteLab interface originally developed for the online version of the Dutch reference corpus SoNaR, now in its third major incarnation.

Our starting point is to make the legacy texts queryable in the Autosearch environment through their modern lemmata and linguistic annotation. WhiteLab’s four levels of text querying, from the most simple ngram to the most complex CQL-syntax based queries, lend the philosopher the leisure to immediately delve into the texts, relying on a standard interface which affords a choice interplay between metadata and textual contents and is easy to get familiar with, rather than first having to deal with the non-core business intricacies of attaining data literacy.

In a CLARIAH PLUS use case project we are extending the functionality of the system to allow for concept-focused searches. We integrate the mega-querying offered by the novel HitPaRank algorithm ( Rather than work on the plain text in FoLiA paragraphs, or its POS-tagged and modernized lemmatized version, HitPaRank will leverage the indexing provided by Autosearch to identify all paragraphs containing any hits from the weighted terms in user-defined lists, and to rank them according to user-defined criteria of relevance for research questions. Having the full texts and ngram querying, ranking and grouping functionality at their fingertips, experts will be able to first compile data-driven lists of terms relevant to their particular research question to extract and rank paragraphs from the corpus. Relevance ranked paragraphs are then exportable to a new database add-on connected to the user interface, which then allows them to directly focus on and grade or annotate those paragraphs likely most relevant to the research at hand.

Targeted corpora range from Classical Greek and (Neo-)Latin works, the large German legacy offerings by the Deutsche Textarchiv and other major digital publishers, to more specific authors, e.g. 18th century Christian Wolff, affording a comparable corpus in German and Latin, and Immanuel Kant, 19th century Bolzano, Hegel, Marx and Engels and 20th century Willard Van Orman Quine and Bertrand Russell.

Our being able to put all these ‘Collected Works’ at the ready online disposal of Digital Humanities researchers should be rightly seen as a culmination and celebration of all the many prior years of Computational Linguistics in the Netherlands.