Bram M.A. van Dijk, Marco R. Spruit and Max J. van Duijn
(Semi-)Spontaneous Dutch children’s language is relatively well-documented. Before the 2000s, several audio and text corpora were compiled in unstructured home settings that involved a small number of younger children. A decade later, more structured settings were used (e.g. with picture books) to target older children’s language production. Nowadays we see representative large-scale corpora, e.g. BasiScript, consisting of written output of thousands of children of 4-12 years (Tellings et al. 2018), and JASMIN-CGN, containing more than 15h of high-quality recordings of children’s speech (Cucchiarini and van Hamme 2013). Most corpora were compiled primarily for studying language acquisition and speech development.
Experimental work has suggested that in children’s development, linguistic and sociocognitive skills mutually reinforce each other. Still, there is currently no Dutch corpus that combines children’s spontaneous speech, linguistic metadata (e.g. lemma frequencies), children’s sociocognitive skills (e.g. performance on Theory of Mind tests), and backgrounds (e.g. educational level parents, siblings), which could fuel work on the intersection of language acquisition and developmental psychology. To fill this gap, we present ChiSCor (CHIldren’s Story CORpus), a new text (60k tokens) and audio (15h speech) corpus of >500 freely-told fantasy narratives collected from hundreds of children (4-12 years), enriched with the metadata mentioned above. We focus on narrative because in storytelling, children need to draw on both linguistic and cognitive skills (esp. Theory of Mind) to present a story that is coherent and interesting for an audience, and need to reason about what story characters feel, intend, think, etc.
We expect ChiSCor to be valuable not only for linguists and developmental psychologists, but also for researchers in natural language processing more widely. In our contribution, we first show that ChiSCor is a representative sample of Dutch child speech, by examining its token distribution and demonstrating that it approximates a Zipfian distribution, implying that in this respect it is an equally ‘normal’ language sample as BasiScript. Second, by comparing the lemma part-of-speech distribution and by building a frequency profile of the corpus, we explicate differences between ChiSCor’s and BasiScript’s differing domains (speech vs. written). Third, with word embeddings we show that our corpus exhibits meaningful information about children’s syntactic and semantic skills, since in vector space we can e.g. distinguish different word classes, mental from physical predicates, and intentional from non-intentional verbs. Fourth, we highlight potential applications of ChiSCor, by reflecting on pilot work done with ChiSCor (Van Dijk and Van Duijn 2021), and by evaluating performance of contemporary neural networks in processing spontaneous narratives on tasks such as POS-tagging and mapping speech to text.
Cucchiarini, and van Hamme (2013), The JASMIN Speech Corpus: Recordings of Children, Non-natives and Elderly People. Spyns et al. (eds.), Essential SLT for Dutch, pp.43-59.
Tellings, Oostdijk, Monster, Grootjen, and Van Den Bosch (2018). BasiScript: A corpus of contemporary Dutch texts written by primary school children, Int. Journal Corpus Linguistics, pp.494–508.
Van Dijk, and Max Van Duijn (2021), Modelling Characters’ Mental Depth in Stories Told by Children Aged 4-10, Proceedings of the Cognitive Science Society, pp.2384-2390.