Computational approaches to study systematicity in word learning

Willeke van der Varst and Giovanni Cassani

Most former linguistic theories posit an arbitrary relation between the form of a word (how a word sounds) and its lexical semantics, implying that the meaning of a word cannot be inferred from its form. However, recent studies have pointed to the contrary, highlighting how the relationship between the form and meaning of a word is likely not entirely arbitrary. This seems especially prevalent during the first few years of language acquisition, when non-arbitrary form-meaning correspondences seem to help with acquiring new words.

The current work investigates the relationship between word forms and lexical meanings during language acquisition using computational tools from distributional semantics and adopting a zero-shot learning perspective to test whether systematicity facilitates subsequent word learning. In detail, we use the Linear Discriminative Learning (LDL) model and Form-Semantic Consistency (FSC) model. Both models map representations of word form onto semantic space using multivariate multiple regression and cross-modal analogies respectively. Consequently, they can generate semantic representations for novel words exploiting statistical regularities in form-meaning mapping observed in the vocabulary learned up to that point. This improves over previous approaches to the study of systematicity in language acquisition, which looked at correlations between form and meaning representations of known words and could not assess the degree of systematicity of a novel word with the respect to a learner’s current vocabulary.

The US and UK English parts of the CHILDES corpus were used to train the models. The corpus was split into nine bins, based on the age of the child in each transcript. Semantic representations were derived from each bin using word2vec spaces for the FSC model and Naïve Discriminative Learning (NDL) spaces for the LDL model, in line with previous studies. Moreover, Boolean form vectors were obtained considering which character trigrams are present in every word.

The FSC and LDL models were used to derive a measure of form-meaning systematicity for words yet to be learned (considering words produced by children at previous time points as the reference vocabulary). We then coded to-be-learned words according to the corpus bin in which they were first produced by a child and fitted a Cumulative Link Mixed Model in R to predict how many bins into the future a word would be produced first as a function of its frequency, semantic neighborhood density, and length. We then added the target measures of systematicity separately and measured the change in AIC.

Preliminary results indicate that systematicity has a positive effect, with more systematic words being learned earlier. After including FSC and LDL systematicity measures, the AIC score drops by 32.5 and 18.9 points respectively (as compared to the base model). These results corroborate evidence about the non-arbitrariness of form-meaning relationships in the lexicon, and its beneficial role in word learning. We will now test random baselines to probe whether it is truly an effect of systematicity or an epiphenomenon of other, untested properties of the lexicon and plan to test further models, including FastText.