Skip to main content

Khazar Khorrami: Artificial neural networks can shed light on the mysteries behind language acquisition in infants

Tampere University
LocationKorkeakoulunkatu 1, Tampere
Hervanta Campus, Tietotalo, auditorium TB109 and remote connection
Date18.10.2024 12.00–16.00 (UTC+3)
LanguageEnglish
Entrance feeFree of charge
Photo: Nahid Sheikhipour
In her doctoral dissertation, M.Sc. (Tech) Khazar Khorrami explored the development of early language perception skills using computational modeling. In her research, she applied self-supervised artificial neural networks as models of infant language learners, simulating and interpreting the emergence of language capabilities in these models when they are exposed to speech or audiovisual input.

Infants understand speech long before they start speaking their first words. By the age of 1 year, most infants comprehend the meanings of many common words. How infants acquire their early language skills has remained a mystery to the public and a crucial research question for scientists.

Computational simulations provide powerful tools for studying complex phenomena. In the context of early language acquisition, computational models enable researchers to test theories and evaluate how different learning scenarios, algorithms, and input structures shape learning outcomes. Furthermore, understanding the complexities of human learning processes contributes to advancing human-like artificial intelligence systems.

"The methodological core of this research involves studying the emergence of different linguistic skills in an artificial neural network learner model exposed to both speech and/or audiovisual data. To replicate a realistic learning scenario for infants, self-supervised training algorithms were applied to unlabeled raw speech signals as well as images of everyday scenes paired with spoken utterances describing these scenes and events. After training, the models were evaluated to determine whether they could recognize phonetic and word forms and map speech to corresponding visual objects and contexts," says M.Sc. Khazar Khorrami.

Predictive processing of sensory audiovisual input

Some theories posit that the environment is the primary driver of language acquisition, suggesting that early language skills develop through statistical processing of environmental data. For instance, infants might identify repeating speech patterns or associate speech sounds with visual objects.

However, other theories argue that statistical learning alone is insufficient for language development, particularly due to infants' limited cognitive capacities and the sparse and highly ambiguous input they receive during the first year of life. These theories propose that language acquisition is an inherent human ability, with the environment serving merely as a trigger to activate this pre-existing capacity.

"The thesis tests the feasibility of statistical learning as a mechanism for early language perception development using self-supervised algorithms and neural network models trained on realistic-scale data. Evaluated on phoneme recognition, word form recognition, and word meaning understanding, the results indicate that certain early language skills can emerge from statistical learning using limited auditory and audiovisual inputs comparable to what infants receive during their first year of life. Moreover, the vocabulary growth trajectory of the model is consistent with real-world data collected through parental reports," Khorrami explains. 

"Consequently, the findings serve as evidence supporting the predictive processing theory of language acquisition," she adds.

Do infants need to recognize phonemes before understanding words?

According to bottom-up theories of language learning, language is organized hierarchically in the learner’s mind: first, phoneme patterns are identified, which then allow segmentation of continuous speech into meaningful units such as words and syllables. At the final stage, these word forms are mapped to their referential meanings in the external world.

In contrast, some top-down theories challenge the hierarchical notion, suggesting that language acquisition may start with meaning-oriented mechanisms. In this view, learners interpret incoming sensory data, such as speech input, to understand and interact with the world. Knowledge of linguistic structures, such as phonemes and words, then emerges as a by-product of this meaning-centric processing.

"The thesis investigates the developmental timeline of different levels of language perception skills within the context of bottom-up versus top-down theories. Through a series of studies, it demonstrates that speech segmentation, alignment of words with visual objects, and phonemic knowledge can emerge as a by-product of meaning-oriented learning of audiovisual patterns,” Khorrami says. 

“Nonetheless, we found that the emergence of linguistic patterns consistently follows a hierarchical order: phoneme organization occurs first, followed by lexical and semantic understanding at later stages," she adds.

Public defence on Friday 18 October 

The doctoral dissertation of M.Sc. (Tech) Khazar Khorrami in the field of Computing and Electrical Engineering titled Computational modeling of early language acquisition with multimodal neural networks will be publicly examined at the Faculty of Information Technology and Communication Sciences of Tampere University at 12:00 on Friday, October 18th, 2024 at Hervanta Campus, Tietotalo building (Korkeakoulunkatu 1), auditorium TB109. The Opponents will be Associate Professor Afra Alishahi from Tilburg University and Professor Martti Vainio from University of Helsinki. The Custos will be Associate Professor Okko Räsänen from Tampere University.

 

The doctoral dissertation is available online
The public defence can be followed via remote connection