Scientists at New York University have developed a machine learning model that mimics the way children learn language. Using video and audio recordings from a young child’s perspective, the model successfully learned to match words with corresponding images. Its effectiveness was twice as high as that of much larger models, suggesting we are closer to understanding how children begin to understand and use language. Until now, we’ve been dealing with unconfirmed, but plausible theories.
“There are many different theories on how children acquire language. On one hand, we have the theory of natural language acquisition – many researchers claim that children are born with special elements or knowledge built into the brain, which uniquely allows us to learn language. On the other hand, there is the theory based on nurturing, according to which children learn language mainly based on sensory experiences. It’s not about innate abilities, but everyday experiences that allow us to, for example, learn English,” explains Wai Keen Vong of the Data Analytics Center at New York University, in a conversation with Newseria Innovation News Agency.
In their work, the scientists mostly relied on the second of the mentioned theories. In developing the model, they wanted to better understand the process of early language acquisition.
The data used for the study was collected using a camera mounted on a child’s head and worn by one child. It recorded video and audio, accompanying the child between the ages of 6 and 25 months. The dataset included 600,000 video frames combined with 37,500 transcribed utterances. It was crucial that the study not be conducted in laboratory conditions, but as close to a natural environment as possible.
“This is the largest model of its kind yet developed. Existing datasets contain recordings of people talking to their children, but none covered such a long period and included video materials showing what the child sees at any given moment. Our data provides a unique perspective on the issue of language acquisition. Moreover, we combined this data with a multimodal neural network. The model is quite general in terms of how it learns. Data is sent and simple rule updates, which guide the learning process, take place. The fact that learning occurs offers a new insight into early language acquisition, it is different from the previous approach on this issue. Much more emphasis is put on learning and what can be achieved through it,” emphasizes Wai Keen Vong.
The model’s performance was evaluated based on categorization, which involved assigning words to corresponding visualizations. The classification accuracy was 61.6%. For comparison, tests were conducted on a contrasting neural network of image and text CLIP, trained on a much larger data resource. In this case, the classification accuracy was only 34.7%.
“For the first time, we have confirmed that our model can learn these words based on children’s real experiences. This ability to learn validates the concept that this may be one way children acquire such concepts. We can’t state it categorically, but it’s very suggestive and an interesting direction for further research,” the researcher predicts.
According to Precedence Research, the global machine learning market reached over $38 billion in revenue in 2022. By 2032, its size will grow to over $771 billion.