When language models learn like babies

Dr. Lukas Edman conducts research on masked language modeling. He won first prize in the Baby Language Modeling Challenge at the Conference on Empirical Methods in Natural Language Processing

(PresseBox) (Heilbronn, Germany, 17.02.2026)

Whether in digital assistance systems, text summarization, or programming—wherever language needs to be processed efficiently, AI-supported large language models (LLMs) are used. But these supposed all-rounders have their weaknesses. One of them is that trillions of words are sometimes needed to train a model. This has significant disadvantages, ranging from high costs and enormous energy consumption to greater sensitivity to bias.

In addition, LLMs often fail at tasks that seem trivial to us humans, explains Dr. Lukas Edman, postdoc at Prof. Alexander Fraser's Chair of Data Analytics and Statistics at TUM Campus Heilbronn: “They have difficulty with long-term contexts. For example, if you talk to ChatGPT for a very long time, it often no longer understands what was said some time ago. They have problems with logical thinking—complex tasks have to be broken down into smaller steps. They even fail at very simple tasks: they often can't insert a specific letter in a particular place in a word, or they don't recognize that a sentence can be grammatically correct even though it doesn't make sense in terms of content.”

The young scientist is researching Masked Language Modeling (MLM) – a training method in which individual words in a sentence are masked, i.e., left out. The model is supposed to predict the missing words and thus learn to figure out the meaning from the context. MLM improves the general understanding of sentences and makes it possible to get by with significantly less training data than before. The biggest advantage, in Edman's view, is that “the method is very similar to human learning: when we listen to someone, our brain constantly tries to predict the next word. If our prediction is wrong, we have to adapt and learn from it. This is exactly how the training of the models works – and that makes it easy to implement.”

Refinement Through Selective Masking

But MLM also has disadvantages: with simple sentences, the model learns very quickly which word is missing. For example, if you leave out the word “to” in the sentence “I like to go shopping,” it fills in the gap correctly after just a few attempts. “If such a passage continues to be masked, it does not provide any new insights and costs unnecessary computing time,” says Edman.

This is where Adaptive MLM comes in – a refinement of standard MLM in which the masked words are specifically selected. “First, we leave out randomly selected words. During training, we check whether the model predicts them correctly. We weight all correctly predicted words lower so that they are masked less frequently in the future. Instead, the training focuses on the difficult cases,“ explains Edman. For example, versatile adjectives or adverbs are more difficult to predict than very common words such as “the” or “and”.

Recognizing Connections Without Large Data Sets

It is often helpful to break words down into tokens—smaller units—or even finer components called subtokens. By splitting the word “walking” into the tokens “walk” and “ing,” the model can recognize the connection between “walk” and “walking” without relying on extremely large amounts of training data. “In fact, there has been some progress here, especially in adjective nominalization – that is, when an adjective such as ‘laughable’ is converted into a noun such as ‘laughability’. We often work with invented adjectives such as ‘wuggable’, which the model is supposed to convert into the noun ‘wuggability’. This teaches it the rule that ‘able’ typically becomes ‘ability’ and not ‘ness,’ for example,” explains Edman.

The goal is to develop a model that can access all letters in every word: “We humans can do that. We normally ignore this information when reading. But when we see that something is misspelled, we notice it. Language models should be able to do this too.” To achieve this goal, adaptive training approaches need to be systematically investigated further: “For example, we could analyze how the model behaves on a large scale. To do this, we would use larger data sets and compare whether the advantages of adaptive MLM only come into play with smaller amounts of data.” Edman also wants to test the simultaneous masking of several words that are related: “This could help convey grammatical concepts even more effectively.”

Opportunity for Stronger Cooperation

Last fall, Edman achieved a major success: At the Conference on Empirical Methods in Natural Language Processing (EMNLP) in Suzhou, China, a leading international conference in the field of empirical language processing and machine language understanding, he won the first prize in the Baby Language Modeling (BabyLM) Challenge. BabyLM refers to a research approach that investigates how language models learn languages with very little training data – similar to a baby, which does not have an infinite amount of data at its disposal.

“The Challenge Award means a lot to me,” says Edman. “It helps to publicize my research and hopefully convinces other people that it is worth looking into this topic. At the same time, it offers the opportunity to collaborate with other expert researchers. This is particularly computationally intensive, so it helps enormously that we have found an efficient method.”

When language models learn like babies

Website Promotion

PresseMail abonnieren