Unsupervised Lemmatization Model

- 15 mins

The code for this unsupervised lemmatization model can be found here.

How frequent is each word in a language?

This is a basic question in language modeling. To answer it, we need to group together all the strings in our corpus that belong to the same word. This is called lemmatization. The difficult thing about lemmatization is that sometimes it isn’t obvious which lemma, or dictionary-entry, a word in our text stems from. This is especially true in inflected languages like Latin and Ancient Greek.

One approach to solving this problem, implemented by Patrick Burns in his 2016 Google Summer of Code project for CLTK, involves supervised training data. ‘Supervised training data,’ in this case, means a lot of texts whose lemmatizations have been solved by hand. Various machine learning techniques can pick up patterns in the training data, and then the computer can look for those patterns in fresh, unlemmatized texts. The presence of such patterns helps the computer make an educated guess in ambiguous cases. But what happens when we want to know the frequency of words, but we don’t have a training data set? This is the situation in ancient languages other than Greek and Latin, and it also comes up when one wants to scrutinize a sub-set of Latin, for example Flavian poets, without relying on models trained on texts that were written centuries apart.

To build an unsupervised model of word frequency, I grouped together all the tokens in the text that might have come from the same lemma. Whenever a word could stem from a particular lemma, I count that lemma as ‘seen’ in the text.

Arma virumque cano
Arma virumque cano
Armō, v. : 'to arm, to equip'
COUNT: 1
[Not supported by viewer]
Arma, n pl..: weapon, arms (pl.)
COUNT: 1
[Not supported by viewer]
Neve armate manus
Neve armate manus
Armō, v. : 'to arm, to equip'
COUNT: 2
[Not supported by viewer]
Armatus, adj.: armed person, soldier.
COUNT: 1
[Not supported by viewer]
saevos in armis
saevos in armis
Armus, n.: 'shoulder, flank (of animal)'
COUNT: 1
[Not supported by viewer]
Arma, n. pl.: weapon, arms (pl.)
COUNT: 2
[Not supported by viewer]
Phrase #1:
Aeneid 1.1
[Not supported by viewer]
Phrase #2:
Aeneid 9.115
[Not supported by viewer]
Phrase #3:
Aeneid 12.107
[Not supported by viewer]
THOUSANDS OF REPETITIONS
THOUSANDS OF REPETITIONS<br>
UNSUPERVISED COUNT:
Arma, n. pl.: 2
Armo,v.: 2
Armatus,adj.: 1
Armus,n.: 1
[Not supported by viewer]
UNSUPERVISED COUNT 
(FINAL):
Arma, n. pl.: 5,581
Armo,v.: 4,876
Armatus,adj.: 1135
Armus,n.: 2740
[Not supported by viewer]
Correct Lemmatizations are highlighted in green.
Correct Lemmatizations are highlighted in green.

This works because different inflected forms of a lemma ‘overlap’ with different other lemmata. Basically, each instance of a word is ‘right’ in the same way, but ‘wrong’ in a different way. Over millions of tokens, the wrong answers begin to cancel each other out and the frequency counts in the model gradually begin to reflect the true rate of appearance of each lemma.

How well does this work?

I trained this model on some 8 million words of Latin poetry and prose from the Tesserae Latin corpus. Then I used the model to create a lemmatizer that returns all possibilities for an ambiguous form, but weights the probability of each lemma according to its frequency in the model.

Arma virumque cano
<u>Arma</u> virumque cano
Armō, v. : 'to arm, to equip'
COUNT: 4,876
[Not supported by viewer]
Arma, n pl..: weapon, arms (pl.)
COUNT: 5,581
[Not supported by viewer]
Armō, v. : 'to arm, to equip'
PROBABILITY: 0.47
[Not supported by viewer]
Arma, n. pl.: weapon, arms (pl.)
PROBABILITY: 0.53
[Not supported by viewer]
Correct Lemmatizations are highlighted in green.
Correct Lemmatizations are highlighted in green.
5,581

5,581 + 4,876
5,581<br><br>5,581 + 4,876<br>
4,876

5,581 + 4,876
4,876<br><br>5,581 + 4,876<br>

When presented with a token which might stem from more than one lemma, I simply chose the most probable answer according to the model. I tested this system against ~17,000 tokens of Latin from the cltk corpus whose correct lemmatizations have been entered by hand. The result was correct roughly 91% of the time. Let’s break down those results.

First, it should be said that 73% of the tokens in this data set are unambiguous–they can only come from one lemma, so it isn’t possible to make a mistake.

Second, if you guess randomly in ambiguous cases, you happen to be right about about 30% of the time (presumably because each ambiguous case has, on average, 3 possible lemmatizations). So a random selection gives roughly 81% accuracy.

Using the unsupervised frequency model, we end up guessing correctly in 66% of ambiguous cases (more than twice as often as we would get by ‘flipping a coin’), for a total accuracy of ~91%. That’s strong enough to suggest that the model of language frequency seen here is good enough to expand to other NLP tasks or combing with other unsupervised lemmatization techniques.

Observations

A few things about this model surprised me. First, if we only count unambiguous forms, the model doesn’t lead to a successful lemmatizer. In fact its accuracy drops below what we see through random selection.

Second, this model was developed as part of a more complex lemmatizer that looked at surrounding language. That model worked roughly 86% of the time, until I added a step to remove word-frequency information from the model. That cause accuracy to drop below 80%–in other words, worse than random chance–and it became clear that nearby words were not as effectual in assigning probabilities to possible lemmas as word frequency. Still, nearby language is a feature worth revisiting, and you can find the code for that project here.

comments powered by Disqus
rss facebook twitter github youtube mail spotify instagram linkedin google pinterest medium vimeo