Google Summer of Code Results
Overview:
The Classical Language Toolkit (CLTK) is a suite of Natural Language Processing (NLP) tools used by humanists to study ancient texts. The original focus of the project was Latin and Greek, and the most sophisticated tools are available only for these languages. However, one of the main goals of the project is to support scholarship into cultures outside of the Western canon, particularly where NLP tools are not available.
What I Set Out to Do:
Most NLP techniques rely on a large corpus of training data. For example ‘lemmatization,’ wherein all the inflected forms of a word are analysed as a single item, is often done by training a lemmatizer on a large group of texts whose words have already been grouped by lemma. Machine learning techniques are applied to build a model, and the model is then used to lemmatize texts which have not been analyzed by hand.
However, under-studied languages lack adequate training data to support this standard approach; there just aren’t enough texts which have been hand-curated to train a model. Therefore I set out to build an unsupervised model which can learn, for example, the correct lemmatization without ‘knowing the answers.’ The goal is to provide accuracy comparable to supervised approaches, without the need for hand-curated training data.
Lemmatization is a first step, and it is useful because the success of an unsupervised algorithm can be easily tested. However my unsupervised model can be extended to create tools for, e.g., cross-language document alignment, where testing is more difficult.
Steps Completed:
- Added lemmatizers for Greek and For Latin that provide all possible word groupings
- Built an unsupervised language model extensible to other NLP tasks such as translation or document alignment
- Used the language model to build an unsupervised Latin lemmatizer with > 90% accuracy
- Added tools to suggest synonyms for Latin and Greek words
- Added tools to suggest Latin translations for Greek words, and vice versa
Additional Service Completed:
- Coordinated the incorporation of CLTK into the core logic of the next generation of the open-source Tesserae Project
- Expanded the CLTK corpus of texts with aligned translation of Latin and Greek (useful for testing translation tools)
Steps Remaining:
- Build an unsupervised language model for Greek
- Use Greek model to build unsupervised lemmatizer for Greek
- Use Greek and Latin language models to assign probabilities to potential synonyms and translations
Where the code lives:
Already Pulled into CLTK:
- A dictionary of lemmas for Latin word-forms
- A dictionary of Greek translations for Latin lemmas
- A dictionary of Latin synonyms for Latin lemmas
- Dictionaries of lemmata, synonyms, and Latin translations for Greek words
- A tool for looking up all possible lemmas of a Greek or Latin word
Not Yet Pulled into CLTK:
- The unsupervised model and unsupervised lemmatization tool
- Experiments with various unsupervised language models
- The new corpus of plaintext, aligned Latin and Greek
- A corpus of 10 million words of Latin poetry and prose, used to train the unsupervised model
Documentation:
- How to use the lemma lookup tool
- The logic behind the unsupervised lemmatization model
- Initial GSoC blog post
Learning Experience:
- Improved Python coding skill
- Developed better habits for open-source contributions
- Learned how large coding projects are managed smoothly