Google Summer of Code Results

- 3 mins

Overview:

The Classical Language Toolkit (CLTK) is a suite of Natural Language Processing (NLP) tools used by humanists to study ancient texts. The original focus of the project was Latin and Greek, and the most sophisticated tools are available only for these languages. However, one of the main goals of the project is to support scholarship into cultures outside of the Western canon, particularly where NLP tools are not available.

What I Set Out to Do:

Most NLP techniques rely on a large corpus of training data. For example ‘lemmatization,’ wherein all the inflected forms of a word are analysed as a single item, is often done by training a lemmatizer on a large group of texts whose words have already been grouped by lemma. Machine learning techniques are applied to build a model, and the model is then used to lemmatize texts which have not been analyzed by hand.

However, under-studied languages lack adequate training data to support this standard approach; there just aren’t enough texts which have been hand-curated to train a model. Therefore I set out to build an unsupervised model which can learn, for example, the correct lemmatization without ‘knowing the answers.’ The goal is to provide accuracy comparable to supervised approaches, without the need for hand-curated training data.

Lemmatization is a first step, and it is useful because the success of an unsupervised algorithm can be easily tested. However my unsupervised model can be extended to create tools for, e.g., cross-language document alignment, where testing is more difficult.

Steps Completed:

Additional Service Completed:

Steps Remaining:

Where the code lives:

Already Pulled into CLTK:

Not Yet Pulled into CLTK:

Documentation:

Learning Experience:

comments powered by Disqus
rss facebook twitter github youtube mail spotify instagram linkedin google pinterest medium vimeo