Jekyll2018-08-16T22:33:56+00:00https://jamesgawley.github.io/James GawleyHow does the written word go from communication to 'art?' How does that art affect the mind?Unsupervised Lemmatization Model2018-08-14T11:59:00+00:002018-08-14T11:59:00+00:00https://jamesgawley.github.io/Unsupervised-Lemmatization-Model<p><em><a href="https://github.com/jamesgawley/cltk/blob/master/cltk/lemmatize/latin/unsupervised.py">The code for this unsupervised lemmatization model can be found here.</a></em></p>
<h2 id="how-frequent-is-each-word-in-a-language">How frequent is each word in a language?</h2>
<p>This is a basic question in language modeling. To answer it, we need to group together all the strings in our corpus that belong to the same word. This is called lemmatization. The difficult thing about lemmatization is that sometimes it isn’t obvious which lemma, or dictionary-entry, a word in our text stems from. This is especially true in inflected languages like Latin and Ancient Greek.</p>
<p>One approach to solving this problem, implemented by <a href="https://disiectamembra.wordpress.com/2016/08/23/wrapping-up-google-summer-of-code/">Patrick Burns in his 2016 Google Summer of Code project for CLTK</a>, involves supervised training data. ‘Supervised training data,’ in this case, means a lot of texts whose lemmatizations have been solved by hand. Various machine learning techniques can pick up patterns in the training data, and then the computer can look for those patterns in fresh, unlemmatized texts. The presence of such patterns helps the computer make an educated guess in ambiguous cases. But what happens when we want to know the frequency of words, but we don’t have a training data set? This is the situation in ancient languages other than Greek and Latin, and it also comes up when one wants to scrutinize a sub-set of Latin, for example Flavian poets, without relying on models trained on texts that were written centuries apart.</p>
<p>To build an unsupervised model of word frequency, I grouped together all the tokens in the text that might have come from the same lemma. Whenever a word <em>could</em> stem from a particular lemma, I count that lemma as ‘seen’ in the text.</p>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="839px" version="1.1" content="<mxfile userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" version="9.0.5" editor="www.draw.io" type="device"><diagram id="ce38d9e4-6dba-9bcd-ecb4-3782d756e4bb" name="Page-1">7Vxbc5s6EP41nrYvHkDc/Oi67mlmepJMnZxz+qgYxabFyAWcS3/9WXFHEja2MSZJO9MWFklI365W366QB2iyevorwOvl39Qh3kBTnKcB+jTQNFUd2fAfkzwnEksxE8EicJ20UCGYub9JKlRS6cZ1SFgpGFHqRe66KpxT3yfzqCLDQUAfq8XuqVd96xoviCCYzbEnSv91nWiZSG1DKeRfiLtYZm9WlfTJHZ7/XAR046fvG2joPv6TPF7hrK20fLjEDn0sidB0gCYBpVFytXqaEI9hm8GW1Ptc8zTvd0D8qEkFO1XUA/Y2JOty3LHoOQMjHg5hFZQB+vi4dCMyW+M5e/oI6gfZMlp5cKfCZdocCSLyVNsnNR8pWBChKxIFz1AkrWBqqbGkxoM0bWgkksdCGaqVQrisKCItiFMDWOStFyDARYpDDSZG/zBRcwh2gGKcChStf6AgpREmttRQlOMxUftoKEipYKKOFBERTYKI2QIgZv/w4OCQoaFL0NBbQEO0jnGwwkziBpvVLxDCAoZ9KmAEo42qQIRRQH+SCfVoABKf+lDy473reZwIe+7Ch9s5oENA/pFh58K6Nk4frFzHYa+RIl/VzQnAl0xOGfioBfCt3aZIfGfMuEIBn2B6xBF4ws6xl0ZnSAaXyQLi4ch9qDYvG3H6hmvqwotzaHW7Os01DrOQboI5SSuVCQDXjqVsbyfCwYJEQjsx/Pmgm60ge2lk7uEwdOdVpQD0wfN/zD7Bz6e33yvmulthwCTjIZUJ57mUaO4Av6kS+XZGJ1PiqJdKRH1SonGgElV+KjbUIoCNn0vF1qxAWN/fvF2uw4VRJC0eaiKquttGOmMB52BBGbWsLvyDiTGw4Rqqw3PTYyu8CxcLdvGQSaDtXDhkJdEY/n0XUdYnaCSuHt+RXxt3/S6rdxdktSZXt5c3STVVfCpoYhfXgAD2zjZ0Q+EIBwts7TmZz3vEOjJl2xKeIVO21YaykUzZuEbPPhuIN5RrO9bZI8Fr6ifVQeEwj5X3UONDF4q2tTtkmqKiHYPYjt4jRdfFOp0qvgHD7DrY0RVLvoR04vZ6mEpCHCBdhn+qyJUuyQNJ5jWO2MUK+5tw75n6gsI/Dv0O4z9NXIQFoF90AIjMavLr4OABGTsaaol3opH8PW3xTk1cic/ufzTrjA5Z0wVAzsNDtbdJTwTld0lParKPEVtvpKrHzo8ttBR0DtDA6EkQZvQ0pJ7jkmD4JwZpwMQ61f1rT34iS6+ia3A7XI3XPr6h06U/D85/htCHqGlGTYewhMupxTfXJHChs2xmJMInNyrl4OBu/xRcAnJ1rSmn5RIXdDYb4WfgoTZiqDsaaosfIfl7am3X3K+8bmwtfzT/Mnq4+2fa3Pzukn8ZPcwQGBwgne6Hih4wxOSBshyX6yckw33F4bCAfYfhsLHfxs3LowQm7w0P3kvb1VBL7r7uPbX9Go22lT/afWcer1feyjij+zbFfSyIoGrjJ1+a1U/j5nBJN54DPiiue+9h/yf8/57esz767gp7H/5s5TQxgC7DKFOTGUCytWPiFRuwfxeu45GK+zzyPArb/Dl8n0d7m4kUgcV1agUN0oove+XMvo88NlASGjpdMG2Kqc1tSulbMF2OnLNlphw5Z56nNzbCU9eDbcQa7WioLXalyt9Ta7v6fuUttLX80WzM6iEb4z50QqITln1p3UbkaIlU7HoZ4JDERAyp8eoorozcojwmPoEWC+5WXpzVYRsU68XEotzmIRJ5tex4Bf9p20G6FFnV2S2bT711adoivSibttaCaY+G8QmQt2PcgjpF65apsxXrbkBMOvfb3Foose4sFdW6dYu7kGXrRm04bm2oslT/2zFvU+UWYklSRDmVfZsCrozdztJbGkRLuqA+9qaFlEOgzLVT4pxdfy9IdMbCc0pePPtBoug5PTOKNxEFUfHer5SuB/sFXlU+ns3gMh9PZKJCGrPqxvCKWwY3X65uZ+PLTzMQX32Gf75Nr6c3FzcXV5ezt2T1wmE/mdnL3PrIOF4vtrhK317Obq+n3/65mE3hVimyNLxKGueMsvxQbRpRkzZOk8YbfcizpZn805C9vguRkOYiSbpHhlR9vYbLZUWQJdqtbNeHT54cZLb66b21ts1da6f319nULPtru+ZAUOv+2m7jPFTtInjgAqZIFrCznpDSlOoUsDnTbpo8yvvDEZ8THHPb7+vhbvQqMXTrrJ/YGJxr43fdDjy+qPPttKjXBmfTOtfrWXXIxRIG/7sIjU8vcnMcnWjb3ObO/OzI0/LFM+/YdBhc+aPzuiOtBQPMdyjE2Ki5CVbciHVWE7SrZ9mRZffej+y3KbinH2lHwRUfc14Ft+RjTI3L6/FEokUFi4kracBXE9QV0dD7zxeX468fThwZQm8nhi0PxvaODyFomNgs0dNelKgiSQL6oEBRs3RFbEqYfK8mejSzY4R52qPxbxy1kfgYiR/1TmgQsN9c05SvZAUG4f6Gv9RnX7DAqFgfoAse60Z8YiT+snMREMK0+lq1lB+34Zeeko5QS1E+3Ba/D5e4tuJH+ND0fw==</diagram></mxfile>" onclick="(function(svg){var src=window.event.target||window.event.srcElement;while (src!=null&&src.nodeName.toLowerCase()!='a'){src=src.parentNode;}if(src==null){if(svg.wnd!=null&&!svg.wnd.closed){svg.wnd.focus();}else{var r=function(evt){if(evt.data=='ready'&&evt.source==svg.wnd){svg.wnd.postMessage(decodeURIComponent(svg.getAttribute('content')),'*');window.removeEventListener('message',r);}};window.addEventListener('message',r);svg.wnd=window.open('https://www.draw.io/?client=1&lightbox=1&edit=_blank');}}})(this);" viewBox="0 0 839 451" style="cursor:pointer;max-width:100%;max-height:451px;"><defs /><g transform="translate(0.5,0.5)"><rect x="626" y="292.5" width="170" height="115" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="12.5" y="292.5" width="175" height="115" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="305" y="292.5" width="180" height="110" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="130" y="160" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="30" y="60" width="140" height="40" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(44.5,73.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="110" height="12" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 112px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Arma virumque cano</div></div></foreignObject><text x="55" y="12" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">Arma virumque cano</text></switch></g><path d="M 48 90 L 70 90" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 60 90 L 60 148.63" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 60 153.88 L 56.5 146.88 L 60 148.63 L 63.5 146.88 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 60 120 L 180 120 Q 190 120 190 130 L 190 148.63" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 190 153.88 L 186.5 146.88 L 190 148.63 L 193.5 146.88 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><rect x="0" y="160" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="0" y="155" width="120" height="70" fill="#f8cecc" stroke="#b85450" pointer-events="none" /><g transform="translate(1.5,169.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="116" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 116px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Armō, <i>v</i>. : 'to arm, to equip'<br />COUNT: 1<br /></div></div></foreignObject><text x="58" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><rect x="130" y="155" width="120" height="70" fill="#d5e8d4" stroke="#82b366" pointer-events="none" /><g transform="translate(131.5,169.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="116" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 116px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Arma, <i>n pl.</i>.: weapon, arms (pl.)<br />COUNT: 1<br /></div></div></foreignObject><text x="58" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><rect x="407" y="160" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="307" y="60" width="140" height="40" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(322.5,73.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="108" height="12" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 110px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Neve armate manus</div></div></foreignObject><text x="54" y="12" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">Neve armate manus</text></switch></g><path d="M 365 90 L 385 90 Q 395 90 385 90 L 355 90" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><rect x="277" y="160" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="277" y="155" width="120" height="70" fill="#d5e8d4" stroke="#82b366" pointer-events="none" /><g transform="translate(278.5,169.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="116" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 116px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Armō, <i>v</i>. : 'to arm, to equip'<br />COUNT: 2<br /></div></div></foreignObject><text x="58" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><rect x="407" y="155" width="120" height="70" fill="#f8cecc" stroke="#b85450" pointer-events="none" /><g transform="translate(408.5,169.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="116" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 116px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Armatus, <i>adj</i>.: armed person, soldier.<br />COUNT: 1<br /></div></div></foreignObject><text x="58" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><path d="M 374 125 L 374 90" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 337 148.63 L 337 135 Q 337 125 347 125 L 357 125 Q 367 125 377 125 L 447 125 Q 457 125 457 135 L 457 148.63" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 337 153.88 L 333.5 146.88 L 337 148.63 L 340.5 146.88 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 457 153.88 L 453.5 146.88 L 457 148.63 L 460.5 146.88 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><rect x="684" y="160" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="584" y="60" width="140" height="40" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(611.5,73.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="84" height="12" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 84px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">saevos in armis</div></div></foreignObject><text x="42" y="12" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">saevos in armis</text></switch></g><path d="M 667 90 L 667 90 L 689 90 Q 699 90 689 90 L 667 90" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><rect x="554" y="160" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="554" y="155" width="120" height="70" fill="#f8cecc" stroke="#b85450" pointer-events="none" /><g transform="translate(555.5,169.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="116" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 116px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Armus, <i>n.</i>: 'shoulder, flank (of animal)'<br />COUNT: 1<br /></div></div></foreignObject><text x="58" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><rect x="684" y="155" width="120" height="70" fill="#d5e8d4" stroke="#82b366" pointer-events="none" /><g transform="translate(685.5,169.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="116" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 116px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Arma, <i>n</i>. pl.: weapon, arms (pl.)<br />COUNT: 2<br /></div></div></foreignObject><text x="58" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><path d="M 680 125 L 680 90" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 614 148.63 L 614 135 Q 614 125 624 125 L 634 125 Q 644 125 654 125 L 724 125 Q 734 125 734 135 L 734 148.63" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 614 153.88 L 610.5 146.88 L 614 148.63 L 617.5 146.88 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 734 153.88 L 730.5 146.88 L 734 148.63 L 737.5 146.88 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><rect x="60" y="5" width="80" height="40" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(70.5,11.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="58" height="26" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 60px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Phrase #1:<br /><i>Aeneid </i>1.1<br /></div></div></foreignObject><text x="29" y="19" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><rect x="337" y="5" width="80" height="40" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(341.5,11.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="70" height="26" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 70px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Phrase #2:<br /><i>Aeneid </i>9.115<br /></div></div></foreignObject><text x="35" y="19" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><rect x="614" y="5" width="95" height="40" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(620.5,11.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="78" height="26" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 78px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Phrase #3:<br /><i>Aeneid </i>12.107<br /></div></div></foreignObject><text x="39" y="19" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><path d="M 709 25 L 709 25" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 709 25 L 709 25 L 709 25 L 709 25 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><g transform="translate(305.5,334.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="178" height="26" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 178px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">THOUSANDS OF REPETITIONS<br /></div></div></foreignObject><text x="89" y="19" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">THOUSANDS OF REPETITIONS<br></text></switch></g><g transform="translate(52.5,308.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="95" height="82" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 95px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">UNSUPERVISED COUNT:<br />Arma, <i>n. pl.</i>: 2<br />Armo,<i>v</i>.: 2<br />Armatus,<i>adj</i>.: 1<br />Armus,<i>n.</i>: 1</div></div></foreignObject><text x="48" y="47" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><path d="M 120 345 L 120 345" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 120 345 L 120 345 L 120 345 L 120 345 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 140 25 L 330.63 25" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 335.88 25 L 328.88 28.5 L 330.63 25 L 328.88 21.5 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 417 25 L 603.63 25" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 608.88 25 L 601.88 28.5 L 603.63 25 L 601.88 21.5 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 710 25 L 820 25 Q 830 25 830 35 L 830 230 Q 830 240 820 240 L 110 240 Q 100 240 100 250 L 100 283.63" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 100 288.88 L 96.5 281.88 L 100 283.63 L 103.5 281.88 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 188 348 L 298.63 348" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 303.88 348 L 296.88 351.5 L 298.63 348 L 296.88 344.5 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 485 348 L 620.63 349.91" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 625.88 349.98 L 618.83 353.39 L 620.63 349.91 L 618.93 346.39 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><g transform="translate(636.5,308.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="144" height="82" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 144px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">UNSUPERVISED COUNT <br />(FINAL):<br />Arma, <i>n. pl.</i>: 5,581<br />Armo,<i>v</i>.: 4,876<br />Armatus,<i>adj</i>.: 1135<br />Armus,<i>n.</i>: 2740<br /></div></div></foreignObject><text x="72" y="47" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><g transform="translate(265.5,433.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="258" height="12" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 260px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Correct Lemmatizations are highlighted in green.</div></div></foreignObject><text x="129" y="12" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">Correct Lemmatizations are highlighted in green.</text></switch></g></g></svg>
<p>This works because different inflected forms of a lemma ‘overlap’ with different other lemmata. Basically, each instance of a word is ‘right’ in the same way, but ‘wrong’ in a different way. Over millions of tokens, the wrong answers begin to cancel each other out and the frequency counts in the model gradually begin to reflect the true rate of appearance of each lemma.</p>
<h2 id="how-well-does-this-work">How well does this work?</h2>
<p>I trained this model on some 8 million words of Latin poetry and prose from the <a href="https://github.com/tesserae/tesserae/tree/master/texts/la">Tesserae Latin corpus</a>. Then I used the model to create a lemmatizer that returns all possibilities for an ambiguous form, but weights the probability of each lemma according to its frequency in the model.</p>
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="675px" version="1.1" content="<mxfile userAgent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" version="9.0.5" editor="www.draw.io" type="device"><diagram id="3f6ffbd2-6b12-3416-0040-8359d3e68b3e" name="Page-1">7Vptk9o2EP41TNoPZfwm2/cRaNpmJm1ueuk0+ShsHWhiW0Q2B+TXd2VJgCUfB9RnnE5v5jhrvauXfZ5daXWM/Fm+/ZXj1fJ3lpJs5DnpduT/PPK82AvhUwh2UhDEkRQsOE2lyD0IHug3ooSOkq5pSsqGYsVYVtFVU5iwoiBJ1ZBhztmmqfbIsuaoK7wgluAhwZkt/Zum1VItCzkH+W+ELpZ6ZNdRb+Y4+bLgbF2o8Uae/1j/yNc51n0p/XKJU7Y5EvlvR/6MM1bJp3w7I5lwrXabtPvlmbf7eXNSVOcYKFiecLYmesb1vKqd9kW9GiL0nZE/3SxpRR5WOBFvNwA+yJZVnkHLhUfVHeEV2T47JXe/UOAPYTmp+A5UlEGoXKOoEyHZ3BxwcAOlsjzCQMuwgn6x7/iwfHhQHmj3htfijTCDAaZreFiIhwnPsRZCd3u5MKN8nX8FS+AlLpjlSHBJ1fRWWXH2hcxYxjhIClaA5vSRZpkhwhldFNBMwIUE5FPhYAp0nagXOU1TMUwrPE0AO0AoaiIU+zZCXgtCHvr3CIXD46uHmu5wnRbGtvkj7ICxrm85BBg6mqFRDM9g7miyUs3UpyP67oVjoelP4PNNxeokmkvzukW+runqjbabc201+/DXHx+lWQDqcRTaOheHAaRMp/4xYuGQSgcXECYFNNovMSDqggFBGwPwM+AXYl3ZuJ0CNZAbgleskObAghL+/AAWP55EHxg/Q7H7P/oyAfQJ/93wMmKoOanDoceEqPu9TUK8//PDdDJ99/7dx8/S2BkHUTdRMY9RgNqiIk5IkgwvKiwS9BgUnp0TBxcUvR4T/NaokJtEiHOx2mJeruqV2jtGe3SIbeTCHaMlPJDfTXjE3twPQzs8UkTiNBh+ePS6aQT2qdHyOSnSiaikhcsyXJY0aSJAtrT6dPT8WfhmjPbeIalRYpdszRMlQtHZ/jryB2pxh5ZxkuGKPjWHbPORGuGeURj4sIe7xgkOeWOjXJHzV3bHJbXRVWh05fpGRxXmC1JZHdWg7Rd+Ho5n5LnrcQRk+O6TonzduAxjRTK53OYmMBjc0Z2Bu3817kZVHNo9dQi8feqbMc7FVZjnvCd5Dj75Br+sKOukLO4kluC8TDgQ0hj0JCqBBSekGF+cbr+biwovMs4gLYcQv+0yyesgySJ7z339JGv5phmT8rrgZknWuNkzvXxupAVGR+jV4gy5Q8DwVnj50Qs72bWAveKWiOyjzYnLCVOyV4X5iivd82+1+i4nAs9vprYeqwl02bFD7Q3W+odL/CCITyaYc3mP3NP9dEh7+x85J7j7PdMeNTNJn1U0iv/rvDdvMq9N+KjHhH/GBeiLW7SudH5yxo63r33kTh3q5j3hFCYmzrWyJOr4cKZC+LheuvF5zUAx9K5jg1kZmVcYz5ABIMO7I7WVUChPTNcovO/QyVmZ6mp1BybK8a/lZdjF8f9EBd7Ovqhb+gU3pV9H5cKN6Ke3pnP5p3PklQSE5uFrI1L98NUc/+0/</diagram></mxfile>" onclick="(function(svg){var src=window.event.target||window.event.srcElement;while (src!=null&&src.nodeName.toLowerCase()!='a'){src=src.parentNode;}if(src==null){if(svg.wnd!=null&&!svg.wnd.closed){svg.wnd.focus();}else{var r=function(evt){if(evt.data=='ready'&&evt.source==svg.wnd){svg.wnd.postMessage(decodeURIComponent(svg.getAttribute('content')),'*');window.removeEventListener('message',r);}};window.addEventListener('message',r);svg.wnd=window.open('https://www.draw.io/?client=1&lightbox=1&edit=_blank');}}})(this);" viewBox="0 0 675 222" style="cursor:pointer;max-width:100%;max-height:222px;"><defs /><g transform="translate(0.5,0.5)"><rect x="0" y="55" width="140" height="40" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(14.5,69.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="110" height="12" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 112px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;"><u>Arma</u> virumque cano</div></div></foreignObject><text x="55" y="12" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica"><u>Arma</u> virumque cano</text></switch></g><rect x="190" y="85" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="190" y="0" width="120" height="70" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(191.5,14.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="117" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 117px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Armō, <i>v</i>. : 'to arm, to equip'<br />COUNT: 4,876<br /></div></div></foreignObject><text x="59" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><rect x="190" y="80" width="120" height="70" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(191.5,94.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="117" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 117px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Arma, <i>n pl.</i>.: weapon, arms (pl.)<br />COUNT: 5,581<br /></div></div></foreignObject><text x="59" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><rect x="554" y="5" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="554" y="0" width="120" height="70" fill="#f8cecc" stroke="#b85450" pointer-events="none" /><g transform="translate(555.5,14.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="117" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 117px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Armō, <i>v</i>. : 'to arm, to equip'<br />PROBABILITY: 0.47<br /></div></div></foreignObject><text x="59" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><rect x="554" y="85" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><rect x="554" y="80" width="120" height="70" fill="#d5e8d4" stroke="#82b366" pointer-events="none" /><g transform="translate(555.5,94.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="117" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 117px; white-space: normal; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Arma, <i>n</i>. pl.: weapon, arms (pl.)<br />PROBABILITY: 0.53<br /></div></div></foreignObject><text x="59" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">[Not supported by viewer]</text></switch></g><path d="M 485.03 115.07 L 543.63 115.01" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 548.88 115 L 541.89 118.51 L 543.63 115.01 L 541.88 111.51 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 482.97 35.07 L 547.63 35.07" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 552.88 35.07 L 545.88 38.57 L 547.63 35.07 L 545.88 31.57 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><g transform="translate(254.5,203.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="258" height="12" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 260px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">Correct Lemmatizations are highlighted in green.</div></div></foreignObject><text x="129" y="12" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">Correct Lemmatizations are highlighted in green.</text></switch></g><path d="M 309.86 35.07 L 353.63 35.01" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 358.88 35 L 351.89 38.51 L 353.63 35.01 L 351.88 31.51 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 310 115 L 353.63 115" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 358.88 115 L 351.88 118.5 L 353.63 115 L 351.88 111.5 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><rect x="363" y="5" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(386.5,14.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="73" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 75px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">5,581<br /><br />5,581 + 4,876<br /></div></div></foreignObject><text x="37" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">5,581<br><br>5,581 + 4,876<br></text></switch></g><path d="M 388 35 L 458 35" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><rect x="365" y="85" width="120" height="60" fill="#ffffff" stroke="#000000" pointer-events="none" /><g transform="translate(388.5,94.5)"><switch><foreignObject style="overflow:visible;" pointer-events="all" width="73" height="40" requiredFeatures="http://www.w3.org/TR/SVG11/feature#Extensibility"><div xmlns="http://www.w3.org/1999/xhtml" style="display: inline-block; font-size: 12px; font-family: Helvetica; color: rgb(0, 0, 0); line-height: 1.2; vertical-align: top; width: 75px; white-space: nowrap; word-wrap: normal; text-align: center;"><div xmlns="http://www.w3.org/1999/xhtml" style="display:inline-block;text-align:inherit;text-decoration:inherit;">4,876<br /><br />5,581 + 4,876<br /></div></div></foreignObject><text x="37" y="26" fill="#000000" text-anchor="middle" font-size="12px" font-family="Helvetica">4,876<br><br>5,581 + 4,876<br></text></switch></g><path d="M 390 115 L 460 115" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 140.21 75.07 L 145.1 75.03 Q 150 75 150 65 L 150 52 Q 150 42 160 41.99 L 180.74 41.97" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 185.99 41.97 L 178.99 45.47 L 180.74 41.97 L 178.98 38.47 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 140.21 84.72 L 145.1 84.86 Q 150 85 150 95 L 150 105 Q 150 115 160 115.02 L 183.49 115.06" fill="none" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /><path d="M 188.74 115.07 L 181.74 118.55 L 183.49 115.06 L 181.75 111.55 Z" fill="#000000" stroke="#000000" stroke-miterlimit="10" pointer-events="none" /></g></svg>
<p>When presented with a token which might stem from more than one lemma, I simply chose the most probable answer according to the model. I tested this system against ~17,000 tokens of Latin from the cltk corpus whose correct lemmatizations have been entered by hand. The result was correct roughly 91% of the time. Let’s break down those results.</p>
<p>First, it should be said that 73% of the tokens in this data set are unambiguous–they can only come from one lemma, so it isn’t possible to make a mistake.</p>
<p>Second, if you guess randomly in ambiguous cases, you happen to be right about about 30% of the time (presumably because each ambiguous case has, on average, 3 possible lemmatizations). So a random selection gives roughly 81% accuracy.</p>
<p>Using the unsupervised frequency model, we end up guessing correctly in 66% of ambiguous cases (more than twice as often as we would get by ‘flipping a coin’), for a total accuracy of ~91%. That’s strong enough to suggest that the model of language frequency seen here is good enough to expand to other NLP tasks or combing with other unsupervised lemmatization techniques.</p>
<h2 id="observations">Observations</h2>
<p>A few things about this model surprised me. First, if we only count unambiguous forms, the model doesn’t lead to a successful lemmatizer. In fact its accuracy drops <em>below</em> what we see through random selection.</p>
<p>Second, this model was developed as part of a more complex lemmatizer that looked at surrounding language. That model worked roughly 86% of the time, until I added a step to remove word-frequency information from the model. That cause accuracy to drop below 80%–in other words, <em>worse</em> than random chance–and it became clear that nearby words were not as effectual in assigning probabilities to possible lemmas as word frequency. Still, nearby language is a feature worth revisiting, and you can find the code for that project <a href="https://github.com/jamesgawley/latin_lemma_disambiguation_models">here</a>.</p>James GawleyThe code for this unsupervised lemmatization model can be found here.Google Summer of Code Results2018-08-14T11:59:00+00:002018-08-14T11:59:00+00:00https://jamesgawley.github.io/Google-Summer-of-Code<h1 id="overview">Overview:</h1>
<p>The Classical Language Toolkit (CLTK) is a suite of Natural Language Processing (NLP) tools used by humanists to study ancient texts. The original focus of the project was Latin and Greek, and the most sophisticated tools are available only for these languages. However, one of the main goals of the project is to support scholarship into cultures outside of the Western canon, particularly where NLP tools are not available.</p>
<h1 id="what-i-set-out-to-do">What I Set Out to Do:</h1>
<p>Most NLP techniques rely on a large corpus of training data. For example ‘lemmatization,’ wherein all the inflected forms of a word are analysed as a single item, is often done by training a lemmatizer on a large group of texts whose words have already been grouped by lemma. Machine learning techniques are applied to build a model, and the model is then used to lemmatize texts which have not been analyzed by hand.</p>
<p>However, under-studied languages lack adequate training data to support this standard approach; there just aren’t enough texts which have been hand-curated to train a model. Therefore I set out to build an unsupervised model which can learn, for example, the correct lemmatization without ‘knowing the answers.’ The goal is to provide accuracy comparable to supervised approaches, without the need for hand-curated training data.</p>
<p>Lemmatization is a first step, and it is useful because the success of an unsupervised algorithm can be easily tested. However my unsupervised model can be extended to create tools for, e.g., cross-language document alignment, where testing is more difficult.</p>
<h1 id="steps-completed">Steps Completed:</h1>
<ul>
<li>Added lemmatizers for Greek and For Latin that provide all possible word groupings</li>
<li>Built an unsupervised language model extensible to other NLP tasks such as translation or document alignment</li>
<li>Used the language model to build an unsupervised Latin lemmatizer with > 90% accuracy</li>
<li>Added tools to suggest synonyms for Latin and Greek words</li>
<li>Added tools to suggest Latin translations for Greek words, and vice versa</li>
</ul>
<h1 id="additional-service-completed">Additional Service Completed:</h1>
<ul>
<li>Coordinated the incorporation of CLTK into the core logic of the next generation of the open-source <a href="http://tesserae.caset.buffalo.edu">Tesserae Project</a></li>
<li>Expanded the CLTK corpus of texts with aligned translation of Latin and Greek (useful for testing translation tools)</li>
</ul>
<h1 id="steps-remaining">Steps Remaining:</h1>
<ul>
<li>Build an unsupervised language model for Greek</li>
<li>Use Greek model to build unsupervised lemmatizer for Greek</li>
<li>Use Greek and Latin language models to assign probabilities to potential synonyms and translations</li>
</ul>
<h1 id="where-the-code-lives">Where the code lives:</h1>
<h3 id="already-pulled-into-cltk">Already Pulled into CLTK:</h3>
<ul>
<li><a href="https://github.com/cltk/latin_models_cltk/blob/master/semantics/lemmata.py">A dictionary of lemmas for Latin word-forms</a></li>
<li><a href="https://github.com/cltk/latin_models_cltk/blob/master/semantics/translations.py">A dictionary of Greek translations for Latin lemmas</a></li>
<li><a href="https://github.com/cltk/latin_models_cltk/blob/master/semantics/synonyms.py">A dictionary of Latin synonyms for Latin lemmas</a></li>
<li><a href="https://github.com/cltk/greek_models_cltk/tree/master/semantics">Dictionaries of lemmata, synonyms, and Latin translations for Greek words</a></li>
<li><a href="https://github.com/cltk/cltk/blob/master/cltk/semantics/latin/lookup.py">A tool for looking up all possible lemmas of a Greek or Latin word</a></li>
</ul>
<h3 id="not-yet-pulled-into-cltk">Not Yet Pulled into CLTK:</h3>
<ul>
<li><a href="https://github.com/jamesgawley/cltk/tree/master/cltk/lemmatize/latin/unsupervised.py">The unsupervised model and unsupervised lemmatization tool</a></li>
<li><a href="https://github.com/jamesgawley/latin_lemma_disambiguation_models">Experiments with various unsupervised language models</a></li>
<li><a href="https://github.com/jamesgawley/greek_text_greek_fragmentary_historians">The new corpus of plaintext, aligned Latin and Greek</a></li>
<li><a href="https://github.com/jamesgawley/latin_text_tesserae_collection">A corpus of 10 million words of Latin poetry and prose, used to train the unsupervised model</a></li>
</ul>
<h1 id="documentation">Documentation:</h1>
<ul>
<li><a href="http://docs.cltk.org/en/latest/latin.html#semantics">How to use the lemma lookup tool</a></li>
<li><a href="https://jamesgawley.github.io/Unsupervised-Lemmatization-Model">The logic behind the unsupervised lemmatization model</a></li>
<li><a href="https://jamesgawley.github.io/Initial-GSoC-Blog-Post-for-CLTK">Initial GSoC blog post</a></li>
</ul>
<h1 id="learning-experience">Learning Experience:</h1>
<ul>
<li>Improved Python coding skill</li>
<li>Developed better habits for open-source contributions</li>
<li>Learned how large coding projects are managed smoothly</li>
</ul>James GawleyOverview:Google Summer of Code Initial Post2018-04-30T11:59:00+00:002018-04-30T11:59:00+00:00https://jamesgawley.github.io/Initial-GSoC-Blog-Post-for-CLTK<p><em>This post was originally submitted to the Classical Language Toolkit’s blog at the outset of the 2018 GSoC period.</em></p>
<p>My name is James Gawley, and I’m a PhD candidate in Classics at the University at Buffalo, SUNY. My dissertation is about how poets manipulate reader attention and memory to craft successful allusions. In particular I’m focused on Vergil’s relationship to Homer, which means I’m looking at Latin allusions to Greek poetry. I take two fundamentally different approaches to the research: on the one hand I gather responses from expert readers who try to formally describe the similarities between known allusions, and on the other hand I build computer models to derive all the points of similarity between the entire texts, not limited to known connections. I believe that when we compare what a computer derives and what a human notices, we will learn something fundamental about how allusion is signalled to the reader.</p>
<p>The first part of my GSoC proposal for CLTK involves two steps. First, I will add the dictionaries my friends and I have developed for suggesting Latin-to-Greek translations (and vice versa), as well as dictionaries of intra-language synonyms. Second, I will combine those translations and synonyms with CLTK’s ‘backoff’ approach to lemma disambiguation. Right now, the translation and synonym dictionaries suggest a list of words without telling the user which translation or synonym is more probable. But the probability of each translation or synonym is critical for many NLP applications. So what determines which translation or synonym is most likely? The answer is context, and that’s what CLTK’s backoff lemmatizer analyzes. Once these tools have been combined, CLTK users will be able to ask, for any given word in context, what the most probable translations or synonyms might be.</p>
<p>The second thing I’m going to do during my GSoC fellowship is to add pre-trained word embeddings to CLTK. Embeddings are a way of getting at the meaning of a word through the context in which it appears. For example, the words ‘monarch’ and ‘king’ probably appear in similar contexts in an English corpus. The neat thing about embeddings in particular is that they use latent variables that are so finely tuned that they incorporate all the contexts in which a word appears, and one can add or subtract parts of that context with interesting results. To take a famous example, if one subtracts the vector-representation of the word ‘man’ from the vector representation of the word ‘king’ in a well-trained model of English, the resulting vector representation is closest to that of the word ‘queen.’ In other words ‘king’ – ‘man’ = ‘queen’ in English. The same thing is true for the Greek and Latin vector models I have trained for Latin and Greek, which I will add to the CLTK.</p>
<p>I’m looking forward to working with Patrick Burns and Kyle Johnson over the next few months, and I can’t wait to make this technology available to the CLTK user base!</p>James GawleyThis post was originally submitted to the Classical Language Toolkit’s blog at the outset of the 2018 GSoC period.