Translating Ottoman Turkish Spelling to Latin Alphabet using Surface Forms

dervaze is a project I have started back in my Ph.D. work in 2015 to translate Ottoman Turkish to modern Turkish spelling and providing an OCR/ICR/handwriting recognition engine for Ottoman language.

The reason I had to stop was the lack of data, since without some considerable amount of data, statistical methods for both Natural Language Processing and Computer Vision fails. Producing and maintaining data seemed a much more important burden than having technical solutions, so I mostly gave up the idea that a working solution is obtainable with the classical OCR techniques. The research still waits me to finish.

I begin this series of explorations in Computer Vision, Machine Learning and related fields to document my achievements and provide some basis for further research. Time to time, I will provide results regarding the ideas here. Currently this is mostly a hobby/side project.

My current endeavour is to write a robust translation engine between Ottoman and Turkish in Dart.¹

Instead of making a full fledged morphological analysis like TRMorph, Google Research Morphological Analyzer or Starlang Morphological Analyzer, our aim is to provide a surface level to surface level translation between Arabic and Turkish letters.

It is true that it’s possible to use something like:


+-----------------+                +----------------+                 +---------------+
|                 |                |                |                 |               |
|                 |                |                |                 |               |
|    kelimeler    |    +------->   |  kelime+PLU    |  +----------->  |   کلمهلر      |
|                 |                |                |                 |               |
|                 |                |                |                 |               |
+-----------------+                +----------------+                 +---------------+

to translate between Turkish Latin and Ottoman, but there are two problems here:

There is no Ottoman morphological analyzer and although grammatically Turkish, surface level forms should be translated. Even Turkish Latin ones are relatively recent.
Translating the output of a Turkish Latin morphological analyzer to Ottoman seems a work more than writing a translation method itself.

For example for the query kelimeler TRMorph gives

    kelime<N><pl>
    kel<Adj><0><N><p1s><dat><0><V><cpl:pres><3p>
    kel<Adj><p1s><Prn><dat><0><V><cpl:pres><3p>
    kelime<N><0><V><cpl:pres><3p>
    kelime<N><pl><0><V>
    kelime<N><pl><0><V><cpl:pres><3p>
    kelime<N><pl><0><V><cpl:pres><3s>

and although most of the items’ surface forms are identical, we need to work through all suffixes and their different kinds of connections. Also, as the surface form of Ottoman Turkish has less information, Ottoman morphological analysis would yield much more results than the corresponding counterpart.

Because of these hindrances and trying to come up with a quick-and-dirty yet workable solution, I have made the following observations:

Our part-of-speech system should not need to be highly specialized. Actually we need to have only three classes: Nouns, verbs and proper nouns. Proper nouns are grammatically nouns but their ortography may require different rules. We will call these three word classes.
We can get away with a set of surface level rules to translate suffixes for each word class. These rules use attributes that can be derived from Turkish Latin ortography.

These rules are:

Part of Speech: The root class we discussed above
Last Vowel: The last vowel in a Turkish root that is required to find the actual suffix when vowel harmony is utilized.
Last Consonant: The last consonant of the Turkish root. This is needed to find the palatalization/softening of certain consonants.
Ends with Vowel: Whether the root ends with vowel like ata or not.
Has Single Vowel: Whether the root has a single vowel. This is important in aorist inflection irregularities in verbs.
Last Vowel Hard: Whether the last vowel is one of a, ı, o, u or not.
Last Consonant Hard: Whether the last consonant is one of p, ç, t, k and undergoes softening when it receives a suffix starting with a vowel.
Has Consonant Softening: Inverse of the last consonant hard rule, when a root ends with b, c, d, g and happens to receive a suffix.

From these rules which can be derived from the Turkish Latin forms of the words using regular expressions, we can write rules to translate Turkish to Ottoman and vice versa.

TBC.

Dart has this _little_ benefit to be able to run both on mobile (in Android and iOS) and
servers with support for the web, so _write once, run on Linux, iOS and Android_ is feasible
without much architecture jumping. I hope Flutter gains more traction and becomes the mainstream
way of writing mobile applications.