Translating Ottoman Turkish Spelling to Latin Alphabet using Surface Forms

dervaze is a project I have started back in my Ph.D. work in 2015 to translate Ottoman Turkish to modern Turkish spelling and providing an OCR/ICR/handwriting recognition engine for Ottoman language.

The reason I had to stop was the lack of data, since without some considerable amount of data, statistical methods for both Natural Language Processing and Computer Vision fails. Producing and maintaining data seemed a much more important burden than having technical solutions, so I mostly gave up the idea that a working solution is obtainable with the classical OCR techniques. The research still waits me to finish.

I begin this series of explorations in Computer Vision, Machine Learning and related fields to document my achievements and provide some basis for further research. Time to time, I will provide results regarding the ideas here. Currently this is mostly a hobby/side project.

My current endeavour is to write a robust translation engine between Ottoman and Turkish in Dart.1

Instead of making a full fledged morphological analysis like TRMorph, Google Research Morphological Analyzer or Starlang Morphological Analyzer, our aim is to provide a surface level to surface level translation between Arabic and Turkish letters.

It is true that it’s possible to use something like:


+-----------------+                +----------------+                 +---------------+
|                 |                |                |                 |               |
|                 |                |                |                 |               |
|    kelimeler    |    +------->   |  kelime+PLU    |  +----------->  |   کلمهلر      |
|                 |                |                |                 |               |
|                 |                |                |                 |               |
+-----------------+                +----------------+                 +---------------+

to translate between Turkish Latin and Ottoman, but there are two problems here:

  1. There is no Ottoman morphological analyzer and although grammatically Turkish, surface level forms should be translated. Even Turkish Latin ones are relatively recent.

  2. Translating the output of a Turkish Latin morphological analyzer to Ottoman seems a work more than writing a translation method itself.

For example for the query kelimeler TRMorph gives

    kelime<N><pl>
    kel<Adj><0><N><p1s><dat><0><V><cpl:pres><3p>
    kel<Adj><p1s><Prn><dat><0><V><cpl:pres><3p>
    kelime<N><0><V><cpl:pres><3p>
    kelime<N><pl><0><V>
    kelime<N><pl><0><V><cpl:pres><3p>
    kelime<N><pl><0><V><cpl:pres><3s>

and although most of the items’ surface forms are identical, we need to work through all suffixes and their different kinds of connections. Also, as the surface form of Ottoman Turkish has less information, Ottoman morphological analysis would yield much more results than the corresponding counterpart.

Because of these hindrances and trying to come up with a quick-and-dirty yet workable solution, I have made the following observations:

  1. Our part-of-speech system should not need to be highly specialized. Actually we need to have only three classes: Nouns, verbs and proper nouns. Proper nouns are grammatically nouns but their ortography may require different rules. We will call these three word classes.

  2. We can get away with a set of surface level rules to translate suffixes for each word class. These rules use attributes that can be derived from Turkish Latin ortography.

These rules are:

From these rules which can be derived from the Turkish Latin forms of the words using regular expressions, we can write rules to translate Turkish to Ottoman and vice versa.

TBC.

1
Dart has this _little_ benefit to be able to run both on mobile (in Android and iOS) and
servers with support for the web, so _write once, run on Linux, iOS and Android_ is feasible
without much architecture jumping. I hope Flutter gains more traction and becomes the mainstream
way of writing mobile applications.