Progress on Ottoman Translation

Progress on Ottoman Translation - 2018-6. Week.

Some of the following posts will be like TODO list for the coming months. What am I planning with dervaze and its mobile versions. As I have become mostly a solo developer, I’ll share my experience with the problem here to shed light for those interested.

The technology for Ottoman OCR was mostly ready when my interruptions regarding family life began. I’ll need to check what is available but a more pressing problem for me is the speed of translation. Currently it’s so slow that’s barely usable.

I have written the dictionary as a C library, without any dependencies to a database. I have integrated this to the Android version. I’m currently updating the search functionality in mobile to use this library instead of web service. It will be magnitudes faster than the current version, because it’s offline and it’s using a trie to keep the words. Both small and fast.

In my experience, having a single data structure and functions to transform/index is much simpler than having multiple data structures. In my case, this single basic structure is this:

typedef struct _dervaze_lexical_item {
  int index;
  bstr latin_search_key;
  bstr latin;
  bstr visenc_search_key;
  bstr visenc_dotless_search_key;
  bstr visenc;
  bstr annotation;
  bstr meaning;
  bstr abjad;
  lexical_role role;
  int last_vowel;
  int props;
} dervaze_lexical_item;

visenc is our abbreviation for visual encoding which is used to encode Arabic/Ottoman/Farsi words with basic ASCII letters. It’s documented in its own page.

index is a common index to assign numbers for words. Search keys for Latin, Visenc and Dotless Visenc (e.g. to write ﺥ and search also with letters ح,چ,ج…) are keys to search this lexical item. Tries I have mentioned are using these to search the lexical items.

I’m intentionally avoiding UTF-8 or any other Unicode encoding because when I would like to see results on the terminal, which are mostly not very fit to see Arabic text.

When I began to write this software, one of my intentions was to be able to search words by its traditional numerics. These are used in classical Ottoman poetry to sign a date with a verse. For example, the letter aleph corresponds to 1, bah (ب) to 2 etc. In order to search words with these numerals, e.g. when you type 246 to get words having that number, we also keep this here.

The lexical_role is used in translation. Currently we have two roles to make the distinction between noun and verb suffixes in Turkish.

last_vowel as the name implies, is the last vowel of the word. Vowel harmony of Turkish is not reflected in Ottoman spelling, so when we want to convert an Ottoman word to a corresponding Turkish Latin spelling, we check the last vowel to add the correct suffixes. (Not gemilar but gemiler, for example.)

props is a bit field to show various properties of the lexical item. It’s a bit field and has the following properties for now.

#define HAS_FINAL_VOWEL 0x01
#define HAS_SINGLE_VOWEL 0x02
#define IS_LAST_VOWEL_HARD 0x04
#define IS_FINAL_CONSONANT_HARD 0x08
#define HAS_CONSONANT_SOFTENING 0x10

These also have various effects in especially converting Ottoman to Turkish.

Current version of translation works in Python, but I’ll be writing it in C this week. As its third rewrite, I don’t think it’ll pose an algorithmic challenge, though I’m not sure debugging a C version will be easier.