Converting Latin-based Turkish spelling to Ottoman

Posted on 2012-09-23 :: Tags: search, conversion, query, transcription, Ottoman, Turkish, NLP

I’m working on a system to search Ottoman document collections.

In order to query a large collection in Ottoman, the user needs to write the query in Ottoman, which uses an Arabic-based alphabet with a completely different set of spelling rules. This limits usability, since most users will not be familiar with the spelling. Experts are, but we can’t assume all users will be able to use it.

There are various methods of transcribing Ottoman to modern Turkish. Many of these use diacritics to denote different long vowels. When it comes to consonants that are spelled identically in Turkish but correspond to different letters in Ottoman, most of these transcription systems are silent. They don’t represent the difference between the letters tha and sin for example, although the first is used in Arabic words considerably. Turks pronounce these two letters identically since Ottoman times, so they are written identically as s in Turkish.

The information content of these two writing systems is different. One has three s letters, the other has a rich set of vowels etc. The Latin-based Turkish script looks more verbose, so I decided to reduce this verbosity to get a set of probable Ottoman spellings.