A Regular Conversion Algorithm Between Turkish and Ottoman


Modern Turkish spells all Turkish/Arabic/Farsi rooted words according to their pronunciation. When it comes to convert from a system to another, this creates a problem that might be solved with the aid of regular expressions.

For example, in Ottoman a word is spelled as mnwr, as letters corresponding to letters in Arabic, but in Turkish, the spelling reflects the pronunciation as münevver. Since 1-1 mapping is not possible between these two writing systems, a set of possible Ottoman spellings must be produced with a regular expression.

When the parser sees münevver it should convert this to mv?nh?(a1)?ww?h?r. This produces a set of strings, mvnha1vvhr being the longest and mnwr being the shortest. From a dictionary search, it can be verified that there is a word in Ottoman spelled as mnwr thus selecting it as the correct spelling.

The dictionary in our study is word labels. The system will lookup a set of handwritten word images after this label lookup and search for these images in the text. It can also create a set of images from the regular expression by spelling each candidate.