Converting Latin based Turkish spelling to Ottoman

I’m working on a system to search Ottoman document collections.

In order to query a large collection in Ottoman, the user needs to write the query in Ottoman, which uses Arabic based alphabet with completely different set of spelling rules. This limits the usability, since most of the users will not be familiar with spelling. Experts do, but we can't assume experts will be able to use it.

There are various methods of transcribing Ottoman to modern Turkish. Many of these use diacritics to denote different long vowels. When it comes to consonants that are spelled identical in Turkish but correspond to different letters in Ottoman, most of these transcription systems are silent. They don't represent the difference between letters tha and sin for example, although the first is used in Arabic words considerably. Turks pronunce these two letters identically from the Ottoman times, so they are written identically as s in Turkish.

The information content of these two writing systems is different. One has three s letters, the other has rich set of vowels etc. Latin based Turkish script looks more verbose, so I decided to reduce this verbosity to get a set of probable Ottoman spellings.