There are already various transliteration systems for Arabic-based scripts to represent them in the Roman alphabet. However, all of them aim to represent phonemes in transliteration, without paying attention to distinct visual elements. When we are manually transcribing these texts, the method is fine. However, when we try to represent visual elements in scanned handwritten documents, we face some problems regarding these transliteration systems.
Since conventional systems aim to represent phonemes, a correct reading is necessary and this requires expertise in the language being represented. For Ottoman, this is a deeper problem since the writing system is not actively used.
Labels should correspond to classes in the classification of visual elements. For example, in some cases we label ى, ي and ـيـ identically as y but their visual features are distinct from each other. In other cases, an identical letter is coded as y, e or i albeit represented identically in visual terms, because of their difference in reading. This makes visual classes mixed and fuzzy, and any measurement of performance gives little clue about the effectiveness of features or classifiers.
Here we document a new transliteration system for visual items in Ottoman Turkish. In this work, our aim is to provide a simple visual approach to transliteration and then devise necessary conversion tables into phonetic transliteration or transcription systems.
Except for numerals, words in Arabic-based writing systems are composed of items in two categories. The first category is a large letter group composed of a continuous movement of the pen, like کلمه. The other category is the smaller elements like dots and diacritics found around these larger items.
A visual transliteration should represent distinct visual elements differently. In order to keep simplicity in labeling and application, dotless letters will be used as base letters and the other letters will be written in terms of these dotless items.
A full letter code is composed of one base letter code and an optional set of diacritic codes. The base letter code is a small Roman letter. The diacritic codes have two parts: the first part is one of o, u or i, which mean over, under or ligature respectively. The second part of the diacritic code shows the type of diacritics and is a string of digits or letters. A full letter code conforms to the following regular expression: [a-h,j-n,p-t,v-z](([ou][0-9]+)|(i[a-h,j-n,p-t,v-z]))*
The letters o, u and i are not used as base letter codes, so there is no ambiguity in parsing the elements.
The tables for all visual elements can be found in Tables tab:baseletters, tab:diacritics and tab:numerals.
Letter Shape Transliteration Used in Letters
ا e ا
ٮ b ب, ت, ث, ن, پ, یـ
ح x ح, خ, ج, چ
د d د, ذ
ر r ر, ز
س s س, ش
ص z ص, ض
ط t ط, ظ
ع a ع, غ
ٯ f ف, ق
ک k ک, گ
ل l ل, ك
م m م
و w و
ه h ه, ة
ی y ی
ﺀ c ﺀ
Table: The Transliterations of Base Letters
Transliteration Description
1 Dot of ب, ن or خ
2 Dots of ق, ت or یـ
3 Dots of ث, ش or چ
5 ء hamza and also in ك
8 ـّ shadda
0 ـْ sukun
6 ـٓ madda
7 / sign above گ
4 ـَ fatha and ـِ kasra
9 ـُ damma
44 ـً fathatan and ـٍ kasratan
99 ـٌ dammatan
Table: Transliterations for Diacritics
Description Transliteration
۱ n1
۲ n2
۳ n3
۴ n4
۵ n5
۶ n6
۷ n7
۸ n8
۹ n9
۰ n0
Table: Transliterations for Numerals
The advantages of using this transliteration instead of a phonetic transliteration can be summarized as follows:
Representing visual elements in a phonetic way is not an optimal representation for Computer Vision research. The information from the phonetic representation slips into visual information which results in classes that bear no direct connection with visual features. In turn, these classes become harder to classify and understand.
The transliteration system described in this paper does not need expertise in the language. Anyone who recognizes the letters should be able to transliterate word images.
It is possible to denote single elements which do not represent a sound in the language with visual transliteration. No transliteration system for Arabic and similar languages represents diacritics as in Table tab:diacritics. This is important for Computer Vision, since these items are as legitimate visual elements on a paper as others. Without independent representation, these items would have to be represented along with others, and their classification could not be done independently.
The system is much more flexible than a phonetic transliteration system. It allows for the development of new letter signs by combining existing diacritics with base letters. For example, ﭪ is a letter not found in historical documents, but can be seen in modern Arabic to represent the v sound. Although not thought of beforehand, this can be represented as fo3 in the system we describe. There are also writing variations, for example, in Maghribi (Western) Arabic, the letter ف is written as ڢ but is phonetically identical. In a phonetic transliteration this difference is lost, but the visual transliteration is able to represent the usual case with fo1 and the specific case with fu1.
It is common in handwriting to attribute diacritics of one letter to another. For example, three dots in the middle of سر may be read as شر or سژ. In a phonetic transliteration system, the first might be represented as şr and the second as sj. This leads to complexity in describing such middle cases. In a visual transliteration system, these two are represented as so3r and sro3, in which we are able to write rules that exchange o3 between neighboring letters and decide for the best reading in later stages.
In handwriting and print, diacritics are especially written loosely. For example, three dots above ث are written as three separate dots in print, but usually contracted as a single shape in handwriting. For the print case, we can have so1o1o1 denoting three dots separately while for the handwriting case so3 is used and in an upper level, the former is converted to the latter by a rule specifying three separate dots mean a single three dot.
Table tab:ottoman-letters shows all letters of the Ottoman alphabet.
**Ottoman **Transliteration** Ottoman Transliteration Letter Letter** \\
ا e اَ eo4
اِ eu4 اُ eo9
أ eo5 إ eu5
آ eo6 ب bu1
پ bu3 ت bo2
ث bo3 ج xu1
چ xu3 ح x
خ xo1 د d
ذ do1 ر r
ز ro1 ژ ro3
س s ش so3
ص z ض zo1
ط t ظ to1
ع a غ ao1
ف fo1 ق fo2
ك lo5 گ ko7
ل l م m
ن bo1 ڭ lo5o3
ه h و w
ی y ـیـ bu2
لا lie ک k
Table: Ottoman Letters and Transliterations