Dervaze: A Transliteration System for Ottoman

Author:Emre Sahin
Date:<2014-01-08 Wed>

/Dervaze/ (meaning “the portal”) is a set of tools that aim to transliterate historical Ottoman documents to Modern Turkish. These will be hosted in http://dervaze.com in the near future.

In this document, I describe the transliteration system. The system is organized as a pipeline in which the tools at a stage produce the input of the next stage. Input to the system is a set of historical document images and the output is either a search result or a textual representation of these documents.

Following sections describe these stages briefly.

=binarize= Color Images to Binary Images

Binarization is to convert color or grayscale images to black-white binary images. The document images come in various flavors, mostly as color images and this color information is mostly a noise for further stages. Therefore we need to remove the color, but keep the textual representation intact.

Although easy to describe, this stage includes challenges like determining the ink color or removing ink stains from images.

In the literature the idea is to use a mathematical model, like Otsu’s method to convert color to binary. We look to this problem differently as a classification problem. The idea is briefly the following:

Color images consist of 3 channels. The ink color of a region should be persistenly present or absent in these channels. For example a dark blue ink should be represented within similar numeric ranges in each of these channels, a red ink should be represented more in red channel than others. A standard binarization approach tries to come up with a cumulative ink color value with all three channels. We do it differently.

Instead of trying to find a cumulative threshold for binarization, we detect edges in each channel, considering each channel as a separate binary image. When components in each channel are found, they are evaluated by various features (like size, presence in other channels) and classified as text or non-text.

After classification, the text elements are drawn to a canvas with black and the document hence become binarized.

=extract-components= Convert binary images to components

Although the components are extracted in the binarization stage, in order to have a definite input and output, we extract components in an independent stage. The primary reason of this is to evaluate performance of different binarization options. This stage works even if the binarization part uses another (standard) approach.

Binarized document images are converted to sets of components by finding their edges. Each component is recorded with its location and binary image.

=extract-features= Find features of components for comparison

Each component can have several different features, like height, width, number of holes, number of ascenders and descenders, etc. Some of these features work better than others in classification. However we don’t know beforehand which works better than others.

In order to find a set of good features, we extract all features we can think of and run a Principal Component Analysis on these. Previously, as we used to lack classes in these components, it was impossible to find good features and we had to check the outputs. However, now, as we labeled around 11.000 components from 50 handwritten pages, we can determine which features are better than others.

=extract-features= extracts a large set of features from the component set. These features are stored in CSV files and are analyzed for their classification value.