Dervaze: A Transliteration System for Ottoman

Posted on 2014-01-07 :: Tags: Dervaze, Transliteration, Binarization, OCR, Image Processing

Dervaze (meaning “the portal”) is a set of tools that aims to transliterate historical Ottoman documents to Modern Turkish.

Here, I describe the transliteration system. The system is organized as a pipeline in which the tools at a stage produce the input for the next stage. The input to the system is a set of historical document images. The output is either a search result or a textual representation of these documents.

The sections below describe these stages briefly.

Binarize Color Images to Binary Images

Binarization is the process of converting color or grayscale images to black-and-white binary images. Document images come in various flavors, mostly as color images. Color information is mostly noise for further stages. It is better to remove the color while keeping the textual representation intact.

Although seemingly easy at first, this stage includes challenges like determining the ink color or removing ink stains from images.

In the literature, the standard idea is to use a mathematical model, like Otsu’s method, to convert color to binary. We approach this problem differently, as a classification problem. The idea is briefly as follows:

Color images consist of 3 channels. The ink color of a region should be persistently present or absent in these channels. For example, a dark blue ink should be represented within similar numeric ranges in each of these channels, and a red ink should be represented more in the red channel than others. A standard binarization approach tries to come up with a cumulative ink color value using all three channels. We do it differently.

Instead of trying to find a cumulative threshold for binarization, we detect edges in each channel, considering each channel as a separate binary image. When components in each channel are found, they are evaluated by various features (like size, presence in other channels) and classified as text or non-text.

After classification, the text elements are drawn to a canvas in black, and the document hence becomes binarized.

Extract Components: Convert Binary Images to Components

Although the components are extracted during the binarization stage, we extract them in an independent stage to have a definite input and output. The primary reason for this is to evaluate the performance of different binarization options. This stage works even if the binarization part uses another (standard) approach.

Binarized document images are converted to sets of components by finding their edges. Each component is recorded with its location and binary image.

Extract Features: Find Features of Components for Comparison

Each component can have several different features, such as height, width, number of holes, number of ascenders and descenders, etc. Some of these features work better than others in classification. However, we don’t know beforehand which ones work better than others.

In order to find a set of good features, we extract all the features we can think of and run a Principal Component Analysis (PCA) on them. Previously, since we lacked classifications for these components, it was impossible to find good features, and we had to check the outputs manually. However, now that we have labeled around 11,000 components from 50 handwritten pages, we can determine which features are better than others.

extract-features extracts a large set of features from the component set. These features are stored in CSV files and analyzed for their classification value.

Table of Contents

Binarize Color Images to Binary Images

Extract Components: Convert Binary Images to Components

Extract Features: Find Features of Components for Comparison