Dervaze (meaning "the portal") is a set of tools that aim to transliterate historical Ottoman documents to Modern Turkish.
Here, I describe the transliteration system. The system is organized as a pipeline in which the tools at a stage produce the input of the next stage. Input to the system is a set of historical document images. The output is either a search result or a textual representation of these documents.
Following sections describe these stages briefly.
binarize Color Images to Binary Images
Binarization is to convert color or grayscale images to black-white binary images. The document images come in various flavors, mostly as color images. Color information is mostly a noise for further stages. Better to remove the color, but keep the textual representation intact.
Although seemingly easy at first, this stage includes challenges like determining the ink color or removing ink stains from images.
In the literature the idea is to use a mathematical model, like Otsu's method to convert color to binary. We look to this problem differently as a classification problem. The idea is briefly the following:
Color images consist of 3 channels. The ink color of a region should be persistenly present or absent in these channels. For example a dark blue ink should be represented within similar numeric ranges in each of these channels, a red ink should be represented more in red channel than others. A standard binarization approach tries to come up with a cumulative ink color value with all three channels. We do it differently.
Instead of trying to find a cumulative threshold for binarization, we detect edges in each channel, considering each channel as a separate binary image. When components in each channel are found, they are evaluated by various features (like size, presence in other channels) and classified as text or non-text.
After classification, the text elements are drawn to a canvas with black and the document hence become binarized.
extract-components Convert binary images to components
Although the components are extracted in the binarization stage, in order to have a definite input and output, we extract components in an independent stage. The primary reason of this is to evaluate performance of different binarization options. This stage works even if the binarization part uses another (standard) approach.
Binarized document images are converted to sets of components by finding their edges. Each component is recorded with its location and binary image.
extract-features Find features of components for comparison
Each component can have several different features, like height, width, number of holes, number of ascenders and descenders, etc. Some of these features work better than others in classification. However we don't know beforehand which works better than others.
In order to find a set of good features, we extract all features we can think of and run a Principal Component Analysis on these. Previously, as we used to lack classes in these components, it was impossible to find good features and we had to check the outputs. However, now, as we labeled around 11.000 components from 50 handwritten pages, we can determine which features are better than the others.
extract-features extracts a large set of features from the component set.
These features are stored in CSV files and are analyzed for their classification value.