Turning Ottoman Letters into Graphs (1)

Posted on 2012-09-22 :: Tags: Ottoman, Arabic, connected components, character recognition, document processing, graphs

Today’s work was about sharding a page’s components and recording them as new images. Instead of artificial boundaries (like word/sentence boundaries), the labeling should rely on connected components.

There are two problems here. In Arabic-based writing systems, dots play a significant role, much more so than in Latin-based scripts. Therefore, these dots should be classified correctly.

The second problem is that the connected components are not always reliable. There are unduly divided components which are part of a single component. We can’t label them as they are, and uniting them into a uniform component requires manual intervention—something we try to avoid.

In the coming days, I’ll try to exemplify these problems and how we treat them.