Turning Ottoman Letters into Graphs (1).
Today’s work was about sharding a page’s components and recording them as new images. Instead of artificial boundaries (like word/sentence boundaries), the labeling should rely on connected components.
There are two problems here. In Arabic-based writing systems, dots play a significant role, much more so than in Latin-based scripts. Therefore, these dots should be classified correctly.
The second problem is that the connected components are not always reliable. There are unduly divided components which are part of a single component. We can’t label them as they are, and uniting them into a uniform component requires manual intervention—something we try to avoid.
In the coming days, I’ll try to exemplify these problems and how we treat them.