Today’s work was about sharding a page’s components and recording them as new images. Instead of artificial boundaries (like word/sentence boundaries), the labeling should rely on connected components.

There are two problems here. In Arabic-based writing systems, dots play a significant role, much more so than in Latin-based scripts. Therefore, these dots should be classified correctly.

The second problem is that the connected components are not always reliable. There are unduly divided components which are part of a single component. We can’t label them as they are, and uniting them into a uniform component requires manual intervention—something we try to avoid.

In the coming days, I’ll try to exemplify these problems and how we treat them.