Today's work was about sharding a page's components and recording them as new images. Instead of artificial boundaries (like word/sentence boundaries), the labeling should rely on connected components.
There are two problems here. In Arabic based writing systems, dots play a significant role, much more so than Latin based scripts. Therefore these dots should be classified correctly.
The second problem is that the connected components are not always reliable. There are unduly divided components which are part of a single component. We can't label them as is, and uniting them to a uniform component requires manual intervention. Something we try to avoid.
In coming days I'll try to exemplify these problems and how we treat them.