Paper Review: Text Line Segmentation of Historical Documents: A Survey
Authors: Laurance Likforman-Sulem, Abderrezak Sahour, Bruno Taconet
URL: http://arxiv.org/pdf/0704.1267.pdf
Keywords:
- page segmentation
- overlapping components
- image quality
- document complexity
- preprocessing
- projection based
- smearing based
- grouping based
- hough transform based
- repulsive attractive
- stochastic
- touching components
Q1: What are the most usable techniques for Ottoman divans?
Likforman-Sulem and Faure's techique which uses Gestalt criteria to associate text elements might be of use. Feldbach and Tennies' work which is tried on Church Registers may also be helpful. Hough transform may be used. Repulsive Attractive method of Oztop et.al. is used. Stochastic methods of Tseng and Lee which uses probabilistic Viterbi algorithm can be used.
Q2: How touching components are delimited successfully?
A touching component can be detected using its size. After that, it should either be put into a lower/upper line or separated. A successful separation requires letter images or skeletons. (which we lack.)
Q3: How Hough transform is used?
Centroids of the CCs are used as units of Hough transform. Line hypotheses are developed in Hough domain and verified in image domain.
Q4: What are problems specific to Non-Latin texts?
Baseline of Hebrew is at the upper part of letters, because of their box shape. Devanagari etc. has also a head line on top of them. Diacritics and interletter shapes pose problems for Arabic.