Posted on 2012-07-27 :: Tags: text line segmentation, historical documents, Hough transform, document analysis, image processing

Authors: Laurance Likforman-Sulem, Abderrezak Sahour, Bruno Taconet

URL: http://arxiv.org/pdf/0704.1267.pdf

Keywords:

page segmentation
overlapping components
image quality
document complexity
preprocessing
projection based
smearing based
grouping based
hough transform based
repulsive attractive
stochastic
touching components

Q1: What are the most usable techniques for Ottoman divans?

Likforman-Sulem and Faure’s technique, which uses Gestalt criteria to associate text elements, might be of use. Feldbach and Tennies’ work, which was tested on Church Registers, may also be helpful. The Hough transform can be used. The Repulsive-Attractive method of Öztop et al. is also applicable. Stochastic methods by Tseng and Lee, which use a probabilistic Viterbi algorithm, can also be utilized.

Q2: How are touching components successfully delimited?

A touching component can be detected by its size. Subsequently, it should either be assigned to a lower or upper line, or be separated. Successful separation requires letter images or skeletons (which we lack).

Q3: How is the Hough transform used?

Centroids of the connected components (CCs) are used as units of the Hough transform. Line hypotheses are developed in the Hough domain and verified in the image domain.

Q4: What are the problems specific to non-Latin texts?

The baseline of Hebrew is at the upper part of the letters because of their box shape. Devanagari and similar scripts also have a headline on top of them. Diacritics and inter-letter shapes pose problems for Arabic.

Table of Contents