|
|
It is a block image labeling system. It does not assume that the blocks
to be treated are already segmented nor that they contain homogeneous data.
It is based on connected component analysis to label the blocks' contents
as small letter text, medium letter text, large letter text, graphics or
photographs, giving the percentage of each of these components with respect
to the surface area it occupies. It uses a recursive algorithm that allows
one to improve on the result of segmentation. The performance of the method
is given. |
Line
forming
The document is deskewed if its skew-angle is greater than a certain
degree harmful to horizontal alignment. After the extraction of cc's and
the elimination of those considered as noise (i.e. those whose number of
black pixels or surface area is less than an a priori fixed threshold)
we then proceed to merge them into bigger entities. CCs are then grouped
to form lines.The line portions so formed are then merged into larger sets
to obtain real text lines and to discard those that are not horizontally
aligned.
|
|
Line
classification
The classification of the formed lines is based on some coefficients
extracted from the constituting cc's (such as size, density and the percentage
of the black segments whose width is approximately equal to
|
CC
Classification
When the cc is isolated, it is passed through a series of filters, on
the basis of attributes extracted from it, until its type is obtained.
In all, there are sixteen filters which are applied in order. There are
many thresholds used in these filters, but these are determined before
hand during a learning stage on many kinds of documents, thus
|
|
We have chosen to use the percentage of the surface area of each type of components. However, we like to give to text a percentage close to that we would have given visually. (Human eyes often surround a text zone with an invisible rectangle and consider the interline spaces as integral part of the text). Therefore, when a text line is obtained, we do not consider the total of the surface areas of its cc's, but the area of its circumscribing rectangle. Furthermore, when a rectangle circumscribing a cc is enclosed in a rectangle circumscribing another cc, the area of the former is subtracted from that of the latter (for example, a surrounded title). The method has been tested on about 120 blocks for each class, chosen
from scientific journals like IEEE, IBM, ACM, etc. and technical reports.
When the page images are not segmented before hand, the method can be used
as a means of separating different media in the page. We observed that
Small Letter Text blocks are very rare. This reinforces the idea that the
notion of Small Letter Text is very subjective. In fact, each font has
a particular size for small letters. Therefore, in a multifont document,
Small Lette blocks are always confused with Medium Letter blocks.
|
|