Presentation

CLASSIFICATION

Principle

It is a block image labeling system. It does not assume that the blocks to be treated are already segmented nor that they contain homogeneous data. It is based on connected component analysis to label the blocks' contents as small letter text, medium letter text, large letter text, graphics or photographs, giving the percentage of each of these components with respect to the surface area it occupies. It uses a recursive algorithm that allows one to improve on the result of segmentation. The performance of the method
is given.

Line forming

The document is deskewed if its skew-angle is greater than a certain degree harmful to horizontal alignment. After the extraction of cc's and the elimination of those considered as noise (i.e. those whose number of black pixels or surface area is less than an a priori fixed threshold) we then proceed to merge them into bigger entities. CCs are then grouped to form lines.The line portions so formed are then merged into larger sets to obtain real text lines and to discard those that are not horizontally aligned.

Line classification

The classification of the formed lines is based on some coefficients extracted from the constituting cc's (such as size, density and the percentage of the black segments whose width is approximately equal to
that of the cc), as well as homogeneity of the spaces separating them. It is performed in two manners depending on the number of cc's in the lines. When there is only one cc, it is passed through a series of
filters to determine its type. Otherwise, the line is either cut into smaller sets with respect to the homogeneity of spaces and sizes of its cc's or classified globally.

CC Classification

When the cc is isolated, it is passed through a series of filters, on the basis of attributes extracted from it, until its type is obtained. In all, there are sixteen filters which are applied in order. There are many thresholds used in these filters, but these are determined before hand during a learning stage on many kinds of documents, thus
assuring their stability.

Experiments

We have chosen to use the percentage of the surface area of each type of components. However, we like to give to text a percentage close to that we would have given visually. (Human eyes often surround a text
zone with an invisible rectangle and consider the interline spaces as integral part of the text). Therefore, when a text line is obtained, we do not consider the total of the surface areas of its cc's, but the area of its circumscribing rectangle. Furthermore, when a rectangle circumscribing a cc is enclosed in a rectangle circumscribing another cc, the area of the former is subtracted from that of the latter (for example, a surrounded title).

The method has been tested on about 120 blocks for each class, chosen from scientific journals like IEEE, IBM, ACM, etc. and technical reports. When the page images are not segmented before hand, the method can be used as a means of separating different media in the page. We observed that Small Letter Text blocks are very rare. This reinforces the idea that the notion of Small Letter Text is very subjective. In fact, each font has a particular size for small letters. Therefore, in a multifont document, Small Lette blocks are always confused with Medium Letter blocks.