Presentation

DSS

This software is designed by Didier Grzejszczak and improved a first time in 2012 by Hervé Locteau, and a second time by Ahmad Montaser Awal between september 2013 and August 2014

First version

The principle:
The system is composed of several sub-steps:

Preprocessing:

Extraction of connected components (CC)
Border removal by rules based on the CC shape
Salt/pepper suppression by a modified KFill algorithm
Correction of the tilt and orientation of the document by the RAST algorithm.
Exclusion of too small/large CCs.

Segmentation into pseudo lines and pseudo-words:

This is done by a double smearing, a first 'Smearing' to extract document lines. Then, an analysis of the gaps between the CCs can find pseudo-words.

Segmentation and classification

Differently from most of existing works, the document is first segmented into pseudo-lines before being segmented into pseudo-words. Pseudo-line are set of connected components where horizontal distances and vertical distances are less than thresholds. Pseudo-words are considered as set of connected components belonging to the same pseudo-line. The extraction is made possible using an adjacency and linear graph. Each pseudo-word is described by several descriptors providing 137 features. An SVM multi-class is used to classify a pseudo-word into: Handwritten text, Printed text or Noise.

Contextual re-labelling

Contextual information is exploited by considering the neighbourhood to relabel the pseudo-words. Different techniques have been used.

Results

The training was done on a DB of 75 documents (10173 pseudo-words shared as H: 2357; P: 4892; N: 2924). The testing DB contains 24 documents (~7669pseudo-words shared as H:1191; P: 4218; N: 1934). All documents are label at the pixel level

Second version

Several improvements have been provided by Ahmad Montaser Awal. They are described below:

New datasets

Training DB_T0 (new): 107 documents (32706 pseudo-words), H: 5888; P: 18078; N: 8740. Test DB0 (new): 202 documents (~82142 pseudo-words): H: 11970; P: 43705; N: 25190. All documents are label at the pixel level

Horizontal overlapping

Line detachment

MLP and classifier combination

A combination between MLP and SVM is performed.

Contextual re-labelling by CRF

Re-labelling using pseudo-lines

Re-labelling using dominant class of line

Example

Results

System specialisation
A. M. Awal studied here the training effect using a specialized dataset. He takes the case of invoices where the majority of text is printed regularly with boxes containing marks,

Three adaptations have been made for:

Noise identification: thin horizontal lines, thin vertical lines, isolated dense pseudo-words
Inspired from printed text: CCs belonging to the same pseudo-line must be aligned horizontally
Contextual relabeling: a) pseudo-lines: ignore regularity factor, b) local: consider Left/right horizontal neighbors of each pseudo-word change its label based on regularity and classification scores, if both neighbors have the same label

Feature selection

Only 48 features are kept and used to train the classifier (by reliefF)
The global performance is very similar to that using all the features
Classification time is reduced to the half

Ambiguity layer

The objective is to detect a maximum of classification errors
In addition to the three layers (handwritten, printed, noise), a fourth layer is produced containing ambiguous pseudo-words. Ideally, this layer will contain all the classification errorsi

Pseudo-word Ambiguity

Ambiguity fusion

Ambiguity layer evaluation