DSS
 
This software is designed by Didier Grzejszczak and improved a first time in 2012 by Hervé Locteau, and a second time by Ahmad Montaser Awal between september 2013 and August 2014
First version
The principle:
The system is composed of several sub-steps:
  • Preprocessing:
    • Extraction of connected components (CC)
    • Border removal by rules based on the CC shape
    • Salt/pepper suppression by a modified KFill algorithm
    • Correction of the tilt and orientation of the document by the RAST algorithm.
    • Exclusion of too small/large CCs.
  • Segmentation into pseudo lines and pseudo-words:
  • This is done by a double smearing, a first 'Smearing' to extract document lines. Then, an analysis of the gaps between the CCs can find pseudo-words.


Segmentation and classification

Differently from most of existing works, the document is first segmented into pseudo-lines before being segmented into pseudo-words. Pseudo-line are set of connected components where horizontal distances and vertical distances are less than thresholds. Pseudo-words are considered as set of connected components belonging to the same pseudo-line. The extraction is made possible using an adjacency and linear graph. Each pseudo-word is described by several descriptors providing 137 features. An SVM multi-class is used to classify a pseudo-word into: Handwritten text, Printed text or Noise.


Contextual re-labelling

Contextual information is exploited by considering the neighbourhood to relabel the pseudo-words. Different techniques have been used.


Results

The training was done on a DB of 75 documents (10173 pseudo-words shared as H: 2357; P: 4892; N: 2924). The testing DB contains 24 documents (~7669pseudo-words shared as H:1191; P: 4218; N: 1934). All documents are label at the pixel level


Second version

Several improvements have been provided by Ahmad Montaser Awal. They are described below:

New datasets

Training DB_T0 (new): 107 documents (32706 pseudo-words), H: 5888; P: 18078; N: 8740. Test DB0 (new): 202 documents (~82142 pseudo-words): H: 11970; P: 43705; N: 25190. All documents are label at the pixel level


Horizontal overlapping


Line detachment


MLP and classifier combination

A combination between MLP and SVM is performed.


Contextual re-labelling by CRF


Re-labelling using pseudo-lines


Re-labelling using dominant class of line

Example

Results


System specialisation
A. M. Awal studied here the training effect using a specialized dataset. He takes the case of invoices where the majority of text is printed regularly with boxes containing marks,

Three adaptations have been made for:

  • Noise identification: thin horizontal lines, thin vertical lines, isolated dense pseudo-words
  • Inspired from printed text: CCs belonging to the same pseudo-line must be aligned horizontally
  • Contextual relabeling: a) pseudo-lines: ignore regularity factor, b) local: consider Left/right horizontal neighbors of each pseudo-word change its label based on regularity and classification scores, if both neighbors have the same label

Feature selection

  • Only 48 features are kept and used to train the classifier (by reliefF)
  • The global performance is very similar to that using all the features
  • Classification time is reduced to the half

Ambiguity layer

  • The objective is to detect a maximum of classification errors
  • In addition to the three layers (handwritten, printed, noise), a fourth layer is produced containing ambiguous pseudo-words. Ideally, this layer will contain all the classification errorsi

Pseudo-word Ambiguity

Ambiguity fusion

Ambiguity layer evaluation