DSS
 
Principle
This software is designed by Hani Daher during his postdoctoral period, between 2012 and 2014. The model links two incremental classification modules. The first creates two classes, one representing the continuity, another the rupture. The input vectors are the result of comparison of factual descriptor vectors in two successive pages. In case of rupture shown on a pair of pages, the second page becomes the beginning of a new document. The second classifier classifies documents from a bag of words. He just confirm or deny the segmentation made ??by his predecessor. The latter classification is also used for a correction module of a case based reasoning type that will correct any registered error. segmentation
Descriptor logbook
Each page is recognized by OCR and the result is an XML file. From this file are extracted textual descriptors such as dates, phone numbers, numeric and alphanumeric codes, etc. These descriptors are compared in pairs of consecutive pages in the stream and a descriptor vector is formed. For each descriptor observed in two pages, 3 values are performed showing its degree in continuity, rupture or uncertainty.
Classifier functioning
The algorithm of the proposed method constructs and maintains a model in the form of a graph G whose nodes are representatives of documents. Each node of G is a vector of characteristics that is updated continuously by the algorithm.
The algorithm



Rejection



Experiments
Databases

4 databases have been constituted:

  • Database 1: 761 documents,1522 pages,15 classes
  • Database 2: 1280 documents, 5039 pages, 23 classes
  • Database 3: 3797 documents,11753 pages,137 classes
  • Database 4 (combination of all the 3 databases): 5106 documents, 16882 pages, 164 classes
Demonstration
Figure 1

The self-test function allows variations in the rejection thresholds in the learning flow. We get a curve plotting the various scores

  • In abscissa: there is the variation of the threshold between 0.1 and 1
  • In ordinate:
    • Precision of the continuity class
    • Precision of the rupture class
    • Average precision (continuity +rupture)/2
    • Over-segmentation
      • More the rejection threshold increases more we have cases of over-segmentation
  • On the left, there are values for each thresholdwith the number of pages and documents in the flow
Figure 2

Figure 2 illustrates continuity scores,rupture and over-segmentation for a given threshold.

Figure 3

Figure 3 illustrates the continuity case between two successive pages in the stream. On the right, the descriptors which value is equal to 1 are checked.

Figure 4

The Figure 4 illustrate the confusion case.