Presentation

DSS

Principle

This software is designed by Hani Daher during his postdoctoral period, between 2012 and 2014. The model links two incremental classification modules. The first creates two classes, one representing the continuity, another the rupture. The input vectors are the result of comparison of factual descriptor vectors in two successive pages. In case of rupture shown on a pair of pages, the second page becomes the beginning of a new document. The second classifier classifies documents from a bag of words. He just confirm or deny the segmentation made ??by his predecessor. The latter classification is also used for a correction module of a case based reasoning type that will correct any registered error. segmentation

Descriptor logbook

Each page is recognized by OCR and the result is an XML file. From this file are extracted textual descriptors such as dates, phone numbers, numeric and alphanumeric codes, etc. These descriptors are compared in pairs of consecutive pages in the stream and a descriptor vector is formed. For each descriptor observed in two pages, 3 values are performed showing its degree in continuity, rupture or uncertainty.

Classifier functioning

The algorithm of the proposed method constructs and maintains a model in the form of a graph G whose nodes are representatives of documents. Each node of G is a vector of characteristics that is updated continuously by the algorithm.
The algorithm

Rejection

Experiments

Databases

4 databases have been constituted:

Database 1: 761 documents,1522 pages,15 classes

Database 2: 1280 documents, 5039 pages, 23 classes

Database 3: 3797 documents,11753 pages,137 classes

Database 4 (combination of all the 3 databases): 5106 documents, 16882 pages, 164 classes

Demonstration

Figure 1

The self-test function allows variations in the rejection thresholds in the learning flow. We get a curve plotting the various scores

In abscissa: there is the variation of the threshold between 0.1 and 1

In ordinate:

Precision of the continuity class
Precision of the rupture class
Average precision (continuity +rupture)/2
Over-segmentation

More the rejection threshold increases more we have cases of over-segmentation

On the left, there are values for each thresholdwith the number of pages and documents in the flow

Figure 2

Figure 2 illustrates continuity scores,rupture and over-segmentation for a given threshold.

Figure 3

Figure 3 illustrates the continuity case between two successive pages in the stream. On the right, the descriptors which value is equal to 1 are checked.

Figure 4

The Figure 4 illustrate the confusion case.