Presentation

DSS

This software is designed by KC Santosh and improved by Tapan Bhowmik between January 2014 and August 2014. The goal of the project is to extract content within tables in document images based on learnt patterns. Clients provide a set of key fields in the form of client pattern within the tables which they think are relevant. The extraction will be based on the search for similar patterns.

The pattern

The client pattern is represented with a graph by:

Assigning fields to nodes
Labeling attributes via several features

regular expression, feature vector of content, size, number of words, number of lines, word separation gap etc. each attribute possesses, for instance

Computing possible relations that exist between attributes

spatial relations

Here is a generated graph

Graph Mining

Graph mining in presence of client (Major Steps)

Step-1: Starting with any arbitrary node in the graph, find a similar field in the document with vertical window search
Step-2: Validate the similar field via feature score. If feature score is greater than a certain threshold then it is considered as similar
Step-3: Find associated fields via relations
Step-4: Compose Graph
Step-5: Calculate Graph matching score
Step-6: Validate similar pattern if graph matching score is greater than a certain threshold