Presentation

There are now a range of products to dematerialize (scan, identify and read automatically) paper documents (letters, forms, invoices, checks, ...), but these solutions are currently only available for large companies. DOD project is based on a new vision: the dematerialization for all. It is developing a new generation of technologies to build software and web services product (SaaS) can handle automatically after a simple learning by example, all paper and electronic documents that arrive daily in PMEs, TPEs and homes. The team READ takes part in the development of such products. About five are studied, concerning the flow segmentation, incremental classification, heterogeneous document segmentation and information extraction.

Under the ECLEIR project, in collaboration with the company eNovadys (Strasbourg) we seek to develop an aid to the collection of chemical data to provide the community a specific database. The work involves digitizing heritage knowledge contained in chemist laboratory notebooks to give them access. These data are very well structured in a sequence always the same and which contains a reaction scheme, a table, several paragraphs of text and some images. It is a document that is produced in handwritten and is therefore difficult to operate digitally. One of the challenges of the project is therefore to develop tools to simplify the passage of such manuscripts to electronic documents.

We study the contribution of stochastic models and neural networks for non-constrained handwritten text recognition. The research is focused on two aspects: bidimensional Markov fields for stochastic models and adaptive topology for neural networks. Several applications have been experimented such as bank checks recognition, multi-scale digits and multi-lingual word recognition.

Document Image Analysis

This area's aim is to analyze digitized document images and extract their structure in order to dispose of a reusable electronic version. We study adaptive binarization techniques, segmentation into blocks and areas, extraction of keywords. We are also studying several strategies for structure recognition and content interpretation. Several applications are experiemted on different types of documents: administrative, scientific, historical, etc.

The retroconversion consists in finding the leading structure of the document (physical and logical fields). It is based on the one hand on the segmentation result and on the other hand on the result of the content recognition (OCRs). An a priori model guides the strategy by offering assumptions of physico-logic structuring

Nowadays, needs concerning the documentation have evolved from the information constitution to its intelligent consultation. Indeed, the mass of documents related to the increasingly significant storage capacities, obliges the information systems to introduce intelligence into the search processes and personalized exploitation, by taking more and more account of the user requirements.

Project Osé-ITESOFT: Document on Demand
There are now a range of products to dematerialize (scan, identify and read automatically) paper documents (letters, forms, invoices, checks, ...), but these solutions are currently only available for large companies. DOD project is based on a new vision: the dematerialization for all. It is developing a new generation of technologies to build software and web services product (SaaS) can handle automatically after a simple learning by example, all paper and electronic documents that arrive daily in PMEs, TPEs and homes. The team READ takes part in the development of such products. About five are studied, concerning the flow segmentation, incremental classification, heterogeneous document segmentation and information extraction.
	Document Stream Segmentation Hani Daher proposes a framework for the segmentation and classification of multipage administrative document streams. It is based on the careful search of textual descriptors and their neighborhood to model the limits between pages. An incremental classifier is used to accommodate new classes of documents that may appear in the stream without the need for a fixed training set. Lastly a semi-automatic cased based reasoning framework allows the system to learn from the previous mistakes and correct potential future errors.
	Printed/Handwritten Text Separation Here we tackle the problem of machine printed and handwritten text separation in real noisy documents. A first version of the system was proposed by Didier Grzejszczak. Pseudo-lines and pseudo-words are used as basic blocks for classification. Then, the local neighborhood of each pseudo-word is studied in order to propagate the context and correct the classification errors. Montaser Awal extends this separation problem by conditional random fields considering the horizontal neighborhood. He than enhanced this method by using a more global context based on class dominance in the pseudo-line.
	Incremental Clustering Based on an adaptive incremental neural gas algorithm (AING), Rafik Bouguélia proposes a new stream-based semisupervised active learning method (A2ING) for document classification, which is able to actively query (from a human annotator) the class-labels of documents that are most informative for learning, according to an uncertainty measure. The method maintains a model as a dynamically evolving graph topology of labelled document-representatives that we call neurons. Experiments on different real-world datasets show that the proposed method requires on average only 36.3% of the incoming documents to be labelled, for learning a model which achieves an average gain of 2.15-3.22% in precision, compared to the traditional supervised learning with fully labelled training documents.
	Table Extraction Thotreingam Kasar is involved in the development of an interactive tool for testing and evaluating a query-based table information extraction system, developed at our laboratory. The interface is also being used for generating table groundtruth and performance evaluation. Tapan Bhowmik is working on the development of a system for Table information retrieval. The objective of this work is to retrieve similar patterns from the document on the basis of an input query pattern defined by a client. A graph searching technique is used to retrieve the similar patterns from the document.
	Entity matching in documents with structured database Nihel Kooli proposes an approach of entity recognition in documents recognized by OCR based on entity resolution in industrial databases. A pre-step of entity resolution provides a non-redundant database. Then, an adapted method is used to retrieve the entities from their structures that tolerates possible OCR errors. Nihel applied an adapted version of EROCS (Chakaravarthy et al., 2006), which operates on document segments to match the document to its corresponding entities. The segments correspond to elementary blocks given by the OCR.
Project ECLEIR-eNovalys - Cifre
Under the ECLEIR project, in collaboration with the company eNovadys (Strasbourg) we seek to develop an aid to the collection of chemical data to provide the community a specific database. The work involves digitizing heritage knowledge contained in chemist laboratory notebooks to give them access. These data are very well structured in a sequence always the same and which contains a reaction scheme, a table, several paragraphs of text and some images. It is a document that is produced in handwritten and is therefore difficult to operate digitally. One of the challenges of the project is therefore to develop tools to simplify the passage of such manuscripts to electronic documents.
	Document Segmentation Nabil Ghanmi addresses the problem of segmentation of chemistry documents in homogeneous areas such as the chemical formula, the table and the text lines. After noise filtering, elementary structures such as lines and parts of words are extracted. Specific descriptors taking into account the texture of the text and graphics are considered.Then, the document elementary are classified in one of three classes. For table separation, Nabil Ghanmi uses a labeling task based on a CRF model. It combines two classifiers: a local classifier which assigns a label to the line based on local features and a contextual classifier which uses features taking into account the neighbourhood. The CRF model gives the global conditional probability of a given labeling of the line considering the results of the two classifiers.

Writing recognition
We study the contribution of stochastic models and neural networks for non-constrained handwritten text recognition. The research is focused on two aspects: bidimensional Markov fields for stochastic models and adaptive topology for neural networks. Several applications have been experimented such as bank checks recognition, multi-scale digits and multi-lingual word recognition.
	George Saon described a new formalism NSHP-HMM which combines the two-dimensional modeling of the random fields and the assets of HMMs. The word image is regarded as the realization of a random field which in its turn is seen as a sequence of column observation. He experimented this model on both digits and word recognition. Christophe Choisy uses the NSHP-HMM to normalize the word images to a fixed size, then SVM (Support Vector Machine) for classification. The letter models are carried out by cross training. Szilard Vajda extends this model for large vocabulary and Indian script identification. A first improvement is reached for large vocabulary by building a common word graph visited partially thanks to thresholds computed on the letter occurrences in the word paths. He is also studying the influence of structural information added to pixels in the raw images to increase the discriminating power of the model.
	Najoua Ben Amara proposed a pseudo-bidimensional architecture PHMM for printed Arabic sub-words. The role of the horizontal HMMs is to specialise the analysis in horizontal bands which are very specific of the Arabic writing morphology. Samia Maddouri proposes a transparent neural network for Arabic writing recognition. This approach ensures a complementarity between a global vision by NN and a local vision by Fourier descriptors Kaouther Bouriel tries to improve this neural model by adding a training module and data input selection.
	For printed text, Laurent Pierron developed a digitalization platform combining the results of different commercial OCRs and using heuristics for the error correction. He obtains error rates lower than 1 per 10000 for a particular class of documents.
	In the same context as previously, Hubert Cecotti experiments a neural network with convolutional layers and adaptive topology. The MLP combines the OCR results and categorizes the error rejections. These errors are then processed by a local ICR (Intelligent Character Recognition). The topology adaptation is performed thanks to Self-Organizing Maps modifying the connection layers by changing their node connection.

Document Image Analysis
This area's aim is to analyze digitized document images and extract their structure in order to dispose of a reusable electronic version. We study adaptive binarization techniques, segmentation into blocks and areas, extraction of keywords. We are also studying several strategies for structure recognition and content interpretation. Several applications are experiemted on different types of documents: administrative, scientific, historical, etc.
	Tundé Akindelé proposed a document segmentation technique into polygonal blocks. The method takes as a starting point the Pavlidis method, based on the prolongation of the white bands. The idea is to propose a more general technique able to treat the mosaical structures and to have fine contours for the blocks. Mériem Zaoui used two techniques for the separation of Arabic and French text in bilingual documents. The first one is based on fractal dimensions, the second uses morphological primitives of the letters close to the central band. The latter is inspired by the Larry Spitz approach for the separation of the Latin and Eastern languages.
	Yolande Belaïd used the Hough transform for the extraction of ines and columns in forms and a hierarchy of NNs networks for the item classification. The classification features are extracted directly from the image labels. They correspond to some morphological aspects of the labels. Jean Luc Panchèvre and Yolande Belaïd used a hierarchy of neural networks for the classification of the cell contents.
	Hatem Hamza proposes a cascade of neural classifiers for image segmentation and binarization. The cascade starts with a self organizing map labeled by Kmeans and followed by an MLP. This classification approach allows not only the extraction of different plans from the image but also their adaptive compression. One application of this approach is historical image restoration.
	Norbert Valverde developed an adaptive technology for the localization of informative areas in forms of the purchase order type. The method is based on the research of anchor points by discrete relaxation. Christophe Choisy and Yolande Belaïd proposed a Markovian method for automatic classification of invoices. The method is an adaptation of the LBG in which the class kernels are synthesized by HMMs. Hatem Hamza experiments a new document classification approach based on important keyword extraction. The importance notion is based on a careful search for occurency and redundancy within the same document image and between documents. The application concerns forms and invoice recognition.

Document retroconversion
The retroconversion consists in finding the leading structure of the document (physical and logical fields). It is based on the one hand on the segmentation result and on the other hand on the result of the content recognition (OCRs). An a priori model guides the strategy by offering assumptions of physico-logic structuring
	Several techniques have been studied in our team concerning bibliographical document structure: Julian Anigbogu used an attribute grammar for bibliographic cards (More project, an European project in collaboration with Jouve company and Royal Belgium Library). François Parmentier used conceptual graphs for bibliographical references. Laurent Pierron employs a part of speech tagging technique for tables of contents (Collaboration with Xerox Research Center: Calliope document server) Dominique Besagni extended this technique by adpative modeling of citations (Collaboration with INIST Institute)
	Most of our research was focused on the structure modeling and on reasoning strategy for the retroconversion of these structures. Yannick Chenevoy used a multi-agents system, ATOME, for the macrostructure recognition. This system uses the BlackBoard summary to initially select the meta-sources of knowledge, which use events to apply specialized knowledge sources for the effective treatment. This system was applied on scientific papers. The structure model provides the working hypotheses to the system. Tunde Akindele proposed a semi-automatic technique for structure model training by using a tree inference technique. The training is initiated by a physical segmentation in blocks, supplemented manually by logical attributes. Then, the system infers the structure starting from new examples by acting on constructors and qualifiers.
	Yves Rangoni employs a specific neural network named transparent network for structure extraction. Contrarely to the classical MLP, the layers are observable where each neuron is associated to a specific concept. The local observation and hence the low level features constitutes the first layers while the global context is given by the last layers. To classify the input, one uses a left-right propagation. Each neuron in each layer votes for a neuron in a higher layer leading to a higher concept. Hence, progressively, the context becomes more precized layer after layer until the final one. In the case of ambiguities observed in the final layer, the system operates to a right-to-left propagation allowing 1) to find the error sources in each layer and 2) to refine the input based on these errors.

Document indexing and retrieval
Nowadays, needs concerning the documentation have evolved from the information constitution to its intelligent consultation. Indeed, the mass of documents related to the increasingly significant storage capacities, obliges the information systems to introduce intelligence into the search processes and personalized exploitation, by taking more and more account of the user requirements.
	Project RNRT PAPLOO positions in this area. It aims at the definition of a generic framework of transformation and document retrieval for a personalized use. The desires of the user are taken into account in all the chain processes. The first step of PAPLOO relates to the document preparation. It makes it possible to have the documents in the most detailed possible form relatively to their structure and content. For slightly or not structured documents, the project envisages further recognition steps and structure retro-conversion, followed by content indexing. Document retrieval is operated on the basis of keywords extracted from the user request. This search is then refined on the content in order to better satisfy the user needs in terms of structure and layout.