|
|
|
There are now a range of products to dematerialize (scan, identify and read automatically) paper documents
(letters, forms, invoices, checks, ...), but these solutions are currently only available for large companies.
DOD project is based on a new vision: the dematerialization for all.
It is developing a new generation of technologies to build software and web services product (SaaS) can handle
automatically after a simple learning by example, all paper and electronic documents that arrive daily in PMEs,
TPEs and homes.
The team READ takes part in the development of such products. About five are studied, concerning the flow segmentation,
incremental classification, heterogeneous document segmentation and information extraction.
|
|
|
Document Stream Segmentation
|
|
Printed/Handwritten Text Separation
|
|
Incremental Clustering
|
|
Table Extraction
|
|
Entity matching in documents with structured database
|
|
|
Under the ECLEIR project, in collaboration with the company eNovadys (Strasbourg) we seek to develop an aid
to the collection of chemical data to provide the community a specific database. The work involves digitizing
heritage knowledge contained in chemist laboratory notebooks to give them access. These data are very well
structured in a sequence always the same and which contains a reaction scheme, a table, several paragraphs of
text and some images. It is a document that is produced in handwritten and is therefore difficult to operate digitally.
One of the challenges of the project is therefore to develop tools to simplify the passage of such manuscripts to electronic documents.
|
|
|
Document Segmentation
|
|
|
|
We study the contribution of stochastic models and neural networks for
non-constrained handwritten text recognition. The research is
focused on two aspects: bidimensional Markov fields for
stochastic models and adaptive topology for neural networks. Several
applications have been experimented such as bank checks recognition,
multi-scale digits and multi-lingual word recognition.
|
|
|
|
|
|
|
|
|
Document
Image Analysis
|
|
This area's aim
is to analyze digitized document images and extract their structure in
order to dispose of a reusable electronic version. We
study adaptive binarization techniques, segmentation into blocks and
areas, extraction of keywords. We are also studying several
strategies for structure recognition and content interpretation.
Several applications are experiemted on different types of documents:
administrative, scientific, historical, etc.
|
|
|
|
|
|
|
|
|
|
|
|
The
retroconversion
consists in
finding the leading structure of the document (physical and
logical
fields). It is based on the one hand on the segmentation result
and
on
the other hand on the result of the content recognition (OCRs). An a
priori
model guides the strategy by offering assumptions of physico-logic
structuring
|
|
|
Several techniques have been studied
in our team concerning bibliographical document structure:
|
|
Most of our research was focused on the
structure
modeling and on reasoning strategy for the retroconversion of
these
structures.
|
|
Yves Rangoni employs a specific neural network named
transparent
network for structure extraction. Contrarely to the classical MLP, the
layers are observable where each neuron is associated to a specific
concept. The local observation and hence the low level features
constitutes the first layers while the global context is given by the
last layers. To classify the input, one uses a left-right propagation.
Each neuron in each layer votes for a neuron in a higher layer leading
to a higher concept. Hence, progressively, the context becomes more
precized layer after layer until the final one. In the case of
ambiguities observed in the final layer, the system operates to a
right-to-left propagation allowing 1) to find the error sources in each
layer and 2) to refine the input based on these errors. |
|
|
Nowadays,
needs concerning the documentation have evolved from the information
constitution to its intelligent consultation. Indeed, the mass of
documents related to the increasingly
significant storage capacities, obliges the information systems to
introduce
intelligence into the search processes and personalized exploitation,
by taking more
and more account of the user requirements.
|
|
|
Project RNRT PAPLOO positions in this area. It aims at the definition of a generic framework of transformation and document retrieval for a personalized use. The desires of the user are taken into account in all the chain processes. The first step of PAPLOO relates to the document preparation. It makes it possible to have the documents in the most detailed possible form relatively to their structure and content. For slightly or not structured documents, the project envisages further recognition steps and structure retro-conversion, followed by content indexing. Document retrieval is operated on the basis of keywords extracted from the user request. This search is then refined on the content in order to better satisfy the user needs in terms of structure and layout. |