Programs Home
Overview

Technical Program

Solicitation


|
 |
 |
 |
Multilingual Automatic Document Classification Analysis and Translation (MADCAT)
Program Manager: Dr. Joseph Olive
Approach:
MADCAT has been structured in terms of the following technical task areas:
- Automatic transcription into English of handwritten or combined handwritten and printed Arabic images: These engines will accept input in the form of scanned, photographed or PDF documents which are either handwritten or consist of combined printed and handwritten documents. The output will consist of English translation (with confidence measures), with the proper capitalization and punctuation, while preserving the original layout of the documents.
- Linguistic data acquisition: A minimum of 10,000 printed documents, 15,000 handwritten and 15,000 mixed printed and handwritten documents (the data must contain writing samples of at least 500 different writers) will be collected, organized, and annotated for effective research, algorithm development, and performance evaluation. Annotation will consist of segmentation, zone interpretation, transcription and translation to English.
- Evaluation: Methods will be created for evaluating the accuracy of the processing engines, including methods for generation of a gold standard translation and methods for editing system output.
DARPA's desired end result includes:
- A transcription engine that produces English transcripts with 95% accuracy from Arabic printed images for 95% of the documents,
- A transcription engine that produces English transcripts with 90% accuracy from Arabic handwritten or combined handwritten and printed images for 95% of the documents.
To achieve this result, DARPA will test the MADCAT technologies in a series of carefully selected operational applications. Although the main thrust of the program is to develop technology to handle Arabic documents, the technologies developed under MADCAT will also be capable of being quickly adapted to other languages and scripts. The language and script independence will be evaluated by testing on a surprise language during latter program phases.

|