Click here for the DARPA Information Processing Techniques Office (IPTO) home page. Thrust Areas Programs Solicitations Personnel Home
Programs

Programs Home

Overview

bullet Mission

Technical Program

bullet Approach
bullet Objectives

Solicitation

bullet BAA 07-38 (Closed)



Multilingual Automatic Document Classification Analysis and Translation (MADCAT)

Program Manager: Dr. Joseph Olive

Approach:

MADCAT has been structured in terms of the following technical task areas:
  • Automatic transcription into English of handwritten or combined handwritten and printed Arabic images: These engines will accept input in the form of scanned, photographed or PDF documents which are either handwritten or consist of combined printed and handwritten documents. The output will consist of English translation (with confidence measures), with the proper capitalization and punctuation, while preserving the original layout of the documents.


  • Linguistic data acquisition: A minimum of 10,000 printed documents, 15,000 handwritten and 15,000 mixed printed and handwritten documents (the data must contain writing samples of at least 500 different writers) will be collected, organized, and annotated for effective research, algorithm development, and performance evaluation. Annotation will consist of segmentation, zone interpretation, transcription and translation to English.


  • Evaluation: Methods will be created for evaluating the accuracy of the processing engines, including methods for generation of a gold standard translation and methods for editing system output.
DARPA's desired end result includes:
  • A transcription engine that produces English transcripts with 95% accuracy from Arabic printed images for 95% of the documents,


  • A transcription engine that produces English transcripts with 90% accuracy from Arabic handwritten or combined handwritten and printed images for 95% of the documents.
To achieve this result, DARPA will test the MADCAT technologies in a series of carefully selected operational applications. Although the main thrust of the program is to develop technology to handle Arabic documents, the technologies developed under MADCAT will also be capable of being quickly adapted to other languages and scripts. The language and script independence will be evaluated by testing on a surprise language during latter program phases.



Click here to visit the DARPA website.|   Search   |   Contact Us   |   Contact DARPA   |   Privacy and Security Notice   |   Webmaster

gradient
Thrust Areas   |   Programs   |   Solicitations   |   Personnel   |   Home