Translingual Information Detection, Extraction, and Summarization (TIDES) 1. Good morning, and thank you for joining us as ITO launches a new program in translingual information services. Your presence here is particularly important. Increasing portions of our time are spent seeking, finding, and utilizing information accessible over networks such as the World Wide Web. I am here today to seek your help, for it is clearly the case that information volume will continue to grow, and increasing proportions of it will not be in English. To that end, DARPA is initiating a program to develop the technology to seek, acquire, and utilize foreign language materials using English tools. The Translingual Information Detection, Extraction, and Summarization, or TIDES, program will deliver the machine translation technology and text processing capability to enable an individual to use English to specify a query of multilingual materials, and subsequently to extract detailed information from retrieved documents, to summarize a set of documents from multiple languages, and to translate relevant materials into English. 2. Translingual information services are needed now more than ever. The World Wide Web continues to grow exponentially. Two years ago, there were more than 200 million Web pages. Michael Lesk, the director of the Information and Intelligent Systems Division at NSF, tracked Web growth in his previous position at Bellcore, and estimated its growth at 10-fold per year, putting the Web today at well over 10 billion pages. Add to this the global production of audio and print materials, and we have a truly awesome amount of information. Prof. Robert Frederking at Carnegie Mellon University monitors the production of international Web pages. Observing that foreign language materials are growing at a faster rate than English materials, he predicts that by midyear 1999 the amount of non-English resources on the Web will exceed the English resources. 3. There exists on the Web an increasingly valuable array of international information. Many organizations and individuals of potential interest routinely release timely information via the Web and related networks. Yet much of this material is difficult to filter and interpret due to the relative lack of language skills within the U.S. Consider, for example, a speech delivered by the Tamil National leader, Mr. Pirapaharan on May 13, 1998. This was the anniversary of the launch of Sri Lanka's biggest and longest assault on the Tamil homelands. Mr. Pirapaharan described how the Liberation Tigers of Tamil Eelam, or LTTE, defended against Sri Lanka's latest military ambitions. His speech was published on the Web and is reproduced here. Suffice it to say that there are relatively few individuals in the US who are competent to interpret his remarks. 4. The TIDES program’s goal is to make these kinds of foreign language materials accessible and usable in English. This includes the ability to express an information need in English, use this query to search among materials in a multitude of languages, retrieve relevant materials, translate their content into English, extract names of people, places, organizations, and related entities, identify events of interest, and correlate the content of an array of documents occurring in multiple languages. The objective is to be able to quickly and accurately develop a comprehensive understanding of unfolding international situations. 5. This world map is color-coded with a most optimistic view of the state of machine translation today. Countries colored dark green are either English-speaking countries or countries for which commercial machine translation products are available for the primary language of that country. These systems were largely the result of cold war Defense Department investments and have resulted in a modestly successful commercial industry. In light green are countries for which machine translation capability is being developed. Yellow indicates countries in which a primitive capability is available, and the red identifies those countries for which no machine translation capability is known to exist. Machine translation is available for major European and, to a lesser extent, Asian languages, but it is generally not available for areas of current and anticipated concern. 6. Overlaying the map with stars to identify countries for which the State Department has issued travel warnings, and with bombs to indicate countries of recent terrorist incidents, we readily see that areas of concern are poorly served by contemporary machine translation products and services. 7. The targets of the TIDES program are, therefore, (1) to enable the acquisition of information across a wide variety of languages, (2) to accommodate a new language of interest by rapidly developing the necessary machine translation capability, and (3) to facilitate the effective utilization of foreign language materials by providing the capability to extract detailed information from them, correlate facts across multiple documents, and generate coherent summaries of multiple document sets including materials in multiple languages. 8. Analysts confront an impossible problem. Exhaustive search is typically expected, in order to avoid missing a key fact, event, or relationship, and most information is in text, including, for example, the Web, newswire, cables, printed documents, OCR’d paper documents, and transcribed speech. Critical information sources occur in unfamiliar languages. There are always many simmering pots, and it is unpredictable which will heat up. There are over 70 languages of critical interest in PACOM’s area of responsibility, alone. Commercial machine translation is inadequate, and essentially non-existent for all but the major world languages 9. The world today is a complex, dynamic place. There are about 228 countries, in which there are more than 6700 languages spoken, and if you count the different dialects, variations, and names by which these languages go, the number balloons up closer to 40,000. The TIDES program harbors no expectations of addressing a scope this broad, but it does intend to deliver representative capabilities for the major families of world languages and to develop the capability to move into a new language of interest rapidly. 10. In order to better understand the TIDES problem space, consider the typical requirement to produce a report. The analyst’s job is to find that information which contributes to an understanding of the problem and to utilize this information in the creation of the report. The total set of information available to the analyst, including libraries, local databases, the Web, intelligence reports, and related materials is included within the information space. This is typically a huge volume of information. The analyst’s skills are brought to bear to extract that which is relevant to the problem from this space. 11. The typical steps an analyst goes through in this process are summarized here. Information Retrieval creates a subset of the total information available presumed to be relevant to the specific problem. Topic detection buckets the retrieved materials into categories of interest. Entity extraction identifies the names of people, corporations, and organizations, dates, times, events, and related entities, in order to establish correlation among related entities. The volume of information relevant to a specific problem overwhelms the analyst. Summarization typically strives to reduce the set of information to be examined by a factor of 10. We will now look at this process as it could be assembled today. 12. Information retrieval is a relatively mature technology, reflected in a highly successful commercial market. Information retrieval has been an active field of research for decades. But despite the successes, it remains the case today that on formal evaluations measured by precision and recall, the best retrieval systems perform only modestly well. Precision measures the fraction of materials retrieved that are actually relevant to the query. Recall measures the fraction of relevant materials that are actually retrieved. The best retrieval systems available today typically deliver a combined performance of about 50% on precision and recall. So the analyst responsible for developing a comprehensive assessment, is already working at a disadvantage. 13. DARPA initiated research in 1997 on automated topic detection through a program called Topic Detection and Tracking, or TDT. The objective was to automatically identify the topics of news stories. The pilot study used text from Reuters North American and transcripts from various shows on CNN. The problem consisted of 3 subproblems: (1) identifying boundaries between stories in the text stream, or “segmentation”, (2) recognizing stories that are related to target events, or “recognition”, and (3) correlating stories on the same event, or “tracking”. In the first year of research, target events were identified out of a series of nearly 16,000 stories with typically 75% accuracy. This very encouraging result suggests that automatic topic detection could become a fundamental enhancement to information retrieval and utilization 14. DARPA has, likewise, pursued through the Message Understanding Conferences (otherwise known as MUC) the automated extraction of the names of people and their organizations, references to geographic places, and identification of events. Substantial success has been achieved in this form of extraction of low-level semantics. We can now automatically extract these relatively simple noun forms with approximately 80% accuracy and use them to improve our overall retrieval strategies and information understanding processes. 15. Automated summarization of a single document is difficult and complex. Summarization can be characterized by type, content, perspective, and performance. Two types of summarization are extraction, containing verbatim material selected from the source, and abstraction, representing a condensation and reformulation of material in the source. Content can focus on the newest facts, or may provide background or tutorial information. The perspective may be that of the author or designed to reflect the specific interests of the user. Performance can be measured simply by the level of compression, the ratio of the length of the summary to the length of the article, or by omission, the ratio of relevant information omitted in the summary to the set of relevant information contained in the article. This latter measure is very difficult to ascertain. Coherent summarization of multiple documents is currently beyond the state of the art, but when available could be used to sharpen the query and improve the return from information retrieval. 16. These information retrieval and processing stages, while individually imprecise, still produce valuable results. While the serial application of them to a specific problem may return only a third of the content represented in a large information space, the immensity of the information spaces with which we deal works to the advantage of statistical processes. Additionally, human expertise is a critical component. A recent experiment conducted on the MEDLINE database by researchers at the University of Illinois was reported in the September 18, 1998 issue of Science magazine. One participating physician observed before the trials, that “If you work hard at it and you have a lot of time, you can usually, but not always, find the information that you are looking for.” Augmented with advanced access and visualization tools produced by the research team, another physician observed, “It’s wonderful. I’m now getting far more useful information out of MEDLINE, and I’m getting it in a time frame… while the patient was in my office.” It is also important to note that the tools in use today are designed for single languages, not the multiplicity of languages represented on the Web or in commercial publication. 17. When one moves into the multilingual arena, issues of mapping your query into the target language and interpreting the results dominate. The TREC text retrieval conference continues to explore this area in a cross-lingual retrieval track for English, German, French, and Italian. Performance to date has typically been around half of the precision and recall results returned on monolingual collections. The TIDES program intends to substantially increase the level of performance on cross-lingual retrieval and expand dramatically the number of languages in which retrieval is supported. 18. It intends to do this by exploiting the knowledge extracted in each stage of the information utilization process, and by feeding this knowledge back to the user for refinement and back to the retrieval process as a more precise representation of the query. The TIDES program will test a number of hypotheses in this area: (1) an end user can employ machine translation to refine a foreign language query and improve retrieval performance by 50% (2) identification of coherent and consistent topics appearing in retrieved materials and feeding these back to the retrieval process will improve performance by another 25% (3) names, places, events, and related entities can be extracted from the results of a multilingual search, correlated, and fed back to improve retrieval performance by yet another 25% (4) a coherent multidocument summary can be used as a refined query to deliver translingual performance comparable to that achieved currently in monolingual systems. In the interest of time, I’ll skip the next 3 slides in your book and move directly to the goals of the program. 19. The 3-year goals for the TIDES program include (1) performance on translingual queries at 75% of the performance for monolingual queries, (2) the ability to establish a retrieval capability in a new language within one month, and to deliver high quality translation capability within one year, (3) 50% accuracy in recognition of topics and named entities across multiple languages, and (4) the generation of coherent query- specific summaries from multiple documents in at least 2 languages 20. In five years, the TIDES program intends to deliver (1) translingual capabilities for at least 30 languages, (2) 80% accuracy in translingual entity correlation, (3) 70% accuracy in filling out mutlilingual templates, and (4) the ability to generate query-specific, coherent summaries of up to 20 documents in at least 4 languages. 21. This is the TIDES program. It intends to develop tools to tame the flood of multilingual information. Armed with these tools, Defense analysts will be better able to understand unfolding international situations and provide timely and valid interpretations to strategic and tactical decision makers. I am pleased to announce that the TIDES BAA is being released today. Proposals are due on July 26. I thank you for your attention, and I look forward enthusiastically to your ideas, approaches, and capabilities for achieving the goals of the TIDES program.