Global Document Annotation (GDA)

The GDA Tag Set | Japanese Home Page

The GDA Initiative aims at having Internet authors annotate their electronic documents with a common standard tag set which allows machines to automatically recognize the semantic and pragmatic structures of the documents. A huge amount of annotated data is expected to emerge, which should serve not just as tagged linguistic corpora but also as a worldwide, self-extending knowledge base mainly consisting of examples of how our knowledge manifests.

GDA comprises the following three steps.

  1. Propose an XML tag set which allows machines to automatically infer underlying structures of documents.
  2. Promote development and spread of communication-aiding applications which exploit those tags.
  3. Motivate thereby the authors of WWW files to annotate their documents by those tags.
The example of annotated text below, which will appear just as `time files like an arrow' when seen through a WWW browser, should give a flavor of the tags to be proposed in Step 1.
  <np sem="time0">time</np>
  <v sem="fly1">flies</v>
  <adp>like <np>an <n>arrow</n></np></adp>
XML elements such as <np>...</np> encodes parse tree bracketing, and the property sem disambiguates polysemy of words. Note that these tags reduce the notorious ambiguities involved here, so that it is easy to automatically determine the underlying structure of the sentence by the present technology. The tags proposed in Step 1 will also encode coreferences, scopes of logical/modal operators, rhetorical structure, social relationship between the author and the audience, and so on, in order to render the document machine-understandable to various degrees.

Step 2 concerns AI applications such as machine translation, information retrieval, information filtering, data mining, consultation, expert systems, and so on. If annotation with such tags as mentioned above may be assumed, it is certainly possible to drastically improve the accuracy in such applications. For instance, translation using those tags will be almost guaranteed to produce intelligible outputs. New types of applications for communication aid may be invented as well. As far as Step 1 and Step 2 are concerned, there are similar reseach activities (Cunningham et al. 1997, McKelvie et al. 1997, and Zajac 1997). GDA could be regarded as a standardization of NLP tools such as addressed in these activities plus tagging by humans to embody Step 3.

The Internet has opened up an era in which an unrestricted number of people publish their messages electronically through their home pages. At Step 3, their commitment to present themselves to the widest and best possible audience is organized towards tagging. Those WWW authors will be motivated to annotate their home pages, because documents annotated according to a common standard can be translated, retrieved, and so on with higher accuracy and thus have greater chance to reach many, right kind of readers. Thus, tagging makes documents stand out much more effectively than decorating them with pictures and sounds. People tend to be unaware of the original spirit of SGML and are using HTML almost for the sake of visual layout only, but GDA will reintroduce that spirit to the HTML world.

Mitchell (Selman et al., 1996) proposes to develop a program to turn WWW into a knowledge base, but that involves automatic understanding of WWW documents, which requires a huge knowledge base from the outset. Given the current state of the art, human aid to machine understanding those documents is necessary in order to maintain the quality of the resulting knowledge base. The annotation proposed in GDA will serve as such aid. It will be possible to automatically transform tagged documents into a formal language for knowledge representation.

However, GDA does not intend to propose a canonical knowledge representation language. The tag sets proposed in GDA are by no means intended to be such. In line with the spirit of SGML, it is important that annotated documents can be transformed into various formats of internal representation depending on the different application needs. Seen from the viewpoint of a worldwide knowledge base, the GDA tags aim at facilitating information interchange, sharing, and reuse --- communication in a wide sense of the term --- across different formats of internal representation. This is a reasonable target because a definitive knowledge representation language will not be agreed upon very soon among AI researchers. What is more important is that it overlaps with interchange of information among endusers across the Internet. Communication support for creating the knowledge base and that for serving general Internet users are thus woven together. GDA captures the point where researchers' demand and endusers' demand happen to meet, and induces the latter towards meeting both.

Generally speaking, manual tagging varies from one user to another and one situation to another, even if professional people are employed under a strict control. Many endusers tagging their files, the chances are that the variance is very large in GDA. However, disagreement in tagging is allowed as long as the underlying structure of the documents can be automatically recognized without ambiguity. The good news are that resolution of tagging disagreement is obviously much easier than ambiguity resolution concerning documents, and that the amount of annotated data is expected to be huge enough to statistically exclude tagging noises.

We are planning to publish the first versions of standard tag sets in the summer of 1998. These tag sets will be mostly extracted from the tags proposed in TEI, EAGLES, and CES, among others. Each tag set will be designed to serve some particular application such as translation or retrieval.

Tag sets include sense tags to disambiguate polysemy of words and phrases. The first version of the sense tag dictionary will probably be made from WordNet, which raises no copyright problem. The dictionary should then be continually extended by adding new entries and dividing the existing entries to more minute ones, typically as new vocabularies are incorporated from languages for which GDA applications are developed for the first time --- the sense tags are of course designed to be translingual. In this connection, EuroWordNet will be a large source of new entries in the near future.

There are more issues concerning the tags. For example, some software tool to support humans to markup their documents is necessary to reduce the cost of manual annotation. Also, the tags should be extended to encode multimodal structures, though the first versions of tag sets will mainly concern text data.

Tag-Based Applications

Step 2 and 3 of GDA form a positive feedback cycle. When plenty of high-quality, low-cost services are available for communication aid using the tags, many people will be strongly motivated to markup their files. When tagged documents abound, more competitive communication aiding services will be provided mainly for commercial benefit. Once the cycle somehow starts turning, economic principles will account for the rest of the story. The question is how to kick off this cycle.

The AI research community should kick it off. The obvious merit of tagging for general users comprise just enjoying communication supporting technologies, whereas AI research can enjoy not only the social recognition but also the resulting knowledge base. It should hence be relatively easy to persuade the AI community of the merits of GDA.

So we would like to promote the AI community to develop application systems which use GDA tags for information interchange across the Internet. Completed application systems are welcome because they will contribute to kick off the cycle discussed above, but proposals to develop some applications are also welcome as long as they are promising; We are expecting interesting ideas about how to use the tags.

Listed below are some examples of tag-based applications which could address the above challenge. They encompass not only short-term developments but also long-term research themes. Translation, retrieval and filtering partially belong to the former category and are expected to provide high-quality service to the Internet users very soon. The other applications discussed below might involve more basic research issues.

Since the tags are designed to aid machines to understand documents, the applications using tags will use NLP (natural language processing) technologies to various degrees. Once a document is somehow understood by exploiting the tags and transformed into some appropriate internal representation, however, the rest of the story may be irrelevant to NLP. In many cases, the transformation to internal representation may be done as precompilation, if the scope of information needed for the application task is predictable.

To make good use of AI technologies in the current state of the art, we must construct an artificial environment in which AI is maximally useful. GDA is to create such an environment.

HASIDA Kôiti, Dr. Sci.
Social Intelligence Technology Research Lab.
AIST Tôkyô Waterfront
2-3-26, Aomi, Kôtô-ku, Tôkyô, 135-0064, Japan.