Tag-Based Information Filtering

Automatic abstraction is useful when you want to skim documents, typically to decide whether you should read them carefully. Abstraction is basically to pick up important parts of a document. The importance is calculated by some features of these parts and the rhetorical structure of the document. For instance, a very simple system (Luhn, 1958) just uses the number of keywords, whereas more complicated systems (Edmundson, 1969, Wataname, 1996) use keywords as well as sentence types, sentence locations, rhetorical relations, and so forth. Most of such systems do not use parsing but just use surface level clues. If appropriate tags are present in the text, therefore, such systems can extract more accurate information thereof by parsing, generating better abstracts. Also, methods which exploit semantic structure of input documents (Hasida et al., 1988) can be implemented easily by assuming tagged inputs.

Sentences in an abstract may not just be picked up from the original document but composed of its various parts. Resolution of anaphora, ellipsis, paraphrase, etc. is vital in this respect. Most of such coreferences will be marked up by GDA tags, which should greatly facilitate abstract generation. CES tags such as <ref> will be employed in GDA for that purpose.

GDA will also incorporate tags like <figure>, which happens to be a CES tag as well, to encode embeddings figures, tables, and so on. They can be used to include those extrasentential materials into the abstract. This should be important in abstraction, because figures and tables are often useful for quick overview. A major research issue would be how to estimate their importance.

GDA Home Page