Tag-Based Information Retrieval

Current IR (information retrieval) systems usually use either full-text matching or keyword spotting. A major problem with these systems is overmatching; They are biased towards high recall while sacrificing precision. To improve both, matching mechanisms should exploit more information in the text, such as semantic class of keywords and sentence meaning. The current NLP (natural language processing) techniques, however, are not mature enough to accurately pick up such information. The GDA tags should be of a great help for improving NLP accuracy, which will then improve IR.

For instance, if you want to find documents about Mr. Washington, you may end up finding many documents about people called Washington as well as documents about the city of Washington or Washington State. The retrieval could disregard irrelevant documents by using markups like <name type=person>Washington</name>, which is proposed in CES. The same distinction is possible by other tags. In particular, sense tags associated with keywords will be useful here.

Some retrieval systems accept a natural language sentence as a query, but most of them just gather keywords from the query sentence and search database for these keywords. When you issue query `Mary files a suit against Tom,' you are probably not interested in documents saying that Tom files a suit against Mary. To retrieve adequate documents for this type of query, it is needed to understand facts stated in documents. Parsing is necessary in order to understand them, but the accuracy of parsing becomes very low for long sentences. The tags encoding parse tree bracketing are very promising for improving the parsing accuracy.

The retrieved documents are usually sorted or classified when presented to the user, However, the current methods for sorting and classification use just surface-level information of texts. More accurate methods may be envisaged by assuming annotated documents. Morohashi et al. (1995, 1996) propose a unique navigation system \cite{moro95,moro96} which presents some properties of the currently retrieved documents through multiple views. For instance, a geographical view presents the number of documents for each region in a map. Their system infer the semantic categories of keywords by just consulting some thesauri, but sense tags for keywords will greatly improve the performance of such a system.

GDA Home Page