Content-Based Indexing and retrieval of Text Document -Web Search engines
The amount of information available on the Internet is currently growing at an incredible rate. However, the lack of efficient indexing is still a major barrier to effective information retrieval on the web. A lot of research has gone into developing retrieval systems on the web. Despite all that, using current indexing techniques, it has been reliably estimated that on average only 30% of the returned items are relevant to the user’s need, and that 70% of all relevant items in the collection are never returned. These results are far from ideal considering the user is still presented with thousands of documents pertaining to a keyword query in milliseconds. Existing indexing techniques, mainly used by search engines, are keyword based. In other words, each document is represented by a set of meaningful terms (also called descriptors or keywords) that are believed to express its content. The major drawback to keyword based methods is that they only use a small amount of the information associated with a document as the basis for relevance decisions. As a consequence, irrelevant information that uses a certain word in a different context might be retrieved or information where different words about the desired content are used might be missed. To achieve better performance, more semantic information about the documents needs to be captured. Some attempts at improving the traditional techniques using Natural Language Processing, logic and document clustering have offered some improvements. Some of the KMRC members have supervised PhD students who worked on the development new content-based indexing and retrieval algorithms (such as RST-Index) which uses computational and linguistic techniques such as Rhetorical Structure Theory (RST) and Natural Language Understanding (NLU) and there is still a need for PhD students to extend the research on content-based indexing and retrieval to improve what has been achieved so far.