This article needs additional citations for verification. (August 2010) |
Automatic indexing is the computerized process of scanning large volumes of documents against a controlled vocabulary, taxonomy, thesaurus or ontology and using those controlled terms to quickly and effectively index large electronic document depositories. These keywords or language are applied by training a system on the rules that determine what words to match. There are additional parts to this such as syntax, usage, proximity, and other algorithms based on the system and what is required for indexing. This is taken into account using Boolean statements to gather and capture the indexing information out of the text.[1] As the number of documents exponentially increases with the proliferation of the Internet, automatic indexing will become essential to maintaining the ability to find relevant information in a sea of irrelevant information. Natural language systems are used to train a system based on seven different methods to help with this sea of irrelevant information. These methods are Morphological, Lexical, Syntactic, Numerical, Phraseological, Semantic, and Pragmatic. Each of these look and different parts of speed and terms to build a domain for the specific information that is being covered for indexing. This is used in the automated process of indexing.[1]
The automated process can encounter problems and these are primarily caused by two factors: 1) the complexity of the language; and, 2) the lack intuitiveness and the difficulty in extrapolating concepts out of statements on the part of the computing technology.[2] These are primarily linguistic challenges and specific problems and involve semantic and syntactic aspects of language.[2] These problems occur based on defined keywords. With these keywords you are able to determine the accuracy of the system based on Hits, Misses, and Noise. These terms relate to exact matches, keywords that a computerized system missed that a human wouldn't, and keywords that the computer selected that a human would not have. The Accuracy statistic based on this should be above 85% for Hits out of 100% for human indexing. This puts Misses and Noise combined to be 15% or less. This scale provides a basis for what is considered a good Automatic Indexing System and shows where problems are being encountered.[1]