Download Language Modeling for Information Retrieval by W. Bruce Croft, John Lafferty PDF
By W. Bruce Croft, John Lafferty
A statisticallanguage version, or extra easily a language version, is a prob abilistic mechanism for producing textual content. Such adefinition is common adequate to incorporate an unending number of schemes. even though, a contrast might be made among generative types, which could in precept be used to synthesize man made textual content, and discriminative concepts to categorise textual content into predefined cat egories. the 1st statisticallanguage modeler was once Claude Shannon. In exploring the appliance of his newly based concept of data to human language, Shannon thought of language as a statistical resource, and measured how weH uncomplicated n-gram versions anticipated or, equivalently, compressed average textual content. to do that, he predicted the entropy of English via experiments with human topics, and in addition envisioned the cross-entropy of the n-gram types on traditional 1 textual content. the power of language types to be quantitatively evaluated in tbis approach is one in every of their vital virtues. after all, estimating the real entropy of language is an elusive objective, aiming at many relocating objectives, due to the fact language is so diverse and evolves so quick. but fifty years after Shannon's learn, language versions stay, by means of all measures, faraway from the Shannon entropy liInit by way of their predictive strength. even though, tbis has no longer saved them from being invaluable for quite a few textual content processing projects, and in addition may be considered as encouragement that there's nonetheless nice room for development in statisticallanguage modeling.
Read or Download Language Modeling for Information Retrieval PDF
Similar storage & retrieval books
At the world-wide-web, velocity and potency are important. clients have little persistence for sluggish web content, whereas community directors need to make the main in their on hand bandwidth. A thoroughly designed net cache reduces community site visitors and improves entry instances to renowned net sites-a boon to community directors and net clients alike.
The two-volume set LNCS 8796 and 8797 constitutes the refereed lawsuits of the thirteenth overseas Semantic net convention, ISWC 2014, held in Riva del Garda, in October 2014. The overseas Semantic net convention is the ultimate discussion board for Semantic net examine, the place innovative medical effects and technological recommendations are offered, the place difficulties and recommendations are mentioned, and the place the way forward for this imaginative and prescient is being built.
This publication identifies and discusses the most demanding situations dealing with electronic company innovation and the rising tendencies and practices that might outline its destiny. The publication is split into 3 sections overlaying traits in electronic structures, electronic administration, and electronic innovation. the outlet chapters examine the problems linked to laptop intelligence, wearable know-how, electronic currencies, and dispensed ledgers as their relevance for enterprise grows.
This e-book deals a radical but easy-to-read reference advisor to varied elements of cloud computing protection. It starts with an creation to the overall thoughts of cloud computing, via a dialogue of safety features that examines how cloud defense differs from traditional details safeguard and reports cloud-specific sessions of threats and assaults.
Additional resources for Language Modeling for Information Retrieval
3 English processing. Prior to any experiments, each dataset was processed as follows. Both documents and queries were tokenized on whitespace and punctuation characters. Tokens with fewer than two characters were discarded. , 2001) stemmer, which combines morphological mIes with a large dictionary of special cases and exceptions. , 2001) stop-list were removed. All of the remaining tokens were used for indexing, and no other form of processing was used on either the queries or the documents. 4 Chinese Resources.
Without experimental evidence, it would be difficult to guess which ranking prineiple would perform better. Is it preferable to favor a few highly-relevant words, as done by the probability ratio, or many possiblyrelevant words, as favored by cross-entropy? 7 suggest that rankings based on cross-entropy are noticeably better. However, it is important to realize that our experiments do not contradict the arguments of Robertson (Robertson, 1977), and do not diminish the importance ofthe probability ranking prineiple.
We used query number 58 from the cross-language retrieval task of TREC-9. The English query was: "environmental protection laws". We show 20 tokens with highest probability under the model. It is evident that many stop-words and punctuation characters are assigned high probabilities. This is not surprising, since these characters were not removed during pre-processing, and we naturally expect these characters to occur frequently in the documents that discuss any topic. However, the model also assigns high probabilities to words that one would consider highly relevant to the topic of environmental protection.