Slides on Formal Characterization of IR Models | CS 410 | Study notes Computer Science

Lecture 3

CS 410/510

Information Retrieval on the Internet

Models in IR

I have met with but one or two persons in the course of my life who

understood the art of Walking, that is, of taking walks,—who had a genius,

so to speak, for sauntering: which word is beautifully derived from “idle

people who roved about the country, in the Middle Ages, and as ked charity,

under pretence of going à la Sainte Terre,”to the Holy Land, till the children

exclaimed, “There goes a Sainte-Terrer,” a Saunterer, a Holy-Lander. They

who never go to the Holy Land in their walks, as they pretend, are indeed

mere idlers and vagabonds; but they who do go there are saunterersin the

good sense, such as I mean. Some, however, would derive the wordform

sans terre, without land or a home, which, therefore, in the good sense, will

mean, having no particular home, but equally at home everywhere.For this

is the secret of successful sauntering. He who sits still in a house all the

time may be the greatest vagrant of all; but the saunterer, in the good

sense, is no more vagrant than the meandering river, which is all the while

sedulously seeking the shortest course to the sea. But I prefer the first,

which, indeed, is the most probable derivation. For every walk is a sort of

crusade, preached by some Peter the Hermit in us, to go forth and

reconquer this Holy Land from the hands of the Infidels.

- from an essay by Henry David Thoreau

What is this essay about? Justify your answer.

Taxonomy of IR models

• Basic taxonomy:

– Boolean (set theoretic)

– Vector (algebraic)

– Probabilistic (probabilistic)

• Note, will only consider “ad hoc” retrieval

tasks, not filtering or routing tasks

Note: See chapter 2 of Baeza-Yates text for more complete treatment of

definitions and formalisms

Formal characterization of IR models

An IR model is a quadruple where:

1. is a set of logical views (representations) for

the documents in the collection

2. is a set of logical views (representations) for

the user information needs (queries)

3. is a framework for modeling document

representations, queries, and their relationships

4. is a ranking function that associates a

real number with and a that

defines an ordering among documents wrt the

query .

Qqi

∈

Dd j∈

)],(,,,[ ji dqRFQD

),( ji dqR

Basic concepts

• Documents described by representative

keywords called index terms

– Could be assigned, extracted, selected

• May want to assign numerical weights to

indicate importance, reflecting ability to

– Summarize document contents

– Discriminate this document from others

• Ranking function generally predicts the

relevance of query to document

Weighted index terms

• Let be an index term and be a document

– is the weight associated with

– Quantifies the importance of for describin g

or for discriminating from other documents

• Let be a vector of weighted

index terms to describe

–where tis the number of index terms

• Usually make a simplifying assumption that the

index terms weights are independent

– They are not independent

– Some systems try to exploit co-occurrence data

),( j

dk i

,≥

),...,,( ,,2,1 jtjjj wwwd =

Slides on Formal Characterization of IR Models | CS 410, Study notes of Computer Science

Related documents

Partial preview of the text

Download Slides on Formal Characterization of IR Models | CS 410 and more Study notes Computer Science in PDF only on Docsity!

Lecture 3

CS 410/

Information Retrieval on the Internet

tasks, not filtering or routing tasks

q i ∈ Q d j ∈ D

q i

Q

R ( qi , dj )

keywords called index terms

indicate importance, reflecting ability to

relevance of query qi^ to document dj

algebra

( d j , q )

way to order results (but may not be useful)

Vector space model

to index terms

documents and queries

Vector space model

Cosine of angle between two vectors:

×

×

w w

w w

d q

d q

simd q

×

Vector space model

Vector space model

Vector space model

relevance to the query, not similarity

be independent of relevance of other

documents

the relationship between document and

query, independent of user situation

probability of being in the relevant set

PBAPA PBAPA

PBAPA

PB

PBAPA

P AB

P(A|B) and rearrange:

PB

PB APA

P AB =

= = =

− =

−

to p i

−

− − +

R r n r

r N n R r

log