Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Slides on Formal Characterization of IR Models | CS 410, Study notes of Computer Science

Material Type: Notes; Professor: Maier; Class: TOP: INTRO TO MULTIMEDIA NTWRK; Subject: Computer Science; University: Portland State University; Term: Unknown 1989;

Typology: Study notes

Pre 2010

Uploaded on 08/16/2009

koofers-user-sog
koofers-user-sog 🇺🇸

10 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
Lecture 3
CS 410/510
Information Retrieval on the Internet
Models in IR
I have met with but one or two persons in the course of my life who
understood the art of Walking, that is, of taking walks,—who had a genius,
so to speak, for sauntering: which word is beautifully derived from “idle
people who roved about the country, in the Middle Ages, and as ked charity,
under pretence of going à la Sainte Terre,”to the Holy Land, till the children
exclaimed, “There goes a Sainte-Terrer,” a Saunterer, a Holy-Lander. They
who never go to the Holy Land in their walks, as they pretend, are indeed
mere idlers and vagabonds; but they who do go there are saunterersin the
good sense, such as I mean. Some, however, would derive the wordform
sans terre, without land or a home, which, therefore, in the good sense, will
mean, having no particular home, but equally at home everywhere.For this
is the secret of successful sauntering. He who sits still in a house all the
time may be the greatest vagrant of all; but the saunterer, in the good
sense, is no more vagrant than the meandering river, which is all the while
sedulously seeking the shortest course to the sea. But I prefer the first,
which, indeed, is the most probable derivation. For every walk is a sort of
crusade, preached by some Peter the Hermit in us, to go forth and
reconquer this Holy Land from the hands of the Infidels.
- from an essay by Henry David Thoreau
What is this essay about? Justify your answer.
Taxonomy of IR models
Basic taxonomy:
Boolean (set theoretic)
Vector (algebraic)
Probabilistic (probabilistic)
Note, will only consider “ad hoc” retrieval
tasks, not filtering or routing tasks
Note: See chapter 2 of Baeza-Yates text for more complete treatment of
definitions and formalisms
Formal characterization of IR models
An IR model is a quadruple where:
1. is a set of logical views (representations) for
the documents in the collection
2. is a set of logical views (representations) for
the user information needs (queries)
3. is a framework for modeling document
representations, queries, and their relationships
4. is a ranking function that associates a
real number with and a that
defines an ordering among documents wrt the
query .
Qqi
Dd j
i
q
)],(,,,[ ji dqRFQD
D
Q
F
),( ji dqR
Basic concepts
Documents described by representative
keywords called index terms
Could be assigned, extracted, selected
May want to assign numerical weights to
indicate importance, reflecting ability to
Summarize document contents
Discriminate this document from others
Ranking function generally predicts the
relevance of query to document
i
qj
d
Weighted index terms
Let be an index term and be a document
is the weight associated with
Quantifies the importance of for describin g
or for discriminating from other documents
Let be a vector of weighted
index terms to describe
–where tis the number of index terms
Usually make a simplifying assumption that the
index terms weights are independent
They are not independent
Some systems try to exploit co-occurrence data
),( j
dk i
i
kj
d
0
,
ji
w
i
kj
d
j
d
),...,,( ,,2,1 jtjjj wwwd =
j
d
pf3
pf4
pf5
pf8

Partial preview of the text

Download Slides on Formal Characterization of IR Models | CS 410 and more Study notes Computer Science in PDF only on Docsity!

Lecture 3

CS 410/

Information Retrieval on the Internet

Models in IR

I have met with but one or two persons in the course of my life who understood the art of Walking, that is, of taking walks,—who had a genius, so to speak, for sauntering: which word is beautifully derived from “idle people who roved about the country, in the Middle Ages, and asked charity, under pretence of going à la Sainte Terre,” to the Holy Land, till the children exclaimed, “There goes a Sainte-Terrer,” a Saunterer, a Holy-Lander. They who never go to the Holy Land in their walks, as they pretend, are indeed mere idlers and vagabonds; but they who do go there are saunterers in the good sense, such as I mean. Some, however, would derive the word form sans terre, without land or a home, which, therefore, in the good sense, will mean, having no particular home, but equally at home everywhere. For this is the secret of successful sauntering. He who sits still in a house all the time may be the greatest vagrant of all; but the saunterer, in the good sense, is no more vagrant than the meandering river, which is all the while sedulously seeking the shortest course to the sea. But I prefer the first, which, indeed, is the most probable derivation. For every walk is a sort of crusade, preached by some Peter the Hermit in us, to go forth and reconquer this Holy Land from the hands of the Infidels.

- from an essay by Henry David Thoreau What is this essay about? Justify your answer.

Taxonomy of IR models

  • Basic taxonomy:
    • Boolean (set theoretic)
    • Vector (algebraic)
    • Probabilistic (probabilistic)
  • Note, will only consider “ad hoc” retrieval

tasks, not filtering or routing tasks

Note: See chapter 2 of Baeza-Yates text for more complete treatment of definitions and formalisms

Formal characterization of IR models

An IR model is a quadruple where:

  1. is a set of logical views (representations) for the documents in the collection
  2. is a set of logical views (representations) for the user information needs (queries)
  3. is a framework for modeling document representations, queries, and their relationships
  4. is a ranking function that associates a real number with and a that defines an ordering among documents wrt the query.

q i ∈ Q d j ∈ D

q i

[ D , Q , F , R ( qi , dj )] D

Q

F

R ( qi , dj )

Basic concepts

  • Documents described by representative

keywords called index terms

  • Could be assigned, extracted, selected
  • May want to assign numerical weights to

indicate importance, reflecting ability to

  • Summarize document contents
  • Discriminate this document from others
  • Ranking function generally predicts the

relevance of query qi^ to document dj

Weighted index terms

  • Let be an index term and be a document
    • is the weight associated with
    • Quantifies the importance of for describing or for discriminating from other documents
  • Let be a vector of weighted index terms to describe - where t is the number of index terms
  • Usually make a simplifying assumption that the index terms weights are independent - They are not independent - Some systems try to exploit co-occurrence data

( k (^) i , dj )

k (^) i dj wi , j ≥ 0 k (^) i dj dj d (^) j =( w 1 , j , w 2 , j ,..., wt , j ) d j

Boolean model

  • Queries are Boolean expressions
  • Retrieval based on set theory & Boolean

algebra

Documents with index term k (^1)

Documents with index term k (^2)

Query: k 1 AND k (^2)

Documents Retrieved

Boolean model

Documents with index term k (^1)

Documents with index term k (^2)

Query: k 1 OR k (^2)

Documents Retrieved

Documents with index term k (^1)

Documents with index term k (^2)

Documents Retrieved

Query: k 1 NOT k 2

Boolean model

  • Index term weights all = 1
    • Index terms either present or absent
  • similarity =

1 if any of the conjunctive components of the query is satisfied* 0 otherwise

  • Where query is written as a Boolean expression in CNF

Note similarity to data retrieval and DB query language

( d j , q )

Boolean model

  • Prediction of relevance is binary
    • relevant or nonrelevant
  • No inherent ranking function; need some

way to order results (but may not be useful)

  • Publication date
  • Alphabetically, e.g. by title or author
  • Random
  • Boolean operators difficult for many users
  • Result set often too small or too large

We had so much fun at the Kohler factory that Kaye suggested we check out the GM plant in Janesville. It's a huge plant, 3.5 million square feet, with 3 assembly lines. Two of them make trucks and Bluebird bus frames, but the line we saw makes Chevy Suburbans and similar light trucks, at the rate of one every 67 seconds.

A few overall comments on the Suburban line. Janesville is an assembly plant, so all the parts are made elsewhere, and come to the plant by truck and rail.

The Janesville facility was built by GM in 1919 as the Sampson Tractor Plant, and started making trucks as well the next year. In 1922 they started making Chevrolet passenger cars there.

There is very little inventory of parts on site. Basically, enough parts for one shift arrive at one time by train or truck.

Doc 1:

Doc 2:

Doc 3:

Doc 4:

Query Doc 1 Doc 2 Doc 3 Doc 4 janesville AND parts frames OR parts (truck OR trucks) NOT cars (plant NOT parts) OR (truck AND train)

Vector space model

  • Start with a little history

Vector space model

  • Allows assignment of non-binary weights

to index terms

  • Allows computation of similarity between

documents and queries

  • Usually calculated as the cosine of the angle between two vectors and (or a variation on that calculation)
  • Natural to return ranked list of documents

d (^) j q

Vector space model

Cosine of angle between two vectors:

q is the same for all docs; does not affect ranking

d (^) j allows normalization for length of the document sim(dj , q) ranges from 0 to +

= =

=

×

×

t i iq

t i ij

iq

t i ij

w w

w w

1

2 1 ,

2 ,

1 , ,

d q

d q

simd q

j

j j

×

cosine coefficient for similarity can be used with either binary or real-valued term weights

Vector space model

  • Assignment of term weights is typically based on word frequencies - Frequency of term in document - High frequency indicates term reflects document content - Frequency of term in the entire collection - Document frequency is # documents with term - Low frequency in collection suggests good discriminator - TF*IDF - TF is frequency of term in document (possibly normalized) - IDF is inverse of document frequency (usually calculated as log (N/n (^) i) where N = #docs in collection, n (^) i = #docs with term i)

Vector space model

  • Many variations on term-weighting have been tried, e.g. - Logarithmic term frequencies - Term frequencies normalized to max term frequency and scaled to fall in range 0.5 – 1 - Salton and Buckley experimented extensively with various permutations in the SMART system using multiple test collections
  • Query terms may be weighted or binary
    • Weighted may be useful for long queries, such as documents or long descriptions of information needs

Vector space model

  • Advantages
    • Term weighting improves performance compared to term overlap (weights = 0 or 1)
    • Allows partial matching
    • Allows ranked retrieval
  • Drawbacks
    • Assumes indexing terms are independent

Query: Chevy assembly Janesville

Doc Chevy assembly occurs in Janesville at the Chevy factory.

Doc Assembly of cars in Janesville is interesting.

Doc Factory assembly of Chevy cars is interesting.

Term TF 1 TF 2 TF 3 DF log(N/n (^) i) wi,d1 wi,d2 wi,d3 wi,q

occurs janesville factory cars

chevy assembly

interesting Doc Similarity to query: 1 2 3

Raw term freq (^) IDF

Probabilistic approach

  • Answer the “Basic Question”:
    • “What is the probability that this document is relevant to this query?” *
  • Rank documents by probability of relevance
    • estimate P (Relevance|Document)
  • Follows from Probability Ranking Principle
    • “If retrieved documents are ordered by decreasing probability of relevance on the data available, then the system’s effectiveness is the best to be gotten for the data.” * *K. Sparck Jones, S Walker, S.E. Robertson. A Probabilistic Model of Information Retrieval: Development and Status. TR 446, Cambridge University Computer Laboratory, September 1998.

Probabilistic model

  • Ranking is based on probability of

relevance to the query, not similarity

  • Relevance assumed to be binary
  • Relevance of one document assumed to

be independent of relevance of other

documents

  • Relevance assumed to be an attribute of

the relationship between document and

query, independent of user situation

A little history: Probabilistic model

  • Maron and Kuhns (1960)
    • Proposed calculation of a relevance number
      • A measure of the probable relevance of a document for a requestor
      • A number used to rank documents
    • Proposed probabilistic indexing
      • Indexer assigns terms with a probability
        • probability that the document will be relevant to a user who is interested in the subject designated by the term
      • Weighted index terms will characterize content more accurately

A little history: Probabilistic model

  • Maron and Kuhns (1960)
    • Proposed techniques for finding the “closest” index terms to the original query terms in order to retrieve more documents - Query expansion
    • Proposed expanding the result set using a distance function (based on weighted index terms) to find documents similar to the original retrieved documents - Relevance feedback

Probabilistic model

  • Goal: rank documents according to

probability of being in the relevant set

  • estimate P(Relevance|Document) for a query
  • Based on Bayes theorem

PBAPA PBAPA

PBAPA

PB

PBAPA

P AB

Probabilistic model

  • Bayes theorem
    • P(A) is prior probability
      • Assumes no knowledge of P(B)
    • P(A|B) is conditional probability
      • Is the probability of A given a known value for B
  • Conditional probability defined:
  • Combine with corresponding equation for

P(A|B) and rearrange:

PB

PB APA

P AB =

( )

( | ) ( ) PB

P AB = PAB

P ( A | B ) P ( B )= P ( AB )= P ( B | A ) P ( A )

Probabilistic model

  • Problem with
    • Requires a component for each term that is absent (Ai = 0); a zero value would be natural
    • Subtract the “natural zero” component from every document, only consider terms present
    • Order-preserving transformation:
    • Let W be a function that assigns a weight for each value of each attribute: - W( Ai = a (^) i ) = - W( Ai = 0) = 0 - Score =

∑ (^) =

= i i i

i i PA a L

PA a L ( | )

log ( |)

= = =

− =

= i i i i

i i i i i

i i i

i i PA aLPA L

PA aLPA L PA L

PA L PA aL

PA aL ( |)( 0 | )

) log( |)(^0 |) ( 0 ))

log(^0 |) ( |)

(log( |)

∑ (^) i W^ ( Ai = ai )

( |)( 0 |) log ( |)(^0 )|) PAaLPA L

PA aLPA L i i i

i i i = =

= =

Probabilistic model

  • Score is based on:
  • Weight for term presence:
    • where pi = P(term i present|L)
    • and qi = P(term i present|L)
  • So, how to assign wi?
    • i.e. how to assign pi and qi?

( 1 )

( 1 ) log i i

i i i (^) q p

p q w

∑ (^) = =

= = i i i i

i i i PA a LPA L

PA a LPA L ( | )( 0 | )

log( | )(^0 |)

Probabilistic model

  • Consider possible ways to assign values

to p i

  • Unweighted: presence/absence of term
    • ( pi , qi = 0 or 1)
  • Collection frequency weights
  • Incorporate relevance information from user
  • Term frequency (within document) weights R N – R N

R – r N – n – R + r N – n

Does not contain term

Contains term r n – r n

Relevant Non-relevant

Contingency table

R N – R N

Does notcontain term R – r N – n – R + r N – n

Contains term r n – r n

Relevant Non-relevant

Probabilistic model

  • To use contingency table to assign weights:
    • For any i ,
  • So

R

r p = (^) N R

n r q

( )( )

( ) log R rn r

rN n R r − −

− − +

( 1 )

log (^1 ) i i

i i i (^) q p w p q

= −

Probabilistic model

  • Must estimate values for p and q
  • To get a well-behaved formula, add 0.5 to all central cells in contingency table
  • If no relevance info available, using collection frequencies for weighting: - Assume R is small relative to N - Estimate q as n/N (proportion of docs that contain term) - Assume p = a constant - Transform equation, get CFW = ∑ i ni

log N

Probabilistic model

  • If relevance info available
    • Use term frequency information from relevant documents to estimate p and q using the contingency table (after addition of 0.5 to central cells)
  • RW = ∑ (^) − + − +

i i i i

i i i

R r n r

r N n R r

log

Probabilistic model

  • No relevance data: using within document term frequency data
  • Early work modeled as two types of occurrences
    • Document is about the topic of the term
      • document is elite for that term
    • Document not about the topic of the term
    • Modeled as a mixture of two Poisson distributions
    • Very complex formula, difficult to estimate parameters, little performance benefit
    • Simplified equation (with tuning parameters) has similar desired behavior; has been very successful

Probabilistic model

  • Summary
    • Based on estimating the probability that a document is relevant to a query
    • Requires various assumptions
    • Like the vector model, calculates a score to be used for relevance and uses a weighted vector to represent the terms in queries and documents
    • Underlies some very successful research prototypes