Document = Set of Words
Each Word = Dimension in Vector
–After
removing very common and rare words
–Stemming
è(retriev*, inform*, visual*,
interact*) = 4D vector
Each Word / Dimension Weighted based
on Frequency
-“Inverse” = 1 / Frequency
èThe less frequent, the greater the
weight
Similarity of Documents = Angle
between Vectors
-Two text passages similar if their
vectors point in a similar direction