čtvrtek 21. května 2015

Why to use cosine distance instead of Euclidean distance in text mining?

Euclidean distance is a bad idea . . .
. . . because Euclidean distance is large for vectors of different lengths.

The Euclidean distance between q and d2 is large even though the distribution of terms in the query q and the distribution of terms in the document d2 are very similar.

Key idea of cosine distance:
Rank documents according to angle with query.

But even if we normalize the data and use Euclidean distance, then if the data are sparse (contains many zeros), we get into troubles:

Taking Euclidean metric, we can write:  |xy|2=|x|2+|y|22 xy.
If the space is sparse, then 2xy is zero most of the time. Hence the metric degrades into x^2+y^2. And that is useless as a distance measure.

On the other end, cosine distance just looks at xy. Hence it is not adversely impacted by the combination of high dimension and sparseness [SAS].

But of course, sometimes we don't want to look just at the angle of the vectors. Hence for dense data, it is still advisory to use Euclidean distance.

For further discussion, see a related question at stats.stackexchange.com.

Žádné komentáře: