. . . because Euclidean distance is large for vectors of different lengths.
Key idea of cosine distance:
Rank documents according to angle with query.
But even if we normalize the data and use Euclidean distance, then if the data are sparse (contains many zeros), we get into troubles:
Taking Euclidean metric, we can write:
If the space is sparse, then 2xy is zero most of the time. Hence the metric degrades into x^2+y^2. And that is useless as a distance measure.
On the other end, cosine distance just looks at xy. Hence it is not adversely impacted by the combination of high dimension and sparseness [SAS].
But of course, sometimes we don't want to look just at the angle of the vectors. Hence for dense data, it is still advisory to use Euclidean distance.
For further discussion, see a related question at stats.stackexchange.com.