Dominik Schnitzer,
"Indexing Content-Based Music Similarity Models for Fast Retrieval in Massive Databases,"
, 2012
Original Titel:
Indexing Content-Based Music Similarity Models for Fast Retrieval in Massive Databases,
Sprache des Titels:
Englisch
Original Kurzfassung:
This thesis develops a large-scale music recommendation system. To achieve this goal we solve
three problems preventing the currently top-performing class of content-based music similarity
algorithms from being used as recommendation engine in huge databases with millions of songs:
First, we show how to correctly use the non-vectorial music similarity features with their nonmetric
divergences in centroid-computing algorithms. All previous approaches had to artificially
vectorize the data before they were able to work with the features.
Second, we show how the problem of ?hubs? can be alleviated. Hubs are objects in a recommendation
system which are unwontedly often retrieved as nearest neighbors. The examined
music recommendation methods are especially prone to hubs, significantly decreasing their retrieval
quality. We also identify hubs as a problem of machine learning and show the beneficial
effects of our method on a large number of general public machine learning collections.
Third, we present a new method to speed up music recommendation queries. The method
uses a filter-and-refine systems layout. It achieves a very high retrieval accuracy and speeds up
queries by a factor of 10?40 compared to a linear scan. The method enables us to use the music
similarity methods with very large databases.
We finally merge all three introduced methods in a large-scale, high-quality music recommendation
prototype: the system computes (i) a natural clustering of the music similarity features
to (ii) apply the introduced hub-reducing method and (iii) use the filter-and-refine method to
allow for fast retrieval. The prototype is called ?Wolperdinger?, it operates on a collection of 2.3
million songs and it is able to answer recommendation queries in a fraction of a second. It is the
largest content-based music recommendation system published to date.