Повышение точности прогнозов на данных Netflix с помощью построения алгоритмических композиций (отчет)

Материал из MachineLearning.

Перейти к: навигация, поиск

Введение в проект

Project Specifications

The Goal

Boost the performance of exisiting algorithms on Netfix data with the help of algebraic approach to the synthesis of algorithmic compositions.

The Motivation

There is a posibility exists that individually weak algorithms while used in a compositions could result in a much better performance. After the Netflix Prize there has been left a lot of probes(according to Netflix's notation). So, it would be interesting to apply algebraic approach to combine them and achieve better accuracy.

The Data

The data comes from 1,151 sets of predictions of integer ratings in the range (1-5). Each set of predictions was generated by algorithms with the objective to minimise the RMSE over the full set of over 1.4 million ratings. All the individual probe files were merged into one large file (1,151 columns, 1.4 million rows). From this file, random sampling was used to generate data sets. Each of the data sets is split into 2 files, one for Training (Rating provided) and one for Scoring. There are 2 small and 2 medium data sets.

In each of the data sets, the Training and Scoring files are of the same size:

  • RMSE Small — 200 sets of predictions for 15,000 ratings. The objective is to minimize the RMSE.
  • AUC Small — 200 sets of prediction for 15,000 ratings. Only ratings of 2 particular values are included (eg 1s and 5s), sampled with different weightings on each value. The objective is to minimize the AUC.
  • RMSE Medium — 250 sets of predictions for 20,000 ratings. The objective is to minimize the RMSE.
  • AUC Medium — 250 sets of predictions for 20,000 ratings. The objective is to minimize the AUC.

The data sets are csv files with header rows. The first 2 columns are rowID and Target (actual rating) and then either 250 or 200 columns of predicted rating values. In the Scoring files the Target values are zeros. For both the RMSE and AUC datasets, the predicted ratings values have been converted to integers by rounding to 3 decimal places and multiplying by 1,000. Thus each predicted rating will be an integer and should lie in the range 1,000 to 5,000.

Evaluation Criteria

RMSE (http://en.wikipedia.org/wiki/Mean_squared_error)

AUC (http://en.wikipedia.org/wiki/Receiver_operating_characteristic)

For the AUC dataset, the Target values are 1 and -1 (zeros in the Scoring files).

Therefore, in this case Netflix Dataset was modified so that to allow 2-class recognition (where 1~5stars and -1~1star)

Requirements

Synthesized algorithms should at least outperform simple averaged composition.

Feasibility and Possible Obstacles

We can only assess the diversity of algorithms but we don't know the underlying principle of how any given algorithm has been trained and what kind of correlations exist among algorithms. To build strong composition it is better to have diversed and individually accurate algorithms.

Approach

Boosting, Bagging, Genetic Algorithm, Stochastical Local Search with Adaptation, (For|Back)ward Stepwise Selection (Greedy Search), Linear Regression

Mathematical Formalization

Given a set of predictions (e.g. Netflix' probes) for finite set of samples - an X matrix, and the true labels - y vector. Build a correcting operator in the space of all functions over initial space F that will give better performance on a qualifying(test) set. Initial dataset was partitioned into two independent datasets: training and qualifying.

Descriptions of the Algorithms

References

[1] R. Bell and Y. Koren. Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights. IEEE International Conference on Data Mining (ICDM’07), pp. 43–52, 2007.

[2] Breiman L. Bagging predictors // Machine Learning.� 1996.� Vol. 24, no. 2.� Pp. 123–140.

[3] Friedman J., Hastie T., Tibshirani R. Additive logistic regression: a statistical view of boosting: Tech. rep.: Dept. of Statistics, Stanford University Technical Report, 1998.

[4]...

Hypotheses about underlying data properties

All probe-predictions yield performance better than simple random guessing algorithm(in case of classification problem), and uniform predictions over the set of (1:5) in Z(in case of regression problem). Distribution of labels in training and qualifying sets is the same. Because of all probes has been built by well-known scientists and they have used stable and robust algorithmic models, all probes are informative. We can treat this problem as a problem of feature selection where each probe-prediction is a unique feature. Therefore, Backward Stepwise Selection should yield better results.

Mathematical Foundations

Heuristics, Modifications and Suggestions

  • Here we use regularization parameter because of the large number of probes(number of params to be tuned). The value of regularization parameter has been found with the help of CV(cross validation).
  • To measure pairwaise similarity of predictions we have used Kendell-Kemeni coefficient. It is the best possible measure of feature similarity in case of problems where the target variable is in a ranked scale.

Описание системы

  • Ссылка на файл system.docs
  • Ссылка на файлы системы

Отчет о вычислительных экспериментах

Визуальный анализ работы алгоритма

Анализ качества работы алгоритма

Анализ зависимости работы алгоритма от параметров

Отчет о полученных результатах

Список литературы

Данная статья является непроверенным учебным заданием.
Студент: Спирин Никита
Преподаватель: В.В. Стрижов
Срок: 15 декабря 2009

До указанного срока статья не должна редактироваться другими участниками проекта MachineLearning.ru. По его окончании любой участник вправе исправить данную статью по своему усмотрению и удалить данное предупреждение, выводимое с помощью шаблона {{Задание}}.

См. также методические указания по использованию Ресурса MachineLearning.ru в учебном процессе.


Личные инструменты