Повышение точности прогнозов на данных Netflix с помощью построения алгоритмических композиций (отчет)

Материал из MachineLearning.

Версия от 16:57, 1 февраля 2010; Nikita Spirin (Обсуждение | вклад)

(разн.) ← Предыдущая | Текущая версия (разн.) | Следующая → (разн.)

Введение в проект

Project Specifications

The Goal

Boost the performance of exisiting algorithms on Netfix data with the help of algebraic approach to the synthesis of algorithmic compositions.

The Motivation

There is a non-null probability exists that individually weak algorithms while used in a compositions could result in a much better performance. After the Netflix Prize there has been left a lot of probes(according to Netflix's notation). So, it would be interesting to apply algebraic approach to combine them and achieve better accuracy.

The Data

The data comes from 1,151 sets of predictions of integer ratings in the range (1-5). Each set of predictions was generated by algorithms with the objective to minimise the RMSE over the full set of over 1.4 million ratings. All the individual probe files were merged into one large file (1,151 columns, 1.4 million rows). Form this file, random sampling was used to generate data sets. Each of the data sets is split into 2 files, one for Training (Rating provided) and one for Scoring. There are 2 small and 2 medium data sets.

In each of the data sets, the Training and Scoring files are of the same size:

RMSE Small — 200 sets of predictions for 15,000 ratings. The objective is to minimize the RMSE.

AUC Small — 200 sets of prediction for 15,000 ratings. Only ratings of 2 particular values are included (eg 1s and 5s), sampled with different weightings on each value. The objective is to minimize the AUC.

RMSE Medium — 250 sets of predictions for 20,000 ratings. The objective is to minimize the RMSE.

AUC Medium — 250 sets of predictions for 20,000 ratings. The objective is to minimize the AUC.

The data sets are csv files with header rows. The first 2 columns are rowID and Target (actual rating) and then either 250 or 200 columns of predicted rating values. In the Scoring files the Target values are zeros. For both the RMSE and AUC tasks, the predicted ratings values have been converted to integers by rounding to 3 decimal places and multiplying by 1,000. Thus each predicted rating will be an integer and should lie in the range 1,000 to 5,000. The initial datasets are taken from AusDM Ensemble Competition.

Evaluation Criteria

RMSE (http://en.wikipedia.org/wiki/Mean_squared_error)

AUC (http://en.wikipedia.org/wiki/Receiver_operating_characteristic)

For the AUC tasks, the Target values are 1 and -1 (zeros in the Scoring files). Therefore, in this case Netflix Dataset was modified so that to allow 2-class recognition (where 1~5stars and -1~1star)

Requirements

Synthesized algorithms should at least outperform simple averaged composition.

Feasibility and Possible Obstacles

We can only assess the diversity of algorithms but we don't know the underlying principle of how any given algorithm has been trained and what kind of correlations exist among algorithms. To build strong ensemble it is better to have diversed algorithms.

Approach

In progress... ;)

Mathematical Formalization

Descriptions of the Algorithms

References

[1] R. Bell and Y. Koren. Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights. IEEE International Conference on Data Mining (ICDM’07), pp. 43–52, 2007.

[2] A. Toscher and M. Jahrer. The BigChaos Solution to the Netflix Grand Prize, 2009.

[3] Breiman L. Bagging predictors // Machine Learning.� 1996.� Vol. 24, no. 2.� Pp. 123–140.

[4] Friedman J., Hastie T., Tibshirani R. Additive logistic regression: a statistical view of boosting: Tech. rep.: Dept. of Statistics, Stanford University Technical Report, 1998.

[5] J. H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, Vol. 29 (2001), pp. 1189-1232.

[6] J. H. Friedman. Stochastic Gradient Boosting. Computational Statistics & Data Analysis, Vol. 38 (2002), pp. 367–378.

[7]...

Hypotheses about underlying data properties

Mathematical Foundations

Heuristics and Modifications

Описание системы

Ссылка на файл system.docs
Ссылка на файлы системы

Отчет о вычислительных экспериментах

Визуальный анализ работы алгоритма

Анализ качества работы алгоритма

Анализ зависимости работы алгоритма от параметров

Отчет о полученных результатах

Список литературы

Данная статья является непроверенным учебным заданием.

Студент: Участник:Спирин Никита

Преподаватель: Участник:В.В. Стрижов

Срок: 15 декабря 2009

До указанного срока статья не должна редактироваться другими участниками проекта MachineLearning.ru. По его окончании любой участник вправе исправить данную статью по своему усмотрению и удалить данное предупреждение, выводимое с помощью шаблона {{Задание}}.

См. также методические указания по использованию Ресурса MachineLearning.ru в учебном процессе.

Источник — «http://machinelearning.ru/wiki/index.php?title=%D0%9F%D0%BE%D0%B2%D1%8B%D1%88%D0%B5%D0%BD%D0%B8%D0%B5_%D1%82%D0%BE%D1%87%D0%BD%D0%BE%D1%81%D1%82%D0%B8_%D0%BF%D1%80%D0%BE%D0%B3%D0%BD%D0%BE%D0%B7%D0%BE%D0%B2_%D0%BD%D0%B0_%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85_Netflix_%D1%81_%D0%BF%D0%BE%D0%BC%D0%BE%D1%89%D1%8C%D1%8E_%D0%BF%D0%BE%D1%81%D1%82%D1%80%D0%BE%D0%B5%D0%BD%D0%B8%D1%8F_%D0%B0%D0%BB%D0%B3%D0%BE%D1%80%D0%B8%D1%82%D0%BC%D0%B8%D1%87%D0%B5%D1%81%D0%BA%D0%B8%D1%85_%D0%BA%D0%BE%D0%BC%D0%BF%D0%BE%D0%B7%D0%B8%D1%86%D0%B8%D0%B9_%28%D0%BE%D1%82%D1%87%D0%B5%D1%82%29»

Категория: Непроверенные учебные задания