Повышение точности прогнозов на данных Netflix с помощью построения алгоритмических композиций (отчет)
Материал из MachineLearning.
Введение в проект
Project Specifications
The Goal
Boost the performance of exisiting algorithms on Netfix data with the help of algebraic approach to the synthesis of algorithmic compositions.
The Motivation
There is a non-null probability exists that individually weak algorithms while used in a compositions could result in a much better performance. After the Netflix Prize there has been left a lot of probes(according to Netflix's notation). So, it would be interesting to apply algebraic approach to combine them and achieve better accuracy.
The Data
The data comes from 1,151 sets of predictions of integer ratings in the range (1-5). Each set of predictions was generated by algorithms with the objective to minimise the RMSE over the full set of over 1.4 million ratings. All the individual probe files were merged into one large file (1,151 columns, 1.4 million rows). Form this file, random sampling was used to generate data sets. Each of the data sets is split into 2 files, one for Training (Rating provided) and one for Scoring. There are 2 small and 2 medium data sets.
In each of the data sets, the Training and Scoring files are of the same size:
- RMSE Small — 200 sets of predictions for 15,000 ratings. The objective is to minimize the RMSE.
- AUC Small — 200 sets of prediction for 15,000 ratings. Only ratings of 2 particular values are included (eg 1s and 5s), sampled with different weightings on each value. The objective is to minimize the AUC.
- RMSE Medium — 250 sets of predictions for 20,000 ratings. The objective is to minimize the RMSE.
- AUC Medium — 250 sets of predictions for 20,000 ratings. The objective is to minimize the AUC.
The data sets are csv files with header rows. The first 2 columns are rowID and Target (actual rating) and then either 250 or 200 columns of predicted rating values. In the Scoring files the Target values are zeros. For both the RMSE and AUC tasks, the predicted ratings values have been converted to integers by rounding to 3 decimal places and multiplying by 1,000. Thus each predicted rating will be an integer and should lie in the range 1,000 to 5,000. The initial datasets are taken from AusDM Ensemble Competition.
Evaluation Criteria
RMSE (http://en.wikipedia.org/wiki/Mean_squared_error)
AUC (http://en.wikipedia.org/wiki/Receiver_operating_characteristic)
For the AUC tasks, the Target values are 1 and -1 (zeros in the Scoring files). Therefore, in this case Netflix Dataset was modified so that to allow 2-class recognition (where 1~5stars and -1~1star)
Requirements
Synthesized algorithms should at least outperform simple averaged composition.
Feasibility and Possible Obstacles
We can only assess the diversity of algorithms but we don't know the underlying principle of how any given algorithm has been trained and what kind of correlations exist among algorithms. To build strong ensemble it is better to have diversed algorithms.
Approach
In progress... ;)
Mathematical Formalization
Descriptions of the Algorithms
References
[1] R. Bell and Y. Koren. Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights. IEEE International Conference on Data Mining (ICDM’07), pp. 43–52, 2007.
[2] A. Toscher and M. Jahrer. The BigChaos Solution to the Netflix Grand Prize, 2009.
[3] Breiman L. Bagging predictors // Machine Learning.� 1996.� Vol. 24, no. 2.� Pp. 123–140.
[4] Friedman J., Hastie T., Tibshirani R. Additive logistic regression: a statistical view of boosting: Tech. rep.: Dept. of Statistics, Stanford University Technical Report, 1998.
[5] J. H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, Vol. 29 (2001), pp. 1189-1232.
[6] J. H. Friedman. Stochastic Gradient Boosting. Computational Statistics & Data Analysis, Vol. 38 (2002), pp. 367–378.
[7]...
Hypotheses about underlying data properties
Mathematical Foundations
Heuristics and Modifications
Описание системы
- Ссылка на файл system.docs
- Ссылка на файлы системы
Отчет о вычислительных экспериментах
Визуальный анализ работы алгоритма
Анализ качества работы алгоритма
Анализ зависимости работы алгоритма от параметров
Отчет о полученных результатах
Список литературы
Данная статья является непроверенным учебным заданием.
До указанного срока статья не должна редактироваться другими участниками проекта MachineLearning.ru. По его окончании любой участник вправе исправить данную статью по своему усмотрению и удалить данное предупреждение, выводимое с помощью шаблона {{Задание}}. См. также методические указания по использованию Ресурса MachineLearning.ru в учебном процессе. |