Временной ряд (библиотека примеров)

Материал из MachineLearning.

(Различия между версиями)
Перейти к: навигация, поиск
м (Акселерометр)
(Sensors)
 
(21 промежуточная версия не показана)
Строка 1: Строка 1:
-
'''Временной ряд''' — набор измерений, сделанный через равные промежутки времени. Представлен ряд примеров временных рядов, предназначенных для тестирования алгоритмов прогнозирования.
+
'''Time series''' is a sequence of equally spaced data measurements. This page lists a number of examples of time series for testing forecasting algorithms.
-
== Структура файлов ==
+
== File structure ==
-
Файл имеет расширение tsName.csv, значения в строках разделены запятыми. Десятичные знаки отделены точкой. Первый столбец — время. Второй столбец — прогнозируемый временной ряд, последующие столбцы — вспомогательный набор временных рядов. К файлу прилагается вспомогательный файл tsNameReadme.txt, в котором указаны:
+
Data is stored in comma-separated .csv format, with decimals searated by periods. The first column contains timestamps. The second column stores the forecasted time series, other columns may store complementary time series. Each dataset is followed with a tsNameReadme.txt file, with specifies:
-
* источник данных (или задача, которую требовалось решить),
+
* data source (or a problem that was solved),
-
* формат отсчетов времени,
+
* timestamps format,
-
* названия столбцов (смысловые),
+
* interpretation of the data, column-wise,
-
* тип шкал столбцов,
+
* type of data scale for each column,
-
* периоды, если есть,
+
* periodicity, if present,
-
* прочая информация.
+
* other information.
-
== Примеры ==
+
{|class="wikitable"
 +
|-
 +
|ts
 +
|colspan="2" |Describes time series
 +
|-
 +
|t
 +
|[T,1]
 +
|Time in milliseconds since 1/1/1970 (UNIX format)
 +
|-
 +
|x
 +
|[T, N]
 +
|Columns of the matrix are time series; missing values are NaNs
 +
|-
 +
|legend
 +
|{1, N }
 +
|Time series descriptions ts.x, e.g. ts.legend={‘Consumption, ‘Price’, ‘Temperature’};
 +
|-
 +
|readme
 +
|[string]
 +
|Data information (source, formation time etc.)
 +
|-
 +
|type
 +
|[1,N]
 +
|(optional) Time series types ts.x, 1-real-valued, 2-binary, k – k-valued
 +
|-
 +
|timegen
 +
|[T,1]=func(timetick)
 +
|(optional) Time ticks generator, may contain the start (end) time in UNIX format and a function to generate the vector t [T,1]
 +
|-
 +
|}
-
=== Синтетические ряды (в формате ts - см. [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/Technologies/интерфейсы.doc TSForecastingInterfaces])===
+
== Examples ==
-
* Константа [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/constants.mat Constant]
+
-
* Синус [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/sines.mat Sine]
+
-
* Два синуса [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/2sines.mat 2Sines]
+
-
* Пила [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/saws.mat Saw]
+
-
* Трапеция [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/trapezia.mat Trapezium]
+
-
=== Высокопериодичные ===
+
=== Synthetic time series (in ts format - see. [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/Technologies/интерфейсы.doc TSForecastingInterfaces])===
-
* Потребление электроэнергии [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsEnergyConsumption.csv EnergyConsumption]
+
* Constant [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/constants.mat Constant]
-
* Работа машин и механизмов
+
* Sine [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/sines.mat Sine]
-
* Звук
+
* Two sines [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/2sines.mat 2Sines]
-
* Музыка [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsLedZeppelin.csv LedZeppelin]
+
* Triangles [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/saws.mat Saw]
 +
* Trapezoid [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/trapezia.mat Trapezium]
-
=== Периодичные зашумленные ===
+
=== Highly periodic ===
-
* Цены на электроэнергию
+
* Electricity consumption [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsEnergyConsumption.csv EnergyConsumption]
-
* Цены на потребительские товары
+
* Machinery
-
* Объем сбыта товаров [https://dmba.svn.sourceforge.net/svnroot/dmba/Data/RetialSalesItems.csv RetailSalesItems]
+
* Sounds
-
* Цены на сахар [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsSugarPrice.csv SugarPrice]
+
* Music [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsLedZeppelin.csv LedZeppelin]
-
* Цены на хлеб [https://dmba.svn.sourceforge.net/svnroot/dmba/Data/WhiteBreadPrices.csv WhiteBreadPrices]
+
-
* Объем потребления напитков
+
-
* Погода: температура, влажность, сила ветра [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsGermanWeather.csv GermanWeather]
+
-
* Объем пассажирских (и грузо-) перевозок
+
-
=== Со сложным периодом ===
+
=== Noisy periodic time series ===
-
* Электрокардиограмма [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsEcg.csv ECG]
+
* Electricity prices
-
* Пульсовая волна
+
* Prices for consumables and commodities
-
* Энцефалограмма
+
* Retail sales [https://dmba.svn.sourceforge.net/svnroot/dmba/Data/RetialSalesItems.csv RetailSalesItems]
-
* Отраженные волны
+
* Sugar prices [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsSugarPrice.csv SugarPrice]
 +
* Bread prices [https://dmba.svn.sourceforge.net/svnroot/dmba/Data/WhiteBreadPrices.csv WhiteBreadPrices]
 +
* Drink consumption
 +
* Weather: tempreture, humidity, wind [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsGermanWeather.csv GermanWeather]
 +
* Passenger (and freight) transportation
-
=== Апериодичные ===
+
=== Complex periodicity ===
-
* Распространение гриппа [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsFluUSA.csv FluUSA]
+
* ECG [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsEcg.csv ECG]
-
* Миграция населения
+
* Pulse wave
-
* Миграция птиц
+
* MEG
 +
* Reflected time series
-
=== Сильно зашумленные ===
+
=== Aperiodic ===
-
* Цены (объемы) на основные биржевые инструменты [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsCSCO.csv Cisco]
+
* Flu propagation [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsFluUSA.csv FluUSA]
-
* Биржевые индикаторы [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsDJIA.csv DowJonesIndustrialAverage]
+
* Migration
-
* Цены на опционы (по сетке)
+
-
=== Событийные ===
+
=== High noise ===
-
* Землетрясения [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsEarthquakesArkansas.csv ArkansasEarthquakes]
+
* Stock exchange [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsCSCO.csv Cisco]
-
* Финансовые пузыри [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsFinancialBubbles.csv FinancialBubbles]
+
* Market indices [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsDJIA.csv DowJonesIndustrialAverage]
-
* Рекорды
+
* Option prices
-
=== Акселерометр ===
+
-
* http://hasc.jp/hc2010/HASC2010corpus/hasc2010corpus-en.html (2010) 540 subjects, 6 activities
+
-
* http://www-scf.usc.edu/~mizhang/datasets.html (2012) 14 subjects, 12 activities, accel+gyro sensors
+
=== Event-driven ===
 +
* Earthquakes [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsEarthquakesArkansas.csv ArkansasEarthquakes.csv]
 +
* Financial Bubbles [https://mlalgorithms.svn.sourceforge.net/svnroot/mlalgorithms/TSForecasting/TimeSeries/Sources/tsFinancialBubbles.csv FinancialBubbles.csv]
 +
* Records
 +
=== Accelerometry ===
 +
* [https://archive.ics.uci.edu/ml/datasets/OPPORTUNITY+Activity+Recognition]
 +
OPPORTUNITY Activity Recognition Data Set for Human Activity Recognition from Wearable, Object, and Ambient Sensors is a dataset devised to benchmark human activity recognition algorithms (classification, etc.).
-
* http://www.cis.fordham.edu/wisdm/dataset.php (2010-2011) 36 subjects, 6 activities, accelerometer
+
* http://hasc.jp/hc2010/HASC2010corpus/hasc2010corpus-en.html (2010) 540 subjects, 6 activities (stay, walk, jog, skip, stair up, stair down). Includes segmented data (only one activity type, 20 seconds) and sequence data.
-
* http://www.opportunity-project.eu/challengedatasetdownload (2011) 4 subjects, 17 sensors in different positions
+
* http://www-scf.usc.edu/~mizhang/datasets.html (2012) 14 subjects, 12 activities (walk forward, walk left, walk right, go upstairs, go downstairs, run forward, jump up and down, sit and fidget, stand, sleep, elevator up, and elevator down). The data is captured by the [http://www.motionnode.com/ MotionNode] inertial sensing device which integrates an 3-axis accelerometer (+-6g) and an 3-axis gyroscope (+-500dps), sampled at 100 Hz.
-
* http://smartlab.ws/component/content/article?id=60 (2013) 30 subjects, 6 activities, fixed set of features from
+
* http://www.cis.fordham.edu/wisdm/dataset.php (2010-2011) 36 subjects, 6 activities (stay, walk, jog, sit, stair up, stair down).
-
* http://llmpp.nih.gov/lymphoma/: classification of DLBCL (Diffuse large B-cell lymphoma) patients into curable and noncurable groups ([http://www.broadinstitute.org/mpr/publications/projects/Lymphoma/Shipp_et_al_2002.pdf pdf]). Raw data for all Lymphochip microarrays are [http://llmpp.nih.gov/lymphoma/data/rawdata/ available here]. For each microarray, two scan files were generated, one for each fluorescence emission wavelength corresponding to the fluorophor used in the reverse transcription labeling reaction.
+
-
== Конкурсы Kaggle ==
+
* http://www.opportunity-project.eu/challengedatasetdownload (2011), 4 subjects. An annotated dataset of complex, interleaved and hierarchical activities, with a particularly large number of atomic activities (around 30’000), collected in a rich sensor environment. The full setup including both ambient and on-body sensors comprises 72 sensors of 10 modalities, integrated in the environment and on the body.
-
* https://www.kaggle.com/c/seizure-prediction: predict seizures in intracranial EEG recordings. Intracranial EEG was recorded from dogs with naturally occurring epilepsy using an ambulatory monitoring system. EEG was sampled from 16 electrodes at 400 Hz, and recorded voltages were referenced to the group average. These are long duration recordings, spanning multiple months up to a year and recording up to a hundred seizures in some dogs. Preictal training and testing data segments are provided covering one hour prior to seizure with a five minute seizure horizon.
+
-
* https://www.kaggle.com/c/belkin-energy-disaggregation-competition/data: SmartHouse energy consumption prediction. Electromagnetic Interference (EMI) is measured using a special sensor built at the Ubicomp Lab to identify what appliance is being used and how much energy it is consuming. The data is available from 4 homes (H1-H4) consisting of both training datasets and testing datasets. The training set includes information about which appliance was turned ON or OFF and at what timestamps.
+
-
* https://www.kaggle.com/c/predicting-parkinson-s-disease-progression-with-smartphone-data: measure the symptoms of Parkinson’s disease with a smartphone. The data was collected from 9 PD patients, at varying stages of the disease, and 7 healthy controls over a period wthin 4 months. The data inclides the following streams: audio, accelerometry (3D, for each of the 3 axes: mean, absolute central moment, standard deviation, maximum deviation, power spectral density across four separate bands), GPS (latitude, longitude, altitude), compass (for each of the 3 axes: mean, absolute central moment, standard deviation, maximum deviation).
+
-
* https://www.kaggle.com/c/accelerometer-biometric-competition/data: recognize users of mobile devices from accelerometer data. The dataset contains approximately 60 million unique samples of accelerometer data collected from 387 different devices. These are split into equal sets for training and test. Samples in the training set are labeled with the unique device from which the data was collected. The test set is demarcated into 90k sequences of consecutive samples from one device.
+
 +
* http://smartlab.ws/component/content/article?id=60 (2013) 30 subjects, 6 activities, fixed set of features from
-
== Базы данных ==
+
* http://www.ife.ee.ethz.ch/research/groups/Dataset/skoda_mini_checkpoint/SkodaMiniCP.zip '''Skoda Mini Checkpoint'''. The dataset contains acceleration meaurements (calibrated and raw) of 10 manipulative gestures performed in a car maintenance scenario (1 subject, 70 instances per activity).
 +
** Some of the datasets, available from http://www.ife.ee.ethz.ch/research/groups/Dataset are described [http://www.ife.ee.ethz.ch/research/groups/Dataset/dateset_description here].
 +
 
 +
* https://cloud5.cs.fau.de/owncloud/public.php?service=files&t=9a07b48c7950d1b61d8fb8b0382ff6c7 12 subjects, 4 swimming styles (butterfly, backstroke, breaststroke and freestyle), two states (swimming/resting), one type of events (turns). See [http://www5.informatik.uni-erlangen.de/Forschung/Publikationen/2013/Jensen13-COK.pdf Classification of Kinematic Swimming Data with Emphasis on Resource Consumption]
 +
 
 +
=== Other ===
 +
* http://llmpp.nih.gov/lymphoma/: classification of DLBCL (Diffuse large B-cell lymphoma) patients via '''gene expression''' ([http://www.broadinstitute.org/mpr/publications/projects/Lymphoma/Shipp_et_al_2002.pdf pdf]). Raw data for all Lymphochip microarrays are [http://llmpp.nih.gov/lymphoma/data/rawdata/ available here]. For each microarray, two scan files were generated, one for each fluorescence emission wavelength corresponding to the fluorophor used in the reverse transcription labeling reaction.
 +
* http://www.cse.ust.hk/~qyang/ICDMDMC07/: '''indoor location and transferlearning'''. The task is to predict the location of each collection of received signal strength (RSS) values in an indoor environment, received from the WiFi Access Points (APs).
 +
*# The training data a set of (RSS values, Location Label) pairs, where the location labels are discrete (non-sequential), and a collection of partially labelled user traces, which corresponds to a sequence of RSS values collected as a user continuously walks around a building.
 +
*# The training data and test data are collected at different time periods. Some test data objects are associated with location labels to use as benchmarks.
 +
** Similar dataset http://www.cse.ust.hk/~derekhh/ActivityRecognition/dataset/hkust.rar set was used in the papers [https://www.aaai.org/Papers/AAAI/2004/AAAI04-092.pdf High-level Goal Recognition in a Wireless LAN] and [https://www.aaai.org/Papers/AAAI/2005/AAAI05-001.pdf Multiple-Goal Recognition from Low-Level Signals] for "inferring high-level user-behavior patterns from low-level sensory data through '''location-based plan recognition'''".
 +
* [http://www.caida.org/data/overview/ CAIDA] collects several different types Internet-related of data at geographically and topologically diverse locations.
 +
 
 +
== Kaggle competitions ==
 +
* https://www.kaggle.com/c/seizure-prediction: '''predict seizures in intracranial EEG''' recordings. Intracranial EEG was recorded from dogs with naturally occurring epilepsy using an ambulatory monitoring system. EEG was sampled from 16 electrodes at 400 Hz, and recorded voltages were referenced to the group average. These are long duration recordings, spanning multiple months up to a year and recording up to a hundred seizures in some dogs. Preictal training and testing data segments are provided covering one hour prior to seizure with a five minute seizure horizon.
 +
* https://www.kaggle.com/c/belkin-energy-disaggregation-competition/data: '''SmartHouse energy consumption prediction'''. Electromagnetic Interference (EMI) is measured using a special sensor built at the Ubicomp Lab to identify what appliance is being used and how much energy it is consuming. The data is available from 4 homes (H1-H4) consisting of both training datasets and testing datasets. The training set includes information about which appliance was turned ON or OFF and at what timestamps.
 +
* https://www.kaggle.com/c/predicting-parkinson-s-disease-progression-with-smartphone-data: measure the symptoms of '''Parkinson’s disease''' with a smartphone. The data was collected from 9 PD patients, at varying stages of the disease, and 7 healthy controls over a period wthin 4 months. The data includes the following streams: audio, accelerometry (3D, for each of the 3 axes: mean, absolute central moment, standard deviation, maximum deviation, power spectral density across four separate bands), GPS (latitude, longitude, altitude), compass (for each of the 3 axes: mean, absolute central moment, standard deviation, maximum deviation).
 +
* https://www.kaggle.com/c/accelerometer-biometric-competition/data: '''recognize users of mobile devices''' from accelerometer data. The dataset contains approximately 60 million unique samples of accelerometer data collected from 387 different devices. These are split into equal sets for training and test. Samples in the training set are labeled with the unique device from which the data was collected. The test set is demarcated into 90k sequences of consecutive samples from one device.
 +
* https://www.kaggle.com/c/connectomics: '''network structure reconstruction'''. Test data includes time series of neural activities obtained from fluorescence signals and (x, y) coordinates of the neurons. (Neurons are arranged on a flat surface simulating a neural culture). For training data, the network connectivity is also provided. The task is to reconstruct their connectivity from activity data.
 +
* https://www.kaggle.com/c/grasp-and-lift-eeg-detection: identify '''hand motions from EEG recordings'''. Dataset contains EEG time series for 12 subjects in total, 10 series of trials for each subject (8 series in training set and 2 series in test set), and approximately 30 trials within each series. The task is to detect each of six events: HandStart, FirstDigitTouch, BothStartLoadPhase, LiftOff, Replace or BothReleased.
 +
 
 +
== Databases ==
 +
Medicine:
* http://www.physionet.org/ contains collections of recorded physiologic signals (accelerometry, ECG, EEG, EHG, EMG, blood pressure, hart rate, auditory brainstem response, etc.)
* http://www.physionet.org/ contains collections of recorded physiologic signals (accelerometry, ECG, EEG, EHG, EMG, blood pressure, hart rate, auditory brainstem response, etc.)
* http://www.ebi.ac.uk/arrayexpress/experiments/browse.html is a database of genomic data. Data can be searched by a number of parameters, such as molecule (DNA, RNA, amplicon, metabolite, protein) or experimntal technology (array, high-throughput sequencing, mass spectrometry)
* http://www.ebi.ac.uk/arrayexpress/experiments/browse.html is a database of genomic data. Data can be searched by a number of parameters, such as molecule (DNA, RNA, amplicon, metabolite, protein) or experimntal technology (array, high-throughput sequencing, mass spectrometry)
* https://www.ieeg.org/ includes a large database of scientific data and tools to analyze epilepsy datasets.
* https://www.ieeg.org/ includes a large database of scientific data and tools to analyze epilepsy datasets.
 +
* https://sleepdata.org/datasets offers six public datasets of sleep research data collected in children and adults across the U.S.
 +
Cross-disciplinary data repositories, data collections and data search engines (from datacentral.com):
 +
* [http://aws.amazon.com/ru/datasets/ AWS public data sets]
 +
* https://datahub.io/ - data management platform from the Open Knowledge Foundation, based on the CKAN data management system
-
== См. также ==
+
== Sensors ==
 +
* [http://iot.ee.surrey.ac.uk:8080/datasets.html Smart City] Includes Vehicle Traffic, Pollution and Weather data.
 +
* At the bottom of the page there are links to the [http://iot.ee.surrey.ac.uk:8080/datasets.html Live data set], which contains sensor data from the meeting room, including presence of people in the room with temperature, humidity, oxygen and carbon dioxide values.
 +
 
 +
== Spatial-time series ==
 +
* [http://copernicus.eu/data-access-satellite Copernicus: European's eye on Earth]
 +
 
 +
== See also ==
* [[Временной ряд]]
* [[Временной ряд]]
-
Библиотека используется в проектах:
+
The dataset is used in:
* [[Численные методы обучения по прецедентам (практика, В.В. Стрижов)/Группа 874, весна 2011|«исследование свойств алгоритмов прогноза»]],
* [[Численные методы обучения по прецедентам (практика, В.В. Стрижов)/Группа 874, весна 2011|«исследование свойств алгоритмов прогноза»]],
* [[Руководство исследовательскими проектами (практика, В.В. Стрижов)|«выбор прогностических моделей»]].
* [[Руководство исследовательскими проектами (практика, В.В. Стрижов)|«выбор прогностических моделей»]].
 +

Текущая версия

Time series is a sequence of equally spaced data measurements. This page lists a number of examples of time series for testing forecasting algorithms.

Содержание

File structure

Data is stored in comma-separated .csv format, with decimals searated by periods. The first column contains timestamps. The second column stores the forecasted time series, other columns may store complementary time series. Each dataset is followed with a tsNameReadme.txt file, with specifies:

  • data source (or a problem that was solved),
  • timestamps format,
  • interpretation of the data, column-wise,
  • type of data scale for each column,
  • periodicity, if present,
  • other information.
ts Describes time series
t [T,1] Time in milliseconds since 1/1/1970 (UNIX format)
x [T, N] Columns of the matrix are time series; missing values are NaNs
legend {1, N } Time series descriptions ts.x, e.g. ts.legend={‘Consumption, ‘Price’, ‘Temperature’};
readme [string] Data information (source, formation time etc.)
type [1,N] (optional) Time series types ts.x, 1-real-valued, 2-binary, k – k-valued
timegen [T,1]=func(timetick) (optional) Time ticks generator, may contain the start (end) time in UNIX format and a function to generate the vector t [T,1]

Examples

Synthetic time series (in ts format - see. TSForecastingInterfaces)

Highly periodic

Noisy periodic time series

Complex periodicity

  • ECG ECG
  • Pulse wave
  • MEG
  • Reflected time series

Aperiodic

  • Flu propagation FluUSA
  • Migration

High noise

Event-driven

Accelerometry

OPPORTUNITY Activity Recognition Data Set for Human Activity Recognition from Wearable, Object, and Ambient Sensors is a dataset devised to benchmark human activity recognition algorithms (classification, etc.).

  • http://www-scf.usc.edu/~mizhang/datasets.html (2012) 14 subjects, 12 activities (walk forward, walk left, walk right, go upstairs, go downstairs, run forward, jump up and down, sit and fidget, stand, sleep, elevator up, and elevator down). The data is captured by the MotionNode inertial sensing device which integrates an 3-axis accelerometer (+-6g) and an 3-axis gyroscope (+-500dps), sampled at 100 Hz.
  • http://www.opportunity-project.eu/challengedatasetdownload (2011), 4 subjects. An annotated dataset of complex, interleaved and hierarchical activities, with a particularly large number of atomic activities (around 30’000), collected in a rich sensor environment. The full setup including both ambient and on-body sensors comprises 72 sensors of 10 modalities, integrated in the environment and on the body.

Other

  • http://llmpp.nih.gov/lymphoma/: classification of DLBCL (Diffuse large B-cell lymphoma) patients via gene expression (pdf). Raw data for all Lymphochip microarrays are available here. For each microarray, two scan files were generated, one for each fluorescence emission wavelength corresponding to the fluorophor used in the reverse transcription labeling reaction.
  • http://www.cse.ust.hk/~qyang/ICDMDMC07/: indoor location and transferlearning. The task is to predict the location of each collection of received signal strength (RSS) values in an indoor environment, received from the WiFi Access Points (APs).
    1. The training data a set of (RSS values, Location Label) pairs, where the location labels are discrete (non-sequential), and a collection of partially labelled user traces, which corresponds to a sequence of RSS values collected as a user continuously walks around a building.
    2. The training data and test data are collected at different time periods. Some test data objects are associated with location labels to use as benchmarks.
  • CAIDA collects several different types Internet-related of data at geographically and topologically diverse locations.

Kaggle competitions

  • https://www.kaggle.com/c/seizure-prediction: predict seizures in intracranial EEG recordings. Intracranial EEG was recorded from dogs with naturally occurring epilepsy using an ambulatory monitoring system. EEG was sampled from 16 electrodes at 400 Hz, and recorded voltages were referenced to the group average. These are long duration recordings, spanning multiple months up to a year and recording up to a hundred seizures in some dogs. Preictal training and testing data segments are provided covering one hour prior to seizure with a five minute seizure horizon.
  • https://www.kaggle.com/c/belkin-energy-disaggregation-competition/data: SmartHouse energy consumption prediction. Electromagnetic Interference (EMI) is measured using a special sensor built at the Ubicomp Lab to identify what appliance is being used and how much energy it is consuming. The data is available from 4 homes (H1-H4) consisting of both training datasets and testing datasets. The training set includes information about which appliance was turned ON or OFF and at what timestamps.
  • https://www.kaggle.com/c/predicting-parkinson-s-disease-progression-with-smartphone-data: measure the symptoms of Parkinson’s disease with a smartphone. The data was collected from 9 PD patients, at varying stages of the disease, and 7 healthy controls over a period wthin 4 months. The data includes the following streams: audio, accelerometry (3D, for each of the 3 axes: mean, absolute central moment, standard deviation, maximum deviation, power spectral density across four separate bands), GPS (latitude, longitude, altitude), compass (for each of the 3 axes: mean, absolute central moment, standard deviation, maximum deviation).
  • https://www.kaggle.com/c/accelerometer-biometric-competition/data: recognize users of mobile devices from accelerometer data. The dataset contains approximately 60 million unique samples of accelerometer data collected from 387 different devices. These are split into equal sets for training and test. Samples in the training set are labeled with the unique device from which the data was collected. The test set is demarcated into 90k sequences of consecutive samples from one device.
  • https://www.kaggle.com/c/connectomics: network structure reconstruction. Test data includes time series of neural activities obtained from fluorescence signals and (x, y) coordinates of the neurons. (Neurons are arranged on a flat surface simulating a neural culture). For training data, the network connectivity is also provided. The task is to reconstruct their connectivity from activity data.
  • https://www.kaggle.com/c/grasp-and-lift-eeg-detection: identify hand motions from EEG recordings. Dataset contains EEG time series for 12 subjects in total, 10 series of trials for each subject (8 series in training set and 2 series in test set), and approximately 30 trials within each series. The task is to detect each of six events: HandStart, FirstDigitTouch, BothStartLoadPhase, LiftOff, Replace or BothReleased.

Databases

Medicine:

  • http://www.physionet.org/ contains collections of recorded physiologic signals (accelerometry, ECG, EEG, EHG, EMG, blood pressure, hart rate, auditory brainstem response, etc.)
  • http://www.ebi.ac.uk/arrayexpress/experiments/browse.html is a database of genomic data. Data can be searched by a number of parameters, such as molecule (DNA, RNA, amplicon, metabolite, protein) or experimntal technology (array, high-throughput sequencing, mass spectrometry)
  • https://www.ieeg.org/ includes a large database of scientific data and tools to analyze epilepsy datasets.
  • https://sleepdata.org/datasets offers six public datasets of sleep research data collected in children and adults across the U.S.

Cross-disciplinary data repositories, data collections and data search engines (from datacentral.com):

Sensors

  • Smart City Includes Vehicle Traffic, Pollution and Weather data.
  • At the bottom of the page there are links to the Live data set, which contains sensor data from the meeting room, including presence of people in the room with temperature, humidity, oxygen and carbon dioxide values.

Spatial-time series

See also

The dataset is used in:

Личные инструменты