Get prediction from multiple rows - Decision Tree Regressor

Get prediction from multiple rows - Decision Tree Regressor - python

I have a dataset of weather data that I want to use to make a prediction.
The data set consists of data from several different locations. The features in the data set are as follows:
datetime
location
rain
snow
temp_min
temp_max
clouds
pressure
humidity
wind_speed
wind_deg
weather_description
The measurements have been made at the same time in all locations, which make it possible to distinguish between the individual measurements.
I want to use data from all locations as input with getting a prediction for a location.
Is it possible to use several lines as input or can input data only consist of one line?

The DecisionTreeRegressor from scikit-learn expects a dataframe where each output is generated based on a single row. You can nevertheless move all your measurements in into one row (during training and testing) as below:
rain_stn1, rain_stn2, rain_stn3, ..., snow_stn1, snow_stn2, snow_stn3, ...
rain_value#stn1, rain_value#stn2, rain_value#stn3, ...
Of course this means that there needs to be some logical relationship between the stations such as distance. You could also create aggregate values such as rain_nearby (average of stations at <5 km distance), rain_far (average of stations at >5 km distance) which is probably more helpful in your case.
To give more specific answers you need to provide more details on use case, what you are trying to achieve, and how the dataset looks like.

Related

How to interpolate a multidimensional xarray?

I read a netCDF file using xarray. The dataset contains lat, lon information of the population for the years: 1975, 1990, 2000 and 2015.
The dataset looks like the following and I made it also available here:
import xarray as xr
ds = xr.open_dataset('borneo_pop_t.nc')
ds
For each pixel I would like to have the information of each year between 1975 and 2000 given the trend of the data points I have. In particular I would like to generate more information and different layers of the population for the missing years.
How can I do that?

You can use xarray's interpolation function.
Using your variable names,
import pandas as pd
# Create time series of 1x a year data
# (you can use any date range; you can generate pretty much
# any sequence you need with pd.date_range())
dates = pd.date_range('1975-01-01','2015-01-01',freq='1Y')
# Linear interpolation
ds_interp = ds.interp(time=dates)
Note a few things:
This code just generates a simple linear interpolation, though ds.interp() supports everything that scipy.interpolate.interp1d() does - e.g., cubic, polynomial, etc. interpolation. Check the docks linked above for examples.
This of course doesn't create new information; you still only "know" the population at the original points. Be wary what consequences interpolating the data will have on your understanding of what the population actually was in a given year.

TSFRESH - features extracted by a symmetric sliding window

As raw data we have measurements m_{i,j}, measured every 30 seconds (i=0, 30, 60, 90,...720,..) for every subject j in the dataset.
I wish use TSFRESH (package) to extract time-series features, such that for a point of interest at time i, features are calculated based on symmetric rolling window.
We wish to calculate the feature vector of time point i,j based on measurements of 3 hours of context before i and 3 hours after i.
Thus, the 721-dim feature vector represents a point of interest surrounded by 6 hours “context”, i.e. 360 measurements before and 360 measurements after the point of interest.
For every point of interest, features should be extracted based on 721 measurements of m_{i,j}.
I've tried using rolling_direction param in roll_time_series(), but the only options are either roll backwards or forwards in “time” - I'm looking for a way to include both "past" and "future" data in features calculation.

If I understand your idea correctly, it is even possible to do this with only one-sided rolling. Let's try with one example:
You want to predict for the time 8:00 - and you need for this the data from 5:00 until 11:00.
If you roll through the data with a size of 6h and positive rolling direction, you will end up with a dataset, which also includes exactly this part of the data (5:00 to 11:00). Normally, it would be used to train for the value at 11:00 (or 12:00) - but nothing prevents you to use it for predicting the value at 8:00.
Basically, it is just a matter of re-indexing.
(Same is actually true for negative rolling direction)

A "workaround" solution:
Use the "roll_time_series" function twice; one for "backward" rolling (setting rolling_direction=1) and the second for "forward" (rolling_direction=-1), and then combine them into one.
This will provide, for each time point in the original dataset m_{i,j}$, a time series rolling object with 360 values "from the past" and 360 values "from the future" (i.e., the time point is at the center of the window and max_timeshift=360)
Note to the use of pandas functions below: concat(), sort_values(), drop_duplicates() - which are mandatory for this solution to work.
import numpy as np
import pandas as pd
from tsfresh.utilities.dataframe_functions import roll_time_series
from tsfresh.feature_extraction import EfficientFCParameters, MinimalFCParameters
rolled_backward = roll_time_series(activity_data,
column_id=id_column,
column_sort=sort_column,
column_kind=None,
rolling_direction=1,
max_timeshift=360)
rolled_farward = roll_time_series(activity_data,
column_id=id_column,
column_sort=sort_column,
column_kind=None,
rolling_direction=-1,
max_timeshift=360)
# merge into one dataframe, with rolled_farward and rolled_backward window for every time point (sample)
df = pd.concat([rolled_backward, rolled_farward])
# important! - sort and drop duplicates
df.sort_values(by=[id_column, sort_column], inplace=True)
df.drop_duplicates(subset=[id_column, sort_column, 'activity'], inplace=True, keep='first')

Unsupervised learning: Anomaly detection on discrete time series

I am working on a final year project on an unlabelled dataset consisting of vibration data from multiple components inside a wind turbine.
Datasets:
I have data from 4 wind turbines each consisting of 415 10-second intervals.
About the 10 second interval data:
Each of the 415 10-second intervals consist of vibration data for the generator, gearbox etc. (14 features in total)
The vibration data (the 14 features) have a resolution of 25.6kHz (262144 rows in each interval)
The 10-seconds are recorded once every day, at different times => A little more than 1 year worth of data
Head of dataframe with some of the features shown:
Plan:
My current plan is to
Do a Fast Fourier Transformation (FFT) from the time domain for each of the different sensors (gearbox, generator etc.) for each of the 415 intervals. From the FFT I am able to extract frequency information to put in a dataframe. (Statistical data from the FFT like spectral RMS per bin)
Build different data sets for different components.
Add features such as wind speed, wind direction, power produced etc.
I will then build unsupervised ML models that can detect anomalies.
Unsupervised models I consider using are Encoder-Decorder and clustering.
Questions:
Does it look like I have enough data for this type of task? 415
intervals x 4 different turbines = 1660 rows and approx. 20 features
Should the data be treated as a time series? (It is sampled for 10 seconds once a day at random times..)
What other unsupervised ML models/approaches that could be good for this task?
I hope this was clearly written. Thanks in advance for any input!

Detecting and Replacing Outliers

In my mind, there are multiple ways to treat dataset outliers
> -> Delete data
> -> Transforming using log or Bin
> -> using mean median
> -> Test separately
I have a dataset of around 50000 observations and each observation has quite some outlier values (some variable have small amount of outliers some has 100-200 outliers) so excluding data is not the one I'm looking for as it causing me to loose a huge chunk of data.
I read somewhere that using mean and median is for artificial outliers but in my case I think the outliers are Natural
I was actually about to use median to get rid of the outliers and then using mean to fill in missing values but it doesn't seem ok, however I did use it neverthless with this code
median = X.median()
std =X.std()
outliers = (X - median).abs() > std
X.outliers = np.nan
X.fillna(median, inplace = True)
it did lower the overfitting of just one model logistic regression but still gives 100% on Random Forest and the shape of graph changed from
to this
So I'm really confuse what technique to use? I tried replacing 5th and 95th percentile of data as well but it didn't work as well. Should I bin the data present in each column from 1-10? Also should I normalize or standardize my data before applying any model? Any guidance will be appreciated

Check robust statistics.
I would suggest to check the Huber's method / winsorization, which you also have in Python.
For Hypothesis testing you have Wilcoxon signed ranked test and I think Mann-Whitney test

preprocessing EEG dataset in python to get better accuracy

I've an EEG dataset which has 8 features taken using 8-channel EEG headset. Each row represents readings taken with 250ms interval. The values are all floating point representing voltages in micro volt. If I plot individual features, I can see that they form a continuous wave. now the target has 3 categories: 0,1,2. and for a duration of time the target doesn't change because the sample taken spans across multiple rows. I would appreciate any guidance as to how to pre-process the dataset. Since using it as it is gives me very low accuracy(80%) and according to Wikipedia P300 signal can be detected with 95% accuracy. And please note that I've almost zero knowledge about signal processing and analysing waveforms.
I did try making a 3D array where each row represented a single target and the values of each feature was a list of values that originally spanned across multiple rows. But I get an error that says estimator array expected to be <=2. I'm not sure if this was the right approach. But it didn't work anyway.
here have a look at my feature set:-
-1.2198,-0.32769,-1.22,2.4115,0.057031,-2.6568,7.372,-0.2789
-1.4262,-4.19,-5.6546,-7.7161,-5.4359,-9.4553,-3.6705,-5.4851
-1.3152,-6.8708,-8.5599,-14.739,-9.1808,-14.268,-11.632,-8.929
-0.53987,-7.5156,-8.9646,-16.656,-10.119,-15.791,-14.616,-9.4095
Their corresponding targets:-
0
0
0
0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.