Multivariate Times Series Classification using Machine Learning Algorithms - python

I am fairly new to machine learning and am currently working on a way to classify time series data. In order to do so, I would like to get a better understanding of how time series data can be fed into machine learning algorithms.
Further information:
Each sample is a time series consisting of 2000 time points. For each time point, there are several variables, like temperature, speed, acceleration, etc. The data can be represented like this:
data structure for one time series sample
The whole dataset consists of 3000 samples. 3000 samples x 2000 data points per sample = 6000000 data points for each variable.
the goal is to classify the samples into classes from 0 to 4.
My first attempt was just feeding the data as an array into the machine learning algorithms.
Let's say, we just focus on temperature. We can now structure the data like this:
input training data for a ml-algorithm
. Let X be the training input and y be the training output, the data looks like:
[21,21,22,...]=0
[35,35,35,...]=2
[11,12,12,...]=1
[18,17,18,...]=0
Can I just feed the machine learning algorithm (like SVCs) with array-type time series data like this? How does the algorithm know that the elements in the array are chronological data and not single features?
Here is an example code of what I did so far:
dataframe.head()
'sample_nr' 'timestamp' 'temperature' 'speed' 'acceleration'
0 1 0.01 21 -0.43 0.34205
1 1 0.02 21 -0.43 0.34205
2 1 0.03 22 -0.43 0.34205
Create a data_list, which contains all the sample_nr's in a list. Also, the dataframe gets grouped by the sample_nr
data_list = []
for sample_nr, sample_df in dataframe.groupby('sample_nr'):
dataframe.groupby('sample_nr'):
data_list.append(dataframe)
For a first step, we will only focus on one feature, let's say the temperature:
X_list = []
y_list = []
for sample in data_list:
temp_X = np.array(sample['temperature'])
temp_y = sample['label'].unique()[0]
X_list.append(temp_X)
y_list.append(temp_y)
Transform the lists to pandas.Dataframes:
X_df = pd.DataFrame(X_list)
y_df = pd.DataFrame(y_list)
Now, the X_df is a 3000x2000 list: Each row describes a sample, and the values in the columns are the temperature values for each of the 2000 time steps:
print(X_df)
....0....1....2....3
0 21 21 22 22
1 35 35 35 36
2 11 12 12 12
Also, for the output value:
print(y_df)
....0
0 0
1 2
2 1
Now split up the dataframe to train and test data:
X_train_array, X_test_array, y_train_array, y_test_array = train_test_split(X_df, y_df, test_size=0.2, shuffle=True, random_state=42)
X_train_df = pd.DataFrame(X_train_array)
X_test_df = pd.DataFrame(X_test_array)
y_train_df = pd.DataFrame(y_train_array)
y_test_df = pd.DataFrame(y_test_array)
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train_df, y_train_df)

Related

How to Generate list of Predictions after implementing ML Model on full data

I am working on an ML project. I have selected my best dataset and am making predictions using it, and it is returning a list of 0 (no) and 1 (yes). Like if i enter a 10 random points in the dataset, then it will generate numpy array of 10 predictions containing 0 and 1. But what I want is to pass my whole dataset of 700 records into the model and have it return the prediction along with the "ID" column number. So I can identify to whom this prediction belongs. Currently, I am passing the full dataset, and it is returning a numpy list array of 0 and 1. So I can't identify which data point this prediction belongs to. So please help. following is my code
def predicting_attrition(features):
features = pre_process(features)
features['Performance Rating'] = features['Performance Rating'].fillna(mv_pr)
features['Training Hours'] = features['Training Hours'].fillna(mv_th)
features['Length of Total Service'] = features['Length of Total Service'].fillna(mv_lts)
features = pd.get_dummies(features)
missing_cols = set(trainX.columns) - set(features.columns)
for c in missing_cols:
features[c] = 0
features = features[trainX.columns]
return xgb.predict(features)
x= predicting_attrition(data)

Multiple-series training input is giving NaN loss while same data but One-serie training input is not

I want to train a N-Beats time series model using Darts. I have a time serie DataFrame for each users so I want to use Multiple-Series training but when I feed the list of TimeSeries I directly get NaN as losses during training. If I concatenate all users's TimeSeries into one, I get a normal loss. In both cases the data is scale, fill and cast to float.32
data = scaler.transform(filler.transform(data)).astype(np.float32)
Here is the code that I use combine the list of TimeSeries into a single TimeSeries. I also have a pure Darts code for that but it is much slower for the same result.
SPLIT = 0.8
if concatenate_to_one_ts:
all_dfs = []
all_dfs_cov = []
for i in range(len(list_of_target_ts)):
all_dfs.append(list_of_target_ts[i].pd_series())
all_dfs_cov.append(list_of_cov_ts[i].pd_dataframe())
all_dfs = pd.concat(all_dfs)
all_dfs_cov = pd.concat(all_dfs_cov)
nbr_train_sample = int(len(all_dfs) * SPLIT)
all_dfs_train = all_dfs[:nbr_train_sample]
all_dfs_test = all_dfs[nbr_train_sample:]
list_of_target_ts_train = TimeSeries.from_series(all_dfs_train.reset_index(drop=True))
list_of_target_ts_test = TimeSeries.from_series(all_dfs_test.reset_index(drop=True))
all_dfs_cov_train = all_dfs_cov[:nbr_train_sample]
all_dfs_cov_test = all_dfs_cov[nbr_train_sample:]
list_of_cov_ts_train = TimeSeries.from_dataframe(all_dfs_cov_train.reset_index(drop=True))
list_of_cov_ts_test = TimeSeries.from_dataframe(all_dfs_cov_test.reset_index(drop=True))
else:
nbr_train_sample = int(len(list_of_target_ts) * SPLIT)
list_of_target_ts_train = list_of_target_ts[:nbr_train_sample]
list_of_target_ts_test = list_of_target_ts[nbr_train_sample:]
list_of_cov_ts_train = list_of_cov_ts[:nbr_train_sample]
list_of_cov_ts_test = list_of_cov_ts[nbr_train_sample:]
model = NBEATSModel(input_chunk_length=4,
output_chunk_length=1,
batch_size=512,
n_epochs=5,
nr_epochs_val_period=1,
model_name="NBEATS_test",
generic_architecture=True,
force_reset=True,
save_checkpoints=True,
show_warnings=True,
log_tensorboard=True,
torch_device_str='cuda:0'
)
model.fit(series=list_of_target_ts_train,
past_covariates=list_of_cov_ts_train,
val_series=list_of_target_ts_val,
val_past_covariates=list_of_cov_ts_val,
verbose=True,
num_loader_workers=20)
As Multiple-Series training I get:
Epoch 0: 8%|██████████▉ | 2250/27807 [03:00<34:11, 12.46it/s, loss=nan, v_num=logs, train_loss=nan.0
As a single serie training I get:
Epoch 0: 24%|█████████████████████████▋ | 669/2783 [01:04<03:24, 10.33it/s, loss=0.00758, v_num=logs, train_loss=0.00875]
I am also confused by the number of sample per epoch with the same batch size as from what I read here: https://unit8.com/resources/training-forecasting-models/ the single serie should have more sample as the window size cut is not happening for each Multiple Series.
Regarding the NaNs, I would try reducing the learning rate if I were you. Also double check that there's no NaN remaining in your data (see corresponding entry here 1)
Regarding the number of samples, each of the separate time series are split into several (input, output) slices. For the single series, this split is done once overall, whereas for the multiple series, this split is done once per series and then all the resulting samples are regrouped in a common training set. So it is expected to have more training samples with multiple series (and each training sample will have fewer dimensions compared to the single-multivariate-series case).
Thanks Julien Herzen your answer helped me a lot to find the issue. I want to add more details on what was happening.
Regarding the NaNs: the filler from Darts is by default using pandas interpolation. That interpolation was not possible for the multi-series as some of the series had only NaN in those columns so nothing to interpolate from thus returning series with still NaN values. It was not happening for the concatenated to one single series because as all multi-series were concatenated there was value to interpolate from. If you do not need interpolation just add fill=0.0 in MissingValuesFiller(fill=0.0)
regarding the number of samples, after digging in Darts code, I found out that the NBeats model is using GenericShiftedDataset which for multi series is computing the length of the dataset by:
getting the length of the longest sub series and multiplying by the number of series.
self.max_samples_per_ts = (max(len(ts) for ts in self.target_series) - self.size_of_both_chunks + 1)
then when getitem is called
target_idx = idx // self.max_samples_per_ts
target_series = self.target_series[target_idx]
It select a series by dividing the idx by the max number of samples thus shorter series will be sampled more than longer one as they have less data but same chance to get sampled.
Here is my smallest example with input_chunk_length = 4 and output = 1:
Multi series with lengths: [71, 19] -> number of samples (71 * 2) - (2 * input_chunk_length) = 134
Concantenated into a Single serie: 90 -> number of samples: 90 - input_chunk_length = 86
In the multi series the sample in the short sub series will likely be sampled more time.

Preparing Grid weather data for ConvLSTM2d

I am attempting to use a ConvLSTM2d model using hourly grid weather data. I can get the data into a 4d array with these dimensions (num_hours, lat, lon, num_features). ConvLSTM2d requires 5d and I was planning on setting a variable for sequence length of maybe 24hrs. My question is how do i create an additional dimension in this array to have the sequence length dimension?(num_hours, sequence_length, lat, lon, num_features) Is there a smarter, more efficient way to get the data in the correct form from a pandas dataframe that has columns for lat, lon, time, feature type & value?
*
I realize it is always easier to have a sample dataset when asking a question so i created a set to mimic the issue.
import pandas as pd
import numpy as np
weather_variables = ['windspeed', 'temp','pressure']
lats = [x/10 for x in range(400,500,5)]
lons = [x/10 for x in range(900,1000,5)]
hours = pd.date_range('1/1/2021', '9/28/2021', freq= 'H')
df = []
for i in range (0, len(hours)):
for weather in weather_variables:
temp_df = pd.DataFrame(index = lats, columns = lons,data = np.random.randint(0,100,size=(len(lats), len(lons))))
temp_df = temp_df.unstack().to_frame()
temp_df.reset_index(inplace= True)
temp_df['weather_variable'] = weather
temp_df['ts'] = hours[i]
df.append(temp_df)
df = pd.concat(df)
df.columns = ['lon','lat','value','weather_variable', 'ts']
So this code will create a dummy dataset containing a 3 grids for a given hour. The goal is to convert this into a 5d array of overlapping 24 hours sequences. The array would look like this i think (len(hours)?, 24, 20, 20, 3)
From the ConvLSTM paper,
The weather radar data is recorded every 6 minutes, so there
are 240 frames per day. To get disjoint subsets for training, testing and validation, we partition each
daily sequence into 40 non-overlapping frame blocks and randomly assign 4 blocks for training, 1
block for testing and 1 block for validation. The data instances are sliced from these blocks using
a 20-frame-wide sliding window. Thus our radar echo dataset contains 8148 training sequences,
2037 testing sequences and 2037 validation sequences and all the sequences are 20 frames long (5
for the input and 15 for the prediction).
If my calculations are correct, each of the "non-overlapping frame blocks" should have 6 frames in it (240 frames per day / 40 blocks per day = 6 frames per block). So I'm not sure how you create a 20-frame-wide sliding window in a given block. Nonetheless, you could take a similar approach: divide your data into non-overlapping windows of a specific length. Perhaps you use 6 hours of data to predict the next 6. I'm not sure that you need to keep the windows within a given day--a change from 11 pm to 1 am seems just as valid a time window as from say 3 am to 5 am.
I don't think Pandas will be an efficient way to massage the data. I would stick with NumPy or probably a TensorFlow data structure.

Adding a feature from a dataset into a function causes "TypeError: can't convert type 'ndarray' to numerator/denominator"

The task requires you to load a feature of the diabetes dataset and write your own line of best fit for the training data.
I have written the required line of best fit algorithm, however when trying to add the training data to it, I receive this error:
"TypeError: can't convert type 'ndarray' to numerator/denominator"
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from statistics import mean
diabetes = datasets.load_diabetes()
diabetes_X = diabetes.data[:, np.newaxis, 2]
diabetes_X_train = diabetes_X[:-20] #creating the testing and training data
diabetes_X_test = diabetes_X[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
## The below code is where the issue is occurring
xs = np.array(diabetes_X_train, dtype=np.float64)
ys = np.array([diabetes_y_train, dtype=np.float64)
##the algorithm to calculate the line of best
def best_fit_slope_and_intercept(xs,ys):
m = (((mean(xs)*mean(ys)) - mean(xs*ys)) /
((mean(xs)*mean(xs)) - mean(xs*xs)))
b = mean(ys) - m*mean(xs)
return m, b
m, b = best_fit_slope_and_intercept(xs,ys)
print(m,b)
I understand converting the required data into the correct format is the issue but after doing research, I am unable to find the correct way to do so.
All input on how to correctly concatenate or convert the training data as required is appreciated.
the best format for dealing with data is DataFrame, in my view.
you could easily make your dataframe like this:
urdataframe={'headername': bestfitted}
urdataframe=pd.DataFrames(data=urdataframe)
then you can contact ur dataframes easily like :
finaldata=pd.concat((traindata,urdataframe),axis=1)
you could do all machine learning functions on DataFrames and less likely to get errors
for example, if ur train data is something like:
age sex sugar
42 1 120
45 0 250
32 1 98
and your answer is like:
answer
yes
no
yes
so after contact by the code, I mentioned it would be like:
age sex sugar answer
42 1 120 yes
45 0 250 no
32 1 98 yes

Will pandas dataframe object work with sklearn kmeans clustering?

dataset is pandas dataframe. This is sklearn.cluster.KMeans
km = KMeans(n_clusters = n_Clusters)
km.fit(dataset)
prediction = km.predict(dataset)
This is how I decide which entity belongs to which cluster:
for i in range(len(prediction)):
cluster_fit_dict[dataset.index[i]] = prediction[i]
This is how dataset looks:
A 1 2 3 4 5 6
B 2 3 4 5 6 7
C 1 4 2 7 8 1
...
where A,B,C are indices
Is this the correct way of using k-means?
Assuming all the values in the dataframe are numeric,
# Convert DataFrame to matrix
mat = dataset.values
# Using sklearn
km = sklearn.cluster.KMeans(n_clusters=5)
km.fit(mat)
# Get cluster assignment labels
labels = km.labels_
# Format results as a DataFrame
results = pandas.DataFrame([dataset.index,labels]).T
Alternatively, you could try KMeans++ for Pandas.
To know if your dataframe dataset has suitable content you can explicitly convert to a numpy array:
dataset_array = dataset.values
print(dataset_array.dtype)
print(dataset_array)
If the array has an homogeneous numerical dtype (typically numpy.float64) then it should be fine for scikit-learn 0.15.2 and later. You might still need to normalize the data with sklearn.preprocessing.StandardScaler for instance.
If your data frame is heterogeneously typed, the dtype of the corresponding numpy array will be object which is not suitable for scikit-learn. You need to extract a numerical representation for all the relevant features (for instance by extracting dummy variables for categorical features) and drop the columns that are not suitable features (e.g. sample identifiers).

Categories

Resources