Can we cluster Multivariate Time Series dataset in Python - python

I have a dataset with many financial signal values for different stocks at different times.For example
StockName Date Signal1 Signal2
----------------------------------
Stock1 1/1/20 a b
Stock1 1/2/20 c d
.
.
.
Stock2 1/1/20 e f
Stock2 1/2/20 g h
.
.
.
I would like to build a time series table look like below and cluster stocks based on both signal1 and signal2 (2 variables)
StockName 1/1/20 1/2/20 ........ Cluster#
----------------------------------------------------
Stock1 [a,b] [c,d] 0
Stock2 [e,f] [g,h] 1
Stock3 ...... ..... 0
.
.
.
1)Are there any ways to do this? (Clustering stocks based on multiple variables for the time series data). I tried to search online but they are all about clustering time series based on one variable.
2)Also, are there any ways to cluster different stocks at different times as well? (So maybe Stock1 at time1 is in the same cluster with Stock2 at time3)

I am revising my answer here, based on the new information that you last posted.
from utils import *
import time
import numpy as np
from mxnet import nd, autograd, gluon
from mxnet.gluon import nn, rnn
import mxnet as mx
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
# %matplotlib inline
from sklearn.decomposition import PCA
import math
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
context = mx.cpu(); model_ctx=mx.cpu()
mx.random.seed(1719)
# Note: The purpose of this section (3. The Data) is to show the data preprocessing and to give rationale for using different sources of data, hence I will only use a subset of the full data (that is used for training).
def parser(x):
return datetime.datetime.strptime(x,'%Y-%m-%d')
# dataset_ex_df = pd.read_csv('data/panel_data_close.csv', header=0, parse_dates=[0], date_parser=parser)
import yfinance as yf
# Get the data for the stock AAPL
start = '2018-01-01'
end = '2020-04-22'
data = yf.download('GS', start, end)
data = data.reset_index()
data
data.dtypes
# re-name field from 'Adj Close' to 'Adj_Close'
data = data.rename(columns={"Adj Close": "Adj_Close"})
data
num_training_days = int(data.shape[0]*.7)
print('Number of training days: {}. Number of test days: {}.'.format(num_training_days, data.shape[0]-num_training_days))
# TECHNICAL INDICATORS
#def get_technical_indicators(dataset):
# Create 7 and 21 days Moving Average
data['ma7'] = data['Adj_Close'].rolling(window=7).mean()
data['ma21'] = data['Adj_Close'].rolling(window=21).mean()
# Create exponential weighted moving average
data['26ema'] = data['Adj_Close'].ewm(span=26).mean()
data['12ema'] = data['Adj_Close'].ewm(span=12).mean()
data['MACD'] = (data['12ema']-data['26ema'])
# Create Bollinger Bands
data['20sd'] = data['Adj_Close'].rolling(window=20).std()
data['upper_band'] = data['ma21'] + (data['20sd']*2)
data['lower_band'] = data['ma21'] - (data['20sd']*2)
# Create Exponential moving average
data['ema'] = data['Adj_Close'].ewm(com=0.5).mean()
# Create Momentum
data['momentum'] = data['Adj_Close']-1
dataset_TI_df = data
dataset = data
def plot_technical_indicators(dataset, last_days):
plt.figure(figsize=(16, 10), dpi=100)
shape_0 = dataset.shape[0]
xmacd_ = shape_0-last_days
dataset = dataset.iloc[-last_days:, :]
x_ = range(3, dataset.shape[0])
x_ =list(dataset.index)
# Plot first subplot
plt.subplot(2, 1, 1)
plt.plot(dataset['ma7'],label='MA 7', color='g',linestyle='--')
plt.plot(dataset['Adj_Close'],label='Closing Price', color='b')
plt.plot(dataset['ma21'],label='MA 21', color='r',linestyle='--')
plt.plot(dataset['upper_band'],label='Upper Band', color='c')
plt.plot(dataset['lower_band'],label='Lower Band', color='c')
plt.fill_between(x_, dataset['lower_band'], dataset['upper_band'], alpha=0.35)
plt.title('Technical indicators for Goldman Sachs - last {} days.'.format(last_days))
plt.ylabel('USD')
plt.legend()
# Plot second subplot
plt.subplot(2, 1, 2)
plt.title('MACD')
plt.plot(dataset['MACD'],label='MACD', linestyle='-.')
plt.hlines(15, xmacd_, shape_0, colors='g', linestyles='--')
plt.hlines(-15, xmacd_, shape_0, colors='g', linestyles='--')
# plt.plot(dataset['log_momentum'],label='Momentum', color='b',linestyle='-')
plt.legend()
plt.show()
plot_technical_indicators(dataset_TI_df, 400)
This will give you some signals to work with. Of course, these features can be anything you want. I'm sure you know this is technical analysis, and not fundamental analysis. Now, you can do your clustering, and whatever else you want, at this point.
Here is a good link for clustering.
https://www.pythonforfinance.net/2018/02/08/stock-clusters-using-k-means-algorithm-in-python/

Good material to read (Title: Time Series Clustering and Dimensionality Reduction)
https://towardsdatascience.com/time-series-clustering-and-dimensionality-reduction-5b3b4e84f6a3

Related

Any way for grouped forecast with multivariates? (python)

I was trying a VAR multivariate forecast for economics analysis.
I have 20 bank quarter financial data from 2010 to 2021, with corresponding quarter macro data.
I tried VAR multivariate forecast, but it did not work since I have 20 data on every single date.
I tried grouped / hierarchical auto-ARIMA, but it did not work since I need a multivariate forecast.
these are my codes corresponding to number 1. I prefer VAR forecast but can change models if more appropriate.
import pandas as pd
import numpy as np`
import matplotlib.pyplot as plt
%matplotlib inline
from statsmodels.tsa.api import VAR
from statsmodels.tsa.stattools import adfuller
from statsmodels.tools.eval_measures import rmse, aic
data = pd.read_csv('last.csv', parse_dates=['quarter'], index_col='quarter')
dff = data[["npl", "nom_gdp","kospi", "cpi", "avgex","interest", "m2", "lnasset", "bis", "nim", "lnlend"]]
df = data[["npl", "nom_gdp","kospi", "cpi", "avgex","interest", "m2", "nim"]]
nobs = 10
df_train, df_test = df[0:-nobs], df[-nobs:]
df_differenced = df_train.diff().dropna()
model = VAR(df_differenced)
model_fitted = model.fit(4)
model_fitted.summary()
fc = model_fitted.forecast(y=forecast_input, steps=nobs)
df_forecast = pd.DataFrame(fc, index=df.index[-nobs:], columns=df.columns + '_1q')
df_forecast
def invert_transformation(df_train, df_forecast, second_diff=False):
"""Revert back the differencing to get the forecast to original scale."""
df_fc = df_forecast.copy()
columns = df_train.columns
for col in columns:
df_fc[str(col)+'_forecast'] = df_train[col].iloc[-1] + df_fc[str(col)+'_1q'].cumsum()
return df_fc
df_results = invert_transformation(df_train, df_forecast, second_diff=True)
df_results.loc[:, ['npl_forecast', 'gdp_forecast', 'kospi_forecast', 'cpi_forecast',
'avgex_forecast', 'interest_forecast', 'm2_forecast', 'nim_forecast']]
These code give me certain numbers without errors, but since multiple numbers exist in the same period, these numbers are not correct. Is there any way I can forecast the number? I want forecast by banks and also the total forecast. + forecasts to more than 1 period, like 4 other quarters.
Blockquote

Forecasting time series with multiple seasonaliy by using auto_arima(SARIMAX) and Fourier terms

I am trying to forecast a time series in Python by using auto_arima and adding Fourier terms as exogenous features. The data come from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. The specificity of this time series is that it has daily data with weekly and annual seasonalities.
In order to capture these two levels of seasonality I first used TBATS as recommended by Rob J Hyndman in Forecasting with daily data which worked pretty well actually.
I also followed this medium article posted by the creator of TBATS python library who compared it with SARIMAX + Fourier terms (also recommended by Hyndman).
But now, when I tried to use the second approach with pmdarima's auto_arima and Fourier terms as exogenous features, I get unexpected results.
In the following code, I only used the train.csv file that I split into train and test data (last year used for forecasting) and set the maximum order of Fourier terms K = 2.
My problem is that I obtain a smoothed forecast (see Image below) that do not seem to capture the weekly seasonality which is different from the result at the end of this article.
Is there something wrong with my code ?
Complete code :
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Upload the data that consist in a long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) = 500 time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales
# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 2)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])
# Split the time series as well as exogenous features data into train and test splits
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test = y_prime.iloc[(len(y_prime)-365):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]
# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, exogenous=exog_to_train, seasonal=True, m=7)
# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))
# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()
Thanks in advance for your answers !
Here's the answer in case someone's interested.
Thanks again Flavia Giammarino.
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
from pmdarima import auto_arima
import matplotlib.pyplot as plt
# Upload the data that consists long format time series of multiple TS stacked on top of each other
# There are 10 (stores) * 50 (items) time series
train_data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Select only one time series for store 1 and item 1 for the purpose of the example
train_data = train_data.query('store == 1 and item == 1').sales
# Prepare the fourier terms to add as exogenous features to auto_arima
# Annual seasonality covered by fourier terms
four_terms = FourierFeaturizer(365.25, 1)
y_prime, exog = four_terms.fit_transform(train_data)
exog['date'] = y_prime.index # is exactly the same as manual calculation in the above cells
exog = exog.set_index(exog['date'])
exog.index.freq = 'D'
exog = exog.drop(columns=['date'])
# Split the time series as well as exogenous features data into train and test splits
y_to_train = y_prime.iloc[:(len(y_prime)-365)]
y_to_test = y_prime.iloc[(len(y_prime)-365):] # last year for testing
exog_to_train = exog.iloc[:(len(exog)-365)]
exog_to_test = exog.iloc[(len(exog)-365):]
# Fit model
# Weekly seasonality covered by SARIMAX
arima_exog_model = auto_arima(y=y_to_train, D=1, exogenous=exog_to_train, seasonal=True, m=7)
# Forecast
y_arima_exog_forecast = arima_exog_model.predict(n_periods=365, exogenous=exog_to_test)
y_arima_exog_forecast = pd.DataFrame(y_arima_exog_forecast , index = pd.date_range(start='2017-01-01', end= '2017-12-31'))
# Plots
plt.plot(y_to_test, label='Actual data')
plt.plot(y_arima_exog_forecast, label='Forecast')
plt.legend()

any workaround to do forward forecasting for estimating time series in python?

I want to make forward forecasting for monthly times series of air pollution data such as what would be 3~6 months ahead of estimation on air pollution index. I tried scikit-learn models for forecasting and fitting data to the model works fine. But what I wanted to do is making a forward period estimate such as what would be 6 months ahead of the air pollution output index is going to be. In my current attempt, I could able to train the model by using scikit-learn. But I don't know how that forward forecasting can be done in python. To make a forward period estimate, what should I do? Can anyone suggest a possible workaround to do this? Any idea?
my attempt
import pandas as pd
from sklearn.preprocessing StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import BayesianRidge
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url, parse_dates=['dates'])
df.drop(columns=['Unnamed: 0'], inplace=True)
resultsDict={}
predictionsDict={}
split_date ='2017-12-01'
df_training = df.loc[df.index <= split_date]
df_test = df.loc[df.index > split_date]
df_tr = df_training.drop(['pollution_index'],axis=1)
df_te = df_test.drop(['pollution_index'],axis=1)
scaler = StandardScaler()
scaler.fit(df_tr)
X_train = scaler.transform(df_tr)
y_train = df_training['pollution_index']
X_test = scaler.transform(df_te)
y_test = df_test['pollution_index']
X_train_df = pd.DataFrame(X_train,columns=df_tr.columns)
X_test_df = pd.DataFrame(X_test,columns=df_te.columns)
reg = linear_model.BayesianRidge()
reg.fit(X_train, y_train)
yhat = reg.predict(X_test)
resultsDict['BayesianRidge'] = accuracy_score(df_test['pollution_index'], yhat)
new update 2
this is my attempt using ARMA model
from statsmodels.tsa.arima_model import ARIMA
index = len(df_training)
yhat = list()
for t in tqdm(range(len(df_test['pollution_index']))):
temp_train = df[:len(df_training)+t]
model = ARMA(temp_train['pollution_index'], order=(1, 1))
model_fit = model.fit(disp=False)
predictions = model_fit.predict(start=len(temp_train), end=len(temp_train), dynamic=False)
yhat = yhat + [predictions]
yhat = pd.concat(yhat)
resultsDict['ARMA'] = evaluate(df_test['pollution_index'], yhat.values)
but this can't help me to make forward forecasting of estimating my time series data. what I want to do is, what would be 3~6 months ahead of estimated values of pollution_index. Can anyone suggest me a possible workaround to do this? How to overcome the limitation of my current attempt? What should I do? Can anyone suggest me a better way of doing this? Any thoughts?
update: goal
for the clarification, I am not expecting which model or approach works best, but what I am trying to figure it out is, how to make reliable forward forecasting for given time series (pollution index), how should I correct my current attempt if it is not efficient and not ready to do forward period estimation. Can anyone suggest any possible way to do this?
update-desired output
here is my sketch desired forecasting plot that I want to make:
In order to obtain your desired output, I think you need to use a model that can return the standard deviation in the predicted value. Therefore, I adopt Gaussian process regression. From the code you provided in your post, I don't see how this is a time series forecasting task, so in my solution below, I also treat this task as a usual regression task.
First, prepare the data
import pandas
from sklearn.preprocessing import StandardScaler
from sklearn.gaussian_process import GaussianProcessRegressor
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# drop the date column
df_training.drop(columns=['date'],axis=1,inplace=True)
df_test.drop(columns=['date'],axis=1,inplace=True)
y_train = df_training['pollution_index']
y_test = df_test['pollution_index']
df_training.drop(['pollution_index'],axis=1)
df_test.drop(['pollution_index'],axis=1)
scaler = StandardScaler()
scaler.fit(df_training)
X_train = scaler.transform(df_training)
X_test = scaler.transform(df_test)
X_train_df = pd.DataFrame(X_train,columns=df_training.columns)
X_test_df = pd.DataFrame(X_test,columns=df_test.columns)
with the dataframes prepared above, you can train a GaussianProcessRegressor and make predictions by
gpr = GaussianProcessRegressor(normalize_y=True).fit(X_train_df,y_train)
pred,std = gpr.predict(X_test_df,return_std=True)
in which std is an array of standard deviations in the predicted values. Then, you can plot the data by
import numpy as np
from matplotlib import pyplot as plt
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 225
# plot the training data
ax.plot(y_train.index[plot_start:],y_train.values[plot_start:],'navy',marker='o',label='observed')
# plot the test data
ax.plot(y_test.index,y_test.values,'navy',marker='o')
ax.plot(y_test.index,pred,'darkgreen',marker='o',label='pred')
sigma = np.sqrt(std)
ax.fill(np.concatenate([y_test.index,y_test.index[::-1]]),
np.concatenate([pred-1.960*sigma,(pred+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
ax.legend(loc='upper left',prop={'size':16})
the output plot looks like
UPDATE
I thought pollution_index is something that can be predicted by 'dew', 'temp', 'press', 'wnd_spd', 'rain'. If you want a one-step ahead forecasting, here is what you can do
import numpy as np
import pandas as pd
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# extract the relevant info
train_date,train_polltidx = df_training['date'].values,df_training['pollution_index'].values
test_date,test_polltidx = df_test['date'].values,df_test['pollution_index'].values
# train an ARIMA model
model = ARIMA(train_polltidx,order=(1,1,1))
model_fit = model.fit(disp=0)
# you can predict as many as you want, here I only predict len(test_dat.index) days
forecast,stderr,conf = model_fit.forecast(len(test_date))
# plot the result
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 225
# plot the training data
plt.plot(train_date[plot_start:],train_polltidx[plot_start:],'navy',marker='o',label='observed')
# plot the test data
plt.plot(test_date,test_polltidx,'navy',marker='o')
plt.plot(test_date,forecast,'darkgreen',marker='o',label='pred')
# ax.errorbar(np.arange(len(pred)),pred,std,fmt='r')
plt.fill(np.concatenate([test_date,test_date[::-1]]),
np.concatenate((conf[:,0],conf[:,1][::-1])),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.set_xlim([df_training['date'].values[plot_start],df_test['date'].values[-1]])
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=6))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
The output figure is
Clearly, the prediction is very bad, because I haven't done any preprocessing to the training data.
UPDATE 2
Since I'm not familiar with ARIMA, I implement one-step forecasting using GaussianProcessRegressor with the help of this wonderful post.
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.dates as mdates
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.preprocessing import StandardScaler
url = "https://gist.githubusercontent.com/jerry-shad/36912907ba8660e11cd27be0d3e30639/raw/424f0891dc46d96cd5f867f3d2697777ac984f68/pollution.csv"
df = pd.read_csv(url,parse_dates=['date'])
df.drop(columns=['Unnamed: 0'],axis=1,inplace=True)
# sort the dataframe by date and reset the index
df = df.sort_values(by='date').reset_index(drop=True)
# after sorting the dataframe, split the dataframe
split_date ='2017-12-01'
df_training = df.loc[(df.date <= split_date).values]
df_test = df.loc[(df.date > split_date).values]
# extract the relevant info
train_date,train_polltidx = df_training['date'].values,df_training['pollution_index'].values[:,None]
test_date,test_polltidx = df_test['date'].values,df_test['pollution_index'].values[:,None]
# preprocessing
scalar = StandardScaler()
scalar.fit(train_polltidx)
train_polltidx = scalar.transform(train_polltidx)
test_polltidx = scalar.transform(test_polltidx)
def series_to_supervised(data,n_in,n_out):
df = pd.DataFrame(data)
cols = list()
for i in range(n_in,0,-1): cols.append(df.shift(i))
for i in range(0, n_out): cols.append(df.shift(-i))
agg = pd.concat(cols,axis=1)
agg.dropna(inplace=True)
return agg.values
months_look_back = 1
# train
pollt_series = series_to_supervised(train_polltidx,months_look_back,1)
x_train,y_train = pollt_series[:,:months_look_back],pollt_series[:,-1]
# test
pollt_series = series_to_supervised(test_polltidx,months_look_back,1)
x_test,y_test = pollt_series[:,:months_look_back],pollt_series[:,-1]
print("The first %i months in the test set won't be predicted." % months_look_back)
def walk_forward_validation(x_train,y_train,x_test,y_test):
predictions = []
history_x = x_train.tolist()
history_y = y_train.tolist()
for rep,target in zip(x_test,y_test):
# train model
gpr = GaussianProcessRegressor(alpha=1e-4,normalize_y=False).fit(history_x,history_y)
pred,std = gpr.predict([rep],return_std=True)
predictions.append([pred,std])
history_x.append(rep)
history_y.append(target)
return predictions
predictions = walk_forward_validation(x_train,y_train,x_test,y_test)
pred_test,pred_std = zip(*predictions)
# put back
pred_test = scalar.inverse_transform(pred_test)
pred_std = scalar.inverse_transform(pred_std)
train_polltidx = scalar.inverse_transform(train_polltidx)
test_polltidx = scalar.inverse_transform(test_polltidx)
# plot the result
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 100
# plot the training data
plt.plot(train_date[plot_start:],train_polltidx[plot_start:],'navy',marker='o',label='observed')
# plot the test data
plt.plot(test_date[months_look_back:],test_polltidx[months_look_back:],'navy',marker='o')
plt.plot(test_date[months_look_back:],pred_test,'darkgreen',marker='o',label='pred')
sigma = np.sqrt(pred_std)
ax.fill(np.concatenate([test_date[months_look_back:],test_date[months_look_back:][::-1]]),
np.concatenate([pred_test-1.960*sigma,(pred_test+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.set_xlim([df_training['date'].values[plot_start],df_test['date'].values[-1]])
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=6))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
The idea of this script is to cast the time series forecasting task into a supervised regression task. The plot_start is a parameter that controls from which year we want to plot, clearly plot_start cannot be greater than the length of the training data. The output figure of the script is
as you can see, the first month in the test dataset is not predicted, because we need to look back one month to make a prediction.
In order to further make predictions about unseen data, based on this post on CV site, you can train a new model using the predicted value from the last step, therefore, here is how you can do it
unseen_dates = pd.date_range(test_date[-1],periods=180,freq='D').values
all_data = series_to_supervised(df['pollution_index'].values,months_look_back,months_to_predict)
def predict_unseen(unseen_dates,all_data,days_look_back):
predictions = []
history_x = all_data[:,:days_look_back].tolist()
history_y = all_data[:,-1].tolist()
inds = np.arange(unseen_dates.shape[0])
for ind in inds:
# train model
gpr = GaussianProcessRegressor(alpha=1e-2,normalize_y=False).fit(history_x,history_y)
rep = np.array(history_y[-days_look_back:]).reshape(days_look_back,1)
pred,std = gpr.predict(rep,return_std=True)
predictions.append([pred,std])
history_x.append(history_y[-days_look_back:])
history_y.append(pred)
return predictions
predictions = predict_unseen(unseen_dates,all_data,days_look_back=1)
pred_test,pred_std = zip(*predictions)
fig,ax = plt.subplots(figsize=(12,8))
plot_start = 100
# plot the test data
plt.plot(unseen_dates,pred_test,'navy',marker='o')
sigma = np.sqrt(pred_std)
ax.fill(np.concatenate([unseen_dates,unseen_dates[::-1]]),
np.concatenate([pred_test-1.960*sigma,(pred_test+1.9600*sigma)[::-1]]),
alpha=.5,fc='silver',ec='tomato',label='95% confidence interval')
plt.legend(loc='upper left',prop={'size':16})
ax = plt.gca()
ax.xaxis.set_major_locator(mdates.DayLocator(interval=7))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.gcf().autofmt_xdate()
plt.show()
One very important thing to note: The timestep of the real data is a month, using such data to make predictions about days may not be correct.
The model you have built links what you are trying to model, 'pollution_index', to some input variables, in your case ['dew', 'temp', 'press', 'wnd_spd', 'rain']. So to predict pollution_index into the future using your model, at the high level, you need to estimate what these variables would be over the next 3-6 months, and then run your model on that. Practically, you need to come up with something that looks like X_test but has your projections for these variables for the future, and then call:
yhat = reg.predict(X_test)
... to produce the model estimate of where the pollution_index will be. Hope this makes sense. This gives you a "mechanical" ability to use your model for prediction.
For example, following up on your main example where reg is BayesianRidge() that you fit, we would do the following:
import sys
from io import StringIO
import matplotlib.pyplot as plt
# Here we load your predictions for input variables
# I stubbed it with some random data
df_predict_data = StringIO(
"""
date,dew,temp,press,wnd_spd,rain
2021-01-01,59,28,16,0.78,98.7
2021-02-01,68,32,18,0.79,46.1
2021-03-01,75,34,20,0.81,91.5
2021-04-01,63,31,16,0.83,19.1
2021-05-01,74,38,19,0.83,21.8
2021-06-01,65,32,17,0.85,35.4
""")
df_predict = pd.read_csv(df_predict_data, index_col = 'date')
# scale it using the same scaler you used in training
X_predict = scaler.transform(df_predict)
# predict pollution_index
y_predict = reg.predict(X_predict)
# plot it
plt.plot(df_predict.index, y_predict, '.-')
So we get this:
Whether the linear regression you built is a good model for such prediction is a completely different question. As #Sergey Bushmanov mentioned there is vast literature on forecasting and what models are best for this or that, and this thread is probably not the right place to debate that aspect of your question.

How can I get a dataframe of frequency and time from a wav file in python?

i have been able to read a wave file using librosa, but was able to get a list of frequency and amplitudes which is missing the most important part "Time", i don't know how to go a raound
import librosa
from matplotlib import pyplot as plt
from librosa import display
import seaborn as sns
sns.set() # Use seaborn's default style to make attractive graphs
plt.rcParams['figure.dpi'] = 100 # Show nicely large images in this notebook
samples, sampling_rate = librosa.load('last.aac')
plt.figure(figsize=(13,4))
librosa.display.waveplot(y = samples, sr = sampling_rate)
plt.xlabel("Time (Seconds)")
plt.ylabel("Amplitude")
plt.title("Time Domain Frequency")
plt.show()
print("sampling rate is :", sampling_rate)
print("Number of samples :", len(samples))
max1 = max(samples*10)
print("Maximum amplitude", round(max1, 2))
min1 = min(samples*10)
print("Minimum amplitude", round(min1, 2))
import pandas as pd
d=pd.DataFrame(samples,columns=["Amplitudes"], index=None)
d.head()
Amplitudes frequencies
0.091039 2756.813965
-0.040613 2427.044189
-0.046945 2347.927002
-0.136473 2323.757080
0.050659 2317.414062
what i really want to extract is this below
Time Frequency Amplitude
0 0.00000 -57.72730
0 21.53320 -63.55554
0 43.06641 -85.05077
0 64.59961 -91.29989
0 86.13281 -83.86144
0 107.66602 -81.54240
If you have the sampling rate (in Hz), you can use that to create an index in seconds. This is an example using a 4-second audio file.
import librosa
import pandas as pd
y,sr = librosa.load('myfile.wav')
df = pd.DataFrame(y, columns=['Amplitude'])
df.index = [(1/sr)*i for i in range(len(df.index))]
print(df.head())
print(df.tail())
Output:
Amplitude
0.000000 -0.004855
0.000045 -0.007008
0.000091 -0.005919
0.000136 -0.006276
0.000181 -0.005805
Amplitude
3.999773 0.000892
3.999819 0.000765
3.999864 0.000543
3.999909 0.000428
3.999955 0.000216
The index is now in seconds.
Just to note, where you got the Frequencies column isn't shown in your example, and all the plotting stuff is not relevant to the issue, so you should trim some of that for posting questions on SO.

Clustering with mixed data type

Currently my data frame consist of both numerical and categorical values (mixed data type). My data frame looks like -
id age txn_duration Statename amount gender religion
1 27 275 bihar 110 m hindu
2 33 163 maharashtra 50 f muslim
3 53 63 delhi 50 f muslim
4 47 100 up 50 m hindu
5 39 263 punjab 100 m punjabi
6 41 303 delhi 50 m punjabi
There is 20 states (Statename) and 7 religion. I have done get_dummies for both Statename and rekigion but got lots of noise. Also detect outlier.My question is -
1. how to find optimum no of cluster for mixed data type.
2. In this case I am using k-means algo.Can I use k-modes or any other methods which will help my results. Because I am not getting good results using k-means
3.How to interpretation my cluster results. I have use
print (cluster_data[clmns].groupby(['clusters']).mean())
Any other way I can see or plot?please provide me the code
My code is -
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import numpy as np
#Importing libraries
import os
import matplotlib.pyplot as plt#visualization
from PIL import Image
%matplotlib inline
import seaborn as sns#visualization
import itertools
import warnings
warnings.filterwarnings("ignore")
import io
from scipy import stats
from sklearn.cluster import KMeans
from kmodes.kprototypes import KPrototypes
cluster_data = pd.read_csv("cluster.csv")
cluster_data = pd.get_dummies(cluster_data, columns=['StateName'])
cluster_data = pd.get_dummies(cluster_data, columns=['gender'])
cluster_data = pd.get_dummies(cluster_data, columns=['religion'])
clmns = ['mobile', 'age', 'txn_duration', 'amount', 'StateName_Bihar',
'StateName_Delhi', 'StateName_Gujarat', 'StateName_Karnataka',
'StateName_Maharashtra', 'StateName_Punjab', 'StateName_Rajasthan',
'StateName_Telangana', 'StateName_Uttar Pradesh',
'StateName_West Bengal', 'gender_female',
'gender_male', 'religion_buddhist',
'religion_christian', 'religion_hindu',
'religion_jain', 'religion_muslim',
'religion_other', 'religion_sikh']
df_tr_std = stats.zscore(cluster_data[clmns])
#Cluster the data
kmeans = KMeans(n_clusters=3, random_state=0).fit(df_tr_std)
labels = kmeans.labels_
#Glue back to originaal data
cluster_data['clusters'] = labels
clmns.extend(['clusters'])
#Lets analyze the clusters
print (cluster_data[clmns].groupby(['clusters']).mean())
You can run something like this code:
Look at the image attached, in that plot you can see that having more than 3 clusters (for the dataset it was run on) does not provide a significant decrease in distortion. So optimum cluster number would be 3 in that case (simple synthetic data). For noisy data the decision might be harder.
Reference: A. Mueller's scipy notes on sklearn
import matplotlib.pyplot as plt
distortions = []
for i in range(1, 11):
km = KMeans(n_clusters=i,
random_state=0)
km.fit(X)
distortions.append(km.inertia_)
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()
Edit for ValueError:
For ValueError: you need just numericals, so you can do like this:
df_numerics = df.drop(['Statename', 'gender', 'religion], axis=1)
You can also drop other columns that you don't want included in clustering analysis.
with df_numerics, try the elbow method and try to find a good cluster number.
Then, let's say you found out that 3 clusters was good, you can run:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
labels contains the cluster numbers (0,1,2 for 3-clusters) for each row in your dataframe.You can also save this as a column in you datafame:
df['cluster_labels'] = labels
Then to visualize it you can pick 2 columns (more than that is dificult to visualize). Let's say you picked 'txn_duration' and 'amount' you can plot those columns, and add the cluster labels as color like this:
import matplotlib.pyplot as plt
plt.scatter(df['txn_duration'],df['amount'], c=df['cluster_labels'])

Categories

Resources