Summing time series after k-means clustering - python

I am trying out different variations of K in K-means clustering on a set with time series data.
For each experiment I want to sum up the time series for each cluster label and perform predictions on them.
So for example:
If I cluster the time series into 3 clusters I want to sum all the time series (column-wise) belonging to cluster 1 and all the times series belonging to cluster 2, and the same for cluster 3. After that I will make predictions on each aggregated time-series cluster, but I do not need help on the prediction part.
I was thinking to add the cluster labels to the original dataframe and then use .loc and a loop to extract time series corresponding to the same clusters. But I am wondering if there is a more efficient way?
import pandas as pd
from datetime import datetime
import numpy as np
from sklearn.cluster import KMeans
#create dataframe with time series
date_rng = pd.date_range(start='1/1/2018', end='1/08/2018', freq='H')
df = pd.DataFrame(date_rng, columns=['date'])
for i in range(20):1
df['ts' + str(i)] = np.random.randint(0,100,size=(len(date_rng)))
df_pivot = df.pivot_table(columns = 'date', values = df.columns)
#cluster
K = range(1,10,2)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(df_pivot)
print(km.labels_)
#sum/aggregate all ts in each cluster column-wise
#forecast next step for each cluster(dont need help with this part)
`

You can access data points for every cluster and then sum their values.
Something like this:
labels = km.labels_
centroids = km.cluster_centers_
cluster_sums_dict = {} # cluster number: sum of elements
for i in range(k):
# select
temp_cluster = df_pivot[np.where(labels==i)]
cluster_sums_dict[i] = temp_cluster['ts'].sum()
Also on a side note, instead of aggregating a cluster_values, can you use centroids of each cluster for prediction?

Related

Create clusters depending on scores performance

I have data from students who took a test that has 2 sections : the 1st section tests their digital skill at level2, and the second section tests their digital skills at level3. I need to come up with 3 clusters of students depending on their scores to place them in 3 different skills levels (1,2 and 3) --> code sample below
import pandas as pd
data = [12,24,14,20,8,10,5,23]
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
'Scores_section1': [12,24,14,20,8,10,5,23],
'Scores_section2' : [20,4,1,0,18,9,12,10],
'Sum_all_scores': [32,28,15,20,26,19,17,33]}
# Create DataFrame
df = pd.DataFrame(data)
# print dataframe.
df
I thought about using K-means clustering, but following a tutorial online, I'd need to use x,y coordinates. Should I use scores_section1 as x, and Scores_section2 as y or vice-versa, and why?
Many thanks in advance for your help!
Try it this way.
import pandas as pd
data = [12,24,14,20,8,10,5,23]
# initialize data of lists.
data = {'Name': ['Marc','Fay', 'Emile','bastian', 'Karine','kathia', 'John','moni'],
'Scores_section1': [12,24,14,20,8,10,5,23],
'Scores_section2' : [20,4,1,0,18,9,12,10],
'Sum_all_scores': [32,28,15,20,26,19,17,33]}
# Create DataFrame
df = pd.DataFrame(data)
# print dataframe.
df
#Import required module
from sklearn.cluster import KMeans
#Initialize the class object
kmeans = KMeans(n_clusters=3)
#predict the labels of clusters.
df = df[['Scores_section1', 'Scores_section2', 'Sum_all_scores']]
label = kmeans.fit_predict(df)
label
df['kmeans'] = label
df
# K-Means Clustering may be the most widely known clustering algorithm and involves assigning examples to
# clusters in an effort to minimize the variance within each cluster.
# The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets
# on the basis of a sample. The process, which is called ‘k-means,’ appears to give partitions which are reasonably
# efficient in the sense of within-class variance.
# plot X & Y coordinates and color by cluster number
import plotly.express as px
fig = px.scatter(df, x="Scores_section1", y="Scores_section2", color="kmeans", size='Sum_all_scores', hover_data=['kmeans'])
fig.show()
Feel free to modify the code to suit your needs.

Fourier Result on Time Series explained python

I have passed my time series data,which is essentially measurements from a sensor about pressure, through a Fourier transformation, similar to what is described in https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101.
The file used can be found here:
https://docs.google.com/spreadsheets/d/1MLETSU5Trl5gLGO6pv32rxBsR8xZNkbK/edit?usp=sharing&ouid=110574180158524908052&rtpof=true&sd=true
The code related is this :
import pandas as pd
import numpy as np
file='test.xlsx'
df=pd.read_excel(file,header=0)
#df=pd.read_csv(file,header=0)
df.head()
df.tail()
# drop ID
df=df[['JSON_TIMESTAMP','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB','ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_ADH_COATWEIGHT_SP']]
# extract year month
df["year"] = df["JSON_TIMESTAMP"].str[:4]
df["month"] = df["JSON_TIMESTAMP"].str[5:7]
df["day"] = df["JSON_TIMESTAMP"].str[8:10]
df= df.sort_values( ['year', 'month','day'],
ascending = [True, True,True])
df['JSON_TIMESTAMP'] = df['JSON_TIMESTAMP'].astype('datetime64[ns]')
df.sort_values(by='JSON_TIMESTAMP', ascending=True)
df1=df.copy()
df1 = df1.set_index('JSON_TIMESTAMP')
df1 = df1[["ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB"]]
import matplotlib.pyplot as plt
#plt.figure(figsize=(15,7))
plt.rcParams["figure.figsize"] = (25,8)
df1.plot()
#df.plot(style='k. ')
plt.show()
df1.hist(bins=20)
from scipy.fft import rfft,rfftfreq
## https://towardsdatascience.com/fourier-transform-for-time-series-292eb887b101
# convert into x and y
x = list(range(len(df1.index)))
y = df1['ADH_DEL_CURTAIN_DELIVERY~ADH_DEL_AVERAGE_ADH_WEIGHT_FB']
# apply fast fourier transform and take absolute values
f=abs(np.fft.fft(df1))
# get the list of frequencies
num=np.size(x)
freq = [i / num for i in list(range(num))]
# get the list of spectrums
spectrum=f.real*f.real+f.imag*f.imag
nspectrum=spectrum/spectrum[0]
# plot nspectrum per frequency, with a semilog scale on nspectrum
plt.semilogy(freq,nspectrum)
nspectrum
type(freq)
freq= np.array(freq)
freq
type(nspectrum)
nspectrum = nspectrum.flatten()
# improve the plot by adding periods in number of days rather than frequency
import pandas as pd
results = pd.DataFrame({'freq': freq, 'nspectrum': nspectrum})
results['period'] = results['freq'] / (1/365)
plt.semilogy(results['period'], results['nspectrum'])
# improve the plot by convertint the data into grouped per day to avoid peaks
results['period_round'] = results['period'].round()
grouped_day = results.groupby('period_round')['nspectrum'].sum()
plt.semilogy(grouped_day.index, grouped_day)
#plt.xticks([1, 13, 26, 39, 52])
My end result is this :
Result of Fourier Trasformation for Data
My question is, what does this eventually show for our data, and intuitively what does the spike at the last section mean?What can I do with such result?
Thanks in advance all!

how can I drop low correlated features

I am making a preprocessing code for my LSTM training. My csv contains more than 30 variables. After applying some EDA techniques, I found that half of the features can be drop and they don't make any effect on training.
Right now I am dropping such features manually by using pandas.
I want to make a code which can drop such features automaticlly.
I wrote a code to visualize heat map and correlation in this way:
#I am making a class so this part is from preprocessing.
# self.data is a Dataframe which contains all csv data
def calculateCorrelationByPearson(self):
columns = self.data.columns
plt.figure(figsize=(12, 8))
sns.heatmap(data=self.data.corr(method='pearson'), annot=True, fmt='.2f',
linewidths=0.5, cmap='Blues')
plt.show()
for column in columns:
corr = stats.spearmanr(self.data['total'], self.data[columns])
print(f'{column} - corr coefficient:{corr[0]}, p-value:{corr[1]}')
This gives me a perfect view of my features and relationship with each other.
Now I want to drop columns which are not important.
Let's say correlation less than 0.4.
How can I apply this logic in to my code?
Here is an approach to remove variables with a correlation coef value below some threshold:
import pandas as pd
from scipy.stats import spearmanr
data = pd.DataFrame([{"A":1, "B":2, "C":3},{"A":2, "B":3, "C":1},{"A":3, "B":4, "C":0},{"A":4, "B":4, "C":1},{"A":5, "B":6, "C":2}])
targetVar = "A"
corr_threshold = 0.4
corr = spearmanr(data)
corrSeries = pd.Series(corr[0][:,0], index=data.columns) #Series with column names and their correlation coefficients
corrSeries = corrSeries[(corrSeries.index != targetVar) & (corrSeries > corr_threshold)] #apply the threshold
vars_to_keep = list(corrSeries.index.values) #list of variables to keep
vars_to_keep.append(targetVar) #add the target variable back in
data2 = data[vars_to_keep]

Selecting specific features based on correlation values

I am using the Housing train.csv data from Kaggle to run a prediction.
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv
I am trying to generate a correlation and only keep the features that correlate with SalePrice from 0.5 to 0.9. I tried to use this function to fileter some of it, but I am removing the correlation values that are above .9 only.
How would I update this function to only keep those specific features that I need to generate a correlation heat map?
data = train
corr = data.corr()
columns = np.full((corr.shape[0],), True, dtype=bool)
for i in range(corr.shape[0]):
for j in range(i+1, corr.shape[0]):
if corr.iloc[i,j] >= 0.9:
if columns[j]:
columns[j] = False
selected_columns = data.columns[columns]
data = data[selected_columns]
import pandas as pd
data = pd.read_csv('train.csv')
col = data.columns
c = [i for i in col if data[i].dtypes=='int64' or data[i].dtypes=='float64'] # dropping columns as dtype == object
main_col = ['SalePrice'] # column with which we have to compare correlation
corr_saleprice = data.corr().filter(main_col).drop(main_col)
c1 =(corr_saleprice['SalePrice']>=0.5) & (corr_saleprice['SalePrice']<=0.9)
c2 =(corr_saleprice['SalePrice']>=-0.9) & (corr_saleprice['SalePrice']<=-0.5)
req_index= list(corr_saleprice[c1 | c2].index) # selecting column with given criteria
#req_index.append('SalePrice') #if you want SalePrice column in your final dataframe too , uncomment this line
data = data[req_index]
data
Also using for loops is not so efficient, a direct implementation is favorable. I hope this is what you want!
For generating heatmap , you can use following code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
a =data.corr()
mask = np.triu(np.ones_like(a, dtype=np.bool))
plt.figure(figsize=(10,10))
_ = sns.heatmap(a,cmap=sns.diverging_palette(250, 20, n=250),square=True,mask=mask,annot=True,center=0.5)

How do I obtain individual centroids of K mean cluster using nltk (python)

I have used nltk to perform k mean clustering as I would like to change the distance metrics to cosine distance. However, how do I obtain the centroids of all the clusters?
kclusterer = KMeansClusterer(8, distance = nltk.cluster.util.cosine_distance, repeats = 1)
predict = kclusterer.cluster(features, assign_clusters = True)
centroids = kclusterer._centroid
df_clustering['cluster'] = predict
#df_clustering['centroid'] = centroids[df_clustering['cluster'] - 1].tolist()
df_clustering['centroid'] = centroids
I am trying to perform the k mean clustering on a pandas dataframe, and would like to have the coordinates of the centroid of the cluster of each data point to be in the dataframe column 'centroid'.
Thank you in advance!
import pandas as pd
import numpy as np
# created dummy dataframe with 3 feature
df = pd.DataFrame([[1,2,3],[50, 51,52],[2.0,6.0,8.5],[50.11,53.78,52]], columns = ['feature1', 'feature2','feature3'])
print(df)
obj = KMeansClusterer(2, distance = nltk.cluster.util.cosine_distance) #giving number of cluster 2
vectors = [np.array(f) for f in df.values]
df['predicted_cluster'] = obj.cluster(vectors,assign_clusters = True))
print(obj.means())
#OP
[array([50.055, 52.39 , 52. ]), array([1.5 , 4. , 5.75])] #which is going to be mean of three feature for 2 cluster, since number of cluster that we passed is 2
#now if u want the cluster center in pandas dataframe
df['centroid'] = df['predicted_cluster'].apply(lambda x: obj.means()[x])

Categories

Resources