I have a pandas dataframe with 27 columns for electricity consumption, the first column represents the date and time for a two year duration and the other columns have a recorded hourly values for electricity consumption for 26 houses during two years. What I'm doing is clustering using k-means. Whenever I try to plot the date on the x-axis and the values of electricity consumption on the y-axis I have a problem which is x and y must have the same size. I try to reshape and the problem is not being solved.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import datetime
data_consumption2 = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
data_consumption2['Timestamp'] = pd.to_datetime(data_consumption2['Timestamp'], unit='s')
X=data_consumption2.iloc[: , 1:26].values
X=np.nan_to_num(X)
np.concatenate(X)
date=data_consumption2.iloc[: , 0].values
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
C = kmeans.cluster_centers_
plt.scatter(X, R , s=40, c= kmeans.labels_.astype(float), alpha=0.7)
plt.scatter(C[:,0] , C[:,1] , marker='*' , c='r', s=100)
I always get the same error message, X and Y must have save size, try to reshape your data. When I tried to reshape the data it did not work because the date column's size is always smaller than the size of the rest columns.
I think what you are essentially doing is a time series clustering of all households to find similar electricity usage pattern over time.
For that, each timestamp becomes a 'feature', while each household's usage becomes your data row. This will make it easier to apply sklearn clustering methods, which are typically in the form of method.fit(x) where x represents the features (pass the data as 2D array that has the shape of (row, column)). So your data needs to be transposed.
The refactored code is as such:
# what you have done
import pandas as pd
df = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')
# this is to fill all the NaN values with 0
df.fillna(0,inplace=True)
# transpose the dataframe accordingly
df = df.set_index('Timestamp').transpose()
df.rename(columns=lambda x : x.strftime('%D %H:%M:%S'), inplace=True)
df.reset_index(inplace=True)
df.rename(columns={'index':'house_no'}, inplace=True)
df.columns.rename(None, inplace=True)
df.head()
and you should see something like this (don't mind the data shown, I created some dummy data that is similar to yours).
Next, for clustering, this is what you can do:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(df.iloc[:,1:])
y_kmeans = kmeans.predict(df.iloc[:,1:])
C = kmeans.cluster_centers_
# add a new column to your dataframe that contains the predicted clusters
df['cluster'] = y_kmeans
Finally, for plotting, you can produce the scatter chart you wanted using the code below:
import matplotlib.pyplot as plt
color = ['red','green','blue']
plt.figure(figsize=(16,4))
for index, row in df.iterrows():
plt.scatter(x=row.index[1:-1], y=row.iloc[1:-1], c=color[row.iloc[-1]], marker='x', alpha=0.7, s=40)
for index, cluster_center in enumerate(kmeans.cluster_centers_):
plt.scatter(x=df.columns[1:-1], y=cluster_center, c=color[index], marker='o', s=100)
plt.xticks(rotation='vertical')
plt.ylabel('Electricity Consumption')
plt.title(f'All Clusters - Scatter', fontsize=20)
plt.show()
But I would recommend plotting line plots for individual clusters, more visually appealing (to me):
plt.figure(figsize=(16,16))
for cluster_index in [0,1,2]:
plt.subplot(3,1,cluster_index + 1)
for index, row in df.iterrows():
if row.iloc[-1] == cluster_index:
plt.plot(row.iloc[1:-1], c=color[row.iloc[-1]], linestyle='--', marker='x', alpha=0.5)
plt.plot(kmeans.cluster_centers_[cluster_index], c = color[cluster_index], marker='o', alpha=1)
plt.xticks(rotation='vertical')
plt.ylabel('Electricity Consumption')
plt.title(f'Cluster {cluster_index}', fontsize=20)
plt.tight_layout()
plt.show()
Cheers!
Related
I have dataframe of two columns which consists of dates and counts, both the columns as integer type. I am reading a csv file using pandas and using the below code to create the line graph out of it
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
colnames=['records', 'dates']
df = pd.read_csv(".../counts.csv",names=colnames, header=None )
x = df['dates'].to_list()
y = df['records'].to_list()
plt.figure(figsize=(20,5))
plt.plot(x, y)
plt.xticks(np.arange(20220924, 20221025, 1), rotation=45, fontsize = 'x-small')
plt.yticks(np.arange(77980072923, 27182138761, 1000000))
plt.xlabel('dates')
plt.ylabel('records')
the sample input data is as shown in the link input data
but the graph created will not have proper dates in x ticks and counts in y ticks line graph with bad ticks and good plot
I have tried converting the dtypes to string like the code below
df = df.astype(str)
plt.figure(figsize=(20,5))
# time series plot for multiple columns
sb.lineplot(x="dates", y="records", data=df)
plt.xticks(rotation=45, fontsize = 'xx-small')
# set label
plt.ylabel("Record Count")
plt.show()
then i would get the expected ticks for both x and y, but it mess up the line graph as shown line graph with exact ticks and bad plot
I'm getting something weird with the legend in a seaborn jointplot. I want to plot some quantity y as function of a quantity x for 8 different datasets. These datasets have only two columns for x and y and a different number of rows. First of all I concatenate all rows of all datasets using numpy
y = np.concatenate(((data1[:,1]), (data2[:,1]), (data3[:,1]), (data4[:,1]),(data5[:,1]), (data6[:,1]), (data7[:,1]), (data8[:,1])), axis=0)
x = np.concatenate(((data1[:,0]), (data2[:,0]), (data3[:,0]), (data4[:,0]), (data5[:,0]), (data6[:,0]), (data7[:,0]), (data8[:,0])), axis=0)
Then I create the array of values which I will use for the parameter "hue" in the jointplot, which will distinguish the several datasets in the legend/colors. I do this by assigning at every dataset one number from 1 to 8,which is repeated for every row of the cumulative dataset:
indexes = np.concatenate((np.ones(len(data1[:,0])), 2*np.ones(len(data2[:,0])), 3*np.ones(len(data3[:,0])), 4*np.ones(len(data4[:,0])), 5*np.ones(len(data5[:,0])), 6*np.ones(len(data6[:,0])), 7*np.ones(len(data7[:,0])), 8*np.ones(len(data8[:,0]))), axis=0)
Then I create the dataset:
all_together = np.column_stack((x, y, indexes))
df = pd.DataFrame(all_together, columns = ['x','y','Dataset'])
So now I can create the jointplot. This is simply done by:
g = sns.jointplot(y="y", x="x", data=df, hue="Dataset", palette='turbo')
handles, labels = g.ax_joint.get_legend_handles_labels()
g.ax_joint.legend(handles=handles, labels=['data1', 'data2', 'data3', 'data4', 'data5', 'data6', 'data7', 'data8'], fontsize=10)
At this point, the problem is: all points are getting plotted (at least I think), but the legend only shows: data1, data2, data3, data4 and data5. I don't understand why it is not showing also the other three labels, and in this way the plot is difficult to read. I have checked and the cumulative dataset df has the correct shape. Any ideas?
You can add legend='full' to obtain a full legend. By default, sns.jointplot uses sns.scatterplot for the central plot. The keyword parameters which aren't used by jointplot are sent to scatterplot. The legend parameter can be "auto", "brief", "full", or False.
From the docs:
If “brief”, numeric hue and size variables will be represented with a sample of evenly spaced values. If “full”, every group will get an entry in the legend. If “auto”, choose between brief or full representation based on number of levels. If False, no legend data is added and no legend is drawn.
The following code is tested with seaborn 0.11.2:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
N = 200
k = np.repeat(np.arange(1, 9), N // 8)
df = pd.DataFrame({'x': 5 * np.cos(2 * k * np.pi / 8) + np.random.randn(N),
'y': 5 * np.sin(2 * k * np.pi / 8) + np.random.randn(N),
'Dataset': k})
g = sns.jointplot(y="y", x="x", data=df, hue="Dataset", palette='turbo', legend='full')
plt.show()
I am trying to use MiniBatchKMeans with a larger data set and plot 2 different attributes. I am receive an Keyerror: 2 I believe I am making an error in my for loop but I am not sure where. can someone help me see were my error is? I am running the following code:
import numpy as np ##Import necessary packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn.cluster import MiniBatchKMeans
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
print("reviewing dataframe:")
print(Adult.head()) #Getting an overview of the data
print(Adult.shape)
print(Adult.dtypes)
np.median(Adult['fnlwgt']) #Calculating median for final weight column
TooLarge = Adult.loc[:,'fnlwgt'] > 748495 #Setting a value to replace outliers from final weight column with median
Adult.loc[TooLarge,'fnlwgt']=np.median(Adult['fnlwgt']) #replacing values from final weight Column with the median of the final weight column
Adult.loc[:,'fnlwgt']
X = pd.DataFrame()
X.loc[:,0] = Adult.loc[:,'age']
X.loc[:,1] = Adult.loc[:,'hoursperweek']
kmeans = MiniBatchKMeans(n_clusters = 2)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X.loc[:,0][i],X.loc[:,1][i], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0], centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
When I run the for loop I only see 2 data points plotted in the scatter matrix. Do I need to call the points differently from the created data frame?
You can avoid this problem by not running a loop to plot every single of the 32,000 points individually, which is bad practice and unnecessary. You can simply pass two arrays to plt.scatter() to make this scatter plot, there is no need for a loop. Use these lines:
colors = ["green","red"]
plt.scatter(X.iloc[:,0], X.iloc[:,1], c=np.array(colors)[labels],
s = 10, alpha=.1)
plt.scatter(centroids[:, 0], centroids[:, 1], marker = "x", s=150,
linewidths = 5, zorder = 10, c=['green', 'red'])
plt.show()
Your original error was caused by a bad use of pandas indexing. You can replicate your error by doing that:
df = pd.DataFrame(list('dasdasas'))
df[1]
I would like to plot certain slices of my Pandas Dataframe for each rows (based on row indexes) with different colors.
My data look like the following:
I already tried with the help of this tutorial to find a way but I couldn't - probably due to a lack of skills.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("D:\SOF10.csv" , header=None)
df.head()
#Slice interested data
C = df.iloc[:, 2::3]
#Plot Temp base on row index colorfully
C.apply(lambda x: plt.scatter(x.index, x, c='g'))
plt.show()
Following is my expected plot:
I was also wondering if I could displace the mean of each row of the sliced data which contains 480 values somewhere in the plot or in the legend beside of plot! Is it feasible (like the following picture) to calculate the mean and displaced somewhere in the legend or by using small font size displace next to its own data in graph ?
Data sample: data
This gives the plot without legend
C = df.iloc[:,2::3].stack().reset_index()
C.columns = ['level_0', 'level_1', 'Temperature']
fig, ax = plt.subplots(1,1)
C.plot('level_0', 'Temperature',
ax=ax, kind='scatter',
c='level_0', colormap='tab20',
colorbar=False, legend=True)
ax.set_xlabel('Cycles')
plt.show()
Edit to reflect modified question:
stack() transform your (sliced) dataframe to a series with index (row, col)
reset_index() reset the double-level index above to level_0 (row), level_1 (col).
set_xlabel sets the label of x-axis to what you want.
Edit 2: The following produces scatter with legend:
CC = df.iloc[:,2::3]
fig, ax = plt.subplots(1,1, figsize=(16,9))
labels = CC.mean(axis=1)
for i in CC.index:
ax.scatter([i]*len(CC.columns[1:]), CC.iloc[i,1:], label=labels[i])
ax.legend()
ax.set_xlabel('Cycles')
ax.set_ylabel('Temperature')
plt.show()
This may be an approximate answer. scatter(c=, cmap= can be used for desired coloring.
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import itertools
df = pd.DataFrame({'a':[34,22,1,34]})
fig, subplot_axes = plt.subplots(1, 1, figsize=(20, 10)) # width, height
colors = ['red','green','blue','purple']
cmap=matplotlib.colors.ListedColormap(colors)
for col in df.columns:
subplot_axes.scatter(df.index, df[col].values, c=df.index, cmap=cmap, alpha=.9)
Im trying to smooth a graph line out but since the x-axis values are dates im having great trouble doing this. Say we have a dataframe as follows
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
startDate = '2015-05-15'
endDate = '2015-12-5'
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ['value']
df = pd.DataFrame(data, index=index, columns=cols)
Then we plot the data
fig, axs = plt.subplots(1,1, figsize=(18,5))
x = df.index
y = df.value
axs.plot(x, y)
fig.show()
we get
Now to smooth this line there are some usefull staekoverflow questions allready like:
Generating smooth line graph using matplotlib,
Plot smooth line with PyPlot
Creating numpy linspace out of datetime
But I just cant seem to get some code working to do this for my example, any suggestions?
You can use interpolation functionality that is shipped with pandas. Because your dataframe has a value for every index already, you can populate it with an index that is more sparse, and fill every previously non-existent indices with NaN values. Then, after choosing one of many interpolation methods available, interpolate and plot your data:
index_hourly = pd.date_range(startDate, endDate, freq='1H')
df_smooth = df.reindex(index=index_hourly).interpolate('cubic')
df_smooth = df_smooth.rename(columns={'value':'smooth'})
df_smooth.plot(ax=axs, alpha=0.7)
df.plot(ax=axs, alpha=0.7)
fig.show()
There is one workaround, we will create two plots - 1) non smoothed /interploted with date labels 2) smoothed without date labels.
Plot the 1) using argument linestyle=" " and convert the dates to be plotted on x-axis to string type.
Plot the 2) using the argument linestyle="-" and interpolating the x-axis and y-axis using np.linespace and make_interp_spline respectively.
Following is the use of the discussed workaround for your code.
# your initial code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.interpolate import make_interp_spline
%matplotlib inline
startDate = "2015-05-15"
endDate = "2015-07-5" #reduced the end date so smoothness is clearly seen
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ["value"]
df = pd.DataFrame(data, index=index, columns=cols)
fig, axs = plt.subplots(1, 1, figsize=(40, 12))
x = df.index
y = df.value
# workaround by creating linespace for length of your x axis
x_new = np.linspace(0, len(df.index), 300)
a_BSpline = make_interp_spline(
[i for i in range(0, len(df.index))],
df.value,
k=5,
)
y_new = a_BSpline(x_new)
# plot this new plot with linestyle = "-"
axs.plot(
x_new[:-5], # removing last 5 entries to remove noise, because interpolation outputs large values at the end.
y_new[:-5],
"-",
label="interpolated"
)
# to get the date on x axis we will keep our previous plot but linestyle will be None so it won't be visible
x = list(x.astype(str))
axs.plot(x, y, linestyle=" ", alpha=0.75, label="initial")
xt = [x[i] for i in range(0,len(x),5)]
plt.xticks(xt,rotation="vertical")
plt.legend()
fig.show()
Resulting Plot
Overalpped plot to see the smoothing.
Depending on what exactly you mean by "smoothing," the easiest way can be the use of savgol_filter or something similar. Unlike with interpolated splines, this method means that the smoothed line does not pass through the measured points, effectively filtering out higher-frequency noise.
from scipy.signal import savgol_filter
...
windowSize = 21
polyOrder = 1
smoothed = savgol_filter(values, windowSize, polyOrder)
axes.plot(datetimes, smoothed, color=chart.color)
The higher the polynomial order value, the closer the smoothed line is to the raw data.
Here is an example.