I am trying to use MiniBatchKMeans with a larger data set and plot 2 different attributes. I am receive an Keyerror: 2 I believe I am making an error in my for loop but I am not sure where. can someone help me see were my error is? I am running the following code:
import numpy as np ##Import necessary packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn.cluster import MiniBatchKMeans
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
print("reviewing dataframe:")
print(Adult.head()) #Getting an overview of the data
print(Adult.shape)
print(Adult.dtypes)
np.median(Adult['fnlwgt']) #Calculating median for final weight column
TooLarge = Adult.loc[:,'fnlwgt'] > 748495 #Setting a value to replace outliers from final weight column with median
Adult.loc[TooLarge,'fnlwgt']=np.median(Adult['fnlwgt']) #replacing values from final weight Column with the median of the final weight column
Adult.loc[:,'fnlwgt']
X = pd.DataFrame()
X.loc[:,0] = Adult.loc[:,'age']
X.loc[:,1] = Adult.loc[:,'hoursperweek']
kmeans = MiniBatchKMeans(n_clusters = 2)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X.loc[:,0][i],X.loc[:,1][i], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0], centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
When I run the for loop I only see 2 data points plotted in the scatter matrix. Do I need to call the points differently from the created data frame?
You can avoid this problem by not running a loop to plot every single of the 32,000 points individually, which is bad practice and unnecessary. You can simply pass two arrays to plt.scatter() to make this scatter plot, there is no need for a loop. Use these lines:
colors = ["green","red"]
plt.scatter(X.iloc[:,0], X.iloc[:,1], c=np.array(colors)[labels],
s = 10, alpha=.1)
plt.scatter(centroids[:, 0], centroids[:, 1], marker = "x", s=150,
linewidths = 5, zorder = 10, c=['green', 'red'])
plt.show()
Your original error was caused by a bad use of pandas indexing. You can replicate your error by doing that:
df = pd.DataFrame(list('dasdasas'))
df[1]
Related
I am trying to plot a density curve with seaborn using age of vehicles.
My density curve has dips between the whole numbers while my age values are all whole number.
Can't seem to find anything related to this issue so I thought I would try my luck here, any input is appreciated.
My fix currently is just using a histogram with a larger bin but would like to get this working with a density plot.
Thanks!
In seaborn.displot you are passing the kind = 'kde' parameter, in order to get a continuous corve. However, this parameter triggers the Kernel Density Estimation computation, which compute values for all number, included non integers ones.
Instead, you need to tune seaborn.histplot in order to get a continuous step curve with element and fill parameters (I create a fake dataframe just to draw a plot, since you didn't provide your data):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
N = 10000
df = pd.DataFrame({'age': np.random.poisson(lam = 4, size = N)})
df['age'] = df['age'] + 1
fig, ax = plt.subplots(1, 2, figsize = (8, 4))
sns.histplot(ax = ax[0], data = df, bins = np.arange(0.5, df['age'].max() + 1, 1))
sns.histplot(ax = ax[1], data = df, bins = np.arange(0.5, df['age'].max() + 1, 1), element = 'step', fill = False)
ax[0].set_xticks(range(1, 14))
ax[1].set_xticks(range(1, 14))
plt.show()
As a comparison, here the seaborn.displot on the same dataframe, passing kind = 'kde' parameter:
I have this dataframe and I want to line plot it. As I have plotted it.
Graph is
Code to generate is
fig, ax = plt.subplots(figsize=(15, 5))
date_time = pd.to_datetime(df.Date)
df = df.set_index(date_time)
plt.xticks(rotation=90)
pd.DataFrame(df, columns=df.columns).plot.line( ax=ax,
xticks=pd.to_datetime(frame.Date))
I want a marker of innovationScore with value(where innovationScore is not 0) on open, close line. I want to show that that is the change when InnovationScore changes.
You have to address two problems - marking the corresponding spots on your curves and using the dates on the x-axis. The first problem can be solved by identifying the dates, where the score is not zero, then plotting markers on top of the curve at these dates. The second problem is more of a structural nature - pandas often interferes with matplotlib when it comes to date time objects. Using pandas standard plotting functions is good because it addresses common problems. But mixing pandas with matplotlib plotting (and to set the markers, you have to revert to matplotlib afaik) is usually a bad idea because they do not necessarily present the date time in the same format.
import pandas as pd
from matplotlib import pyplot as plt
#fake data generation, the following code block is just for illustration
import numpy as np
np.random.seed(1234)
n = 50
date_range = pd.date_range("20180101", periods=n, freq="D")
choice = np.zeros(10)
choice[0] = 3
df = pd.DataFrame({"Date": date_range,
"Open": np.random.randint(100, 150, n),
"Close": np.random.randint(100, 150, n),
"Innovation Score": np.random.choice(choice, n)})
fig, ax = plt.subplots()
#plot the three curves
l = ax.plot(df["Date"], df[["Open", "Close", "Innovation Score"]])
ax.legend(iter(l), ["Open", "Close", "Innovation Score"])
#filter dataset for score not zero
IS = df[df["Innovation Score"] > 0]
#plot markers on these positions
ax.plot(IS["Date"], IS[["Open", "Close"]], "ro")
#and/or set vertical lines to indicate the position
ax.vlines(IS["Date"], 0, max(df[["Open", "Close"]].max()), ls="--")
#label x-axis score not zero
ax.set_xticks(IS["Date"])
#beautify the output
ax.set_xlabel("Month")
ax.set_ylabel("Artifical score people take seriously")
fig.autofmt_xdate()
plt.show()
Sample output:
You can achieve it like this:
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], "ro-") # r is red, o is for larger marker, - is for line
plt.plot([3, 2, 1], "b.-") # b is blue, . is for small marker, - is for line
plt.show()
Check out also example here for another approach:
https://matplotlib.org/3.3.3/gallery/lines_bars_and_markers/markevery_prop_cycle.html
I very often get inspiration from this list of examples:
https://matplotlib.org/3.3.3/gallery/index.html
prices = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
I have my prices dataframe, and it currently has 3 columns. But at other times, it could have more or fewer columns. Is there a way to use some sort of twinx() loop to create a line-chart of all the different timeseries with a (potentially) infinite number of y-axes?
I tried the double for loop below but I got typeError'd:bTypeError: 'AxesSubplot' object does not support item assignment
# for i in range(0,len(prices.columns)):
# for column in list(prices.columns):
# fig, ax[i] = plt.subplots()
# ax[i].set_xlabel(prices.index())
# ax[i].set_ylabel(column[i])
# ax[i].plot(prices.Date, prices[column])
# ax[i].tick_params(axis ='y')
#
# ax[i+1] = ax[i].twinx()
# ax[i+1].set_ylabel(column[i+1])
# ax[i+1].plot(prices.Date, column[i+1])
# ax[i+1].tick_params(axis ='y')
#
# fig.suptitle('matplotlib.pyplot.twinx() function \ Example\n\n', fontweight ="bold")
# plt.show()
# =============================================================================
I believe I understand why I got the error - the ax object does not allow the assignment of the i variable. I'm hoping there is some ingenious way to accomplish this.
Turned out, the main problem was that you should not mix pandas plotting function with matplotlib which led to a duplication of the axes. Otherwise, the implementation is rather straight forward adapted from this matplotlib example.
from mpl_toolkits.axes_grid1 import host_subplot
import mpl_toolkits.axisartist as AA
from matplotlib import pyplot as plt
from itertools import cycle
import pandas as pd
#fake data creation with different spread for different axes
#this entire block can be deleted if you import your df
from pandas._testing import rands_array
import numpy as np
fakencol=5
fakenrow=7
np.random.seed(20200916)
df = pd.DataFrame(np.random.randint(1, 10, fakenrow*fakencol).reshape(fakenrow, fakencol), columns=rands_array(2, fakencol))
df = df.multiply(np.power(np.asarray([10]), np.arange(fakencol)))
df.index = pd.date_range("20200916", periods=fakenrow)
#defining a color scheme with unique colors
#if you want to include more than 20 axes, well, what can I say
sc_color = cycle(plt.cm.tab20.colors)
#defining the size of the figure in relation to the number of dataframe columns
#might need adjustment for optimal data presentation
offset = 60
plt.rcParams['figure.figsize'] = 10+df.shape[1], 5
#host figure and first plot
host = host_subplot(111, axes_class=AA.Axes)
h, = host.plot(df.index, df.iloc[:, 0], c=next(sc_color), label=df.columns[0])
host.set_ylabel(df.columns[0])
host.axis["left"].label.set_color(h.get_color())
host.set_xlabel("time")
#plotting the rest of the axes
for i, cols in enumerate(df.columns[1:]):
curr_ax = host.twinx()
new_fixed_axis = curr_ax.get_grid_helper().new_fixed_axis
curr_ax.axis["right"] = new_fixed_axis(loc="right",
axes=curr_ax,
offset=(offset*i, 0))
curr_p, = curr_ax.plot(df.index, df[cols], c=next(sc_color), label=cols)
curr_ax.axis["right"].label.set_color(curr_p.get_color())
curr_ax.set_ylabel(cols)
curr_ax.yaxis.label.set_color(curr_p.get_color())
plt.legend()
plt.tight_layout()
plt.show()
Coming to think of it - it would probably have been better to distribute the axes equally to the left and the right of the plot. Oh, well.
I have a pandas dataframe with 27 columns for electricity consumption, the first column represents the date and time for a two year duration and the other columns have a recorded hourly values for electricity consumption for 26 houses during two years. What I'm doing is clustering using k-means. Whenever I try to plot the date on the x-axis and the values of electricity consumption on the y-axis I have a problem which is x and y must have the same size. I try to reshape and the problem is not being solved.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import datetime
data_consumption2 = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
data_consumption2['Timestamp'] = pd.to_datetime(data_consumption2['Timestamp'], unit='s')
X=data_consumption2.iloc[: , 1:26].values
X=np.nan_to_num(X)
np.concatenate(X)
date=data_consumption2.iloc[: , 0].values
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
C = kmeans.cluster_centers_
plt.scatter(X, R , s=40, c= kmeans.labels_.astype(float), alpha=0.7)
plt.scatter(C[:,0] , C[:,1] , marker='*' , c='r', s=100)
I always get the same error message, X and Y must have save size, try to reshape your data. When I tried to reshape the data it did not work because the date column's size is always smaller than the size of the rest columns.
I think what you are essentially doing is a time series clustering of all households to find similar electricity usage pattern over time.
For that, each timestamp becomes a 'feature', while each household's usage becomes your data row. This will make it easier to apply sklearn clustering methods, which are typically in the form of method.fit(x) where x represents the features (pass the data as 2D array that has the shape of (row, column)). So your data needs to be transposed.
The refactored code is as such:
# what you have done
import pandas as pd
df = pd.read_excel(r"C:\Users\user\Desktop\Thesis\Tarek\Parent.xlsx", sheet_name="Consumption")
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='s')
# this is to fill all the NaN values with 0
df.fillna(0,inplace=True)
# transpose the dataframe accordingly
df = df.set_index('Timestamp').transpose()
df.rename(columns=lambda x : x.strftime('%D %H:%M:%S'), inplace=True)
df.reset_index(inplace=True)
df.rename(columns={'index':'house_no'}, inplace=True)
df.columns.rename(None, inplace=True)
df.head()
and you should see something like this (don't mind the data shown, I created some dummy data that is similar to yours).
Next, for clustering, this is what you can do:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(df.iloc[:,1:])
y_kmeans = kmeans.predict(df.iloc[:,1:])
C = kmeans.cluster_centers_
# add a new column to your dataframe that contains the predicted clusters
df['cluster'] = y_kmeans
Finally, for plotting, you can produce the scatter chart you wanted using the code below:
import matplotlib.pyplot as plt
color = ['red','green','blue']
plt.figure(figsize=(16,4))
for index, row in df.iterrows():
plt.scatter(x=row.index[1:-1], y=row.iloc[1:-1], c=color[row.iloc[-1]], marker='x', alpha=0.7, s=40)
for index, cluster_center in enumerate(kmeans.cluster_centers_):
plt.scatter(x=df.columns[1:-1], y=cluster_center, c=color[index], marker='o', s=100)
plt.xticks(rotation='vertical')
plt.ylabel('Electricity Consumption')
plt.title(f'All Clusters - Scatter', fontsize=20)
plt.show()
But I would recommend plotting line plots for individual clusters, more visually appealing (to me):
plt.figure(figsize=(16,16))
for cluster_index in [0,1,2]:
plt.subplot(3,1,cluster_index + 1)
for index, row in df.iterrows():
if row.iloc[-1] == cluster_index:
plt.plot(row.iloc[1:-1], c=color[row.iloc[-1]], linestyle='--', marker='x', alpha=0.5)
plt.plot(kmeans.cluster_centers_[cluster_index], c = color[cluster_index], marker='o', alpha=1)
plt.xticks(rotation='vertical')
plt.ylabel('Electricity Consumption')
plt.title(f'Cluster {cluster_index}', fontsize=20)
plt.tight_layout()
plt.show()
Cheers!
Im trying to smooth a graph line out but since the x-axis values are dates im having great trouble doing this. Say we have a dataframe as follows
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
startDate = '2015-05-15'
endDate = '2015-12-5'
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ['value']
df = pd.DataFrame(data, index=index, columns=cols)
Then we plot the data
fig, axs = plt.subplots(1,1, figsize=(18,5))
x = df.index
y = df.value
axs.plot(x, y)
fig.show()
we get
Now to smooth this line there are some usefull staekoverflow questions allready like:
Generating smooth line graph using matplotlib,
Plot smooth line with PyPlot
Creating numpy linspace out of datetime
But I just cant seem to get some code working to do this for my example, any suggestions?
You can use interpolation functionality that is shipped with pandas. Because your dataframe has a value for every index already, you can populate it with an index that is more sparse, and fill every previously non-existent indices with NaN values. Then, after choosing one of many interpolation methods available, interpolate and plot your data:
index_hourly = pd.date_range(startDate, endDate, freq='1H')
df_smooth = df.reindex(index=index_hourly).interpolate('cubic')
df_smooth = df_smooth.rename(columns={'value':'smooth'})
df_smooth.plot(ax=axs, alpha=0.7)
df.plot(ax=axs, alpha=0.7)
fig.show()
There is one workaround, we will create two plots - 1) non smoothed /interploted with date labels 2) smoothed without date labels.
Plot the 1) using argument linestyle=" " and convert the dates to be plotted on x-axis to string type.
Plot the 2) using the argument linestyle="-" and interpolating the x-axis and y-axis using np.linespace and make_interp_spline respectively.
Following is the use of the discussed workaround for your code.
# your initial code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.interpolate import make_interp_spline
%matplotlib inline
startDate = "2015-05-15"
endDate = "2015-07-5" #reduced the end date so smoothness is clearly seen
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ["value"]
df = pd.DataFrame(data, index=index, columns=cols)
fig, axs = plt.subplots(1, 1, figsize=(40, 12))
x = df.index
y = df.value
# workaround by creating linespace for length of your x axis
x_new = np.linspace(0, len(df.index), 300)
a_BSpline = make_interp_spline(
[i for i in range(0, len(df.index))],
df.value,
k=5,
)
y_new = a_BSpline(x_new)
# plot this new plot with linestyle = "-"
axs.plot(
x_new[:-5], # removing last 5 entries to remove noise, because interpolation outputs large values at the end.
y_new[:-5],
"-",
label="interpolated"
)
# to get the date on x axis we will keep our previous plot but linestyle will be None so it won't be visible
x = list(x.astype(str))
axs.plot(x, y, linestyle=" ", alpha=0.75, label="initial")
xt = [x[i] for i in range(0,len(x),5)]
plt.xticks(xt,rotation="vertical")
plt.legend()
fig.show()
Resulting Plot
Overalpped plot to see the smoothing.
Depending on what exactly you mean by "smoothing," the easiest way can be the use of savgol_filter or something similar. Unlike with interpolated splines, this method means that the smoothed line does not pass through the measured points, effectively filtering out higher-frequency noise.
from scipy.signal import savgol_filter
...
windowSize = 21
polyOrder = 1
smoothed = savgol_filter(values, windowSize, polyOrder)
axes.plot(datetimes, smoothed, color=chart.color)
The higher the polynomial order value, the closer the smoothed line is to the raw data.
Here is an example.