I have implemented a regression model and retrieved results. Now to evaluate the results I want to create plot, where MAE, and its standard deviation are represented in the same figure. However, I want to group the date into intervals and evaluate statistics. Though, I can use sklearn metrics for calculating mean absolute error, it works on entire range of data. Can some one give an idea about how to group the data based on intervals.
The data is very large hence, could not share here. However, random data and implemented code for calculating bias, I am attaching below.
import pandas as pd
import random
import matplotlib.pyplot as plt
yact = random.sample(range(1, 100), 50)
ypred=random.sample(range(1, 100), 50)
df = pd.DataFrame(yact,columns=['yact'])
df['ypred']=ypred
df['bias']=df['yact']-df['ypred']
#groups=[20,40,60,80,100]
I want to creat groups of y pred based on yact (similar to groups given above).
A reference figure which I am trying to plot is present in the first quadrant of below attached figure.
We could use only pandas/matplotlib but seaborn makes this kind of plotting so much easier. First, we categorize the data with pd.cut based on the bins provided, then we plot them with seaborns pointplot. The estimator mean is the default but I wanted to point out that you can feed other functions here into the plot.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#random data generation
rng = np.random.default_rng(123)
n=500
yact = rng.choice(range(1, 100), n)
ypred = rng.choice(range(1, 100), n)
df = pd.DataFrame({"yact": yact, "ypred": ypred})
df['bias']=df['yact']-df['ypred']
#binning of data
bins = [0, 30, 50, 80, 100]
labels = [f"({first}; {second}]" for first, second in zip(bins[:-1], bins[1:])]
df["cats"] = pd.cut(x=df['yact'], bins=bins, labels=labels, include_lowest=True)
#plotting with seaborn
sns.pointplot(x="cats", y="ypred", data=df, order=labels, estimator=np.mean, ci="sd", join=False)
plt.show()
(Unsurprisingly uniform) sample output:
Related
I want to perform spectral clustering on the 3 circles dataset that I have generated using make circles as shown in the figure. All the three circles are of different classes.
from sklearn.datasets import make_circles
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt
import pylab as pl
import networkx as nx
X_small, y_small = make_circles(n_samples=(100,200), random_state=3,
noise=0.07, factor = 0.7)
X_large, y_large = make_circles(n_samples=(100,200), random_state=3,
noise=0.07, factor = 0.4)
y_large[y_large==1] = 2
df = pd.DataFrame(np.vstack([X_small,X_large]),columns=['x1','x2'])
df['label'] = np.hstack([y_small,y_large])
df.label.value_counts()
sns.scatterplot(data=df,x='x1',y='x2',hue='label',style='label',palette="bright")
Since I can't flag this question as duplicate (the similar question has no accepted answer), here is a working example of Spectral Clustering on 3 circles using your code:
X_small, y_small = make_circles(n_samples=(1000,2000), random_state=3,
noise=0.07, factor = 0.1)
X_large, y_large = make_circles(n_samples=(1000,2000), random_state=3,
noise=0.07, factor = 0.6)
y_large[y_large==1] = 2
df = pd.DataFrame(np.vstack([X_small,X_large]),columns=['x1','x2'])
df['label'] = np.hstack([y_small,y_large])
df.label.value_counts()
sns.scatterplot(data=df,x='x1',y='x2',hue='label',style='label',palette="bright")
Then adapt the slightly modified 3 circles dataset (added samples and spread the circles) to the code of this SO answer:
x1 = np.expand_dims(df['x1'].values,axis=1)
x2 = np.expand_dims(df['x2'].values,axis=1)
X = np.concatenate((x1,x2),axis=1)
y = df['label'].values
from sklearn.cluster import SpectralClustering
clustering = SpectralClustering(n_clusters=3, gamma=1000).fit(X)
colors = ['r','g','b']
colors = np.array([colors[label] for label in clustering.labels_])
plt.scatter(X[y==0, 0], X[y==0, 1], c=colors[y==0], marker='X')
plt.scatter(X[y==1, 0], X[y==1, 1], c=colors[y==1], marker='o')
plt.scatter(X[y==2, 0], X[y==2, 1], c=colors[y==2], marker='*')
plt.show()
The np.expand_dims(...,axis=1) is necessary to create the dimension along which to concatenate features with np.concatenate() (we initially have 1D vectors, and we don't want to concatenate along the existing initial dimension which is the samples index dimension). Each plt.scatter() line plots the points of a single true data class (hence the y==y_true index selection) using the associated marker, the colors indicating the class provided by the clustering.
Resulting dataset:
Resulting clusters:
Edit: to use different markers to identify true classes (colors already indicating the clustering classes), as asked by OP in the comments. We unfortunately cannot use an array for markers (as for colors) to produce the plot in a single line of code, this is because marker does not accept a list as input (discussed here).
Edit2: added motivation for the use of np.expand_dims(...,axis=1) and some explanation for the plt.scatter() lines, as asked by OP in the comments.
I have 2 tables a 10 by 110 and a 35 by 110 and both contain random numbers from a exponential distribution function provided by my professor. The assignment is to prove the central limit theorem in statistics.
What I thought to try is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
"importing data"
df1 = pd.read_excel(r'C:\Users\Henry\Desktop\n10.xlsx')
df2 = pd.read_excel(r'C:\Users\Henry\Desktop\n30.xlsx')
df1avg = pd.read_excel(r'C:\Users\Henry\Desktop\n10avg.xlsx')
df2avg = pd.read_excel(r'C:\Users\Henry\Desktop\n30avg.xlsx')
"plotting n10 histogram"
plt.hist(df1, bins=34)
plt.hist(df1avg, bins=11)
"plotting n30 histogram"
plt.hist(df2, bins=63)
plt.hist(df2avg, bins=11)
Is that ok or do I need to format the tables into a singular column, and if so what is the most efficient way to do that?
I suspect that you will want to flatten your dataframe first, as illustrated below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = np.random.exponential(1, [40, 5])
df = pd.DataFrame(N) # convert to dataframe
bin_edges = np.linspace(0,6,30)
plt.figure()
plt.hist(df, bins = bin_edges, density = True)
plt.xlabel('Value')
plt.ylabel('Probability density')
The multiple (5) colours of lines per bin shows the histograms for each column of the data frame.
Fortunately, this is not hard to adjust. You can convert the data frame to a numpy array and flatten it:
plt.hist(df.to_numpy().flatten(), bins = bin_edges, density = True)
plt.ylabel('Probability density')
plt.xlabel('Value')
I am trying to plot a density curve with seaborn using age of vehicles.
My density curve has dips between the whole numbers while my age values are all whole number.
Can't seem to find anything related to this issue so I thought I would try my luck here, any input is appreciated.
My fix currently is just using a histogram with a larger bin but would like to get this working with a density plot.
Thanks!
In seaborn.displot you are passing the kind = 'kde' parameter, in order to get a continuous corve. However, this parameter triggers the Kernel Density Estimation computation, which compute values for all number, included non integers ones.
Instead, you need to tune seaborn.histplot in order to get a continuous step curve with element and fill parameters (I create a fake dataframe just to draw a plot, since you didn't provide your data):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
N = 10000
df = pd.DataFrame({'age': np.random.poisson(lam = 4, size = N)})
df['age'] = df['age'] + 1
fig, ax = plt.subplots(1, 2, figsize = (8, 4))
sns.histplot(ax = ax[0], data = df, bins = np.arange(0.5, df['age'].max() + 1, 1))
sns.histplot(ax = ax[1], data = df, bins = np.arange(0.5, df['age'].max() + 1, 1), element = 'step', fill = False)
ax[0].set_xticks(range(1, 14))
ax[1].set_xticks(range(1, 14))
plt.show()
As a comparison, here the seaborn.displot on the same dataframe, passing kind = 'kde' parameter:
I have a distribution that changes over time for which I would like to plot a violin plot for each time step side-by-side using seaborn. My initial attempt failed as violinplot cannot handle a np.ndarray for the y argument:
import numpy as np
import seaborn as sns
time = np.arange(0, 10)
samples = np.random.randn(10, 200)
ax = sns.violinplot(x=time, y=samples) # Exception: Data must be 1-dimensional
The seaborn documentation has an example for a vertical violinplot grouped by a categorical variable. However, it uses a DataFrame in long format.
Do I need to convert my time series into a DataFrame as well? If so, how do I achieve this?
A closer look at the documentation made me realize that omitting the x and y argument altogether leads to the data argument being interpreted in wide-form:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
samples = np.random.randn(20, 10)
ax = sns.violinplot(data=samples)
plt.show()
In the violin plot documentation it says that the input x and y parameters do not have to be a data frame, but they have a restriction of having the same dimension. In addition, the variable y that you created has 10 rows and 200 columns. This is detrimental when plotting the graphics and causes a dimension problem.
I tested it and this code has no problems when reading the python file.
import numpy as np
import seaborn as sns
import pandas as pd
time = np.arange(0, 200)
samples = np.random.randn(10, 200)
for sample in samples:
ax = sns.violinplot(x=time, y=sample)
You can then group the resulting graphs using this link:
https://python-graph-gallery.com/199-matplotlib-style-sheets/
If you want to convert your data into data frames it is also possible. You just need to use pandas.
example
import pandas as pd
x = [1,2,3,4]
df = pd.DataFrame(x)
I am trying to do a Kernel Density Estimation (KDE) plot with seaborn and locate the median. The code looks something like this:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
sns.kdeplot(data, shade=True)
# x_median, y_median = magic_function()
# plt.vlines(x_median, 0, y_median)
plt.show()
As you can see I need a magic_function() to fetch the median x and y values from the kdeplot. Then I would like to plot them with e.g. vlines. However, I can't figure out how to do that. The result should look something like this (obviously the black median bar is wrong here):
I guess my question is not strictly related to seaborn and also applies to other kinds of matplotlib plots. Any ideas are greatly appreciated.
You need to:
Extract the data of the kde line
Integrate it to calculate the cumulative distribution function (CDF)
Find the value that makes CDF equal 1/2, that is the median
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
p=sns.kdeplot(data, shade=True)
x,y = p.get_lines()[0].get_data()
#care with the order, it is first y
#initial fills a 0 so the result has same length than x
cdf = scipy.integrate.cumtrapz(y, x, initial=0)
nearest_05 = np.abs(cdf-0.5).argmin()
x_median = x[nearest_05]
y_median = y[nearest_05]
plt.vlines(x_median, 0, y_median)
plt.show()