Plotting a histogram with overlaid PDF - python

This is a follow-up to my previous couple of questions. Here's the code I'm playing with:
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
dictOne = {'Name':['First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh', 'Eighth', 'Ninth'],
"A":[1, 2, -3, 4, 5, np.nan, 7, np.nan, 9],
"B":[4, 5, 6, 5, 3, np.nan, 2, 9, 5],
"C":[7, np.nan, 10, 5, 8, 6, 8, 2, 4]}
df2 = pd.DataFrame(dictOne)
column = 'B'
df2[df2[column] > -999].hist(column, alpha = 0.5)
param = stats.norm.fit(df2[column].dropna()) # Fit a normal distribution to the data
print(param)
pdf_fitted = stats.norm.pdf(df2[column], *param)
plt.plot(pdf_fitted, color = 'r')
I'm trying to make a histogram of the numbers in a single column in the dataframe -- I can do this -- but with an overlaid normal curve...something like the last graph on here. I'm trying to get it working on this toy example so that I can apply it to my much larger dataset for real. The code I've pasted above gives me this graph:
Why doesn't pdf_fitted match the data in this graph? How can I overlay the proper PDF?

You should plot the histogram with density=True if you hope to compare it to a true PDF. Otherwise your normalization (amplitude) will be off.
Also, you need to specify the x-values (as an ordered array) when you plot the pdf:
fig, ax = plt.subplots()
df2[df2[column] > -999].hist(column, alpha = 0.5, density=True, ax=ax)
param = stats.norm.fit(df2[column].dropna())
x = np.linspace(*df2[column].agg([min, max]), 100) # x-values
plt.plot(x, stats.norm.pdf(x, *param), color = 'r')
plt.show()
As an aside, using a histogram to compare continuous variables with a distribution is isn't always the best. (Your sample data are discrete, but the link uses a continuous variable). The choice of bins can alias the shape of your histogram, which may lead to incorrect inference. Instead, the ECDF is a much better (choice-free) illustration of the distribution for a continuous variable:
def ECDF(data):
n = sum(data.notnull())
x = np.sort(data.dropna())
y = np.arange(1, n+1) / n
return x,y
fig, ax = plt.subplots()
plt.plot(*ECDF(df2.loc[df2[column] > -999, 'B']), marker='o')
param = stats.norm.fit(df2[column].dropna())
x = np.linspace(*df2[column].agg([min, max]), 100) # x-values
plt.plot(x, stats.norm.cdf(x, *param), color = 'r')
plt.show()

Related

Double for loop to add multiple subplots on same figure

I am working with a clustering analysis problem. My goal is to create a double for loop which changes the numbers of clusters (3 different values for clusters) as well as cycling between the three linkage types per value cluster value. Then plot all of the subplots on the same figure.
I am hoping to achieve a 3x3 view of the subplots. Where each value of cluster is on the x-axis and each type of linkage correlating to the number of clusters is displayed down the y-axis.
The csv file I am working with is simply two columns with x1 and x2 values. I exluded the code where im import and read the csv file. The code I have thus far is as follows:
X1 = input_data.X1.values
X2 = input_data.X2.values
X = np.column_stack((X1, X2))
clusters = 4
Y_Kmeans = KMeans(n_clusters = clusters)
Y_Kmeans.fit(X)
Y_Kmeans_labels = Y_Kmeans.labels_
Y_Kmeans_silhouette = metrics.silhouette_score(X, Y_Kmeans_labels, metric='sqeuclidean')
linkage_types = ['ward', 'average', 'complete']
Y_hierarchy = AgglomerativeClustering(linkage=linkage_types[0], n_clusters=clusters)
Y_hierarchy.fit(X)
Y_hierarchy_labels = Y_hierarchy.labels_
Y_hierarchy_silhouette = metrics.silhouette_score(X, Y_hierarchy_labels,
metric='sqeuclidean')
I have tried this and am not getting the desired results:
fig, axs = plt.subplots(nrows=3, ncols=3, figsize=(15, 12))
plt.subplots_adjust(hspace=0.5)
cluster = [4, 7, 10]
link = [0, 1, 2]
for i in cluster:
for j in link:
plt.scatter(X[:, 0], X[:, 1], c=colormap[Y_hierarchy_labels])
This is the output:
I see two problems:
you have to make calculations inside for-loops - and use i,j in KMeans(n_clusters=i) and AgglomerativeClustering(linkage=linkage_types[j], n_clusters=i)
you have to enumerate() cluster and link in for-loops to get ax = axs[number_cluster, number_link] and draw ax.scatter()
Minimal working code with random data.
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
fig, axs = plt.subplots(nrows=3, ncols=3, figsize=(15, 12))
plt.subplots_adjust(hspace=0.5)
cluster = [4, 7, 10]
link = [0, 1, 2]
for number_cluster, i in enumerate(cluster):
# Y_Kmeans = KMeans(n_clusters=i)
# ... code ...
for number_link, j in enumerate(link):
# Y_hierarchy = gglomerativeClustering(linkage=linkage_types[j], n_clusters=i)
# ... code ...
X = np.random.rand(3+j, 3+i)
print(X[:, 0], X[:, 1])
ax = axs[number_cluster, number_link]
ax.scatter(X[:, 0], X[:, 1], )
ax.set_title(f'cluster: {i}, link: {j}')
plt.show()

How to map heatmap tick labels to a value and add those values as a legend

I want to create a heatmap in seaborn, and have a nice way to see the labels.
With ax.figure.tight_layout(), I am getting
which is obviously bad.
Without ax.figure.tight_layout(), the labels get cropped.
The code is
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sn
n_classes = 10
confusion = np.random.randint(low=0, high=100, size=(n_classes, n_classes))
label_length = 20
label_ind_by_names = {
"A"*label_length: 0,
"B"*label_length: 1,
"C"*label_length: 2,
"D"*label_length: 3,
"E"*label_length: 4,
"F"*label_length: 5,
"G"*label_length: 6,
"H"*label_length: 7,
"I"*label_length: 8,
"J"*label_length: 9,
}
# confusion matrix
df_cm = pd.DataFrame(
confusion,
index=label_ind_by_names.keys(),
columns=label_ind_by_names.keys()
)
plt.figure()
sn.set(font_scale=1.2)
ax = sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}, fmt='d')
# ax.figure.tight_layout()
plt.show()
I would like to create an extra legend based on label_ind_by_names, then post an abbreviation on the heatmap itself, and be able to look up the abbreviation in the legend.
How can this be done in seaborn?
You can define your own legend handler, e.g. for integers:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sn
n_classes = 10
confusion = np.random.randint(low=0, high=100, size=(n_classes, n_classes))
label_length = 20
label_ind_by_names = {
"A"*label_length: 0,
"B"*label_length: 1,
"C"*label_length: 2,
"D"*label_length: 3,
"E"*label_length: 4,
"F"*label_length: 5,
"G"*label_length: 6,
"H"*label_length: 7,
"I"*label_length: 8,
"J"*label_length: 9,
}
# confusion matrix
df_cm = pd.DataFrame(
confusion,
index=label_ind_by_names.values(),
columns=label_ind_by_names.values()
)
fig, ax = plt.subplots(figsize=(10, 5))
fig.subplots_adjust(left=0.05, right=.65)
sn.set(font_scale=1.2)
sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}, fmt='d', ax=ax)
class IntHandler:
def legend_artist(self, legend, orig_handle, fontsize, handlebox):
x0, y0 = handlebox.xdescent, handlebox.ydescent
text = plt.matplotlib.text.Text(x0, y0, str(orig_handle))
handlebox.add_artist(text)
return text
ax.legend(label_ind_by_names.values(),
label_ind_by_names.keys(),
handler_map={int: IntHandler()},
loc='upper left',
bbox_to_anchor=(1.2, 1))
plt.show()
Explanation of the hard-coded figures: the first two are the left and right extreme positions of the Axes in the figure (0.05 = 5 % for the figure width etc). 1.2 and 1 is the location of the upper left corner of the legend box relative to the Axes (1, 1 is the upper right corner of the Axes, we add 0.2 to 1 to account for the space used by the colorbar). Ideally one would use a constrained layout instead of fiddeling with the parameters but it doesn't (yet) support figure legends and if using an Axes legend, it places it between the Axes and the colorbar.

Phase plot using matplotlib tricontourf

I want to plot an image of the results of a finite element simulation with a personalized colormap.
I have been trying to use tricontourf to plot it as follow :
#Z = self.phi.compute_vertex_values(self.mesh)
Z = np.mod(self.phi.compute_vertex_values(self.mesh),2*np.pi)
triang = tri.Triangulation(*self.mesh.coordinates().reshape((-1, 2)).T,
triangles=self.mesh.cells())
zMax = np.max(Z)
print(zMax)
#Colormap creation
nColors = np.max(Z)*200/(2*np.pi)
phiRange = np.linspace(0,zMax,nColors)
intensity = np.sin(phiRange)**2
intensityArray = np.array([intensity, intensity, intensity])
colors = tuple(map(tuple, intensityArray.T))
self.cm = LinearSegmentedColormap.from_list("BAM", colors, N=nColors)
#Figure creation
fig, ax = plt.subplots()
levels2 = np.linspace(0., zMax,nColors)
cax = ax.tricontourf(triang, Z,levels=levels2, cmap = self.cm) #plot of the solution
fig.colorbar(cax)
ax.triplot(triang, lw=0.5, color='yellow') #plot of the mesh
plt.savefig("yolo.png")
plt.close(fig)
And it gives the result :
As you can see there are some trouble where the phase goes from 2pi to 0 that comes from tricontourf when there is a modulo...
My first idea for work around was to work directly on my phase Z. The problem is that if I do this I need to create a much larger colormap. Ultimately, the phase will be very large and so will be the colormap if I want a correct color resolution... Furthemore I would like to have only one period in the colormap on the right (just like in the first figure).
Any idea how I could obtain a figure just like the second one, with a colormap just like the one from the first figure and without creating a very large and expensive colormap ?
EDIT : I have written a small code that is runnable out of the box : It reproduces the problem I have and I have also tried to apply Thomas Kuhn answer to my preoblem. However, it seems that there are some problem with the colorbar... Any idea how I could fix this ?
import matplotlib.pyplot as plt
import matplotlib.tri as mtri
import numpy as np
import matplotlib.colors as colors
class PeriodicNormalize(colors.Normalize):
def __init__(self, vmin=None, vmax=None, clip=False):
colors.Normalize.__init__(self, vmin, vmax, clip)
def __call__(self, value, clip=None):
x, y = [self.vmin, self.vmax], [0, 1]
return np.ma.masked_array(np.interp(
np.mod(value-self.vmin, self.vmax-self.vmin),x,y
))
# Create triangulation.
x = np.asarray([0, 1, 2, 3, 0.5, 1.5, 2.5, 1, 2, 1.5])
y = np.asarray([0, 0, 0, 0, 1.0, 1.0, 1.0, 2, 2, 3.0])
triangles = [[0, 1, 4], [1, 2, 5], [2, 3, 6], [1, 5, 4], [2, 6, 5], [4, 5, 7],
[5, 6, 8], [5, 8, 7], [7, 8, 9]]
triang = mtri.Triangulation(x, y, triangles)
cm = colors.LinearSegmentedColormap.from_list('test', ['k','w','k'], N=1000)
#Figure 1 : modulo is applied on the data :
#Results : problem with the interpolation, but the colorbar is fine
z = np.mod(10*x,2*np.pi)
zMax = np.max(z)
levels = np.linspace(0., zMax,100)
fig1, ax1 = plt.subplots()
cax1=ax1.tricontourf(triang, z,cmap = cm,levels= levels)
fig1.colorbar(cax1)
plt.show()
#Figure 2 : We use the norm parameter with a custom norm that does the modulo
#Results : the graph is the way it should be but the colormap is messed up
z = 10*x
zMax = np.max(z)
levels = np.linspace(0., zMax,100)
fig2, ax2 = plt.subplots()
cax2=ax2.tricontourf(triang, z,levels= levels,norm = PeriodicNormalize(0, 2*np.pi),cmap = cm)
fig2.colorbar(cax2)
plt.show()
Last solution would be to do as I did above : to create a much larger colormap that goes up to zmax and is periodic every 2 pi. However the colorbar would not be nice...
here are the results :
I'm guessing that your problem arises from using modulo on your data before you call tricontourf (which, I guess, does some interpolation on your data and then maps that interpolated data to a colormap). Instead, you can pass a norm to your tricontourf function. Writing a small class following this tutorial, you can make the norm take care of the modulo of your data. As your code is not runnable as such, I came up with an a bit simpler example. Hopefully this is applicable to your problem:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.colors as colors
class PeriodicNormalize(colors.Normalize):
def __init__(self, vmin=None, vmax=None, clip=False):
colors.Normalize.__init__(self, vmin, vmax, clip)
def __call__(self, value, clip=None):
x, y = [self.vmin, self.vmax], [0, 1]
return np.ma.masked_array(np.interp(
np.mod(value-self.vmin, self.vmax-self.vmin),x,y
))
fig,ax = plt.subplots()
x,y = np.meshgrid(
np.linspace(0, 1, 1000),
np.linspace(0, 1, 1000),
)
z = x*10*np.pi
cm = colors.LinearSegmentedColormap.from_list('test', ['k','w','k'], N=1000)
ax.pcolormesh(x,y,z,norm = PeriodicNormalize(0, 2*np.pi), cmap = cm)
plt.show()
The result looks like this:
EDIT:
As the ContourSet you get back from tricontourf spans the full phase, not just the first [0,2pi], the colorbar is created for that full range, which is why you see the colormap repeat itself many times. I'm not quite sure if I understand how the ticks are created, but I'm guessing that it would be quite some work to get that automated to work right. Instead, I suggest to generate a colorbar "by hand", as is done in this tutorial. This, however, requires that you create the axes (cax) where the colorbar is put yourself. Luckily there is a function called matplotlib.colorbar.make_axes() that does this for you (all thanks goes to this answer). So, instead of your original colorbar command, use these two lines:
cax,kw = mcbar.make_axes([ax2], location = 'right')
cb1 = mcbar.ColorbarBase(cax, cmap = cm, norm = norm, orientation='vertical')
To get this picture:

Control order of plotting on Seaborn plot in Python

I am trying to plot residuals on a linear regression plot. It works, with only one caveat. There is an unpleasant looking overlap between residuals and data points. Is there a way to tell matplotlib to plot the residuals first followed by Seaborn plot. I tried changing the order of code, but it didn't help.
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import *
from sklearn.linear_model import LinearRegression
x = np.array([1, 2, 3, 4, 5, 7, 8, 9, 10])
y = np.array([-3, 0, 4, 5, 9, 5, 7, 7, 12])
dat = pd.DataFrame({'x': x, 'y': y})
x = x.reshape(-1,1)
y = y.reshape(-1,1)
linear_model = LinearRegression()
linear_model.fit(X=x, y=y)
pred = linear_model.predict(x)
for ix in range(len(x)):
plot([x[ix], x[ix]], [pred[ix], y[ix]], '#C9B97D')
g = sns.regplot(x='x', y='y', data=dat, ci=None, fit_reg=True)
sns.set(font_scale=1.1)
g.figure.set_size_inches(6, 6)
sns.set_style('ticks')
sns.despine()
The argument you are looking for is zorder. This allows you to control which object appears on top in your figure.
For regplot you have to use the argument scatter_kws which is a dictionary of arguments to be passed to plt.scatter which is used under the hood.
Your sns.regplot becomes:
g = sns.regplot(x='x', y='y', data=dat, ci=None, fit_reg=True,
scatter_kws={"zorder":10, "alpha":1})
Note that I've set alpha to 1 so that the markers are not transparent

matplotlib - How to plot a graph with uneven intervals of 2^n?

I have 2 lists, each has 128 elements
x = [1,2,3,...,128]
y = [y1,y2,...,y128]
How should I use matplotlib to plot (x,y) with x axis appearing as shown in this screenshot?
To replicate the graph, I have (1) created 2 additional lists from the original lists, and (2) used set_xticklabels:
f, ax1 = plt.subplots(1,1,figsize=(16,7))
x1 = [1, 2, 4, 8, 16, 32, 64, 128]
y1 = [y[0],y[1],y[3],y[7],y[15],y[31],y[63],y[127]]
line1 = ax1.plot(x1,y1,label="Performance",color='b',linestyle="-")
ax1.set_xticklabels([0,1,2,4,8,16,32,64,128])
ax1.set_xlabel('Time Period',fontsize=15)
ax1.set_ylabel("Value",color='b',fontsize=15)
The problem with this approach is that only 8 pairs of value are plotted, and 120 pairs are ommitted.
If my comments aren't clear enough, please, ask. :)
from matplotlib import pyplot as plt
# Instanciating my lists...
f = lambda x:x**2
x = [nb for nb in range(1, 129)]
y = [f(nb) for nb in x]
# New values you want to plot, with linear spacing.
indexes_to_keep = [1, 2, 4, 8, 16, 32, 64, 128]
y_to_use = [y[nb - 1] for nb in indexes_to_keep]
# First plot that shows the 128 points as a whole.
fig = plt.figure(figsize=(10, 5.4))
ax1 = fig.add_subplot(121)
ax1.plot(x, y)
ax1.set_title('Former values')
# Second plot that shows only the indexes you wish to keep.
ax2 = fig.add_subplot(122)
# my_ticks = [1, 2, 3, 4, 5, 6, 7]
# meaning : my_ticks will be linear values.
my_ticks = [i for i in range(len(indexes_to_keep))]
# We set the ticks we want to show, meaning : all our list
# instead of some linear spacing matplotlib will show by default
ax2.set_xticks(my_ticks)
# Then, we manually change the name of the X ticks.
ax2.set_xticklabels(indexes_to_keep)
# We will then, plot the LINEAR x axis,
# but with respect to the y-axis values pre-processed.
ax2.plot(my_ticks, y_to_use)
ax2.set_title('New selected values with linear spacing')
plt.show()
Showing...
What you are looking for is a logarithmic scale with base 2. matplotlib provides logarithmic scales and you can define any base you want:
from matplotlib import pyplot as plt
from matplotlib.ticker import ScalarFormatter
#sample data
x = list(range(1, 130))
y = list(range(3, 260, 2))
f, ax1 = plt.subplots(1,1,figsize=(16,7))
x1 = [ 1, 2, 4, 8, 16, 32, 64, 128]
y1 = [y[0],y[1],y[3],y[7],y[15],y[31],y[63],y[127]]
#just the points, where the ticks are
ax1.plot(x1, y1,"bo-", label = "Performance")
#all other points to contrast this
ax1.plot(x, [270 - i for i in y], "rx-", label = "anti-Performance")
#transform x axis into logarithmic scale with base 2
plt.xscale("log", basex = 2)
#modify x axis ticks from exponential representation to float
ax1.get_xaxis().set_major_formatter(ScalarFormatter())
ax1.set_xlabel('Time Period',fontsize=15)
ax1.set_ylabel("Value",color='b',fontsize=15)
plt.legend()
plt.show()
Output:

Categories

Resources