I am trying to plot residuals on a linear regression plot. It works, with only one caveat. There is an unpleasant looking overlap between residuals and data points. Is there a way to tell matplotlib to plot the residuals first followed by Seaborn plot. I tried changing the order of code, but it didn't help.
import numpy as np
import pandas as pd
import seaborn as sns
from pylab import *
from sklearn.linear_model import LinearRegression
x = np.array([1, 2, 3, 4, 5, 7, 8, 9, 10])
y = np.array([-3, 0, 4, 5, 9, 5, 7, 7, 12])
dat = pd.DataFrame({'x': x, 'y': y})
x = x.reshape(-1,1)
y = y.reshape(-1,1)
linear_model = LinearRegression()
linear_model.fit(X=x, y=y)
pred = linear_model.predict(x)
for ix in range(len(x)):
plot([x[ix], x[ix]], [pred[ix], y[ix]], '#C9B97D')
g = sns.regplot(x='x', y='y', data=dat, ci=None, fit_reg=True)
sns.set(font_scale=1.1)
g.figure.set_size_inches(6, 6)
sns.set_style('ticks')
sns.despine()
The argument you are looking for is zorder. This allows you to control which object appears on top in your figure.
For regplot you have to use the argument scatter_kws which is a dictionary of arguments to be passed to plt.scatter which is used under the hood.
Your sns.regplot becomes:
g = sns.regplot(x='x', y='y', data=dat, ci=None, fit_reg=True,
scatter_kws={"zorder":10, "alpha":1})
Note that I've set alpha to 1 so that the markers are not transparent
Related
I want to create a heatmap in seaborn, and have a nice way to see the labels.
With ax.figure.tight_layout(), I am getting
which is obviously bad.
Without ax.figure.tight_layout(), the labels get cropped.
The code is
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sn
n_classes = 10
confusion = np.random.randint(low=0, high=100, size=(n_classes, n_classes))
label_length = 20
label_ind_by_names = {
"A"*label_length: 0,
"B"*label_length: 1,
"C"*label_length: 2,
"D"*label_length: 3,
"E"*label_length: 4,
"F"*label_length: 5,
"G"*label_length: 6,
"H"*label_length: 7,
"I"*label_length: 8,
"J"*label_length: 9,
}
# confusion matrix
df_cm = pd.DataFrame(
confusion,
index=label_ind_by_names.keys(),
columns=label_ind_by_names.keys()
)
plt.figure()
sn.set(font_scale=1.2)
ax = sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}, fmt='d')
# ax.figure.tight_layout()
plt.show()
I would like to create an extra legend based on label_ind_by_names, then post an abbreviation on the heatmap itself, and be able to look up the abbreviation in the legend.
How can this be done in seaborn?
You can define your own legend handler, e.g. for integers:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sn
n_classes = 10
confusion = np.random.randint(low=0, high=100, size=(n_classes, n_classes))
label_length = 20
label_ind_by_names = {
"A"*label_length: 0,
"B"*label_length: 1,
"C"*label_length: 2,
"D"*label_length: 3,
"E"*label_length: 4,
"F"*label_length: 5,
"G"*label_length: 6,
"H"*label_length: 7,
"I"*label_length: 8,
"J"*label_length: 9,
}
# confusion matrix
df_cm = pd.DataFrame(
confusion,
index=label_ind_by_names.values(),
columns=label_ind_by_names.values()
)
fig, ax = plt.subplots(figsize=(10, 5))
fig.subplots_adjust(left=0.05, right=.65)
sn.set(font_scale=1.2)
sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}, fmt='d', ax=ax)
class IntHandler:
def legend_artist(self, legend, orig_handle, fontsize, handlebox):
x0, y0 = handlebox.xdescent, handlebox.ydescent
text = plt.matplotlib.text.Text(x0, y0, str(orig_handle))
handlebox.add_artist(text)
return text
ax.legend(label_ind_by_names.values(),
label_ind_by_names.keys(),
handler_map={int: IntHandler()},
loc='upper left',
bbox_to_anchor=(1.2, 1))
plt.show()
Explanation of the hard-coded figures: the first two are the left and right extreme positions of the Axes in the figure (0.05 = 5 % for the figure width etc). 1.2 and 1 is the location of the upper left corner of the legend box relative to the Axes (1, 1 is the upper right corner of the Axes, we add 0.2 to 1 to account for the space used by the colorbar). Ideally one would use a constrained layout instead of fiddeling with the parameters but it doesn't (yet) support figure legends and if using an Axes legend, it places it between the Axes and the colorbar.
This is a follow-up to my previous couple of questions. Here's the code I'm playing with:
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
dictOne = {'Name':['First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh', 'Eighth', 'Ninth'],
"A":[1, 2, -3, 4, 5, np.nan, 7, np.nan, 9],
"B":[4, 5, 6, 5, 3, np.nan, 2, 9, 5],
"C":[7, np.nan, 10, 5, 8, 6, 8, 2, 4]}
df2 = pd.DataFrame(dictOne)
column = 'B'
df2[df2[column] > -999].hist(column, alpha = 0.5)
param = stats.norm.fit(df2[column].dropna()) # Fit a normal distribution to the data
print(param)
pdf_fitted = stats.norm.pdf(df2[column], *param)
plt.plot(pdf_fitted, color = 'r')
I'm trying to make a histogram of the numbers in a single column in the dataframe -- I can do this -- but with an overlaid normal curve...something like the last graph on here. I'm trying to get it working on this toy example so that I can apply it to my much larger dataset for real. The code I've pasted above gives me this graph:
Why doesn't pdf_fitted match the data in this graph? How can I overlay the proper PDF?
You should plot the histogram with density=True if you hope to compare it to a true PDF. Otherwise your normalization (amplitude) will be off.
Also, you need to specify the x-values (as an ordered array) when you plot the pdf:
fig, ax = plt.subplots()
df2[df2[column] > -999].hist(column, alpha = 0.5, density=True, ax=ax)
param = stats.norm.fit(df2[column].dropna())
x = np.linspace(*df2[column].agg([min, max]), 100) # x-values
plt.plot(x, stats.norm.pdf(x, *param), color = 'r')
plt.show()
As an aside, using a histogram to compare continuous variables with a distribution is isn't always the best. (Your sample data are discrete, but the link uses a continuous variable). The choice of bins can alias the shape of your histogram, which may lead to incorrect inference. Instead, the ECDF is a much better (choice-free) illustration of the distribution for a continuous variable:
def ECDF(data):
n = sum(data.notnull())
x = np.sort(data.dropna())
y = np.arange(1, n+1) / n
return x,y
fig, ax = plt.subplots()
plt.plot(*ECDF(df2.loc[df2[column] > -999, 'B']), marker='o')
param = stats.norm.fit(df2[column].dropna())
x = np.linspace(*df2[column].agg([min, max]), 100) # x-values
plt.plot(x, stats.norm.cdf(x, *param), color = 'r')
plt.show()
I have something similar to this problem respectivly the answer of this problem: RBF interpolation: LinAlgError: singular matrix
But I want to do the probability distribution with rbf.
My code until now:
from scipy.interpolate.rbf import Rbf # radial basis functions
import cv2
import matplotlib.pyplot as plt
import numpy as np
x = [1, 1, 2 ,3, 4, 4, 2, 6, 7]
y = [0, 2, 5, 6, 2, 4, 1, 5, 2]
rbf_adj = Rbf(x, y, function='gaussian')
plt.figure()
# Plotting the original points.
plot3 = plt.plot(x, y, 'ko', markersize=12) # the original points.
plt.show()
My problem is I have only coordinates of the points: x, y
But what can i use for z and d?
This is my error message:
numpy.linalg.linalg.LinAlgError: Matrix is singular.
This is, first, a 1D example to emphasis the difference between the Radial Basis Function interpolation and the Kernel Density Estimation of a probability distribution:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
from scipy.interpolate.rbf import Rbf # radial basis functions
from scipy.stats import gaussian_kde
coords = np.linspace(0, 2, 7)
values = np.ones_like(coords)
x_fine = np.linspace(-1, 3, 101)
rbf_interpolation = Rbf(coords, values, function='gaussian')
interpolated_y = rbf_interpolation(x_fine)
kernel_density_estimation = gaussian_kde(coords)
plt.figure()
plt.plot(coords, values, 'ko', markersize=12)
plt.plot(x_fine, interpolated_y, '-r', label='RBF Gaussian interpolation')
plt.plot(x_fine, kernel_density_estimation(x_fine), '-b', label='kernel density estimation')
plt.legend(); plt.xlabel('x')
plt.show()
And this is the 2D interpolation using Gaussian RBF for the provided data, and by setting arbitrarily the values to z=1:
from scipy.interpolate.rbf import Rbf # radial basis functions
import matplotlib.pyplot as plt
import numpy as np
x = [1, 1, 2 ,3, 4, 4, 2, 6, 7]
y = [0, 2, 5, 6, 2, 4, 1, 5, 2]
z = [1]*len(x)
rbf_adj = Rbf(x, y, z, function='gaussian')
x_fine = np.linspace(0, 8, 81)
y_fine = np.linspace(0, 8, 82)
x_grid, y_grid = np.meshgrid(x_fine, y_fine)
z_grid = rbf_adj(x_grid.ravel(), y_grid.ravel()).reshape(x_grid.shape)
plt.pcolor(x_fine, y_fine, z_grid);
plt.plot(x, y, 'ok');
plt.xlabel('x'); plt.ylabel('y'); plt.colorbar();
plt.title('RBF Gaussian interpolation');
I've written a function that reads data from a csv file and plots it. Now I need to add a subplot with another part of the data from the same file, so I've tried to write a function that calls the first function and adds a subplot. When I do this, I get the two to show up as different figures. How can I suppress this and make both of them show in the same figure?
Here is a mockup of my code:
def timex(h_ratio = [3, 1]):
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.gridspec as gridspec
total_height = h_ratio[0] + h_ratio[1]
gs = gridspec.GridSpec(total_height, 1)
time = [1, 2, 3, 4, 5]
x = [1, 2, 3, 4, 5]
y = [1, 1, 1, 1, 1]
ax1 = plt.subplot(gs[:h_ratio[0], :])
plt.plot(time, x)
plot = plt.gcf
plt.show()
return time, x, y, plot, gs, h_ratio
def timeyx():
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
time, x, y, plot, gs, h_ratio = timex(h_ratio = [3, 1])
ax2 = plt.subplot(gs[h_ratio[1], :])
plt.plot(time, y)
plt.show()
timeyx()
I realize that I have two plt.show() statements, but if I remove one that figure will not show at all.
I am not sure whether you need to use matplotlib.gridspec specifically or not, but you can use subplot2grid to make the job easy.
import matplotlib.pyplot as plt
def timex():
time = [1, 2, 3, 4, 5]
x = [1, 2, 3, 4, 5]
y = [1, 1, 1, 1, 1]
ax1 = plt.subplot2grid((1,2), (0,0))
ax1.plot(time, x)
return time, x, y
def timeyx():
time, x, y = timex()
ax2 = plt.subplot2grid((1,2), (0,1))
ax2.plot(time, y)
timeyx()
plt.show()
This produces one figure shown below with two subplots:
I have an algorithm that can be controlled by two parameters so now I want to plot the runtime of the algorithm depending on these parameters.
My Code:
from matplotlib import pyplot
import pylab
from mpl_toolkits.mplot3d import Axes3D
fig = pylab.figure()
ax = Axes3D(fig)
sequence_containing_x_vals = [5,5,5,5,10,10,10,10,15,15,15,15,20,20,20,20]
sequence_containing_y_vals = [1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4]
sequence_containing_z_vals = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
ax.scatter(sequence_containing_x_vals, sequence_containing_y_vals, sequence_containing_z_vals)
pyplot.show()
This will plot all the points in the space but I want them connected and have something like this:
(The coloring would be nice but not necessary)
To plot the surface you need to use plot_surface, and have the data as a regular 2D array (that reflects the 2D geometry of the x-y plane). Usually meshgrid is used for this, but since your data already has the x and y values repeated appropriately, you just need to reshape them. I did this with numpy reshape.
from matplotlib import pyplot, cm
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
fig = pyplot.figure()
ax = Axes3D(fig)
sequence_containing_x_vals = np.array([5,5,5,5,10,10,10,10,15,15,15,15,20,20,20,20])
X = sequence_containing_x_vals.reshape((4,4))
sequence_containing_y_vals = np.array([1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4])
Y = sequence_containing_y_vals.reshape((4,4))
sequence_containing_z_vals = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])
Z = sequence_containing_z_vals.reshape((4,4))
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.hot)
pyplot.show()
Note that X, Y = np.meshgrid([1,2,3,4], [5, 10, 15, 20]) will give the same X and Y as above but more easily.
Of course, the surface shown here is just a plane since your data is consistent with z = x + y - -5, but this method will work with generic surfaces, as can be seen in the many matplotlib surface examples.