I have a dataset that looks similar to the one simulated in the code below. There are two sets of observations, one for those at X=0 and another for those at X>0.
import numpy as np
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
X1 = np.random.normal(0, 1, 100)
X1 = X1 - np.min(X1)
Y1 = X1 + np.random.normal(0, 1, 100)
X0 = np.zeros(100)
Y0 = np.random.normal(0, 1.2, 100) + 2
X = np.concatenate((X1, X0))
Y = np.concatenate((Y1, Y0))
sns.distplot(Y0, color="orange")
plt.show()
sns.scatterplot(X, Y, hue = (X == 0), legend=False)
plt.show()
There are two plots: a histogram with KDE and a scatterplot.
I want to take the histogram with KDE, rotate it, and orient it appropriately with respect to the scatter plot. I would also like to add a trend line for each respective set of observations.
The ideal result would look something like this:
How do you do this in python, either using seaborn or matplotlib?
This can be done by combining plt.subplots with shared y-axis to keep the scale and sns plots. For trend line you need some additional computation, but you can use np for quick fitting. Here is an example how to achieve your goal, and here is jupyter notebook to play with.
import numpy as np
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
# Prepare some data
np.random.seed(2020)
mean_Y1 = 0
std_Y1 = 1
size_Y1 = 100
X1 = np.random.normal(mean_Y1, std_Y1, size_Y1)
X1 = X1 - np.min(X1)
Y1 = X1 + np.random.normal(mean_Y1, std_Y1, size_Y1)
# this for computing trend line
Z = np.polyfit(X1, Y1, 1)
Y_ = np.poly1d(Z)(X1)
mean_Y0 = 2
std_Y0 = 1.2
size_Y0 = 100
X0 = np.zeros(100)
Y0 = np.random.normal(mean_Y0, std_Y0, size_Y0)
X = np.concatenate((X1, X0))
Y = np.concatenate((Y1, Y0))
# Now time for plotting
fig, axs = plt.subplots(1, 2,
sharey=True,
figsize=(10, 5),
gridspec_kw={'width_ratios': (1, 2)}
)
# control space between plots
fig.subplots_adjust(wspace=0.1)
# set the ticks for y-axis:
axs[0].yaxis.set_tick_params(left=False, labelleft=False, labelright=True)
# if you wish you can rotate xticks on the histogram with:
axs[0].xaxis.set_tick_params(rotation=90)
# plot histogram
dist = sns.distplot(Y0, color="orange", vertical=True, ax=axs[0])
# now we need to get the coordinate of the peak, we need this for mean line
line_data = dist.get_lines()[0].get_data()
max_Y0 = np.max(line_data[0])
# plotting the mean line
axs[0].plot([0, max_Y0], [mean_Y0, mean_Y0], '--', c='orange')
# inverting xaxis
axs[0].invert_xaxis()
# Plotting scatterpot
sns.scatterplot(X, Y, hue = (X == 0), legend=False, ax=axs[1])
# Plotting trend line
sns.lineplot(X1, Y_, ax=axs[1])
# Plotting mean again
axs[1].plot([0, max(X1)], [mean_Y0, mean_Y0], '--', c='orange')
plt.show()
Out:
Related
I want to plot some equation in Matplotlib. But it has different result from Wolframalpha.
This is the equation:
y = 10yt + y^2t + 20
The plot result in wolframalpha is:
But when I want to plot it in the matplotlib with these code
# Creating vectors X and Y
x = np.linspace(-2, 2, 100)
# Assuming α is 10
y = ((10*y*x)+((y**2)*x)+20)
# Create the plot
fig = plt.figure(figsize = (10, 5))
plt.plot(x, y)
The result is:
Any suggestion to modify to code so it has similar plot result as wolframalpha? Thank you
As #Him has suggested in the comments, y = ((10*y*x)+((y**2)*x)+20) won't describe a relationship, so much as make an assignment, so the fact that y appears on both sides of the equation makes this difficult.
It's not trivial to express y cleanly in terms of x, but it's relatively easy to express x in terms of y, and then graph that relationship, like so:
import numpy as np
import matplotlib.pyplot as plt
y = np.linspace(-40, 40, 2000)
x = (y-20)*(((10*y)+(y**2))**-1)
fig, ax = plt.subplots()
ax.plot(x, y, linestyle = 'None', marker = '.')
ax.set_xlim(left = -4, right = 4)
ax.grid()
ax.set_xlabel('x')
ax.set_ylabel('y')
Which produces the following result:
If you tried to plot this with a line instead of points, you'll get a big discontinuity as the asymptotic limbs try to join up
So you'd have to define the same function and evaluate it in three different ranges and plot them all so you don't get any crossovers.
import numpy as np
import matplotlib.pyplot as plt
y1 = np.linspace(-40, -10, 2000)
y2 = np.linspace(-10, 0, 2000)
y3 = np.linspace(0, 40, 2000)
x = lambda y: (y-20)*(((10*y)+(y**2))**-1)
y = np.hstack([y1, y2, y3])
fig, ax = plt.subplots()
ax.plot(x(y), y, linestyle = '-', color = 'b')
ax.set_xlim(left = -4, right = 4)
ax.grid()
ax.set_xlabel('x')
ax.set_ylabel('y')
Which produces this result, that you were after:
I am looking for a way to color the intervals below the curve with different colors; on the interval x < 0, I would like to fill the area under the curve with one color and on the interval x >= 0 with another color, like the following image:
This is the code for basic kde plot:
fig, (ax1) = plt.subplots(1, 1, figsize = ((plot_size + 1.5) * 1,(plot_size + 1.5)))
sns.kdeplot(data=pd.DataFrame(w_contrast, columns=['contrast']), x="contrast", ax=ax1);
ax1.set_xlabel(f"Dry Yield Posterior Contrast (kg)");
Is there a way to fill the area under the curve with different colors using seaborn?
seaborn is a high level api for matplotlib, so the curve will have to be calculated; similar to, but simpler than this answer.
Calculate the values for the kde curve with scipy.stats.gaussian_kde
Use matplotlib.pyplot.fill_between to fill the areas.
Use scipy.integrate.simpson to calculate the area under the curve, which will be passed to matplotlib.pyplot.annotate to annotate.
import seaborn as sns
from scipy.stats import gaussian_kde
from scipy.integrate import simps
import numpy as np
# load sample data
df = sns.load_dataset('planets')
# create the kde model
kde = gaussian_kde(df.mass.dropna())
# plot
fig, ax = plt.subplots(figsize=(9, 6))
g = sns.kdeplot(data=df.mass, ax=ax, c='k')
# remove margins; optional
g.margins(x=0, y=0)
# get the min and max of the x-axis
xmin, xmax = g.get_xlim()
# create points between the min and max
x = np.linspace(xmin, xmax, 1000)
# calculate the y values from the model
kde_y = kde(x)
# select x values below 0
x0 = x[x < 0]
# get the len, which will be used for slicing the other arrays
x0_len = len(x0)
# slice the arrays
y0 = kde_y[:x0_len]
x1 = x[x0_len:]
y1 = kde_y[x0_len:]
# calculate the area under the curves
area0 = np.round(simps(y0, x0, dx=1) * 100, 0)
area1 = np.round(simps(y1, x1, dx=1) * 100, 0)
# fill the areas
g.fill_between(x=x0, y1=y0, color='r', alpha=.5)
g.fill_between(x=x1, y1=y1, color='b', alpha=.5)
# annotate
g.annotate(f'{area0:.0f}%', xy=(-1, 0.075), xytext=(10, 0.150), arrowprops=dict(arrowstyle="->", color='r', alpha=.5))
g.annotate(f'{area1:.0f}%', xy=(1, 0.05), xytext=(10, 0.125), arrowprops=dict(arrowstyle="->", color='b', alpha=.5))
sorry but i can't post my real data or plot.. so I made pictoral plot in MS paint.
So I have my plot - orange line, given as set of X and Y values plt.plot(data_x, data_y).
Then I added horizontal line - blue line that way: plt.axvline(x=10).
Now I would like to fill with color space between this line and my plot (ultimately, with one color when values are belowe horizontal line, and second when they are above).
I tried with plt.fill and plt.fill_between and plt.axhspan though, i receive errors either with dimensionality issues or elements vs sequence.
Is there an easy way to do this?
Yes, there is a where parameter of ax.fill_between for doing this:
import matplotlib.pyplot as plt
import numpy as np
# make data
x = np.linspace(0, np.pi * 2, 300)
y = np.sin(x)
# init figure
fig, ax = plt.subplots()
# plot sin and line
ax.plot(x, y, color='orange')
ax.axhline(0)
# fill between hline and y, but use (y > 0) and (y < 0)
# to create boolean masks determining where to fill
ax.fill_between(x, y, where=(y > 0), color='orange', alpha=.3)
ax.fill_between(x, y, where=(y < 0), color='blue', alpha=.3)
you have to use
import matplotlib.pyplot as plt
import numpy as np
data_x = np.arange(0.0, 2, 0.01)
data_y = np.sin(2 * np.pi * x)
data_y2 = 0
fig, ax = plt.subplots()
ax.fill_between(data_x, data_y, data_y2,
where=data_y2 >= data_y,
facecolor='green', interpolate=True)
ax.fill_between(data_x, data_y, data_y2,
where=data_y2 <= data_y,
facecolor='red', interpolate=True)
Note that data_y2 has to be a scalar (e.g. 0) or of the same shape as data_y.
Here you will find the relevant docu:
https://matplotlib.org/3.1.1/gallery/lines_bars_and_markers/fill_between_demo.html
and
https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.fill_between.html
Matplotlib offers various options for the drawstyle. steps-mid does the following:
The steps variants connect the points with step-like lines, i.e. horizontal lines with vertical steps. [...]
'steps-mid': The step is halfway between the points.
This works fine when the x-scale is linear however when using a log-scale it still seems to compute the step points by averaging in data-space rather than log-space. This leads to data points not being centered between the steps.
import matplotlib.pyplot as plt
import numpy as np
x = np.logspace(0, 10, num=10)
y = np.arange(x.size) % 2
fig, ax = plt.subplots()
ax.set_xscale('log')
ax.plot(x, y, drawstyle='steps-mid', marker='s')
Is there a way to use step-like plotting together with x-log-scale such that the steps are centered between data points in log-space?
I don't know of a way other than building the steps correctly in log space yourself:
import matplotlib.pyplot as plt
import numpy as np
x = np.logspace(0, 10, num=10)
y = np.arange(x.size) % 2
def log_steps_mid(x, y, **kwargs):
x_log = np.log10(x)
x_log_mid = x_log[:-1] + np.diff(x_log)/2
x_mid = 10 ** x_log_mid
x_mid = np.hstack([x[0],
np.repeat(x_mid, 2),
x[-1]])
y_mid = np.repeat(y, 2)
ax.plot(x_mid, y_mid, **kwargs)
fig, ax = plt.subplots()
ax.set_xscale('log')
ax.plot(x, y, ls='', marker='s', color='b')
log_steps_mid(x, y, color='b')
I'd like to make a scatter plot where each point is colored by the spatial density of nearby points.
I've come across a very similar question, which shows an example of this using R:
R Scatter Plot: symbol color represents number of overlapping points
What's the best way to accomplish something similar in python using matplotlib?
In addition to hist2d or hexbin as #askewchan suggested, you can use the same method that the accepted answer in the question you linked to uses.
If you want to do that:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
# Generate fake data
x = np.random.normal(size=1000)
y = x * 3 + np.random.normal(size=1000)
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=100)
plt.show()
If you'd like the points to be plotted in order of density so that the densest points are always on top (similar to the linked example), just sort them by the z-values. I'm also going to use a smaller marker size here as it looks a bit better:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
# Generate fake data
x = np.random.normal(size=1000)
y = x * 3 + np.random.normal(size=1000)
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
# Sort the points by density, so that the densest points are plotted last
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]
fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=50)
plt.show()
Plotting >100k data points?
The accepted answer, using gaussian_kde() will take a lot of time. On my machine, 100k rows took about 11 minutes. Here I will add two alternative methods (mpl-scatter-density and datashader) and compare the given answers with same dataset.
In the following, I used a test data set of 100k rows:
import matplotlib.pyplot as plt
import numpy as np
# Fake data for testing
x = np.random.normal(size=100000)
y = x * 3 + np.random.normal(size=100000)
Output & computation time comparison
Below is a comparison of different methods.
1: mpl-scatter-density
Installation
pip install mpl-scatter-density
Example code
import mpl_scatter_density # adds projection='scatter_density'
from matplotlib.colors import LinearSegmentedColormap
# "Viridis-like" colormap with white background
white_viridis = LinearSegmentedColormap.from_list('white_viridis', [
(0, '#ffffff'),
(1e-20, '#440053'),
(0.2, '#404388'),
(0.4, '#2a788e'),
(0.6, '#21a784'),
(0.8, '#78d151'),
(1, '#fde624'),
], N=256)
def using_mpl_scatter_density(fig, x, y):
ax = fig.add_subplot(1, 1, 1, projection='scatter_density')
density = ax.scatter_density(x, y, cmap=white_viridis)
fig.colorbar(density, label='Number of points per pixel')
fig = plt.figure()
using_mpl_scatter_density(fig, x, y)
plt.show()
Drawing this took 0.05 seconds:
And the zoom-in looks quite nice:
2: datashader
Datashader is an interesting project. It has added support for matplotlib in datashader 0.12.
Installation
pip install datashader
Code (source & parameterer listing for dsshow):
import datashader as ds
from datashader.mpl_ext import dsshow
import pandas as pd
def using_datashader(ax, x, y):
df = pd.DataFrame(dict(x=x, y=y))
dsartist = dsshow(
df,
ds.Point("x", "y"),
ds.count(),
vmin=0,
vmax=35,
norm="linear",
aspect="auto",
ax=ax,
)
plt.colorbar(dsartist)
fig, ax = plt.subplots()
using_datashader(ax, x, y)
plt.show()
It took 0.83 s to draw this:
There is also possibility to colorize by third variable. The third parameter for dsshow controls the coloring. See more examples here and the source for dsshow here.
3: scatter_with_gaussian_kde
def scatter_with_gaussian_kde(ax, x, y):
# https://stackoverflow.com/a/20107592/3015186
# Answer by Joel Kington
xy = np.vstack([x, y])
z = gaussian_kde(xy)(xy)
ax.scatter(x, y, c=z, s=100, edgecolor='')
It took 11 minutes to draw this:
4: using_hist2d
import matplotlib.pyplot as plt
def using_hist2d(ax, x, y, bins=(50, 50)):
# https://stackoverflow.com/a/20105673/3015186
# Answer by askewchan
ax.hist2d(x, y, bins, cmap=plt.cm.jet)
It took 0.021 s to draw this bins=(50,50):
It took 0.173 s to draw this bins=(1000,1000):
Cons: The zoomed-in data does not look as good as in with mpl-scatter-density or datashader. Also you have to determine the number of bins yourself.
5: density_scatter
The code is as in the answer by Guillaume.
It took 0.073 s to draw this with bins=(50,50):
It took 0.368 s to draw this with bins=(1000,1000):
Also, if the number of point makes KDE calculation too slow, color can be interpolated in np.histogram2d [Update in response to comments: If you wish to show the colorbar, use plt.scatter() instead of ax.scatter() followed by plt.colorbar()]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.colors import Normalize
from scipy.interpolate import interpn
def density_scatter( x , y, ax = None, sort = True, bins = 20, **kwargs ) :
"""
Scatter plot colored by 2d histogram
"""
if ax is None :
fig , ax = plt.subplots()
data , x_e, y_e = np.histogram2d( x, y, bins = bins, density = True )
z = interpn( ( 0.5*(x_e[1:] + x_e[:-1]) , 0.5*(y_e[1:]+y_e[:-1]) ) , data , np.vstack([x,y]).T , method = "splinef2d", bounds_error = False)
#To be sure to plot all data
z[np.where(np.isnan(z))] = 0.0
# Sort the points by density, so that the densest points are plotted last
if sort :
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]
ax.scatter( x, y, c=z, **kwargs )
norm = Normalize(vmin = np.min(z), vmax = np.max(z))
cbar = fig.colorbar(cm.ScalarMappable(norm = norm), ax=ax)
cbar.ax.set_ylabel('Density')
return ax
if "__main__" == __name__ :
x = np.random.normal(size=100000)
y = x * 3 + np.random.normal(size=100000)
density_scatter( x, y, bins = [30,30] )
You could make a histogram:
import numpy as np
import matplotlib.pyplot as plt
# fake data:
a = np.random.normal(size=1000)
b = a*3 + np.random.normal(size=1000)
plt.hist2d(a, b, (50, 50), cmap=plt.cm.jet)
plt.colorbar()