How to plot a trendline on scatter-plot matplotlib based on KDE? - python

I am currently trying to plot a trend-line plot on my scatter-plot in MatPlotLib.
I am aware of numpy polyfit function. It does not do what I want.
So here what I have so far:
plot = plt.figure(figsize=(10,10)) #Set up the size of the figure
cmap = "viridis" #Set up the color map
plt.scatter(samples[1], samples[0], s=0.1, c=density_sm, cmap=cmap) #Plot the Cross-Plot
plt.colorbar().set_label('Density of points')
plt.axis('scaled')
plt.xlim(-0.3,0.3)
plt.ylim(-0.3,0.3)
plt.xlabel("Intercept")
plt.ylabel("Gradient")
plt.axhline(0, color='green', alpha=0.5, linestyle="--")
plt.axvline(0, color='green', alpha=0.5, linestyle="--")
#Trend-line_1
z = np.polyfit(samples[1], samples[0], 1)
p = np.poly1d(z)
plt.plot(samples[0],p(samples[0]),color="#CC3333", linewidth=0.5)
#Trend-line_2
reg = sm.WLS(samples[0], samples[1]).fit()
plt.plot(samples[1], reg.fittedvalues)
And here is the result:
Scatter-plot with trends
What I want is:
Scatter-Plot_desired
Trend can easily be seen, but the question is what function to use?

The behaviour of polyfit is as excepted and the result is correct. The problem is that polyfit does not do, what you expect. All (typical) fitting routines minimize the vertical (y-axis) distance between the fit and the data points to be fit. What you seem to expect is however that it minimizes the euclidean distance between the fit and the data. See the difference in this figure:
Here see also code that illustrates the fact with random data. Note that the linear relationship of the data (parameter a) is recovered by the fit, which would not be the case for the euclidean fit. Therefore the seemingly off fit is to be prefered.
N = 10000
a = -1
b = 0.1
datax = 0.3*b*np.random.randn(N)
datay = a*datax+b*np.random.randn(N)
plot = plt.figure(1,figsize=(10,10)) #Set up the size of the figure
plot.clf()
plt.scatter(datax,datay) #Plot the Cross-Plot
popt = np.polyfit(datax,datay,1)
print("Result is {0:1.2f} and should be {1:1.2f}".format(popt[-2],a))
xplot = np.linspace(-1,1,1000)
def pol(x,popt):
popt = popt[::-1]
res = 0
for i,p in enumerate(popt):
res += p*x**i
return res
plt.plot(xplot,pol(xplot,popt))
plt.xlim(-0.3,0.3)
plt.ylim(-0.3,0.3)
plt.xlabel("Intercept")
plt.ylabel("Gradient")
plt.tight_layout()
plt.show()

samples[0] is your "y" and samples[1] is your "x". In the trend line plot use samples[1].

Related

Difficult to plot linear regression line on scatter plot with log scale

I have a example dataframe like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame({'a':[0.05, 0.11, 0.18, 0.20, 0.22, 0.27],
'b':[3.14, 1.56, 33.10, 430.00, 239.10, 2600.22]})
I would like to plot these properties as a scatter plot and then show the linear tendency line of these samples. And I need to put the data on the y axis (df['b']) on log scale.
Although, when I try to do that using the aid of np.polyfit, I get a strange line.
# Coefficients for polynomial function (degree 1)
coefs = np.polyfit(df['a'], df['b'], 1)
fit_coefs = np.poly1d(coefs)
plt.figure()
plt.scatter(df['a'], df['b'], s = 50, edgecolors = 'black')
plt.plot(df['a'], fit_coefs(df['a']), color='red',linestyle='--')
plt.xlabel('a')
plt.ylabel('b')
plt.yscale('log')
And if I convert df['b] to log before the plot, I can get the right linear tendency, but I would like to show the y-axis with the values of the last plot and not as converted log values as this one below:
df['b_log'] = np.log10(df['b'])
coefs = np.polyfit(df['a'], df['b_log'], 1)
fit_coefs = np.poly1d(coefs)
plt.figure()
plt.scatter(df['a'], df['b_log'], s = 50, edgecolors = 'black')
plt.plot(df['a'], fit_coefs(df['a']), color='red', linestyle='--')
plt.xlabel('a')
plt.ylabel('b_log')
So basically, I need a plot like the last one, but the values on y-axis should be like the second plot and I still would get the right linear tendency. Anyone could help me?
You are doing two different things there: First, you are fitting a linear curve to your exponential data (which is presumably not what you want), then you are fitting a linear curve to your log data, which is ok.
In order to get the linear curve from the linear coefficients in the logarithmic plot, you can just do 10**fit_coefs(df['a']):
df['b_log'] = np.log10(df['b'])
coefs = np.polyfit(df['a'], df['b_log'], 1)
fit_coefs = np.poly1d(coefs)
plt.figure()
plt.scatter(df['a'], df['b'], s = 50, edgecolors = 'black')
plt.plot(df['a'], 10**fit_coefs(df['a']), color='red', linestyle='--')
plt.xlabel('a')
plt.ylabel('b_log')
plt.yscale("log")

Seaborn joint_plot and marginal hists mis-aligned

I'm trying to generate a jointplot for data with linear x and log y. The ranges are -22, -13 for x and 1e-3, 1 for y. The plot seems ok, however the marginal histograms are not correct: at least the one for the x data:
Here's my code...
# Convert observed magnitude to Absolute ...
absMag, pop3Mag, nmAbsMag = compMags(dir,z)
pop3Fraction = haloData[dir][z]['1500A_P3']/haloData[dir][z]['1500A']
pop3Fraction[pop3Fraction < 1e-3] = 1e-3 # Map Pop 3 flux < 1e-3 to 1e-3
data = np.array((absMag,pop3Fraction)).T # data is list of (x,y) pairs...
df = pd.DataFrame(data, columns=["M", "f"])
x, y = data.T
# g = sns.jointplot(x="x", y="y", data=df)
g = sns.JointGrid(x='M', y='f', data=df, xlim=[-22,-13],ylim=[0.001,1])
g.plot_joint(plt.scatter)
g.ax_marg_x.set_xscale('linear')
g.ax_marg_y.set_yscale('log')
x_h = g.ax_marg_x.hist(df['M'], color='b', edgecolor='k', bins=magBins)
y_h = g.ax_marg_y.hist(df['f'], orientation="horizontal", color='r', edgecolor='k', bins=fracBins, log=True)
ax = g.ax_joint
ax.set_xscale('linear')
ax.set_yscale('log')
ax.set_xlim([-22,-13])
ax.set_xticks([-21,-19,-17,-15,-13,-11])
ax.set_ylim([1e-3,1])
I'm not sure why the top histogram is not aligned with the data... ???
Never-mind ... on closer inspection there really are more points near -13 than anywhere else... I really need a 2d histogram here to show these nuances.
If someone has a suggestion as to how to make that plot clearly with seaborn I'd appreciate it.

Density scatter plot for huge dataset in matplotlib

I wrote some code a while ago that used gaussian kde to make simple density scatter plots. However, for datasets larger than about 100,000 points, it just ran 'forever' (I killed it after a few days). A friend gave me some code in R that could create such a density plot in seconds (plot_fun.R), and it seems like matplotlib should be able to do the same thing.
I think the right place to look is 2d histograms, but I am struggling to get the density to be 'right'. I modified code I found at this question to accomplish this, but the density is not showing, it looks like only the densist posible points are getting any color.
Here is approximately the code I am using:
# initial data
x = -np.log10(np.random.random_sample(10000))
y = -np.log10(np.random.random_sample(10000))
#histogram definition
bins = [1000, 1000] # number of bins
thresh = 3 #density threshold
#data definition
mn = min(x.min(), y.min())
mx = max(x.max(), y.max())
mn = mn-(mn*.1)
mx = mx+(mx*.1)
xyrange = [[mn, mx], [mn, mx]]
# histogram the data
hh, locx, locy = np.histogram2d(x, y, range=xyrange, bins=bins)
posx = np.digitize(x, locx)
posy = np.digitize(y, locy)
#select points within the histogram
ind = (posx > 0) & (posx <= bins[0]) & (posy > 0) & (posy <= bins[1])
hhsub = hh[posx[ind] - 1, posy[ind] - 1] # values of the histogram where the points are
xdat1 = x[ind][hhsub < thresh] # low density points
ydat1 = y[ind][hhsub < thresh]
hh[hh < thresh] = np.nan # fill the areas with low density by NaNs
f, a = plt.subplots(figsize=(12,12))
c = a.imshow(
np.flipud(hh.T), cmap='jet',
extent=np.array(xyrange).flatten(), interpolation='none',
origin='upper'
)
f.colorbar(c, ax=ax, orientation='vertical', shrink=0.75, pad=0.05)
s = a.scatter(
xdat1, ydat1, color='darkblue', edgecolor='', label=None,
picker=True, zorder=2
)
That produces this plot:
The KDE code is here:
f, a = plt.subplots(figsize=(12,12))
xy = np.vstack([x, y])
z = sts.gaussian_kde(xy)(xy)
# Sort the points by density, so that the densest points are
# plotted last
idx = z.argsort()
x2, y2, z = x[idx], y[idx], z[idx]
s = a.scatter(
x2, y2, c=z, s=50, cmap='jet',
edgecolor='', label=None, picker=True, zorder=2
)
That produces this plot:
The problem is, of course, that this code is unusable on large data sets.
My question is: how can I use the 2d histogram to produce a scatter plot like that? ax.hist2d does not produce a useful output, because it colors the whole plot, and all my efforts to get the above 2d histogram data to actually color the dense regions of the plot correctly have failed, I always end up with either no coloring or a tiny percentage of the densest points being colored. Clearly I just don't understand the code very well.
Your histogram code assigns a unique color (color='darkblue') so what are you expecting?
I think you are also over complicating things. This much simpler code works fine:
import numpy as np
import matplotlib.pyplot as plt
x, y = -np.log10(np.random.random_sample((2,10**6)))
#histogram definition
bins = [1000, 1000] # number of bins
# histogram the data
hh, locx, locy = np.histogram2d(x, y, bins=bins)
# Sort the points by density, so that the densest points are plotted last
z = np.array([hh[np.argmax(a<=locx[1:]),np.argmax(b<=locy[1:])] for a,b in zip(x,y)])
idx = z.argsort()
x2, y2, z2 = x[idx], y[idx], z[idx]
plt.figure(1,figsize=(8,8)).clf()
s = plt.scatter(x2, y2, c=z2, cmap='jet', marker='.')

Plotting confidence and prediction intervals with repeated entries

I have a correlation plot for two variables, the predictor variable (temperature) on the x-axis, and the response variable (density) on the y-axis. My best fit least squares regression line is a 2nd order polynomial. I would like to also plot confidence and prediction intervals. The method described in this answer seems perfect. However, my dataset (n=2340) has repeated entries for many (x,y) pairs. My resulting plot looks like this:
Here is my relevant code (slightly modified from linked answer above):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import summary_table
d = {'temp': x, 'dens': y}
df = pd.DataFrame(data=d)
x = df.temp
y = df.dens
plt.figure(figsize=(6 * 1.618, 6))
plt.scatter(x,y, s=10, alpha=0.3)
plt.xlabel('temp')
plt.ylabel('density')
# points linearly spaced for predictor variable
x1 = pd.DataFrame({'temp': np.linspace(df.temp.min(), df.temp.max(), 100)})
# 2nd order polynomial
poly_2 = smf.ols(formula='dens ~ 1 + temp + I(temp ** 2.0)', data=df).fit()
# this correctly plots my single 2nd-order poly best-fit line:
plt.plot(x1.temp, poly_2.predict(x1), 'g-', label='Poly n=2 $R^2$=%.2f' % poly_2.rsquared,
alpha=0.9)
prstd, iv_l, iv_u = wls_prediction_std(poly_2)
st, data, ss2 = summary_table(poly_2, alpha=0.05)
fittedvalues = data[:,2]
predict_mean_se = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
predict_ci_low, predict_ci_upp = data[:,6:8].T
# check we got the right things
print np.max(np.abs(poly_2.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))
plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)
The print statements evaluate to 0.0, as expected.
However, I need single lines for the polynomial best fit line, and the confidence and prediction intervals (rather than the multiple lines I currently have in my plot). Any ideas?
Update:
Following first answer from #kpie, I ordered my confidence and prediction interval arrays according to temperature:
data_intervals = {'temp': x, 'predict_low': predict_ci_low, 'predict_upp': predict_ci_upp, 'conf_low': predict_mean_ci_low, 'conf_high': predict_mean_ci_upp}
df_intervals = pd.DataFrame(data=data_intervals)
df_intervals_sort = df_intervals.sort(columns='temp')
This achieved desired results:
You need to order your predict values based on temperature. I think*
So to get nice curvy lines you will have to use numpy.polynomial.polynomial.polyfit This will return a list of coefficients. You will have to split the x and y data into 2 lists so it fits in the function.
You can then plot this function with:
def strPolynomialFromArray(coeffs):
return("".join([str(k)+"*x**"+str(n)+"+" for n,k in enumerate(coeffs)])[0:-1])
from numpy import *
from matplotlib.pyplot import *
x = linespace(-15,45,300) # your smooth line will be made of 300 smooth pieces
y = exec(strPolynomialFromArray(numpy.polynomial.polynomial.polyfit(xs,ys,degree)))
plt.plot(x , y)
You can look more into plotting smooth lines here just remember all lines are linear splines, becasue continuous curvature is irrational.
I believe that the polynomial fitting is done with least squares fitting (process described here)
Good Luck!

Matplotlib: avoiding overlapping datapoints in a "scatter/dot/beeswarm" plot

When drawing a dot plot using matplotlib, I would like to offset overlapping datapoints to keep them all visible. For example, if I have:
CategoryA: 0,0,3,0,5
CategoryB: 5,10,5,5,10
I want each of the CategoryA "0" datapoints to be set side by side, rather than right on top of each other, while still remaining distinct from CategoryB.
In R (ggplot2) there is a "jitter" option that does this. Is there a similar option in matplotlib, or is there another approach that would lead to a similar result?
Edit: to clarify, the "beeswarm" plot in R is essentially what I have in mind, and pybeeswarm is an early but useful start at a matplotlib/Python version.
Edit: to add that Seaborn's Swarmplot, introduced in version 0.7, is an excellent implementation of what I wanted.
Extending the answer by #user2467675, here’s how I did it:
def rand_jitter(arr):
stdev = .01 * (max(arr) - min(arr))
return arr + np.random.randn(len(arr)) * stdev
def jitter(x, y, s=20, c='b', marker='o', cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=None, hold=None, **kwargs):
return scatter(rand_jitter(x), rand_jitter(y), s=s, c=c, marker=marker, cmap=cmap, norm=norm, vmin=vmin, vmax=vmax, alpha=alpha, linewidths=linewidths, **kwargs)
The stdev variable makes sure that the jitter is enough to be seen on different scales, but it assumes that the limits of the axes are zero and the max value.
You can then call jitter instead of scatter.
Seaborn provides histogram-like categorical dot-plots through sns.swarmplot() and jittered categorical dot-plots via sns.stripplot():
import seaborn as sns
sns.set(style='ticks', context='talk')
iris = sns.load_dataset('iris')
sns.swarmplot('species', 'sepal_length', data=iris)
sns.despine()
sns.stripplot('species', 'sepal_length', data=iris, jitter=0.2)
sns.despine()
I used numpy.random to "scatter/beeswarm" the data along X-axis but around a fixed point for each category, and then basically do pyplot.scatter() for each category:
import matplotlib.pyplot as plt
import numpy as np
#random data for category A, B, with B "taller"
yA, yB = np.random.randn(100), 5.0+np.random.randn(1000)
xA, xB = np.random.normal(1, 0.1, len(yA)),
np.random.normal(3, 0.1, len(yB))
plt.scatter(xA, yA)
plt.scatter(xB, yB)
plt.show()
One way to approach the problem is to think of each 'row' in your scatter/dot/beeswarm plot as a bin in a histogram:
data = np.random.randn(100)
width = 0.8 # the maximum width of each 'row' in the scatter plot
xpos = 0 # the centre position of the scatter plot in x
counts, edges = np.histogram(data, bins=20)
centres = (edges[:-1] + edges[1:]) / 2.
yvals = centres.repeat(counts)
max_offset = width / counts.max()
offsets = np.hstack((np.arange(cc) - 0.5 * (cc - 1)) for cc in counts)
xvals = xpos + (offsets * max_offset)
fig, ax = plt.subplots(1, 1)
ax.scatter(xvals, yvals, s=30, c='b')
This obviously involves binning the data, so you may lose some precision. If you have discrete data, you could replace:
counts, edges = np.histogram(data, bins=20)
centres = (edges[:-1] + edges[1:]) / 2.
with:
centres, counts = np.unique(data, return_counts=True)
An alternative approach that preserves the exact y-coordinates, even for continuous data, is to use a kernel density estimate to scale the amplitude of random jitter in the x-axis:
from scipy.stats import gaussian_kde
kde = gaussian_kde(data)
density = kde(data) # estimate the local density at each datapoint
# generate some random jitter between 0 and 1
jitter = np.random.rand(*data.shape) - 0.5
# scale the jitter by the KDE estimate and add it to the centre x-coordinate
xvals = 1 + (density * jitter * width * 2)
ax.scatter(xvals, data, s=30, c='g')
for sp in ['top', 'bottom', 'right']:
ax.spines[sp].set_visible(False)
ax.tick_params(top=False, bottom=False, right=False)
ax.set_xticks([0, 1])
ax.set_xticklabels(['Histogram', 'KDE'], fontsize='x-large')
fig.tight_layout()
This second method is loosely based on how violin plots work. It still cannot guarantee that none of the points are overlapping, but I find that in practice it tends to give quite nice-looking results as long as there are a decent number of points (>20), and the distribution can be reasonably well approximated by a sum-of-Gaussians.
Not knowing of a direct mpl alternative here you have a very rudimentary proposal:
from matplotlib import pyplot as plt
from itertools import groupby
CA = [0,4,0,3,0,5]
CB = [0,0,4,4,2,2,2,2,3,0,5]
x = []
y = []
for indx, klass in enumerate([CA, CB]):
klass = groupby(sorted(klass))
for item, objt in klass:
objt = list(objt)
points = len(objt)
pos = 1 + indx + (1 - points) / 50.
for item in objt:
x.append(pos)
y.append(item)
pos += 0.04
plt.plot(x, y, 'o')
plt.xlim((0,3))
plt.show()
Seaborn's swarmplot seems like the most apt fit for what you have in mind, but you can also jitter with Seaborn's regplot:
import seaborn as sns
iris = sns.load_dataset('iris')
sns.swarmplot('species', 'sepal_length', data=iris)
sns.regplot(x='sepal_length',
y='sepal_width',
data=iris,
fit_reg=False, # do not fit a regression line
x_jitter=0.1, # could also dynamically set this with range of data
y_jitter=0.1,
scatter_kws={'alpha': 0.5}) # set transparency to 50%
Extending the answer by #wordsforthewise (sorry, can't comment with my reputation), if you need both jitter and the use of hue to color the points by some categorical (like I did), Seaborn's lmplot is a great choice instead of reglpot:
import seaborn as sns
iris = sns.load_dataset('iris')
sns.lmplot(x='sepal_length', y='sepal_width', hue='species', data=iris, fit_reg=False, x_jitter=0.1, y_jitter=0.1)

Categories

Resources