How do I separate a data set on a scatter plot - python

I'm very new to python but am interested in learning a new technique whereby I can identify different data points in a scatter plot with different markers according to where they fall in the scatter plot.
My specific example is much to this: http://www.astroml.org/examples/datasets/plot_sdss_line_ratios.html
I have a BPT plot and want to split the data along the demarcation line line.
I have a data set in this format:
data = [[a,b,c],
[a,b,c],
[a,b,c]
]
And I also have the following for the demarcation line:
NII = np.linspace(-3.0, 0.35)
def log_OIII_Hb_NII(log_NII_Ha, eps=0):
return 1.19 + eps + 0.61 / (log_NII_Ha - eps - 0.47)
Any help would be great!

There was not enough room in the comments section. Not too dissimilar to what #DrV wrote, but maybe more astronomically inclined:
import random
import numpy as np
import matplotlib.pyplot as plt
def log_OIII_Hb_NII(log_NII_Ha, eps=0):
return 1.19 + eps + 0.61 / (log_NII_Ha - eps - 0.47)
# Make some fake measured NII_Ha data
iternum = 100
# Ranged -2.1 to 0.4:
Measured_NII_Ha = np.array([random.random()*2.5-2.1 for i in range(iternum)])
# Ranged -1.5 to 1.5:
Measured_OIII_Hb = np.array([random.random()*3-1.5 for i in range(iternum)])
# For our measured x-value, what is our cut-off value
Measured_Predicted_OIII_Hb = log_OIII_Hb_NII(Measured_NII_Ha)
# Now compare the cut-off line to the measured emission line fluxes
# by using numpy True/False arrays
#
# i.e., x = numpy.array([1,2,3,4])
# >> index = x >= 3
# >> print(index)
# >> numpy.array([False, False, True, True])
# >> print(x[index])
# >> numpy.array([3,4])
Above_Predicted_Red_Index = Measured_OIII_Hb > Measured_Predicted_OIII_Hb
Below_Predicted_Blue_Index = Measured_OIII_Hb < Measured_Predicted_OIII_Hb
# Alternatively, you can invert Above_Predicted_Red_Index
# Make the cut-off line for a range of values for plotting it as
# a continuous line
Predicted_NII_Ha = np.linspace(-3.0, 0.35)
Predicted_log_OIII_Hb_NII = log_OIII_Hb_NII(Predicted_NII_Ha)
fig = plt.figure(0)
ax = fig.add_subplot(111)
# Plot the modelled cut-off line
ax.plot(Predicted_NII_Ha, Predicted_log_OIII_Hb_NII, color="black", lw=2)
# Plot the data for a given colour
ax.errorbar(Measured_NII_Ha[Above_Predicted_Red_Index], Measured_OIII_Hb[Above_Predicted_Red_Index], fmt="o", color="red")
ax.errorbar(Measured_NII_Ha[Below_Predicted_Blue_Index], Measured_OIII_Hb[Below_Predicted_Blue_Index], fmt="o", color="blue")
# Make it aesthetically pleasing
ax.set_ylabel(r"$\rm \log([OIII]/H\beta)$")
ax.set_xlabel(r"$\rm \log([NII]/H\alpha)$")
plt.show()

I assume you have the pixel coordinates as a, b in your example. The column with cs is then something that is used to calculate whether a point belongs to one of the two groups.
Make your data first an ndarray:
import numpy as np
data = np.array(data)
Now you may create two arrays by checking which part of the data belongs to which area:
dataselector = log_OIII_Hb_NII(data[:,2]) > 0
This creates a vector of Trues and Falses which has a True whenever the data in the third column (column 2) gives a positive value from the function. The length of the vector equals to the number of rows in data.
Then you can plot the two data sets:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
# the plotting part
ax.plot(data[dataselector,0], data[dataselector,1], 'ro')
ax.plot(data[-dataselector,0], data[-dataselector,1], 'bo')
I.e.:
create a list of True/False values which tells which rows of data belong to which group
plot the two groups (-dataselector means "all the rows where there is a False in dataselector")

Related

How to make standard deviation and percentile bands in a python scatter plot

I have data for a scatter plot (for reference, x values are labelled sm, y values are labelled bhm) and my three goals are to find the medians of binned data, create standard deviation bands, and create bands at the 90th and 10th percentiles. I've managed to do the first, and while I've been able to make vertical bars indicating the standard deviation, I can't figure out how to make filled-in bands since every time I try to set parameters with the fill_between function, it says operators with sm/bhm are incompatible since they're datasets and I'm comparing them to singular values (the mean line). I copied all of my code down below and there's a comment pointing out the relevant stuff - I just kept all of it since the variable names are a bit important and also because some parts of the plot don't show up properly without the seemingly extraneous code
To create the bands at 90/10 percent, I tried this bit of code by trying to bin the mean as I did for the median, and then filling the top and bottom of the line +-90% of the data but I keep getting
patsy.PatsyError: model is missing required outcome variables
#stuff that really doesn't work
model = smf.quantreg(bhm, sm)
quantiles = [0.1, 0.9]
fits = [model.fit(q=q) for q in quantiles]
figure, axes = plt.subplots()
_sm = np.linspace(min(sm), max(sm))
for index, quantile in enumerate(quantiles):
_bhm = fits[index].params['world'] * _sm +
fits[index].params['Intercept']
axes.plot(_sm, _bhm, label = quantile)
axes.plot(_sm, _sm, 'g--', label = 'i guess this line is the mean')
#stuff that also doesn't really work
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as mpatches
import h5py
import statistics as stat
import pandas as pd
import statsmodels.formula.api as smf
#my files and labels for things
f=h5py.File(r'C:\Users\hanna\Downloads\CatalogueGalsz0p0.hdf5', 'r')
sm = f['StellarMass']
bhm = f['BHMass']
bt = f['BtoT']
dt = f['DtoT']
nbins = 125
#titles and scaling for the plot
plt.title('Relationships Between Stellar Mass, Black Hole Mass, and Bulge
to Total Ratios')
plt.xlabel('Stellar Mass')
plt.ylabel('Black Hole Mass')
plt.xscale('log')
plt.yscale('log')
axes = plt.gca()
axes.set_ylim([500000,max(bhm)])
axes.set_xlim([min(sm),max(sm)])
#labels for the legend and how I colored the points in the plot
DtoT = np.copy(f['DtoT'].value)
colour = np.zeros(len(DtoT),dtype=str)
for i in np.arange(0, len(bt)):
if bt[i]>=0.5:
colour[i]='green'
else:
colour[i]='red'
redbt = mpatches.Patch(color = 'red', label = 'Bulge to Total Ratios Below 0.5')
greenbt = mpatches.Patch(color = 'green', label = 'Bulge to Total Ratios Above 0.5')
plt.legend(handles = [(redbt), (greenbt)])
#the important part - this is how I binned my data to make the median line, and this part works but not the standard deviation bands
bins = np.linspace(0, max(sm), nbins)
delta = bins[1]-bins[0]
idx = np.digitize(sm, bins)
runningmedian = [np.median(bhm[idx==k]) for k in range(nbins)]
runningstd = [bhm[idx==k].std() for k in range(nbins)]
plt.plot(bins-delta/2, runningmedian, c = 'b', lw=1)
plt.scatter(sm, bhm, c=colour, s=.2)
plt.show()

Replacing part of a plot with a dotted line

I would like to replace part of my plot where the function dips down to '-1' with a dashed line carrying on from the previous point (see plots below).
Here's some code I've written, along with its output:
import numpy as np
import matplotlib.pyplot as plt
y = [5,6,8,3,5,7,3,6,-1,3,8,5]
plt.plot(np.linspace(1,12,12),y,'r-o')
plt.show()
for i in range(1,len(y)):
if y[i]!=-1:
plt.plot(np.linspace(i-1,i,2),y[i-1:i+1],'r-o')
else:
y[i]=y[i-1]
plt.plot(np.linspace(i-1,i,2),y[i-1:i+1],'r--o')
plt.ylim(-1,9)
plt.show()
Here's the original plot
Modified plot:
The code I've written works (it produces the desired output), but it's inefficient and takes a long time when I actually run it on my (much larger) dataset. Is there a smarter way to go about doing this?
You can achieve something similar without the loops:
import pandas as pd
import matplotlib.pyplot as plt
# Create a data frame from the list
a = pd.DataFrame([5,6,-1,-1, 8,3,5,7,3,6,-1,3,8,5])
# Prepare a boolean mask
mask = a > 0
# New data frame with missing values filled with the last element of
# the previous segment. Choose 'bfill' to use the first element of
# the next segment.
a_masked = a[mask].fillna(method = 'ffill')
# Prepare the plot
fig, ax = plt.subplots()
line, = ax.plot(a_masked, ls = '--', lw = 1)
ax.plot(a[mask], color=line.get_color(), lw=1.5, marker = 'o')
plt.show()
You can also highlight the negative regions by choosing a different colour for the lines:
My answer is based on a great post from July, 2017. The latter also tackles the case when the first element is NaN or in your case a negative number:
Dotted lines instead of a missing value in matplotlib
I would use numpy functionality to cut your line into segments and then plot all solid and dashed lines separately. In the example below I added two additional -1s to your data to see that this works universally.
import numpy as np
import matplotlib.pyplot as plt
Y = np.array([5,6,-1,-1, 8,3,5,7,3,6,-1,3,8,5])
X = np.arange(len(Y))
idxs = np.where(Y==-1)[0]
sub_y = np.split(Y,idxs)
sub_x = np.split(X,idxs)
fig, ax = plt.subplots()
##replacing -1 values and plotting dotted lines
for i in range(1,len(sub_y)):
val = sub_y[i-1][-1]
sub_y[i][0] = val
ax.plot([sub_x[i-1][-1], sub_x[i][0]], [val, val], 'r--')
##plotting rest
for x,y in zip(sub_x, sub_y):
ax.plot(x, y, 'r-o')
plt.show()
The result looks like this:
Note, however, that this will fail if the first value is -1, as then your problem is not well defined (no previous value to copy from). Hope this helps.
Not too elegant, but here's something that doesn't use loops which I came up with (based on the above answers) which works. #KRKirov and #Thomas Kühn , thank you for your answers, I really appreciate them
import pandas as pd
import matplotlib.pyplot as plt
# Create a data frame from the list
a = pd.DataFrame([5,6,-1,-1, 8,3,5,7,3,6,-1,3,8,5])
b=a.copy()
b[2]=b[0].shift(1,axis=0)
b[4]=(b[0]!=-1) & (b[2]==-1)
b[5]=b[4].shift(-1,axis=0)
b[0] = (b[5] | b[4])
c=b[0]
d=pd.DataFrame(c)
# Prepare a boolean mask
mask = a > 0
# New data frame with missing values filled with the last element of
# the previous segment. Choose 'bfill' to use the first element of
# the next segment.
a_masked = a[mask].fillna(method = 'ffill')
# Prepare the plot
fig, ax = plt.subplots()
line, = ax.plot(a_masked, 'b:o', lw = 1)
ax.plot(a[mask], color=line.get_color(), lw=1.5, marker = 'o')
ax.plot(a_masked[d], color=line.get_color(), lw=1.5, marker = 'o')
plt.show()

Matplotlib Boxplot: Showing Number of Occurrences of Integer Outliers

I have a plot like the following (using plt.boxplot()):
Now, what I want is plotting a number how often those outliers occured (preferably to the top right of each outlier).
Is that somehow achievable?
ax.boxplot returns a dictionary of all the elements in the boxplot. The key you need here from that dict is 'fliers'.
In boxdict['fliers'], there are the Line2D instances that are used to plot the fliers. We can grab their x and y locations using .get_xdata() and .get_ydata().
You can find all the unique y locations using a set, and then find the number of fliers plotted at that location using .count().
Then its just a case of using matplotlib's ax.text to add a text label to the plot.
Consider the following example:
import matplotlib.pyplot as plt
import numpy as np
# Some fake data
data = np.zeros((10000, 2))
data[0:4, 0] = 1
data[4:6, 0] = 2
data[6:10, 0] = 3
data[0:9, 1] = 1
data[9:14, 1] = 2
data[14:20, 1] = 3
# create figure and axes
fig, ax = plt.subplots(1)
# plot boxplot, grab dict
boxdict = ax.boxplot(data)
# the fliers from the dictionary
fliers = boxdict['fliers']
# loop over boxes in x direction
for j in range(len(fliers)):
# the y and x positions of the fliers
yfliers = boxdict['fliers'][j].get_ydata()
xfliers = boxdict['fliers'][j].get_xdata()
# the unique locations of fliers in y
ufliers = set(yfliers)
# loop over unique fliers
for i, uf in enumerate(ufliers):
# print number of fliers
ax.text(xfliers[i] + 0.03, uf + 0.03, list(yfliers).count(uf))
plt.show()

Matplotlib scatterplot error bars two data sets

I have two data sets, which I'd like to scatter plot next to each other with error bars. Below is my code to plot one data set with error bars. And also the code to generate the second data set. I'd like the points and errors for each data for each value to be adjacent.
I'd also like to remove the line connecting the dots.
import random
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as ss
data = []
n = 100
m = 10
for i in xrange(m):
d = []
for j in xrange(n):
d.append(random.random())
data.append(d)
mean_data = []
std_data = []
for i in xrange(m):
mean = np.mean(data[i])
mean_data.append(mean)
std = np.std(data[i])
std_data.append(std)
df_data = [n] * m
plt.errorbar(range(m), mean_data, yerr=ss.t.ppf(0.95, df_data)*std_data)
plt.scatter(range(m), mean_data)
plt.show()
new_data = []
for i in xrange(m):
d = []
for j in xrange(n):
d.append(random.random())
new_data.append(d)
mean_new_data = []
std_new_data = []
for i in xrange(m):
mean = np.mean(new_data[i])
mean_new_data.append(mean)
std = np.std(new_data[i])
std_new_data.append(std)
df_new_data = [n] * m
To remove the line in the scatter plot use the fmt argument in plt.errorbar(). The plt.scatter() call is then no longer needed. To plot a second set of data, simply call plt.errorbar() a second time, with the new data.
If you don't want the datasets to overlap, you can add some small random scatter in x to the new dataset. You can do this in two ways, add a single scatter float with
random.uniform(-x_scatter, x_scatter)
which will move all the points as one:
or generate a random scatter float for each point with
x_scatter = np.random.uniform(-.5, .5, m)
which generates something like
To plot both datasets (using the second method), you can use:
plt.errorbar(
range(m), mean_data, yerr=ss.t.ppf(0.95, df_data)*std_data, fmt='o',
label="Data")
# Add some some random scatter in x
x_scatter = np.random.uniform(-.5, .5, m)
plt.errorbar(
np.arange(m) + x_scatter, mean_new_data,
yerr=ss.t.ppf(0.95, df_new_data)*std_new_data, fmt='o', label="New data")
plt.legend()
plt.show()

How to plot empirical cdf (ecdf)

How can I plot the empirical CDF of an array of numbers in matplotlib in Python? I'm looking for the cdf analog of pylab's "hist" function.
One thing I can think of is:
from scipy.stats import cumfreq
a = array([...]) # my array of numbers
num_bins = 20
b = cumfreq(a, num_bins)
plt.plot(b)
If you like linspace and prefer one-liners, you can do:
plt.plot(np.sort(a), np.linspace(0, 1, len(a), endpoint=False))
Given my tastes, I almost always do:
# a is the data array
x = np.sort(a)
y = np.arange(len(x))/float(len(x))
plt.plot(x, y)
Which works for me even if there are >O(1e6) data values.
If you really need to downsample I'd set
x = np.sort(a)[::down_sampling_step]
Edit to respond to comment/edit on why I use endpoint=False or the y as defined above. The following are some technical details.
The empirical CDF is usually formally defined as
CDF(x) = "number of samples <= x"/"number of samples"
in order to exactly match this formal definition you would need to use y = np.arange(1,len(x)+1)/float(len(x)) so that we get
y = [1/N, 2/N ... 1]. This estimator is an unbiased estimator that will converge to the true CDF in the limit of infinite samples Wikipedia ref..
I tend to use y = [0, 1/N, 2/N ... (N-1)/N] since:
(a) it is easier to code/more idiomatic,
(b) but is still formally justified since one can always exchange CDF(x) with 1-CDF(x) in the convergence proof, and
(c) works with the (easy) downsampling method described above.
In some particular cases, it is useful to define
y = (arange(len(x))+0.5)/len(x)
which is intermediate between these two conventions. Which, in effect, says "there is a 1/(2N) chance of a value less than the lowest one I've seen in my sample, and a 1/(2N) chance of a value greater than the largest one I've seen so far.
Note that the selection of this convention interacts with the where parameter used in the plt.step if it seems more useful to display
the CDF as a piecewise constant function. In order to exactly match the formal definition mentioned above, one would need to use where=pre the suggested y=[0,1/N..., 1-1/N] convention, or where=post with the y=[1/N, 2/N ... 1] convention, but not the other way around.
However, for large samples, and reasonable distributions, the convention is given in the main body of the answer is easy to write, is an unbiased estimator of the true CDF, and works with the downsampling methodology.
You can use the ECDF function from the scikits.statsmodels library:
import numpy as np
import scikits.statsmodels as sm
import matplotlib.pyplot as plt
sample = np.random.uniform(0, 1, 50)
ecdf = sm.tools.ECDF(sample)
x = np.linspace(min(sample), max(sample))
y = ecdf(x)
plt.step(x, y)
With version 0.4 scicits.statsmodels was renamed to statsmodels. ECDF is now located in the distributions module (while statsmodels.tools.tools.ECDF is depreciated).
import numpy as np
import statsmodels.api as sm # recommended import according to the docs
import matplotlib.pyplot as plt
sample = np.random.uniform(0, 1, 50)
ecdf = sm.distributions.ECDF(sample)
x = np.linspace(min(sample), max(sample))
y = ecdf(x)
plt.step(x, y)
plt.show()
That looks to be (almost) exactly what you want. Two things:
First, the results are a tuple of four items. The third is the size of the bins. The second is the starting point of the smallest bin. The first is the number of points in the in or below each bin. (The last is the number of points outside the limits, but since you haven't set any, all points will be binned.)
Second, you'll want to rescale the results so the final value is 1, to follow the usual conventions of a CDF, but otherwise it's right.
Here's what it does under the hood:
def cumfreq(a, numbins=10, defaultreallimits=None):
# docstring omitted
h,l,b,e = histogram(a,numbins,defaultreallimits)
cumhist = np.cumsum(h*1, axis=0)
return cumhist,l,b,e
It does the histogramming, then produces a cumulative sum of the counts in each bin. So the ith value of the result is the number of array values less than or equal to the the maximum of the ith bin. So, the final value is just the size of the initial array.
Finally, to plot it, you'll need to use the initial value of the bin, and the bin size to determine what x-axis values you'll need.
Another option is to use numpy.histogram which can do the normalization and returns the bin edges. You'll need to do the cumulative sum of the resulting counts yourself.
a = array([...]) # your array of numbers
num_bins = 20
counts, bin_edges = numpy.histogram(a, bins=num_bins, normed=True)
cdf = numpy.cumsum(counts)
pylab.plot(bin_edges[1:], cdf)
(bin_edges[1:] is the upper edge of each bin.)
Have you tried the cumulative=True argument to pyplot.hist?
One-liner based on Dave's answer:
plt.plot(np.sort(arr), np.linspace(0, 1, len(arr), endpoint=False))
Edit: this was also suggested by hans_meine in the comments.
Assuming that vals holds your values, then you can simply plot the CDF as follows:
y = numpy.arange(0, 101)
x = numpy.percentile(vals, y)
plot(x, y)
To scale it between 0 and 1, just divide y by 100.
What do you want to do with the CDF ?
To plot it, that's a start. You could try a few different values, like this:
from __future__ import division
import numpy as np
from scipy.stats import cumfreq
import pylab as plt
hi = 100.
a = np.arange(hi) ** 2
for nbins in ( 2, 20, 100 ):
cf = cumfreq(a, nbins) # bin values, lowerlimit, binsize, extrapoints
w = hi / nbins
x = np.linspace( w/2, hi - w/2, nbins ) # care
# print x, cf
plt.plot( x, cf[0], label=str(nbins) )
plt.legend()
plt.show()
Histogram
lists various rules for the number of bins, e.g. num_bins ~ sqrt( len(a) ).
(Fine print: two quite different things are going on here,
binning / histogramming the raw data
plot interpolates a smooth curve through the say 20 binned values.
Either of these can go way off on data that's "clumpy"
or has long tails, even for 1d data -- 2d, 3d data gets increasingly difficult.
See also
Density_estimation
and
using scipy gaussian kernel density estimation
).
I have a trivial addition to AFoglia's method, to normalize the CDF
n_counts,bin_edges = np.histogram(myarray,bins=11,normed=True)
cdf = np.cumsum(n_counts) # cdf not normalized, despite above
scale = 1.0/cdf[-1]
ncdf = scale * cdf
Normalizing the histo makes its integral unity, which means the cdf will not be normalized. You've got to scale it yourself.
If you want to display the actual true ECDF (which as David B noted is a step function that increases 1/n at each of n datapoints), my suggestion is to write code to generate two "plot" points for each datapoint:
a = array([...]) # your array of numbers
sorted=np.sort(a)
x2 = []
y2 = []
y = 0
for x in sorted:
x2.extend([x,x])
y2.append(y)
y += 1.0 / len(a)
y2.append(y)
plt.plot(x2,y2)
This way you will get a plot with the n steps that are characteristic of an ECDF, which is nice especially for data sets that are small enough for the steps to be visible. Also, there is no no need to do any binning with histograms (which risk introducing bias to the drawn ECDF).
We can just use the step function from matplotlib, which makes a step-wise plot, which is the definition of the empirical CDF:
import numpy as np
from matplotlib import pyplot as plt
data = np.random.randn(11)
levels = np.linspace(0, 1, len(data) + 1) # endpoint 1 is included by default
plt.step(sorted(list(data) + [max(data)]), levels)
The final vertical line at max(data) was added manually. Otherwise the plot just stops at level 1 - 1/len(data).
Alternatively we can use the where='post' option to step()
levels = np.linspace(1. / len(data), 1, len(data))
plt.step(sorted(data), levels, where='post')
in which case the initial vertical line from zero is not plotted.
It's a one-liner in seaborn using the cumulative=True parameter. Here you go,
import seaborn as sns
sns.kdeplot(a, cumulative=True)
This is using bokeh
from bokeh.plotting import figure, show
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(pd_series)
p = figure(title="tests", tools="save", background_fill_color="#E8DDCB")
p.line(ecdf.x,ecdf.y)
show(p)
Although, there are many great answers here, though I would include a more customized ECDF plot
Generate values for the empirical cumulative distribution function
import matplotlib.pyplot as plt
def ecdf_values(x):
"""
Generate values for empirical cumulative distribution function
Params
--------
x (array or list of numeric values): distribution for ECDF
Returns
--------
x (array): x values
y (array): percentile values
"""
# Sort values and find length
x = np.sort(x)
n = len(x)
# Create percentiles
y = np.arange(1, n + 1, 1) / n
return x, y
def ecdf_plot(x, name = 'Value', plot_normal = True, log_scale=False, save=False, save_name='Default'):
"""
ECDF plot of x
Params
--------
x (array or list of numerics): distribution for ECDF
name (str): name of the distribution, used for labeling
plot_normal (bool): plot the normal distribution (from mean and std of data)
log_scale (bool): transform the scale to logarithmic
save (bool) : save/export plot
save_name (str) : filename to save the plot
Returns
--------
none, displays plot
"""
xs, ys = ecdf_values(x)
fig = plt.figure(figsize = (10, 6))
ax = plt.subplot(1, 1, 1)
plt.step(xs, ys, linewidth = 2.5, c= 'b');
plot_range = ax.get_xlim()[1] - ax.get_xlim()[0]
fig_sizex = fig.get_size_inches()[0]
data_inch = plot_range / fig_sizex
right = 0.6 * data_inch + max(xs)
gap = right - max(xs)
left = min(xs) - gap
if log_scale:
ax.set_xscale('log')
if plot_normal:
gxs, gys = ecdf_values(np.random.normal(loc = xs.mean(),
scale = xs.std(),
size = 100000))
plt.plot(gxs, gys, 'g');
plt.vlines(x=min(xs),
ymin=0,
ymax=min(ys),
color = 'b',
linewidth = 2.5)
# Add ticks
plt.xticks(size = 16)
plt.yticks(size = 16)
# Add Labels
plt.xlabel(f'{name}', size = 18)
plt.ylabel('Percentile', size = 18)
plt.vlines(x=min(xs),
ymin = min(ys),
ymax=0.065,
color = 'r',
linestyle = '-',
alpha = 0.8,
linewidth = 1.7)
plt.vlines(x=max(xs),
ymin=0.935,
ymax=max(ys),
color = 'r',
linestyle = '-',
alpha = 0.8,
linewidth = 1.7)
# Add Annotations
plt.annotate(s = f'{min(xs):.2f}',
xy = (min(xs),
0.065),
horizontalalignment = 'center',
verticalalignment = 'bottom',
size = 15)
plt.annotate(s = f'{max(xs):.2f}',
xy = (max(xs),
0.935),
horizontalalignment = 'center',
verticalalignment = 'top',
size = 15)
ps = [0.25, 0.5, 0.75]
for p in ps:
ax.set_xlim(left = left, right = right)
ax.set_ylim(bottom = 0)
value = xs[np.where(ys > p)[0][0] - 1]
pvalue = ys[np.where(ys > p)[0][0] - 1]
plt.hlines(y=p, xmin=left, xmax = value,
linestyles = ':', colors = 'r', linewidth = 1.4);
plt.vlines(x=value, ymin=0, ymax = pvalue,
linestyles = ':', colors = 'r', linewidth = 1.4)
plt.text(x = p / 3, y = p - 0.01,
transform = ax.transAxes,
s = f'{int(100*p)}%', size = 15,
color = 'r', alpha = 0.7)
plt.text(x = value, y = 0.01, size = 15,
horizontalalignment = 'left',
s = f'{value:.2f}', color = 'r', alpha = 0.8);
# fit the labels into the figure
plt.title(f'ECDF of {name}', size = 20)
plt.tight_layout()
if save:
plt.savefig(save_name + '.png')
ecdf_plot(np.random.randn(100), name='Normal Distribution', save=True, save_name="ecdf")
Additional Resources:
ECDF
Interpreting ECDF
(This is a copy of my answer to the question: Plotting CDF of a pandas series in python)
A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.
First create an example series:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
Sort the series:
ser = ser.order()
Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:
ser[len(ser)] = ser.iloc[-1]
Create a new series with the sorted values as index and the cumulative distribution as values
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
Finally, plot the function as steps:
ser_cdf.plot(drawstyle='steps')
None of the answers so far covers what I wanted when I landed here, which is:
def empirical_cdf(x, data):
"evaluate ecdf of data at points x"
return np.mean(data[None, :] <= x[:, None], axis=1)
It evaluates the empirical CDF of a given dataset at an array of points x, which do not have to be sorted. There is no intermediate binning and no external libraries.
An equivalent method that scales better for large x is to sort the data and use np.searchsorted:
def empirical_cdf(x, data):
"evaluate ecdf of data at points x"
data = np.sort(data)
return np.searchsorted(data, x)/float(data.size)
In my opinion, none of the previous methods do the complete (and strict) job of plotting the empirical CDF, which was the asker's original question. I post my proposal for any lost and sympathetic souls.
My proposal has the following: 1) it considers the empirical CDF defined as in the first expression here, i.e., like in A. W. Van der Waart's Asymptotic statistics (1998), 2) it explicitly shows the step behavior of the function, 3) it explicitly shows that the empirical CDF is continuous from the right by showing marks to resolve discontinuities, 4) it extends the zero and one values at the extremes up to user-defined margins. I hope it helps someone:
def plot_cdf( data, xaxis = None, figsize = (20,10), line_style = 'b-',
ball_style = 'bo', xlabel = r"Random variable $X$", ylabel = "$N$-samples
empirical CDF $F_{X,N}(x)$" ):
# Contribution of each data point to the empirical distribution
weights = 1/data.size * np.ones_like( data )
# CDF estimation
cdf = np.cumsum( weights )
# Plot central part of the CDF
plt.figure( figsize = (20,10) )
plt.step( np.sort( a ), cdf, line_style, where = 'post' )
# Plot valid points at discontinuities
plt.plot( np.sort( a ), cdf, ball_style )
# Extract plot axis and extend outside the data range
if not xaxis == None:
(xmin, xmax, ymin, ymax) = plt.axis( )
xmin = xaxis[0]
xmax = xaxis[1]
plt.axis( [xmin, xmax, ymin, ymax] )
else:
(xmin,xmax,_,_) = plt.axis()
plt.plot( [xmin, a.min(), a.min()], np.zeros( 3 ), line_style )
plt.plot( [a.max(), xmax], np.ones( 2 ), line_style )
plt.xlabel( xlabel )
plt.ylabel( ylabel )
What I did to evaluate cdf for large dataset -
Find the unique values
unique_values = np.sort(pd.Series)
Make the rank array for these sorted and unique values in the dataset -
ranks = np.arange(0,len(unique_values))/(len(unique_values)-1)
Plot unique_values vs ranks
Example
The code below plots the cdf of population dataset from kaggle -
us_census_data = pd.read_csv('acs2015_census_tract_data.csv')
population = us_census_data['TotalPop'].dropna()
## sort the unique values using pandas unique function
unique_pop = np.sort(population.unique())
cdf = np.arange(0,len(unique_pop),step=1)/(len(unique_pop)-1)
## plotting
plt.plot(unique_pop,cdf)
plt.show()
This can easily be done with seaborn, which is a high-level API for matplotlib.
data can be a pandas.DataFrame, numpy.ndarray, mapping, or sequence.
An axes-level plot can be done using seaborn.ecdfplot.
A figure-level plot can be done use sns.displot with kind='ecdf'.
See How to use markers with ECDF plot for other options.
It’s also possible to plot the empirical complementary CDF (1 - CDF) by specifying complementary=True.
Tested in python 3.11, pandas 1.5.2, matplotlib 3.6.2, seaborn 0.12.1
import seaborn as sns
import matplotlib.pyplot as plt
# lead sample dataframe
df = sns.load_dataset('penguins', cache=False)
# display(df.head(3))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
# plot ecdf
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
sns.ecdfplot(data=df, x='bill_length_mm', ax=ax1)
ax1.set_title('Without hue')
sns.ecdfplot(data=df, x='bill_length_mm', hue='species', ax=ax2)
ax2.set_title('Separated species by hue')
CDF: complementary=True

Categories

Resources