Creating subplots with equal axis scale, Python, matplotlib - python

I am plotting seismological data and am creating a figure featuring 16 subplots of different depth slices. Each subplot displays the lat/lon of the epicenter and the color is scaled to its magnitude. I am trying to do two things:
Adjust the scale of all plots to equal the x and y min and max for the area selected. This will allow easy comparison across the plots. (so all plots would range from xmin to xmax etc)
adjust the magnitude colors so they also represent the scale (ie colors represent all available points not just the points on that specific sub plot)
I have seen this accomplished a number of ways but am struggling to apply them to the loop in my code. The data I am using is here: Data.
I posted my code and what the current output looks like below.
import matplotlib.pyplot as plt
import pandas as pd
eq_df = pd.read_csv(eq_csv)
eq_data = eq_df[['LON', 'LAT', 'DEPTH', 'MAG']]
nbound = max(eq_data.LAT)
sbound = min(eq_data.LAT)
ebound = max(eq_data.LON)
wbound = min(eq_data.LON)
xlimit = (wbound, ebound)
ylimit = (sbound, nbound)
magmin = min(eq_data.MAG)
magmax = max(eq_data.MAG)
for n in list(range(1,17)):
km = eq_data[(eq_data.DEPTH > n - 1) & (eq_data.DEPTH <= n)]
plt.subplot(4, 4, n)
plt.scatter(km["LON"], km['LAT'], s = 10, c = km['MAG'], vmin = magmin, vmax = magmax) #added vmin/vmax to scale my magnitude data
plt.ylim(sbound, nbound) # set y limits of plot
plt.xlim(wbound, ebound) # set x limits of plot
plt.tick_params(axis='both', which='major', labelsize= 6)
plt.subplots_adjust(hspace = 1)
plt.gca().set_title('Depth = ' + str(n - 1) +'km to ' + str(n) + 'km', size = 8) #set title of subplots
plt.suptitle('Magnitude of Events at Different Depth Slices, 1950 to Today')
plt.show()
ETA: new code to resolve my issue

In response to this comment on the other answer, here is a demonstration of the use of sharex=True and sharey=True for this use case:
import matplotlib.pyplot as plt
import numpy as np
# Supply the limits since random data will be plotted
wbound = -0.1
ebound = 1.1
sbound = -0.1
nbound = 1.1
fig, axs = plt.subplots(nrows=4, ncols=4, figsize=(16,12), sharex=True, sharey=True)
plt.xlim(wbound, ebound)
plt.ylim(sbound, nbound)
for n, ax in enumerate(axs.flatten()):
ax.scatter(np.random.random(20), np.random.random(20),
c = np.random.random(20), marker = '.')
ticks = [n % 4 == 0, n > 12]
ax.tick_params(left=ticks[0], bottom=ticks[1])
ax.set_title('Depth = ' + str(n - 1) +'km to ' + str(n) + 'km', size = 12)
plt.suptitle('Magnitude of Events at Different Depth Slices, 1950 to Today', y = 0.95)
plt.subplots_adjust(wspace=0.05)
plt.show()
Explanation of a couple things:
I have reduced the horizontal spacing between subplots with subplots_adjust(wspace=0.05)
plt.suptitle does not need to be (and should not be) in the loop.
ticks = [n % 4 == 0, n > 12] creates a pair of bools for each axis which is then used to control which tick marks are drawn.
Left and bottom tick marks are controlled for each axis with ax.tick_params(left=ticks[0], bottom=ticks[1])
plt.xlim() and plt.ylim() need only be called once, before the loop

Finally got it thanks to some help above and some extended googling.
I have updated my code above with notes indicating where code was added.
To adjust the limits of my plot axes I used:
plt.ylim(sbound, nbound)
plt.xlim(wbound, ebound)
To scale my magnitude data across all plots I added vmin, vmax to the following line:
plt.scatter(km["LON"], km['LAT'], s = 10, c = km['MAG'], vmin = magmin, vmax = magmax)
And here is the resulting figure:

Related

matplotlib barh: how to make a visual gap between two groups of bars?

I have some sorted data of which I only show the highest and lowest values in a figure. This is a minimal version of what currently I have:
import matplotlib.pyplot as plt
# some dummy data (real data contains about 250 entries)
x_data = list(range(98, 72, -1))
labels = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
ranks = list(range(1, 27))
fig, ax = plt.subplots()
# plot 3 highest entries
bars_top = ax.barh(labels[:3], x_data[:3])
# plot 3 lowest entries
bars_bottom = ax.barh(labels[-3:], x_data[-3:])
ax.invert_yaxis()
# print values and ranks
for bar, value, rank in zip(bars_top + bars_bottom,
x_data[:3] + x_data[-3:],
ranks[:3] + ranks[-3:]):
y_pos = bar.get_y() + 0.5
ax.text(value - 4, y_pos, value, ha='right')
ax.text(4, y_pos, f'$rank:\ {rank}$')
ax.set_title('Comparison of Top 3 and Bottom 3')
plt.show()
Result:
I'd like to make an additional gap to this figure to make it more visually clear that the majority of data is in fact not displayed in this plot. For example, something very simple like the following would be sufficient:
Is this possible in matplotlib?
Here is a flexible approach that just plots a dummy bar in-between. The yaxis-transform together with the dummy bar's position is used to plot 3 black dots.
If multiple separations are needed, they all need a different dummy label, for example repeating the space character.
import matplotlib.pyplot as plt
import numpy as np
# some dummy data (real data contains about 250 entries)
x_data = list(range(98, 72, -1))
labels = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
ranks = list(range(1, 27))
fig, ax = plt.subplots()
# plot 3 highest entries
bars_top = ax.barh(labels[:3], x_data[:3])
# dummy bar inbetween
dummy_bar = ax.barh(" ", 0, color='none')
# plot 3 lowest entries
bars_bottom = ax.barh(labels[-3:], x_data[-3:])
ax.invert_yaxis()
# print values and ranks
for bar, value, rank in zip(bars_top + bars_bottom,
x_data[:3] + x_data[-3:],
ranks[:3] + ranks[-3:]):
y_pos = bar.get_y() + 0.5
ax.text(value - 4, y_pos, value, ha='right')
ax.text(4, y_pos, f'$rank:\ {rank}$')
# add three dots using the dummy bar's position
ax.scatter([0.05] * 3, dummy_bar[0].get_y() + np.linspace(0, dummy_bar[0].get_height(), 3),
marker='o', s=5, color='black', transform=ax.get_yaxis_transform())
ax.set_title('Comparison of Top 3 and Bottom 3')
ax.tick_params(axis='y', length=0) # hide the tick marks
ax.margins(y=0.02) # less empty space at top and bottom
plt.show()
The following function,
def top_bottom(x, l, n, ax=None, gap=1):
from matplotlib.pyplot import gca
if n <= 0 : raise ValueError('No. of top/bottom values must be positive')
if n > len(x) : raise ValueError('No. of top/bottom values should be not greater than data length')
if n+n > len(x):
print('Warning: no. of top/bottom values is larger than one'
' half of data length, OVERLAPPING')
if gap < 0 : print('Warning: some bar will be overlapped')
ax = ax if ax else gca()
top_x = x[:+n]
bot_x = x[-n:]
top_y = list(range(n+n, n, -1))
bot_y = list(range(n-gap, -gap, -1))
top_l = l[:+n] # A B C
bot_l = l[-n:] # X Y Z
top_bars = ax.barh(top_y, top_x)
bot_bars = ax.barh(bot_y, bot_x)
ax.set_yticks(top_y+bot_y)
ax.set_yticklabels(top_l+bot_l)
return top_bars, bot_bars
when invoked with your data and n=4, gap=4
bars_top, bars_bottom = top_bottom(x_data, labels, 4, gap=4)
produces
Later, you'll be able to customize the appearance of the bars as you like using the Artists returned by the function.

Make plt.colorbar extend to the steps immediately before and after vmin/vmax

I want to do something with plt.hist2d and plt.colorbar and I'm having real trouble working out how to do it. To explain, I've written the following example:
import numpy as np
from matplotlib import pyplot as plt
x = np.random.random(1e6)
y = np.random.random(1e6)
plt.hist2d(x, y)
plt.colorbar()
plt.show()
This code generates a plot that looks something like the image below.
If I generate a histogram, ideally I would like the colour bar to extend beyond the maximum and minimum range of the data to the next step beyond the maximum and minimum. In the example in this question, this would set the colour bar extent from 9660 to 10260 in increments of 60.
How can I force either plt.hist2d or plt.colorbar to set the colour bar such that ticks are assigned to the start and end of the plotted colour bar?
I think this is what you're looking for:
h = plt.hist2d(x, y)
mn, mx = h[-1].get_clim()
mn = 60 * np.floor(mn / 60.)
mx = 60 * np.ceil(mx / 60.)
h[-1].set_clim(mn, mx)
cbar = plt.colorbar(h[-1], ticks=np.arange(mn, mx + 1, 60), )
This gives something like,
It's also often convenient to use tickers from the matplotlib.ticker, and use the tick_values method of tickers, but for this purpose I think the above is most convenient.
Good luck!
With huge thanks to farenorth, who got me thinking about this in the right way, I came up with a function, get_colour_bar_ticks:
def get_colour_bar_ticks(colourbar):
import numpy as np
# Get the limits and the extent of the colour bar.
limits = colourbar.get_clim()
extent = limits[1] - limits[0]
# Get the yticks of the colour bar as values (ax.get_yticks() returns them as fractions).
fractions = colourbar.ax.get_yticks()
yticks = (fractions * extent) + limits[0]
increment = yticks[1] - yticks[0]
# Generate the expanded ticks.
if (fractions[0] == 0) & (fractions[-1] == 1):
return yticks
else:
start = yticks[0] - increment
end = yticks[-1] + increment
if fractions[0] == 0:
newticks = np.concatenate((yticks, [end]))
elif fractions[1] == 1:
newticks = np.concatenate(([start], yticks))
else:
newticks = np.concatenate(([start], yticks, [end]))
return newticks
With this function I can then do this:
from matplotlib import pyplot as plt
x = np.random.random(1e6)
y = np.random.random(1e6)
h = plt.hist2d(x, y)
cbar = plt.colorbar()
ticks = get_colour_bar_ticks(cbar)
h[3].set_clim(ticks[0], ticks[-1])
cbar.set_clim(ticks[0], ticks[-1])
cbar.set_ticks(ticks)
plt.show()
Which results in this, which is what I really wanted:

Improve ticking and grid using matplotlib

I have the following code:
import datetime
from matplotlib.ticker import FormatStrFormatter
from pylab import *
hits=array([100,250,130,290])
misses=array([13,18,105,15])
X = np.arange(len(hits))
base=datetime.date(2014, 8, 1)
date_list=array([base + datetime.timedelta(days=x) for x in range(0,len(hits))])
fig,ax = plt.subplots(1,1,1,figsize=(15,10))
bar_handles=[]
for i in range(len(hits)):
bar_handles.append(
ax.barh(
-X[i],hits[i],facecolor='#89E07E', edgecolor='white',
align='center',label="Impressions"))
bar_handles.append(
ax.barh(-X[i],-misses[i],facecolor='#F03255', edgecolor='white',
align='center',label="Misses"))
for i in range(len(bar_handles)):
patch = bar_handles[i].get_children()[0]
bl = patch.get_xy()
percent_x = 0.5*patch.get_width() + bl[0]
percent_y = 0.5*patch.get_height() + bl[1]
percentage=0
if i%2==0:
j=i/2
percentage = 100*(float(hits[j])/float(hits[j]+misses[j]))
else:
j=(i-1)/2
percentage = 100*(float(misses[j])/float(hits[j]+misses[j]))
ax.text(percent_x,percent_y,"%d%%" % percentage,ha='center',va='center')
for i in range(len(hits)):
plt.yticks(-X,date_list)
plt.tick_params(which='both', width=0)
max_hits_num=round(np.amax(hits),-2)
max_miss_num=round(np.amax(misses),-2)
xticks=np.arange(-max_miss_num,max_hits_num,50)
minorLocator = FixedLocator(xticks)
majorLocator = FixedLocator([0])
ax.xaxis.set_major_locator(majorLocator)
ax.xaxis.set_minor_locator(minorLocator)
ax.xaxis.set_minor_formatter(FormatStrFormatter('%d'))
ax.yaxis.grid(False)
ax.xaxis.grid(b=True,which='minor', color='0.5', linestyle='-',linewidth=1)
ax.xaxis.grid(b=True,which='major', color='b', linestyle='-',linewidth=2.5)
# ax2 = plt.twinx()
# ax2.grid(False)
# for i in range(len(hits)):
# plt.yticks(-X,hits+misses)
plt.show()
This generates the following image:
I am left with one big issue and two minor problems. The big issue is that I want to add on the right y-axis the sums of the values. That is add 113,268,235 and 305. Trying something along the lines of twinx or share a subplots did not work out for me.
The minor issues are:
On the x-axis, the values to the left of 0 should be without the minus sign.
If you look closely, you see the the blue major vertical grid line coincides with a gray minor one. Would be nice to have only the blue one. This can be solved by first finding the index of 0 in xticks using numpy.where and then removing this element using numpy.delete.

Histogram bars overlapping matplotlib

I am able to build the histogram I need. However, the bars overlap over one another.
As you can see I changed the width of the bars to 0.2 but it still overlaps. What is the mistake I am doing?
from matplotlib import pyplot as plt
import numpy as np
from matplotlib.font_manager import FontProperties
from random import randrange
color = ['r', 'b', 'g','c','m','y','k','darkgreen', 'darkkhaki', 'darkmagenta', 'darkolivegreen', 'darkorange', 'darkorchid', 'darkred']
label = ['2','6','10','14','18','22','26','30','34','38','42','46']
file_names = ['a','b','c']
diff = [[randrange(10) for a in range(0, len(label))] for a in range(0, len(file_names))]
print diff
x = diff
name = file_names
y = zip(*x)
pos = np.arange(len(x))
width = 1. / (1 + len(x))
fig, ax = plt.subplots()
for idx, (serie, color,label) in enumerate(zip(y, color,label)):
ax.bar(pos + idx * width, serie, width, color=color, label=label)
ax.set_xticks(pos + width)
plt.xlabel('foo')
plt.ylabel('bar')
ax.set_xticklabels(name)
ax.legend()
plt.savefig("final" + '.eps', bbox_inches='tight', pad_inches=0.5,dpi=100,format="eps")
plt.clf()
Here is the graph:
As you can see in the below example, you can easily get non-overlapping bars using a heavily simplified version of your plotting code. I'd suggest you to have a closer look at whether x and y really are what you expect them to be. (And that you try to simplify your code as much as possible when you are looking for an error in the code.)
Also have a look at the computation of the width of the bars. You appear to use the number of subjects for this, while it should be the number of bars per subject instead.
Have a look at this example:
import numpy as np
import matplotlib.pyplot as plt
subjects = ('Tom', 'Dick', 'Harry', 'Sally', 'Sue')
# number of bars per subject
n = 5
# y-data per subject
y = np.random.rand(n, len(subjects))
# x-positions for the bars
x = np.arange(len(subjects))
# plot bars
width = 1./(1+n) # <-- n.b., use number of bars, not number of subjects
for i, yi in enumerate(y):
plt.bar(x+i*width, yi, width)
# add labels
plt.xticks(x+n/2.*width, subjects)
plt.show()
This is the result image:
For reference:
http://matplotlib.org/examples/api/barchart_demo.html
http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.bar
The problem is that the width of your bars is calculated from the three subjects, not the twelve bars per subject. That means you're placing multiple bars at each x-position. Try swapping in these lines where appropriate to fix that:
n = len(x[0]) # New variable with the right length to calculate bar width
width = 1. / (1 + n)
ax.set_xticks(pos + n/2. * width)

How to plot empirical cdf (ecdf)

How can I plot the empirical CDF of an array of numbers in matplotlib in Python? I'm looking for the cdf analog of pylab's "hist" function.
One thing I can think of is:
from scipy.stats import cumfreq
a = array([...]) # my array of numbers
num_bins = 20
b = cumfreq(a, num_bins)
plt.plot(b)
If you like linspace and prefer one-liners, you can do:
plt.plot(np.sort(a), np.linspace(0, 1, len(a), endpoint=False))
Given my tastes, I almost always do:
# a is the data array
x = np.sort(a)
y = np.arange(len(x))/float(len(x))
plt.plot(x, y)
Which works for me even if there are >O(1e6) data values.
If you really need to downsample I'd set
x = np.sort(a)[::down_sampling_step]
Edit to respond to comment/edit on why I use endpoint=False or the y as defined above. The following are some technical details.
The empirical CDF is usually formally defined as
CDF(x) = "number of samples <= x"/"number of samples"
in order to exactly match this formal definition you would need to use y = np.arange(1,len(x)+1)/float(len(x)) so that we get
y = [1/N, 2/N ... 1]. This estimator is an unbiased estimator that will converge to the true CDF in the limit of infinite samples Wikipedia ref..
I tend to use y = [0, 1/N, 2/N ... (N-1)/N] since:
(a) it is easier to code/more idiomatic,
(b) but is still formally justified since one can always exchange CDF(x) with 1-CDF(x) in the convergence proof, and
(c) works with the (easy) downsampling method described above.
In some particular cases, it is useful to define
y = (arange(len(x))+0.5)/len(x)
which is intermediate between these two conventions. Which, in effect, says "there is a 1/(2N) chance of a value less than the lowest one I've seen in my sample, and a 1/(2N) chance of a value greater than the largest one I've seen so far.
Note that the selection of this convention interacts with the where parameter used in the plt.step if it seems more useful to display
the CDF as a piecewise constant function. In order to exactly match the formal definition mentioned above, one would need to use where=pre the suggested y=[0,1/N..., 1-1/N] convention, or where=post with the y=[1/N, 2/N ... 1] convention, but not the other way around.
However, for large samples, and reasonable distributions, the convention is given in the main body of the answer is easy to write, is an unbiased estimator of the true CDF, and works with the downsampling methodology.
You can use the ECDF function from the scikits.statsmodels library:
import numpy as np
import scikits.statsmodels as sm
import matplotlib.pyplot as plt
sample = np.random.uniform(0, 1, 50)
ecdf = sm.tools.ECDF(sample)
x = np.linspace(min(sample), max(sample))
y = ecdf(x)
plt.step(x, y)
With version 0.4 scicits.statsmodels was renamed to statsmodels. ECDF is now located in the distributions module (while statsmodels.tools.tools.ECDF is depreciated).
import numpy as np
import statsmodels.api as sm # recommended import according to the docs
import matplotlib.pyplot as plt
sample = np.random.uniform(0, 1, 50)
ecdf = sm.distributions.ECDF(sample)
x = np.linspace(min(sample), max(sample))
y = ecdf(x)
plt.step(x, y)
plt.show()
That looks to be (almost) exactly what you want. Two things:
First, the results are a tuple of four items. The third is the size of the bins. The second is the starting point of the smallest bin. The first is the number of points in the in or below each bin. (The last is the number of points outside the limits, but since you haven't set any, all points will be binned.)
Second, you'll want to rescale the results so the final value is 1, to follow the usual conventions of a CDF, but otherwise it's right.
Here's what it does under the hood:
def cumfreq(a, numbins=10, defaultreallimits=None):
# docstring omitted
h,l,b,e = histogram(a,numbins,defaultreallimits)
cumhist = np.cumsum(h*1, axis=0)
return cumhist,l,b,e
It does the histogramming, then produces a cumulative sum of the counts in each bin. So the ith value of the result is the number of array values less than or equal to the the maximum of the ith bin. So, the final value is just the size of the initial array.
Finally, to plot it, you'll need to use the initial value of the bin, and the bin size to determine what x-axis values you'll need.
Another option is to use numpy.histogram which can do the normalization and returns the bin edges. You'll need to do the cumulative sum of the resulting counts yourself.
a = array([...]) # your array of numbers
num_bins = 20
counts, bin_edges = numpy.histogram(a, bins=num_bins, normed=True)
cdf = numpy.cumsum(counts)
pylab.plot(bin_edges[1:], cdf)
(bin_edges[1:] is the upper edge of each bin.)
Have you tried the cumulative=True argument to pyplot.hist?
One-liner based on Dave's answer:
plt.plot(np.sort(arr), np.linspace(0, 1, len(arr), endpoint=False))
Edit: this was also suggested by hans_meine in the comments.
Assuming that vals holds your values, then you can simply plot the CDF as follows:
y = numpy.arange(0, 101)
x = numpy.percentile(vals, y)
plot(x, y)
To scale it between 0 and 1, just divide y by 100.
What do you want to do with the CDF ?
To plot it, that's a start. You could try a few different values, like this:
from __future__ import division
import numpy as np
from scipy.stats import cumfreq
import pylab as plt
hi = 100.
a = np.arange(hi) ** 2
for nbins in ( 2, 20, 100 ):
cf = cumfreq(a, nbins) # bin values, lowerlimit, binsize, extrapoints
w = hi / nbins
x = np.linspace( w/2, hi - w/2, nbins ) # care
# print x, cf
plt.plot( x, cf[0], label=str(nbins) )
plt.legend()
plt.show()
Histogram
lists various rules for the number of bins, e.g. num_bins ~ sqrt( len(a) ).
(Fine print: two quite different things are going on here,
binning / histogramming the raw data
plot interpolates a smooth curve through the say 20 binned values.
Either of these can go way off on data that's "clumpy"
or has long tails, even for 1d data -- 2d, 3d data gets increasingly difficult.
See also
Density_estimation
and
using scipy gaussian kernel density estimation
).
I have a trivial addition to AFoglia's method, to normalize the CDF
n_counts,bin_edges = np.histogram(myarray,bins=11,normed=True)
cdf = np.cumsum(n_counts) # cdf not normalized, despite above
scale = 1.0/cdf[-1]
ncdf = scale * cdf
Normalizing the histo makes its integral unity, which means the cdf will not be normalized. You've got to scale it yourself.
If you want to display the actual true ECDF (which as David B noted is a step function that increases 1/n at each of n datapoints), my suggestion is to write code to generate two "plot" points for each datapoint:
a = array([...]) # your array of numbers
sorted=np.sort(a)
x2 = []
y2 = []
y = 0
for x in sorted:
x2.extend([x,x])
y2.append(y)
y += 1.0 / len(a)
y2.append(y)
plt.plot(x2,y2)
This way you will get a plot with the n steps that are characteristic of an ECDF, which is nice especially for data sets that are small enough for the steps to be visible. Also, there is no no need to do any binning with histograms (which risk introducing bias to the drawn ECDF).
We can just use the step function from matplotlib, which makes a step-wise plot, which is the definition of the empirical CDF:
import numpy as np
from matplotlib import pyplot as plt
data = np.random.randn(11)
levels = np.linspace(0, 1, len(data) + 1) # endpoint 1 is included by default
plt.step(sorted(list(data) + [max(data)]), levels)
The final vertical line at max(data) was added manually. Otherwise the plot just stops at level 1 - 1/len(data).
Alternatively we can use the where='post' option to step()
levels = np.linspace(1. / len(data), 1, len(data))
plt.step(sorted(data), levels, where='post')
in which case the initial vertical line from zero is not plotted.
It's a one-liner in seaborn using the cumulative=True parameter. Here you go,
import seaborn as sns
sns.kdeplot(a, cumulative=True)
This is using bokeh
from bokeh.plotting import figure, show
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(pd_series)
p = figure(title="tests", tools="save", background_fill_color="#E8DDCB")
p.line(ecdf.x,ecdf.y)
show(p)
Although, there are many great answers here, though I would include a more customized ECDF plot
Generate values for the empirical cumulative distribution function
import matplotlib.pyplot as plt
def ecdf_values(x):
"""
Generate values for empirical cumulative distribution function
Params
--------
x (array or list of numeric values): distribution for ECDF
Returns
--------
x (array): x values
y (array): percentile values
"""
# Sort values and find length
x = np.sort(x)
n = len(x)
# Create percentiles
y = np.arange(1, n + 1, 1) / n
return x, y
def ecdf_plot(x, name = 'Value', plot_normal = True, log_scale=False, save=False, save_name='Default'):
"""
ECDF plot of x
Params
--------
x (array or list of numerics): distribution for ECDF
name (str): name of the distribution, used for labeling
plot_normal (bool): plot the normal distribution (from mean and std of data)
log_scale (bool): transform the scale to logarithmic
save (bool) : save/export plot
save_name (str) : filename to save the plot
Returns
--------
none, displays plot
"""
xs, ys = ecdf_values(x)
fig = plt.figure(figsize = (10, 6))
ax = plt.subplot(1, 1, 1)
plt.step(xs, ys, linewidth = 2.5, c= 'b');
plot_range = ax.get_xlim()[1] - ax.get_xlim()[0]
fig_sizex = fig.get_size_inches()[0]
data_inch = plot_range / fig_sizex
right = 0.6 * data_inch + max(xs)
gap = right - max(xs)
left = min(xs) - gap
if log_scale:
ax.set_xscale('log')
if plot_normal:
gxs, gys = ecdf_values(np.random.normal(loc = xs.mean(),
scale = xs.std(),
size = 100000))
plt.plot(gxs, gys, 'g');
plt.vlines(x=min(xs),
ymin=0,
ymax=min(ys),
color = 'b',
linewidth = 2.5)
# Add ticks
plt.xticks(size = 16)
plt.yticks(size = 16)
# Add Labels
plt.xlabel(f'{name}', size = 18)
plt.ylabel('Percentile', size = 18)
plt.vlines(x=min(xs),
ymin = min(ys),
ymax=0.065,
color = 'r',
linestyle = '-',
alpha = 0.8,
linewidth = 1.7)
plt.vlines(x=max(xs),
ymin=0.935,
ymax=max(ys),
color = 'r',
linestyle = '-',
alpha = 0.8,
linewidth = 1.7)
# Add Annotations
plt.annotate(s = f'{min(xs):.2f}',
xy = (min(xs),
0.065),
horizontalalignment = 'center',
verticalalignment = 'bottom',
size = 15)
plt.annotate(s = f'{max(xs):.2f}',
xy = (max(xs),
0.935),
horizontalalignment = 'center',
verticalalignment = 'top',
size = 15)
ps = [0.25, 0.5, 0.75]
for p in ps:
ax.set_xlim(left = left, right = right)
ax.set_ylim(bottom = 0)
value = xs[np.where(ys > p)[0][0] - 1]
pvalue = ys[np.where(ys > p)[0][0] - 1]
plt.hlines(y=p, xmin=left, xmax = value,
linestyles = ':', colors = 'r', linewidth = 1.4);
plt.vlines(x=value, ymin=0, ymax = pvalue,
linestyles = ':', colors = 'r', linewidth = 1.4)
plt.text(x = p / 3, y = p - 0.01,
transform = ax.transAxes,
s = f'{int(100*p)}%', size = 15,
color = 'r', alpha = 0.7)
plt.text(x = value, y = 0.01, size = 15,
horizontalalignment = 'left',
s = f'{value:.2f}', color = 'r', alpha = 0.8);
# fit the labels into the figure
plt.title(f'ECDF of {name}', size = 20)
plt.tight_layout()
if save:
plt.savefig(save_name + '.png')
ecdf_plot(np.random.randn(100), name='Normal Distribution', save=True, save_name="ecdf")
Additional Resources:
ECDF
Interpreting ECDF
(This is a copy of my answer to the question: Plotting CDF of a pandas series in python)
A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.
First create an example series:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
Sort the series:
ser = ser.order()
Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:
ser[len(ser)] = ser.iloc[-1]
Create a new series with the sorted values as index and the cumulative distribution as values
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
Finally, plot the function as steps:
ser_cdf.plot(drawstyle='steps')
None of the answers so far covers what I wanted when I landed here, which is:
def empirical_cdf(x, data):
"evaluate ecdf of data at points x"
return np.mean(data[None, :] <= x[:, None], axis=1)
It evaluates the empirical CDF of a given dataset at an array of points x, which do not have to be sorted. There is no intermediate binning and no external libraries.
An equivalent method that scales better for large x is to sort the data and use np.searchsorted:
def empirical_cdf(x, data):
"evaluate ecdf of data at points x"
data = np.sort(data)
return np.searchsorted(data, x)/float(data.size)
In my opinion, none of the previous methods do the complete (and strict) job of plotting the empirical CDF, which was the asker's original question. I post my proposal for any lost and sympathetic souls.
My proposal has the following: 1) it considers the empirical CDF defined as in the first expression here, i.e., like in A. W. Van der Waart's Asymptotic statistics (1998), 2) it explicitly shows the step behavior of the function, 3) it explicitly shows that the empirical CDF is continuous from the right by showing marks to resolve discontinuities, 4) it extends the zero and one values at the extremes up to user-defined margins. I hope it helps someone:
def plot_cdf( data, xaxis = None, figsize = (20,10), line_style = 'b-',
ball_style = 'bo', xlabel = r"Random variable $X$", ylabel = "$N$-samples
empirical CDF $F_{X,N}(x)$" ):
# Contribution of each data point to the empirical distribution
weights = 1/data.size * np.ones_like( data )
# CDF estimation
cdf = np.cumsum( weights )
# Plot central part of the CDF
plt.figure( figsize = (20,10) )
plt.step( np.sort( a ), cdf, line_style, where = 'post' )
# Plot valid points at discontinuities
plt.plot( np.sort( a ), cdf, ball_style )
# Extract plot axis and extend outside the data range
if not xaxis == None:
(xmin, xmax, ymin, ymax) = plt.axis( )
xmin = xaxis[0]
xmax = xaxis[1]
plt.axis( [xmin, xmax, ymin, ymax] )
else:
(xmin,xmax,_,_) = plt.axis()
plt.plot( [xmin, a.min(), a.min()], np.zeros( 3 ), line_style )
plt.plot( [a.max(), xmax], np.ones( 2 ), line_style )
plt.xlabel( xlabel )
plt.ylabel( ylabel )
What I did to evaluate cdf for large dataset -
Find the unique values
unique_values = np.sort(pd.Series)
Make the rank array for these sorted and unique values in the dataset -
ranks = np.arange(0,len(unique_values))/(len(unique_values)-1)
Plot unique_values vs ranks
Example
The code below plots the cdf of population dataset from kaggle -
us_census_data = pd.read_csv('acs2015_census_tract_data.csv')
population = us_census_data['TotalPop'].dropna()
## sort the unique values using pandas unique function
unique_pop = np.sort(population.unique())
cdf = np.arange(0,len(unique_pop),step=1)/(len(unique_pop)-1)
## plotting
plt.plot(unique_pop,cdf)
plt.show()
This can easily be done with seaborn, which is a high-level API for matplotlib.
data can be a pandas.DataFrame, numpy.ndarray, mapping, or sequence.
An axes-level plot can be done using seaborn.ecdfplot.
A figure-level plot can be done use sns.displot with kind='ecdf'.
See How to use markers with ECDF plot for other options.
It’s also possible to plot the empirical complementary CDF (1 - CDF) by specifying complementary=True.
Tested in python 3.11, pandas 1.5.2, matplotlib 3.6.2, seaborn 0.12.1
import seaborn as sns
import matplotlib.pyplot as plt
# lead sample dataframe
df = sns.load_dataset('penguins', cache=False)
# display(df.head(3))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
# plot ecdf
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
sns.ecdfplot(data=df, x='bill_length_mm', ax=ax1)
ax1.set_title('Without hue')
sns.ecdfplot(data=df, x='bill_length_mm', hue='species', ax=ax2)
ax2.set_title('Separated species by hue')
CDF: complementary=True

Categories

Resources