identify graph uptrend or downtrend - python

I am attempting to read in data and plot them on to a graph using python (standard line graph). Can someone please advise on how I can classify whether certain points in a graph are uptrends or downtrends programmatically? Which would be the most optimal way to achieve this? Surely this is a solved problem and a mathematical equation exists to identify this?
here is some sample data with some up trends and downtrends
x = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]
y = [2,5,7,9,10,13,16,18,21,22,21,20,19,18,17,14,10,9,7,5,7,9,10,12,13,15,16,17,22,27]
thanks in advance

A simple way would be to look at the 'rate in change of y with respect to x', known as the derivative. This usually works better with continuous (smooth) functions, and so you could implement it with your data by interpolating your data with an n-th order polynomial as already suggested. A simple implementation would look something like this:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
from scipy.misc import derivative
x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,\
16,17,18,19,20,21,22,23,24,25,26,27,28,29,30])
y = np.array([2,5,7,9,10,13,16,18,21,22,21,20,19,18,\
17,14,10,9,7,5,7,9,10,12,13,15,16,17,22,27])
# Simple interpolation of x and y
f = interp1d(x, y)
x_fake = np.arange(1.1, 30, 0.1)
# derivative of y with respect to x
df_dx = derivative(f, x_fake, dx=1e-6)
# Plot
fig = plt.figure()
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
ax1.errorbar(x, y, fmt="o", color="blue", label='Input data')
ax1.errorbar(x_fake, f(x_fake), label="Interpolated data", lw=2)
ax1.set_xlabel("x")
ax1.set_ylabel("y")
ax2.errorbar(x_fake, df_dx, lw=2)
ax2.errorbar(x_fake, np.array([0 for i in x_fake]), ls="--", lw=2)
ax2.set_xlabel("x")
ax2.set_ylabel("dy/dx")
leg = ax1.legend(loc=2, numpoints=1,scatterpoints=1)
leg.draw_frame(False)
You see that when the plot transitions from an 'upwards trend' (positive gradient) to a 'downwards trend' (negative gradient) the derivative (dy/dx) goes from positive to negative. The transition of this happens at dy/dx = 0, which is shown by the green dashed line. For the scipy routines you can look at:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.misc.derivative.html
http://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html
NumPy's diff/gradient should also work, and not require the interpolation, but I showed the above so you could get the idea. For a complete mathemetical description of differentiation/calculus, look at wikipedia.

I found this topic very important and interesting. I would like to extend the above-mentioned answer:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
from scipy.misc import derivative
x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,\
16,17,18,19,20,21,22,23,24,25,26,27,28,29,30])
y = np.array([2,5,7,9,10,13,16,18,21,22,21,20,19,18,\
17,14,10,9,7,5,7,9,10,12,13,15,16,17,22,27])
# Simple interpolation of x and y
f = interp1d(x, y, fill_value="extrapolate")
x_fake = np.arange(1.1, 30, 0.1)
# derivative of y with respect to x
df_dx = derivative(f, x_fake, dx=1e-6)
plt.plot(x,y, label = "Data")
plt.plot(x_fake,df_dx,label = "Trend")
plt.legend()
plt.show()
average = np.average(df_dx)
if average > 0 :
print("Uptrend", average)
elif average < 0:
print("Downtrend", average)
elif average == 0:
print("No trend!", average)
print("Max trend measure is:")
print(np.max(df_dx))
print("min trend measure is:")
print(np.min(df_dx))
print("Overall trend measure:")
print(((np.max(df_dx))-np.min(df_dx)-average)/((np.max(df_dx))-np.min(df_dx)))
extermum_list_y = []
extermum_list_x = []
for i in range(0,df_dx.shape[0]):
if df_dx[i] < 0.001 and df_dx[i] > -0.001:
extermum_list_x.append(x_fake[i])
extermum_list_y.append(df_dx[i])
plt.scatter(extermum_list_x, extermum_list_y, label="Extermum", marker = "o", color = "green")
plt.plot(x,y, label = "Data")
plt.plot(x_fake, df_dx, label="Trend")
plt.legend()
plt.show()
So, in overall the total trend is upward for this graph!
This approach is also nice when you want to find the x where the slope is zero; for example, the extremum in the curves. The local minimum and maximum points are found with the best accuracy and computation time.

Related

Recover the time shift from nympy.correlate result in Python

This is not a duplicate question since other answers only explain how to plot the cross-correlation function and do not explain how you can get the time difference.
Given a sin signal and shifted version, we should be able to get the time delay between them.
I have created a sin signal and shifted it by t_d=0.05. The following is my code and its output:
import numpy as np
import matplotlib.pyplot as plt
fs = 1000
x = np.linspace(0, 1, fs)
f = 5
t_shift = 0.05
y = np.sin(2*np.pi*f*x)
y_shifted = np.sin(2*np.pi*f*(x-t_shift))
fig, ax = plt.subplots()
ax.plot(x, y, x, y_shifted)
plt.show()
By normalizing signals and applying numpy.correlate we get the following:
y_norm = (y-y.mean())/y.std()
y_shifted_norm = (y_shifted - y_shifted.mean())/y_shifted.std()
cc = np.correlate(y_norm, y_shifted_norm, 'full')
fig, ax = plt.subplots()
ax.plot(range(len(cc)), cc)
plt.show()
Question
From the indices of cross-correlation function, how can I get t_shift=0.05?
#Sepide. It seems to me as if you are trying to maximise the correlation between the signal y and a shifted version of y_shifted. This might be accomplished using np.correlate() but it seems nontrivial indeed to recover the time shifts in the signals. In the solution below I manually shift the time series and compute the correlation coefficient using np.corrcoef. As soon as this Pearson correlation coefficient equals 1, the two signals are aligned.
import numpy as np
import matplotlib.pyplot as plt
# Setting
fs = 1000
x = np.linspace(0, 1, fs)
f = 5
t_shift = 0.05
t_step = 1/fs
# Data
y = np.sin(2*np.pi*f*x)
y_shifted = np.sin(2*np.pi*f*(x-t_shift))
# Compute correlation
MaxTimeShift = 200
CorrelationList = np.empty((MaxTimeShift,1));
CorrelationList[:] = np.NaN
# Compute correlation for various shifts
for iter in range(MaxTimeShift):
CorrelationList[iter] = np.corrcoef( y[0:801].T, y_shifted[iter:(801+iter)].T)[0,1]
# Plot 1
plt.figure(1)
plt.plot(x, y, x, y_shifted)
plt.show()
# Plot 2
plt.figure(2)
ShiftList = t_step*np.arange(MaxTimeShift)
plt.plot(ShiftList, CorrelationList)
plt.title("Correlation coefficient")
plt.show()
print("The time shift between the signals is: ", ShiftList[np.argmax(CorrelationList)])

How Can I calculate the uncertainty of a fourth degree polynomial fit?

I have an experimental data with relative uncertainties. And I would applied a fourth degree polynomial considering a specified model with logarithms. The issue im facing is how can I get an uncertainty on the fit considering the experimental data. I have no idea how can I do this. Any suggestion would be appreciated.
Thank you.
import numpy as np
import matplotlib.pyplot as plt
x = [121.78, 244.69, 344.278, 411.1165, 778.904, 867.38, 964.079, 1112.076, 1212.948, 1299.142, 1408.013]
y = [2.98E-03, 2.18E-03, 1.64E-03, 1.79E-03, 8.09E-04, 7.54E-04, 6.80E-04, 6.11E-04, 5.83E-04, 5.49E-04, 5.01E-04]
error_abs = [4.72E-05, 3.64E-05, 2.63E-05, 3.38E-05, 1.38E-05, 1.50E-05, 1.15E-05, 1.06E-05, 1.66E-05, 1.51E-05, 8.41E-06] # y_err
# x and y in logarithms
t1 = np.log(x)
t2 = np.log(y)
l = np.linspace(np.min(t1),np.max(t1),1000)
p,cov = np.polyfit(t1,t2,4, cov=True)
fit = np.polyval(p,l)
plt.errorbar(x, y, yerr=error_abs, lw=0.5, capsize=2, capthick=0.5, linestyle='none', marker='s',markersize=3, label='efficiency', color='orange')
plt.plot(np.exp(l),np.exp(fit), label ='Fit')
plt.show()
print(np.sqrt(np.diag(cov)))

Fitting a line through 3D x,y,z scatter plot data

I have a handful of data points that cluster along a line in 3d space. I have the x,y,z data in a csv file that I want to import. I would like to find an equation that represents that line, or the plane perpendicular to that line, or whatever is mathematically correct. These data are independent of each other. Maybe there are better ways to do this than what I tried to do but...
I attempted to replicate an old post here that seemed to be doing exactly what I'm trying to do
Fitting a line in 3D
but it seems that maybe updates over the past decade have left the second part of the code not working? Or maybe I'm just doing something wrong. I've included the entire thing that I frankensteined together from this at the bottom. There are two lines that seem to be giving me a problem.
I've snippeted them out here...
import numpy as np
pts = np.add.accumulate(np.random.random((10,3)))
x,y,z = pts.T
# this will find the slope and x-intercept of a plane
# parallel to the y-axis that best fits the data
A_xz = np.vstack((x, np.ones(len(x)))).T
m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0]
# again for a plane parallel to the x-axis
A_yz = np.vstack((y, np.ones(len(y)))).T
m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]
# the intersection of those two planes and
# the function for the line would be:
# z = m_yz * y + c_yz
# z = m_xz * x + c_xz
# or:
def lin(z):
x = (z - c_xz)/m_xz
y = (z - c_yz)/m_yz
return x,y
#verifying:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = Axes3D(fig)
zz = np.linspace(0,5)
xx,yy = lin(zz)
ax.scatter(x, y, z)
ax.plot(xx,yy,zz)
plt.savefig('test.png')
plt.show()
They return this, but no values...
FutureWarning: rcond parameter will change to the default of machine precision times max(M, N) where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass rcond=None, to keep using the old, explicitly pass rcond=-1.
m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0]
FutureWarning: rcond parameter will change to the default of machine precision times max(M, N) where M and N are the input matrix dimensions.
To use the future default and silence this warning we advise to pass rcond=None, to keep using the old, explicitly pass rcond=-1.
m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]
I don't know where to go from here. I don't even actually need the plot, I just needed an equation and am ill-equipped to move forward. If anyone knows an easier way to do this, or can point me in the right direction, I'm willing to learn, but I'm very, very lost. Thank you in advance!!
Here is my entire frankensteined code in case that is what is causing the issue.
import pandas as pd
import numpy as np
mydataset = pd.read_csv('line1.csv')
x = mydataset.iloc[:,0]
y = mydataset.iloc[:,1]
z = mydataset.iloc[:,2]
data = np.concatenate((x[:, np.newaxis],
y[:, np.newaxis],
z[:, np.newaxis]),
axis=1)
# Calculate the mean of the points, i.e. the 'center' of the cloud
datamean = data.mean(axis=0)
# Do an SVD on the mean-centered data.
uu, dd, vv = np.linalg.svd(data - datamean)
# Now vv[0] contains the first principal component, i.e. the direction
# vector of the 'best fit' line in the least squares sense.
# Now generate some points along this best fit line, for plotting.
# we want it to have mean 0 (like the points we did
# the svd on). Also, it's a straight line, so we only need 2 points.
linepts = vv[0] * np.mgrid[-100:100:2j][:, np.newaxis]
# shift by the mean to get the line in the right place
linepts += datamean
# Verify that everything looks right.
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d as m3d
ax = m3d.Axes3D(plt.figure())
ax.scatter3D(*data.T)
ax.plot3D(*linepts.T)
plt.show()
# this will find the slope and x-intercept of a plane
# parallel to the y-axis that best fits the data
A_xz = np.vstack((x, np.ones(len(x)))).T
m_xz, c_xz = np.linalg.lstsq(A_xz, z)[0]
# again for a plane parallel to the x-axis
A_yz = np.vstack((y, np.ones(len(y)))).T
m_yz, c_yz = np.linalg.lstsq(A_yz, z)[0]
# the intersection of those two planes and
# the function for the line would be:
# z = m_yz * y + c_yz
# z = m_xz * x + c_xz
# or:
def lin(z):
x = (z - c_xz)/m_xz
y = (z - c_yz)/m_yz
return x,y
print(x,y)
#verifying:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
fig = plt.figure()
ax = Axes3D(fig)
zz = np.linspace(0,5)
xx,yy = lin(zz)
ax.scatter(x, y, z)
ax.plot(xx,yy,zz)
plt.savefig('test.png')
plt.show()
As was proposed in the old post you refer to, you could also make use of principal component analysis instead of a least squares approach. For that I suggest sklearn.decomposition.PCA from the sklearn package.
An example can be found below using the csv-file you provided.
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
mydataset = pd.read_csv('line1.csv')
x = mydataset.iloc[:,0]
y = mydataset.iloc[:,1]
z = mydataset.iloc[:,2]
coords = np.array((x, y, z)).T
pca = PCA(n_components=1)
pca.fit(coords)
direction_vector = pca.components_
print(direction_vector)
# Create plot
origin = np.mean(coords, axis=0)
euclidian_distance = np.linalg.norm(coords - origin, axis=1)
extent = np.max(euclidian_distance)
line = np.vstack((origin - direction_vector * extent,
origin + direction_vector * extent))
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(coords[:, 0], coords[:, 1], coords[:,2])
ax.plot(line[:, 0], line[:, 1], line[:, 2], 'r')
You can get rid of the complaint from leastsquares by adding rcond=None like this:
m_xz, c_xz = np.linalg.lstsq(A_xz, z, rcond=None)[0]
Is this the right decision for your situation? I have no idea. But there's more about it in the docs.
When I run your code with your inputs it seems to run just fine and I get values assigned to m_xz, c_xz, etc. If you don't call them explicitly with print('m_xz') (or whatever) then you won't see them.
m_xz
Out[42]: 5.186132604596112
c_xz
Out[43]: 62.5764694106141
Also, you reference your data in kind of two different ways. You get x, y, and z from your csv, but also put it into a numpy array. You can get rid of the duplication and pandas by just using numpy:
data = np.genfromtxt('line1.csv', delimiter=',', skip_header=1)
x = data[:,0]
y = data[:,1]
z = data[:,2]

Flow visualisation in python using curved (path-following) vectors

I would like to plot a vector field with curved arrows in python, as can be done in vfplot (see below) or IDL.
You can get close in matplotlib, but using quiver() limits you to straight vectors (see below left) whereas streamplot() doesn't seem to permit meaningful control over arrow length or arrowhead position (see below right), even when changing integration_direction, density, and maxlength.
So, is there a python library that can do this? Or is there a way of getting matplotlib to do it?
If you look at the streamplot.py that is included in matplotlib, on lines 196 - 202 (ish, idk if this has changed between versions - I'm on matplotlib 2.1.2) we see the following:
... (to line 195)
# Add arrows half way along each trajectory.
s = np.cumsum(np.sqrt(np.diff(tx) ** 2 + np.diff(ty) ** 2))
n = np.searchsorted(s, s[-1] / 2.)
arrow_tail = (tx[n], ty[n])
arrow_head = (np.mean(tx[n:n + 2]), np.mean(ty[n:n + 2]))
... (after line 196)
changing that part to this will do the trick (changing assignment of n):
... (to line 195)
# Add arrows half way along each trajectory.
s = np.cumsum(np.sqrt(np.diff(tx) ** 2 + np.diff(ty) ** 2))
n = np.searchsorted(s, s[-1]) ### THIS IS THE EDITED LINE! ###
arrow_tail = (tx[n], ty[n])
arrow_head = (np.mean(tx[n:n + 2]), np.mean(ty[n:n + 2]))
... (after line 196)
If you modify this to put the arrow at the end, then you could generate the arrows more to your liking.
Additionally, from the docs at the top of the function, we see the following:
*linewidth* : numeric or 2d array
vary linewidth when given a 2d array with the same shape as velocities.
The linewidth can be a numpy.ndarray, and if you can pre-calculate the desired width of your arrows, you'll be able to modify the pencil width while drawing the arrows. It looks like this part has already been done for you.
So, in combination with shortening the arrows maxlength, increasing the density, and adding start_points, as well as tweaking the function to put the arrow at the end instead of the middle, you could get your desired graph.
With these modifications, and the following code, I was able to get a result much closer to what you wanted:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.patches as pat
w = 3
Y, X = np.mgrid[-w:w:100j, -w:w:100j]
U = -1 - X**2 + Y
V = 1 + X - Y**2
speed = np.sqrt(U*U + V*V)
fig = plt.figure(figsize=(14, 18))
gs = gridspec.GridSpec(nrows=3, ncols=2, height_ratios=[1, 1, 2])
grains = 10
tmp = tuple([x]*grains for x in np.linspace(-2, 2, grains))
xs = []
for x in tmp:
xs += x
ys = tuple(np.linspace(-2, 2, grains))*grains
seed_points = np.array([list(xs), list(ys)])
# Varying color along a streamline
ax1 = fig.add_subplot(gs[0, 1])
strm = ax1.streamplot(X, Y, U, V, color=U, linewidth=np.array(5*np.random.random_sample((100, 100))**2 + 1), cmap='winter', density=10,
minlength=0.001, maxlength = 0.07, arrowstyle='fancy',
integration_direction='forward', start_points = seed_points.T)
fig.colorbar(strm.lines)
ax1.set_title('Varying Color')
plt.tight_layout()
plt.show()
tl;dr: go copy the source code, and change it to put the arrows at the end of each path, instead of in the middle. Then use your streamplot instead of the matplotlib streamplot.
Edit: I got the linewidths to vary
Starting with David Culbreth's modification, I rewrote chunks of the streamplot function to achieve the desired behaviour. Slightly too numerous to specify them all here, but it includes a length-normalising method and disables the trajectory-overlap checking. I've appended two comparisons of the new curved quiver function with the original streamplot and quiver.
Here's a way to obtain the desired output in vanilla pyplot (i.e., without modifying the streamplot function or anything that fancy). For reminder, the goal is to visualize a vector field with curved arrows whose length is proportional to the norm of the vector.
The trick is to:
make streamplot with no arrows that is traced backward from a given point (see)
plot a quiver from that point. Make the quiver small enough so that only the arrow is visible
repeat 1. and 2. in a loop for every seed and scale the length of the streamplot to be proportional to the norm of the vector.
import matplotlib.pyplot as plt
import numpy as np
w = 3
Y, X = np.mgrid[-w:w:8j, -w:w:8j]
U = -Y
V = X
norm = np.sqrt(U**2 + V**2)
norm_flat = norm.flatten()
start_points = np.array([X.flatten(),Y.flatten()]).T
plt.clf()
scale = .2/np.max(norm)
plt.subplot(121)
plt.title('scaling only the length')
for i in range(start_points.shape[0]):
plt.streamplot(X,Y,U,V, color='k', start_points=np.array([start_points[i,:]]),minlength=.95*norm_flat[i]*scale, maxlength=1.0*norm_flat[i]*scale,
integration_direction='backward', density=10, arrowsize=0.0)
plt.quiver(X,Y,U/norm, V/norm,scale=30)
plt.axis('square')
plt.subplot(122)
plt.title('scaling length, arrowhead and linewidth')
for i in range(start_points.shape[0]):
plt.streamplot(X,Y,U,V, color='k', start_points=np.array([start_points[i,:]]),minlength=.95*norm_flat[i]*scale, maxlength=1.0*norm_flat[i]*scale,
integration_direction='backward', density=10, arrowsize=0.0, linewidth=.5*norm_flat[i])
plt.quiver(X,Y,U/np.max(norm), V/np.max(norm),scale=30)
plt.axis('square')
Here's the result:
Just looking at the documentation on streamplot(), found here -- what if you used something like streamplot( ... ,minlength = n/2, maxlength = n) where n is the desired length -- you will need to play with those numbers a bit to get your desired graph
you can control for the points using start_points, as shown in the example provided by #JohnKoch
Here's an example of how I controlled the length with streamplot() -- it's pretty much a straight copy/paste/crop from the example from above.
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.patches as pat
w = 3
Y, X = np.mgrid[-w:w:100j, -w:w:100j]
U = -1 - X**2 + Y
V = 1 + X - Y**2
speed = np.sqrt(U*U + V*V)
fig = plt.figure(figsize=(14, 18))
gs = gridspec.GridSpec(nrows=3, ncols=2, height_ratios=[1, 1, 2])
grains = 10
tmp = tuple([x]*grains for x in np.linspace(-2, 2, grains))
xs = []
for x in tmp:
xs += x
ys = tuple(np.linspace(-2, 2, grains))*grains
seed_points = np.array([list(xs), list(ys)])
arrowStyle = pat.ArrowStyle.Fancy()
# Varying color along a streamline
ax1 = fig.add_subplot(gs[0, 1])
strm = ax1.streamplot(X, Y, U, V, color=U, linewidth=1.5, cmap='winter', density=10,
minlength=0.001, maxlength = 0.1, arrowstyle='->',
integration_direction='forward', start_points = seed_points.T)
fig.colorbar(strm.lines)
ax1.set_title('Varying Color')
plt.tight_layout()
plt.show()
Edit: made it prettier, though still not quite what we were looking for.

How to plot empirical cdf (ecdf)

How can I plot the empirical CDF of an array of numbers in matplotlib in Python? I'm looking for the cdf analog of pylab's "hist" function.
One thing I can think of is:
from scipy.stats import cumfreq
a = array([...]) # my array of numbers
num_bins = 20
b = cumfreq(a, num_bins)
plt.plot(b)
If you like linspace and prefer one-liners, you can do:
plt.plot(np.sort(a), np.linspace(0, 1, len(a), endpoint=False))
Given my tastes, I almost always do:
# a is the data array
x = np.sort(a)
y = np.arange(len(x))/float(len(x))
plt.plot(x, y)
Which works for me even if there are >O(1e6) data values.
If you really need to downsample I'd set
x = np.sort(a)[::down_sampling_step]
Edit to respond to comment/edit on why I use endpoint=False or the y as defined above. The following are some technical details.
The empirical CDF is usually formally defined as
CDF(x) = "number of samples <= x"/"number of samples"
in order to exactly match this formal definition you would need to use y = np.arange(1,len(x)+1)/float(len(x)) so that we get
y = [1/N, 2/N ... 1]. This estimator is an unbiased estimator that will converge to the true CDF in the limit of infinite samples Wikipedia ref..
I tend to use y = [0, 1/N, 2/N ... (N-1)/N] since:
(a) it is easier to code/more idiomatic,
(b) but is still formally justified since one can always exchange CDF(x) with 1-CDF(x) in the convergence proof, and
(c) works with the (easy) downsampling method described above.
In some particular cases, it is useful to define
y = (arange(len(x))+0.5)/len(x)
which is intermediate between these two conventions. Which, in effect, says "there is a 1/(2N) chance of a value less than the lowest one I've seen in my sample, and a 1/(2N) chance of a value greater than the largest one I've seen so far.
Note that the selection of this convention interacts with the where parameter used in the plt.step if it seems more useful to display
the CDF as a piecewise constant function. In order to exactly match the formal definition mentioned above, one would need to use where=pre the suggested y=[0,1/N..., 1-1/N] convention, or where=post with the y=[1/N, 2/N ... 1] convention, but not the other way around.
However, for large samples, and reasonable distributions, the convention is given in the main body of the answer is easy to write, is an unbiased estimator of the true CDF, and works with the downsampling methodology.
You can use the ECDF function from the scikits.statsmodels library:
import numpy as np
import scikits.statsmodels as sm
import matplotlib.pyplot as plt
sample = np.random.uniform(0, 1, 50)
ecdf = sm.tools.ECDF(sample)
x = np.linspace(min(sample), max(sample))
y = ecdf(x)
plt.step(x, y)
With version 0.4 scicits.statsmodels was renamed to statsmodels. ECDF is now located in the distributions module (while statsmodels.tools.tools.ECDF is depreciated).
import numpy as np
import statsmodels.api as sm # recommended import according to the docs
import matplotlib.pyplot as plt
sample = np.random.uniform(0, 1, 50)
ecdf = sm.distributions.ECDF(sample)
x = np.linspace(min(sample), max(sample))
y = ecdf(x)
plt.step(x, y)
plt.show()
That looks to be (almost) exactly what you want. Two things:
First, the results are a tuple of four items. The third is the size of the bins. The second is the starting point of the smallest bin. The first is the number of points in the in or below each bin. (The last is the number of points outside the limits, but since you haven't set any, all points will be binned.)
Second, you'll want to rescale the results so the final value is 1, to follow the usual conventions of a CDF, but otherwise it's right.
Here's what it does under the hood:
def cumfreq(a, numbins=10, defaultreallimits=None):
# docstring omitted
h,l,b,e = histogram(a,numbins,defaultreallimits)
cumhist = np.cumsum(h*1, axis=0)
return cumhist,l,b,e
It does the histogramming, then produces a cumulative sum of the counts in each bin. So the ith value of the result is the number of array values less than or equal to the the maximum of the ith bin. So, the final value is just the size of the initial array.
Finally, to plot it, you'll need to use the initial value of the bin, and the bin size to determine what x-axis values you'll need.
Another option is to use numpy.histogram which can do the normalization and returns the bin edges. You'll need to do the cumulative sum of the resulting counts yourself.
a = array([...]) # your array of numbers
num_bins = 20
counts, bin_edges = numpy.histogram(a, bins=num_bins, normed=True)
cdf = numpy.cumsum(counts)
pylab.plot(bin_edges[1:], cdf)
(bin_edges[1:] is the upper edge of each bin.)
Have you tried the cumulative=True argument to pyplot.hist?
One-liner based on Dave's answer:
plt.plot(np.sort(arr), np.linspace(0, 1, len(arr), endpoint=False))
Edit: this was also suggested by hans_meine in the comments.
Assuming that vals holds your values, then you can simply plot the CDF as follows:
y = numpy.arange(0, 101)
x = numpy.percentile(vals, y)
plot(x, y)
To scale it between 0 and 1, just divide y by 100.
What do you want to do with the CDF ?
To plot it, that's a start. You could try a few different values, like this:
from __future__ import division
import numpy as np
from scipy.stats import cumfreq
import pylab as plt
hi = 100.
a = np.arange(hi) ** 2
for nbins in ( 2, 20, 100 ):
cf = cumfreq(a, nbins) # bin values, lowerlimit, binsize, extrapoints
w = hi / nbins
x = np.linspace( w/2, hi - w/2, nbins ) # care
# print x, cf
plt.plot( x, cf[0], label=str(nbins) )
plt.legend()
plt.show()
Histogram
lists various rules for the number of bins, e.g. num_bins ~ sqrt( len(a) ).
(Fine print: two quite different things are going on here,
binning / histogramming the raw data
plot interpolates a smooth curve through the say 20 binned values.
Either of these can go way off on data that's "clumpy"
or has long tails, even for 1d data -- 2d, 3d data gets increasingly difficult.
See also
Density_estimation
and
using scipy gaussian kernel density estimation
).
I have a trivial addition to AFoglia's method, to normalize the CDF
n_counts,bin_edges = np.histogram(myarray,bins=11,normed=True)
cdf = np.cumsum(n_counts) # cdf not normalized, despite above
scale = 1.0/cdf[-1]
ncdf = scale * cdf
Normalizing the histo makes its integral unity, which means the cdf will not be normalized. You've got to scale it yourself.
If you want to display the actual true ECDF (which as David B noted is a step function that increases 1/n at each of n datapoints), my suggestion is to write code to generate two "plot" points for each datapoint:
a = array([...]) # your array of numbers
sorted=np.sort(a)
x2 = []
y2 = []
y = 0
for x in sorted:
x2.extend([x,x])
y2.append(y)
y += 1.0 / len(a)
y2.append(y)
plt.plot(x2,y2)
This way you will get a plot with the n steps that are characteristic of an ECDF, which is nice especially for data sets that are small enough for the steps to be visible. Also, there is no no need to do any binning with histograms (which risk introducing bias to the drawn ECDF).
We can just use the step function from matplotlib, which makes a step-wise plot, which is the definition of the empirical CDF:
import numpy as np
from matplotlib import pyplot as plt
data = np.random.randn(11)
levels = np.linspace(0, 1, len(data) + 1) # endpoint 1 is included by default
plt.step(sorted(list(data) + [max(data)]), levels)
The final vertical line at max(data) was added manually. Otherwise the plot just stops at level 1 - 1/len(data).
Alternatively we can use the where='post' option to step()
levels = np.linspace(1. / len(data), 1, len(data))
plt.step(sorted(data), levels, where='post')
in which case the initial vertical line from zero is not plotted.
It's a one-liner in seaborn using the cumulative=True parameter. Here you go,
import seaborn as sns
sns.kdeplot(a, cumulative=True)
This is using bokeh
from bokeh.plotting import figure, show
from statsmodels.distributions.empirical_distribution import ECDF
ecdf = ECDF(pd_series)
p = figure(title="tests", tools="save", background_fill_color="#E8DDCB")
p.line(ecdf.x,ecdf.y)
show(p)
Although, there are many great answers here, though I would include a more customized ECDF plot
Generate values for the empirical cumulative distribution function
import matplotlib.pyplot as plt
def ecdf_values(x):
"""
Generate values for empirical cumulative distribution function
Params
--------
x (array or list of numeric values): distribution for ECDF
Returns
--------
x (array): x values
y (array): percentile values
"""
# Sort values and find length
x = np.sort(x)
n = len(x)
# Create percentiles
y = np.arange(1, n + 1, 1) / n
return x, y
def ecdf_plot(x, name = 'Value', plot_normal = True, log_scale=False, save=False, save_name='Default'):
"""
ECDF plot of x
Params
--------
x (array or list of numerics): distribution for ECDF
name (str): name of the distribution, used for labeling
plot_normal (bool): plot the normal distribution (from mean and std of data)
log_scale (bool): transform the scale to logarithmic
save (bool) : save/export plot
save_name (str) : filename to save the plot
Returns
--------
none, displays plot
"""
xs, ys = ecdf_values(x)
fig = plt.figure(figsize = (10, 6))
ax = plt.subplot(1, 1, 1)
plt.step(xs, ys, linewidth = 2.5, c= 'b');
plot_range = ax.get_xlim()[1] - ax.get_xlim()[0]
fig_sizex = fig.get_size_inches()[0]
data_inch = plot_range / fig_sizex
right = 0.6 * data_inch + max(xs)
gap = right - max(xs)
left = min(xs) - gap
if log_scale:
ax.set_xscale('log')
if plot_normal:
gxs, gys = ecdf_values(np.random.normal(loc = xs.mean(),
scale = xs.std(),
size = 100000))
plt.plot(gxs, gys, 'g');
plt.vlines(x=min(xs),
ymin=0,
ymax=min(ys),
color = 'b',
linewidth = 2.5)
# Add ticks
plt.xticks(size = 16)
plt.yticks(size = 16)
# Add Labels
plt.xlabel(f'{name}', size = 18)
plt.ylabel('Percentile', size = 18)
plt.vlines(x=min(xs),
ymin = min(ys),
ymax=0.065,
color = 'r',
linestyle = '-',
alpha = 0.8,
linewidth = 1.7)
plt.vlines(x=max(xs),
ymin=0.935,
ymax=max(ys),
color = 'r',
linestyle = '-',
alpha = 0.8,
linewidth = 1.7)
# Add Annotations
plt.annotate(s = f'{min(xs):.2f}',
xy = (min(xs),
0.065),
horizontalalignment = 'center',
verticalalignment = 'bottom',
size = 15)
plt.annotate(s = f'{max(xs):.2f}',
xy = (max(xs),
0.935),
horizontalalignment = 'center',
verticalalignment = 'top',
size = 15)
ps = [0.25, 0.5, 0.75]
for p in ps:
ax.set_xlim(left = left, right = right)
ax.set_ylim(bottom = 0)
value = xs[np.where(ys > p)[0][0] - 1]
pvalue = ys[np.where(ys > p)[0][0] - 1]
plt.hlines(y=p, xmin=left, xmax = value,
linestyles = ':', colors = 'r', linewidth = 1.4);
plt.vlines(x=value, ymin=0, ymax = pvalue,
linestyles = ':', colors = 'r', linewidth = 1.4)
plt.text(x = p / 3, y = p - 0.01,
transform = ax.transAxes,
s = f'{int(100*p)}%', size = 15,
color = 'r', alpha = 0.7)
plt.text(x = value, y = 0.01, size = 15,
horizontalalignment = 'left',
s = f'{value:.2f}', color = 'r', alpha = 0.8);
# fit the labels into the figure
plt.title(f'ECDF of {name}', size = 20)
plt.tight_layout()
if save:
plt.savefig(save_name + '.png')
ecdf_plot(np.random.randn(100), name='Normal Distribution', save=True, save_name="ecdf")
Additional Resources:
ECDF
Interpreting ECDF
(This is a copy of my answer to the question: Plotting CDF of a pandas series in python)
A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.
First create an example series:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
Sort the series:
ser = ser.order()
Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:
ser[len(ser)] = ser.iloc[-1]
Create a new series with the sorted values as index and the cumulative distribution as values
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
Finally, plot the function as steps:
ser_cdf.plot(drawstyle='steps')
None of the answers so far covers what I wanted when I landed here, which is:
def empirical_cdf(x, data):
"evaluate ecdf of data at points x"
return np.mean(data[None, :] <= x[:, None], axis=1)
It evaluates the empirical CDF of a given dataset at an array of points x, which do not have to be sorted. There is no intermediate binning and no external libraries.
An equivalent method that scales better for large x is to sort the data and use np.searchsorted:
def empirical_cdf(x, data):
"evaluate ecdf of data at points x"
data = np.sort(data)
return np.searchsorted(data, x)/float(data.size)
In my opinion, none of the previous methods do the complete (and strict) job of plotting the empirical CDF, which was the asker's original question. I post my proposal for any lost and sympathetic souls.
My proposal has the following: 1) it considers the empirical CDF defined as in the first expression here, i.e., like in A. W. Van der Waart's Asymptotic statistics (1998), 2) it explicitly shows the step behavior of the function, 3) it explicitly shows that the empirical CDF is continuous from the right by showing marks to resolve discontinuities, 4) it extends the zero and one values at the extremes up to user-defined margins. I hope it helps someone:
def plot_cdf( data, xaxis = None, figsize = (20,10), line_style = 'b-',
ball_style = 'bo', xlabel = r"Random variable $X$", ylabel = "$N$-samples
empirical CDF $F_{X,N}(x)$" ):
# Contribution of each data point to the empirical distribution
weights = 1/data.size * np.ones_like( data )
# CDF estimation
cdf = np.cumsum( weights )
# Plot central part of the CDF
plt.figure( figsize = (20,10) )
plt.step( np.sort( a ), cdf, line_style, where = 'post' )
# Plot valid points at discontinuities
plt.plot( np.sort( a ), cdf, ball_style )
# Extract plot axis and extend outside the data range
if not xaxis == None:
(xmin, xmax, ymin, ymax) = plt.axis( )
xmin = xaxis[0]
xmax = xaxis[1]
plt.axis( [xmin, xmax, ymin, ymax] )
else:
(xmin,xmax,_,_) = plt.axis()
plt.plot( [xmin, a.min(), a.min()], np.zeros( 3 ), line_style )
plt.plot( [a.max(), xmax], np.ones( 2 ), line_style )
plt.xlabel( xlabel )
plt.ylabel( ylabel )
What I did to evaluate cdf for large dataset -
Find the unique values
unique_values = np.sort(pd.Series)
Make the rank array for these sorted and unique values in the dataset -
ranks = np.arange(0,len(unique_values))/(len(unique_values)-1)
Plot unique_values vs ranks
Example
The code below plots the cdf of population dataset from kaggle -
us_census_data = pd.read_csv('acs2015_census_tract_data.csv')
population = us_census_data['TotalPop'].dropna()
## sort the unique values using pandas unique function
unique_pop = np.sort(population.unique())
cdf = np.arange(0,len(unique_pop),step=1)/(len(unique_pop)-1)
## plotting
plt.plot(unique_pop,cdf)
plt.show()
This can easily be done with seaborn, which is a high-level API for matplotlib.
data can be a pandas.DataFrame, numpy.ndarray, mapping, or sequence.
An axes-level plot can be done using seaborn.ecdfplot.
A figure-level plot can be done use sns.displot with kind='ecdf'.
See How to use markers with ECDF plot for other options.
It’s also possible to plot the empirical complementary CDF (1 - CDF) by specifying complementary=True.
Tested in python 3.11, pandas 1.5.2, matplotlib 3.6.2, seaborn 0.12.1
import seaborn as sns
import matplotlib.pyplot as plt
# lead sample dataframe
df = sns.load_dataset('penguins', cache=False)
# display(df.head(3))
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
# plot ecdf
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
sns.ecdfplot(data=df, x='bill_length_mm', ax=ax1)
ax1.set_title('Without hue')
sns.ecdfplot(data=df, x='bill_length_mm', hue='species', ax=ax2)
ax2.set_title('Separated species by hue')
CDF: complementary=True

Categories

Resources