Plotting CDF of a pandas series in python - python

Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.

I believe the functionality you're looking for is in the hist method of a Series object which wraps the hist() function in matplotlib
Here's the relevant documentation
In [10]: import matplotlib.pyplot as plt
In [11]: plt.hist?
...
Plot a histogram.
Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : False
If `True`, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints. If `normed` is also `True`
then the histogram is normalized such that the last bin equals 1.
If `cumulative` evaluates to less than 0 (e.g., -1), the direction
of accumulation is reversed. In this case, if `normed` is also
`True`, then the histogram is normalized such that the first bin
equals 1.
...
For example
In [12]: import pandas as pd
In [13]: import numpy as np
In [14]: ser = pd.Series(np.random.normal(size=1000))
In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>
In [16]: plt.show()

In case you are also interested in the values, not just the plot.
import pandas as pd
# If you are in jupyter
%matplotlib inline
This will always work (discrete and continuous distributions)
# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series
# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df
# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.
# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
For continuous distributions only
Please note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).
In this case, a percent rank can be used to get to the cdf directly.
Use your best judgment when taking this kind of shortcut! :)
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)
# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)

A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.
First create an example series:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
Sort the series:
ser = ser.sort_values()
Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:
ser[len(ser)] = ser.iloc[-1]
Create a new series with the sorted values as index and the cumulative distribution as values:
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
Finally, plot the function as steps:
ser_cdf.plot(drawstyle='steps')

I came here looking for a plot like this with bars and a CDF line:
It can be achieved like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')
If you want to remove the vertical line, then it's explained how to accomplish that here. Or you could just do:
ax.set_xlim((ax.get_xlim()[0], series.max()))
I also saw an elegant solution here on how to do it with seaborn.

This is the easiest way.
import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )
Image of cumulative histogram

I found another solution in "pure" Pandas, that does not require specifying the number of bins to use in a histogram:
import pandas as pd
import numpy as np # used only to create example data
series = pd.Series(np.random.normal(size=10000))
cdf = series.value_counts().sort_index().cumsum()
cdf.plot()

To me, this seemed like a simply way to do it:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
heights = pd.Series(np.random.normal(size=100))
# empirical CDF
def F(x,data):
return float(len(data[data <= x]))/len(data)
vF = np.vectorize(F, excluded=['data'])
plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))

I really like the answer by Raphvanns. It is helpful because it not only produces the plot, but it also helps me understand what pdf, cdf, and ccdf is.
I have two things to add to Raphvanns's solution: (1) use collections.Counter wisely to make the process easier; (2) remember to sort (assending) value before calculating pdf, cdf, and ccdf.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
Generate random numbers:
s = pd.Series(np.random.randint(1000, size=(1000)))
Build a dataframe as Raphvanns suggested:
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
Calculate PDF, CDF, and CCDF:
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
Plot:
df.plot(x = 'value', y = ['cdf', 'ccdf'], grid = True)
You may wonder why we have to sort the value before calculating PDF, CDF, and CCDF. Well, let's say what would the results be if we don't sort them (note that dict(Counter(s)) automatically sorted the items, we will make the order random in the following).
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
df.plot(x = 'value', y = ['cdf'], grid = True)
This is the plot:
Why did it happen? Well, the essence of CDF is "The number of data points we have seen so far", citing YY's lecture slides of his Data Visualization class. Therefore, if the order of value is not sorted (either ascending or descending is fine), then when you plot, where x axis is in ascending order, the y value of course will be just a mess.
If you apply a descending order, you can imagine that the CDF and CCDF will just swap their places:
I will leave a question to the readers of this post: if I randomize the order of value like above, will sorting value after (rather than before) calculating PDF, CDF, and CCDF solve the problem?
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
# Will this solve the problem?
df = df.sort_values(by='value')
df.plot(x = 'value', y = ['cdf'], grid = True)

Upgrading the answer of #wroscoe
df[your_column].plot(kind = 'hist', histtype = 'step', density = True, cumulative = True)
You can also provide a number of desired bins.

If you're looking to plot a "true" empirical CDF, which jumps exactly at the values of your data set a, and with the jump at each value proportional to the frequency of the value, NumPy has builtin functions to do the work:
import matplotlib.pyplot as plt
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
y = np.cumsum(counts)
x = np.insert(x, 0, x[0])
y = np.insert(y/y[-1], 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
The call to unique() returns the data values in sorted order along with their corresponding frequencies. The option drawstyle='steps-post' in the plot() call ensures that the jumps occur where they should. To force a jump at the smallest data value, the code inserts an additional element in front of x and y.
Example usage:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
ecdf(xvec)
Another usage:
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
ecdf(df['x'])
with output:

Related

Overlay Graphs at same point

I want to overlay some graphs out of CSV data (two datasets).
The graph I got from my dataset is shown down below.
Is there any way to plot those datasets over specific points? I would like to overlay these plots by using the anchor of the "big drop" to compare them in a better way.
The code used:
import pandas as pd
import matplotlib.pyplot as plt
# Read the data
data1 = pd.read_csv('data1.csv', delimiter=";", decimal=",")
data2 = pd.read_csv('data2.csv', delimiter=";", decimal=",")
data3 = pd.read_csv('data3.csv', delimiter=";", decimal=",")
data4 = pd.read_csv('data4.csv', delimiter=";", decimal=",")
# Plot the data
plt.plot(data1['Zeit'], data1['Kanal A'])
plt.plot(data2['Zeit'], data2['Kanal A'])
plt.plot(data3['Zeit'], data3['Kanal A'])
plt.plot(data4['Zeit'], data4['Kanal A'])
plt.show()
plt.close()
I would like to share you some data here:
Link to data
Part 1: Anchor times
A simple way is to find the times of interest (lowest point) in each frame, then plot each series with x=t - t_peak instead of x=t. Two ways come to mind to find the desired anchor points:
Simply using the global minimum (in your plots, that would work fine), or
Using the most prominent local minimum, either from first principles, or using scipy's find_peaks().
But first of all, let us attempt to build a reproducible example:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
def make_sample(t_peak, tmax_approx=17.5, n=100):
# uneven times
t = np.random.uniform(0, 2*tmax_approx/n, n).cumsum()
y = -1 / (0.1 + 2 * np.abs(t - t_peak))
trend = 4 * np.random.uniform(-1, 1) / n
level = np.random.uniform(10, 12)
y += np.random.normal(trend, 1/n, n).cumsum() + level
return pd.DataFrame({'t': t, 'y': y})
poi = [2, 2.48, 2.6, 2.1]
np.random.seed(0)
frames = [make_sample(t_peak) for t_peak in poi]
plt.rcParams['figure.figsize'] = (6,2)
fig, ax = plt.subplots()
for df in frames:
ax.plot(*df.values.T)
In this case, we made the problem maximally inconvenient by giving each time series its own, independent, unevenly distributed time sampling.
Now, retrieving the "maximum drop" by global minimum:
peaks = [df.loc[df['y'].idxmin(), 't'] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
fig, ax = plt.subplots()
for t_peak, df in zip(peaks, frames):
ax.plot(df['t'] - t_peak, df['y'])
But imagine a case where the global minimum is not suitable. For example, add a large sine wave to each series:
frames = [df.assign(y=df['y'] + 5 * np.sin(df['t'])) for df in frames]
# just plotting the first series
df = frames[0]
plt.plot(*df.values.T)
Clearly, there are several local minima, and the one we want ("sharpest drop") is not the global one.
A simple way to find the desired sharpest drop time is by looking at the difference from each point to its two neighbors:
def arg_steepest_min(v):
# simply find the minimum that is most separated from the surrounding points
diff = np.diff(v)
i = np.argmin(diff[:-1] - diff[1:]) + 1
return i
peaks = [df['t'].iloc[arg_steepest_min(df['y'])] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
# just plotting the first curve and the peak found
df = frames[0]
plt.plot(*df.values.T)
plt.plot(*df.iloc[arg_steepest_min(df['y'])].T, 'x')
There are more complex cases where you want to bring the full power of find_peaks(). Here is an example that uses the most prominent minimum, using a certain number of samples for neighborhood:
from scipy.signal import find_peaks, peak_prominences
def arg_most_prominent_min(v, prominence=1, wlen=10):
peaks, details = find_peaks(-v, prominence=prominence, wlen=wlen)
i = peaks[np.argmax(details['prominences'])]
return i
peaks = [df['t'].iloc[arg_most_prominent_min(df['y'])] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
In this case, the peaks found by both methods are the same. Aligning the curves gives:
fig, ax = plt.subplots()
for t_peak, df in zip(peaks, frames):
ax.plot(df['t'] - t_peak, df['y'])
Part 2: aligning the time series for numeric operations
Having found the anchor times and plotted the time series by shifting the x-axis accordingly, suppose now that we want to align all the time series, for example to somehow compare them to one another (e.g.: differences, correlation, etc.). In this example we made up, the time samples are not equidistant and all series have their own sampling.
We can use resample() to achieve our goal. Let us convert the frames into actual time series, transforming the column t (supposed in seconds) into a DateTimeIndex, after shifting the time using the previously found t_peak and using an arbitrary "0" date:
frames = [
pd.Series(
df['y'].values,
index=pd.Timestamp(0) + (df['t'] - t_peak) * pd.Timedelta(1, 's')
) for t_peak, df in zip(peaks, frames)]
>>> frames[0]
t
1969-12-31 23:59:58.171107267 11.244308
1969-12-31 23:59:58.421423545 12.387291
1969-12-31 23:59:58.632390727 13.268186
1969-12-31 23:59:58.823099841 13.942224
1969-12-31 23:59:58.971379021 14.359900
...
1970-01-01 00:00:14.022717327 10.422229
1970-01-01 00:00:14.227996854 9.504693
1970-01-01 00:00:14.235034496 9.489011
1970-01-01 00:00:14.525163506 8.388377
1970-01-01 00:00:14.526806922 8.383366
Length: 100, dtype: float64
At this point, the sampling is still uneven, so we use resample to get a fixed frequency. One strategy is to oversample and interpolate:
frames = [df.resample('100ms').mean().interpolate() for df in frames]
for df in frames:
df.plot()
At this point, we can compare the Series. Here are the pairwise differences and correlations:
fig, axes = plt.subplots(nrows=len(frames), ncols=len(frames), figsize=(10, 5))
for axrow, a in zip(axes, frames):
for ax, b in zip(axrow, frames):
(b-a).plot(ax=ax)
ax.set_title(fr'$\rho = {b.corr(a):.3f}$')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.tight_layout()

What is the most efficient and Pythonic way of calculating limsup/liminf in pandas?

Limsup is defined as the supremum of a sequence onward. In other words, at the current moment one can look at the future values and get the maximum of them to create the limsup.
Question
What is the most efficient and Pythonic way of calculating limsup/liminf in pandas?
My try
I am calculating the limsup using a for loop which I am sure is not an efficient way.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
x = np.random.randn(2000)
y = np.cumsum(x)
df = pd.DataFrame(y, columns=['value'])
df['lim_sup'] = 0
fig, ax = plt.subplots(figsize=(20, 4))
for i in range(len(df)):
df['lim_sup'].iloc[i] = df['value'].iloc[i:].max()
df['value'].plot(ax=ax)
df['lim_sup'].plot(ax=ax, color='r')
ax.legend(['value', 'limsup'])
plt.show()
Reverse the values and use cummax to get the cumulative maximum from the bottom up:
df["suprema"] = df.loc[::-1, "value"].cummax()
This column should probably be referred to as the suprema for m >= n, rather than the limsup.

Displot dips between whole numbers

I am trying to plot a density curve with seaborn using age of vehicles.
My density curve has dips between the whole numbers while my age values are all whole number.
Can't seem to find anything related to this issue so I thought I would try my luck here, any input is appreciated.
My fix currently is just using a histogram with a larger bin but would like to get this working with a density plot.
Thanks!
In seaborn.displot you are passing the kind = 'kde' parameter, in order to get a continuous corve. However, this parameter triggers the Kernel Density Estimation computation, which compute values for all number, included non integers ones.
Instead, you need to tune seaborn.histplot in order to get a continuous step curve with element and fill parameters (I create a fake dataframe just to draw a plot, since you didn't provide your data):
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
N = 10000
df = pd.DataFrame({'age': np.random.poisson(lam = 4, size = N)})
df['age'] = df['age'] + 1
fig, ax = plt.subplots(1, 2, figsize = (8, 4))
sns.histplot(ax = ax[0], data = df, bins = np.arange(0.5, df['age'].max() + 1, 1))
sns.histplot(ax = ax[1], data = df, bins = np.arange(0.5, df['age'].max() + 1, 1), element = 'step', fill = False)
ax[0].set_xticks(range(1, 14))
ax[1].set_xticks(range(1, 14))
plt.show()
As a comparison, here the seaborn.displot on the same dataframe, passing kind = 'kde' parameter:

half yearly colorbar in matplotlib and pandas

I have a panda dataframe. I am making scatter plot and tried to categorize the data based on colorbar. I did it for monthly classification and quality classification as shown in the example code below.
a = np.random.rand(366)
b = np.random.rand(366)*0.4
index = (pd.date_range(pd.to_datetime('01-01-2000'), periods=366))
df = pd.DataFrame({'a':a,'b':b},index = index)
plt.scatter(df['a'],df['b'],c = df.index.month)
plt.colorbar()
And also for quality:
plt.scatter(df['a'],df['b'],c = df.index.quarter)
plt.colorbar()
My question: is there any way to categorize by half yearly. for example from the month 1-6 and 7-12 and also by month like: 10-3 and 4-9
Thank you and your help/suggestion will be highly appreciated.
Make a custom function to put in scatter function to color argument. I made an example for half yearly division. You can use it as template for your own split function:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
# if month is 1 to 6 then the first halfyear else the second halfyear
def halfyear(m):
return 0 if (m <= 6) else 1
# vectorize function to use with Series
hy = np.vectorize(halfyear)
a = np.random.rand(366)
b = np.random.rand(366)*0.4
index = (pd.date_range(pd.to_datetime('01-01-2000'), periods=366))
df = pd.DataFrame({'a':a,'b':b},index = index)
# apply custom function 'hy' for 'c' argument
plt.scatter(df['a'],df['b'], c = hy(df.index.month))
plt.colorbar()
plt.show()
Another way to use lambda function like:
plt.scatter(df['a'],df['b'], \
c = df.index.map(lambda m: 0 if (m.month > 0 and m.month < 7) else 1))
I would opt for a solution which does not completely truncate the monthly information. Using colors which are similar but distinguishable for the months allows to visually classify by half-year as well as month.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
a = np.random.rand(366)
b = np.random.rand(366)*0.4
index = (pd.date_range(pd.to_datetime('01-01-2000'), periods=366))
df = pd.DataFrame({'a':a,'b':b},index = index)
colors=["crimson", "orange", "darkblue", "skyblue"]
cdic = list(zip([0,.499,.5,1],colors))
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("name", cdic,12 )
norm = matplotlib.colors.BoundaryNorm(np.arange(13)+.5,12)
plt.scatter(df['a'],df['b'],c = df.index.month, cmap=cmap, norm=norm)
plt.colorbar(ticks=np.arange(1,13))
plt.show()

Heat map for a very large matrix, including NaNs

I am trying to see if NaNs are concentrated somewhere, or if there is any pattern for their distribution.
The idea is to use python to plot a heatMap of the matrix (which is 200K rows and 1k columns) and set a special color for NaN values (the rest of the values can be represented by the same color, this doesn't matter)
An example of a possible display:
Thank you all in advance
A 1:200 aspect ratio is pretty bad and, since you could run into memory issues, you should probably break it up into several Nx1k pieces.
That being said, here is my solution (inspired by your example image):
from mpl_toolkits.axes_grid1 import AxesGrid
# generate random matrix
xDim = 2000
yDim = 4000
# number of nans
nNans = xDim*yDim*.1
rands = np.random.rand(yDim, xDim)
# create a skewed distribution for the nans
x = np.clip(np.random.gamma(2, yDim*.125, size=nNans).astype(np.int),0 ,yDim-1)
y = np.random.randint(0,xDim,size=nNans)
rands[x,y] = np.nan
# find the nans:
isNan = np.isnan(rands)
fig = plt.figure()
# make axesgrid so we can put a histogram-like plot next to the data
grid = AxesGrid(fig, 111, nrows_ncols=(1, 2), axes_pad=0.05)
# plot the data using binary colormap
grid[0].imshow(isNan, cmap=cm.binary)
# plot the histogram
grid[1].plot(np.sum(isNan,axis=1), range(isNan.shape[0]))
# set ticks and limits, so the figure looks nice
grid[0].set_xticks([0,250,500,750,1000,1250,1500,1750])
grid[1].set_xticks([0,250,500,750])
grid[1].set_xlim([0,750])
grid.axes_llc.set_ylim([0, yDim])
plt.show()
Here is what it looks like:
# Learn about API authentication here: https://plot.ly/python/getting-started
# Find your api_key here: https://plot.ly/settings/api
import plotly.plotly as py
import plotly.graph_objs as go
data = [
go.Heatmap(
z=[[1, 20, 30],
[20, 1, 60],
[30, 60, 1]]
)
]
plot_url = py.plot(data, filename='basic-heatm
soruce: https://plot.ly/python/heatmaps/
What you could do is use a scatter plot:
import matplotlib.pyplot as plt
import numpy as np
# create a matrix with random numbers
A = np.random.rand(2000,10)
# make some NaNs in it:
for _ in range(1000):
i = np.random.randint(0,2000)
j = np.random.randint(0,10)
A[i,j] = np.nan
# get a matrix to plot with only the NaNs:
B = np.isnan(A)
# if NaN plot a point.
for i in range(2000):
for j in range(10):
if B[i,j]: plt.scatter(i,j)
plt.show()
when using python 2.6 or 2.7 consider using xrange instead of range for speedup.
Note. it could be faster to do:
C = np.where(B)
plt.scatter(C[0],C[1])

Categories

Resources