How to generate legible plots in pandas when looping over columns? - python

Generate the dataframe for replicability:
df = pd.DataFrame(np.random.randn(50, 1000), columns=list('ABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDED'))
Check for normalcy of distribution of each variable (note: this takes a long time to run)
# Set the column names
columns= df.columns
# Loop over all columns
fig, axs = plt.subplots(len(df.columns), figsize=(5, 25))
for n, col in enumerate(df.columns):
df[col].hist(ax=axs[n])
Result generates illegible histograms and takes a very long time to run.
The length of time is okay, but I am curious if anyone has suggestions for generating legible histograms (do not have to be fancy), which can be quickly reviewed for the entire dataframe to ensure the normality of the distributions.

This code generates 1000 histograms and allows you to see each one in sufficient detail to understand how normally-distributed the columns are:
import pandas as pd
import matplotlib.pyplot as plt
cols = 1000
df = pd.DataFrame(np.random.normal(0, 1, [50, cols]))
# Loop over all columns
fig, ax = plt.subplots(figsize = (16, 10))
for n, col in enumerate(df.columns):
plt.subplot(25, 40, n+1)
df[col].hist(ax = plt.gca())
plt.axis('off')
plt.tight_layout()
plt.savefig('1000_histograms.png', bbox_inches='tight', pad_inches = 0, dpi = 200)
Another way to ascertain normality is with a QQ plot, which may be easier to visualize in bulk compared to a histogram:
import statsmodels.api as sm
cols = 1000
df = pd.DataFrame(np.random.normal(0,1, [50, cols]))
fig, axs = plt.subplots(figsize=(18, 12))
for n, col in enumerate(df.columns):
plt.subplot(25,40,n+1)
sm.qqplot(df[col], ax=plt.gca(), #line='45',
marker='.', markerfacecolor='C0', markeredgecolor='C0',
markersize=2)
# sm.qqline(ax=plt.gca(), line='45', fmt='lightgray')
plt.axis('off')
plt.savefig('1000_QQ_plots13.png', bbox_inches='tight', pad_inches=0, dpi=200)
The closer each line is to a 45 degree diagonal, the more normally-distributed the column data is.

Plotting vs normality test
Proposition
Output example
Corresponding code sample
Plotting vs normality test
As discussed in comments below, the OP question has changed to thousands of plots management. From that perspective, Nathaniel answer's is appropriate.
However, I felt that the unsaid intent was to decide wheter a given variable was normally distributed or not, with thousands+ variables to consider.
Check for normalcy of distribution of each variable (note: this takes a long time to run)
With that in mind, it appears (to me) that having a human reviewing thousands of plots to spot normal/non-normal distributions is an innapropriate method. There is a french idiom for this: "usine à gaz" ("gas factory")
Therefore, this answer focuses on performing the analysis programmatically and provide some kind of more concise report.
Proposition
Perform analysis of data normality over a huge number of columns.
It relies on the suggestion expressed in this answer.
The idea is to:
perform a distribution test (normality) for all columns
capitalize into a dataframe the results
Report into a graph the normal/non-normal ratios.
Report the non-normal column names.
With this method, we can further use programming to manipulate the normal/non-normal columns. For instance, we could perform additional distribution tests, or plot only the non normal distribution, thus reducing the number of graphs to actually observe.
Output example:
------------
Columns probably not a normal dist:
Column Not_Normal p-value Normality
0 V True 0.0 Not Normal
0 W True 0.0 Not Normal
0 X True 0.0 Not Normal
0 Y True 0.0 Not Normal
0 Z True 0.0 Not Normal
Disclaimer: methods used may not be statistically "canonical". One should be very careful when using statistical tools, since each one as its specific usage domain/use case.
I chose a 0.01 (1%) p-value, since it could be the upcoming standard value in scientific publications instead of the usual 0.05 (5%))
One should read https://en.wikipedia.org/wiki/Normality_test
Tests of univariate normality include the following:
D'Agostino's K-squared test,
Jarque–Bera test,
Anderson–Darling test,
Cramér–von Mises criterion,
Lilliefors test,
Kolmogorov–Smirnov test
Shapiro–Wilk test, and
Pearson's chi-squared test.
Code
Behavior may vary on your computer depending on RNG (random numbers generation).
The following example is made with 5 normal random sampling and 5 pareto random sampling using numpy.
The normality test performs well in these conditions (even if I feel that the 0.0 p value tests are suspicious even for a pareto random generation)
Nevertheless, I think we can agree that it is about the method, not actual the results.
import pandas as pd
import numpy as np
import scipy
from scipy import stats
import seaborn as sb
import matplotlib.pyplot as plt
import sys
print('System: {}'.format(sys.version))
for module in [pd, np, scipy, sb]:
print('Module {:10s} - version {}'.format(module.__name__, module.__version__))
nb_lines = 10000
headers_normal = 'ABCDE'
headers_pareto = 'VWXYZ'
reapeat_factor = 1
nb_cols = len(list(reapeat_factor * headers_normal))
df_normal = pd.DataFrame(np.random.randn(nb_lines, nb_cols), columns=list(reapeat_factor * headers_normal))
df_pareto = pd.DataFrame((np.random.pareto(12.0, size=(nb_lines,nb_cols )) + 15.) * 4., columns=list(reapeat_factor * headers_pareto))
df = df_normal.join(df_pareto)
alpha = 0.01
df_list = list()
# normality code taken from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html
cat_map = {True: 'Not Normal',
False: 'Maybe Normal'}
for col in df.columns:
k2, p = stats.normaltest(df[col])
is_not_normal = p < alpha
tmp_df = pd.DataFrame({'Column': [col],
'Not_Normal': [is_not_normal],
'p-value': [p],
'Normality': cat_map[is_not_normal]
})
df_list.append(tmp_df)
df_results = pd.concat(df_list)
df_results['Normality'] = df_results['Normality'].astype('category')
print('------------')
print('Columns names probably not a normal dist:')
# full data
print(df_results[(df_results['Normality'] == 'Not Normal')])
# only column names
# print(df_results[(df_results['Normality'] == 'Not Normal')]['Column'])
print('------------')
print('Plotting countplot')
sb.countplot(data=df_results, y='Normality', orient='v')
plt.show()
Outputs:
System: 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
Module pandas - version 0.24.1
Module numpy - version 1.16.2
Module scipy - version 1.2.1
Module seaborn - version 0.9.0
------------
Columns names probably not a normal dist:
Column Not_Normal p-value Normality
0 V True 0.0 Not Normal
0 W True 0.0 Not Normal
0 X True 0.0 Not Normal
0 Y True 0.0 Not Normal
0 Z True 0.0 Not Normal
------------
Plotting countplot

I really like Nathaniel's answer but I will add my two cents.
I would go for seaborn and in particular seaborn.distplot.
This will allow you to easily fit a normal distribution to each histogram plot and make the visualization easier.
import seaborn as sns
from scipy.stats import norm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
cols = 1000
df = pd.DataFrame(np.random.normal(0, 1, [50, cols]))
from scipy.stats import norm
fig, ax = plt.subplots(figsize = (16, 10))
for i, col in enumerate(df.columns):
ax=fig.add_subplot(25, 4, i+1)
sns.distplot(df[col],fit=norm, kde=False,ax=ax)
plt.tight_layout()
Additionally, I am not sure if putting columns with the same name in your example was done on purpose. If that's the case the easiest solution to loop through the columns is to use .iloc and the code would look like this:
import seaborn as sns
from scipy.stats import norm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(50, 1000), columns=list('ABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDED'))
fig, ax = plt.subplots(figsize = (12, 10))
for i, col in enumerate(df.columns):
plt.subplot(25, 40, i+1)
sns.distplot(df.iloc[:,i],fit=norm, kde=False,ax=plt.gca())
plt.axis('off')
plt.tight_layout()

Try something like this:
plt.figure(figsize=(26, 3 * len(df.columns))
for i, col in enumerate(df.columns):
plt.subplot(3, 4, i + 1)
plt.hist(df[col], color='blue', bins=100)
plt.title(col)
4 is the number of columns, 3 is the number of rows. I suppose instead of 3 it is better to write something like this:
plt.subplot(len(df.columns) / 4, 4, i + 1)

Try this - tight_layout ensures no overlap, figsize controls the size of each plot.
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 3*30), columns=list('ABC'*30))
df.hist(figsize=(20,20))
plt.tight_layout()
plt.show()
However, if you are after a normality test, would suggest to use something like this: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html instead of relying on visual inspection, especially if you have many variables.

Related

Replace outliers with neighbour-Value

I have a plot with some outliers (wrong measurements):
The base data is good though. I want to just delete everything that is too far off the "current average". I tried using pd.rolling().mean() but with no satisfactory result:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = np.genfromtxt('shard_height_plot.csv', delimiter = ',')
df = pd.DataFrame(data)
df.set_index(0, inplace = True)
df2 = df.rolling(20).mean()
plt.plot(df)
plt.plot(df2)
plt.show()
I tried to search the web for a good solution but couldn't find one. It shouldn't be that hard to delete data points, that jump through the roof, should it?
Edit:
data file can be downloaded here: https://ufile.io/pviuc
Edit2:
I takled this problem of too many outliers by improving my data set creation.
The core of it:
if abs(D - D_List[-2]) > 30:
D = D_List[-2]
D_List.pop()
D_List.append(D)
Basically what this does is checking if the change of a value is larger than 30, if so it deletes the last value and replaces is with the second last. Not very spectacular but just what I need. I used one of the answers though because it is so much prettier. Thank you guys very much.
Let's try using scipy.signal see docs:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import signal
data = np.genfromtxt('shard_height_plot.csv', delimiter = ',')
df = pd.DataFrame(data)
df.set_index(0, inplace = True)
df2 = df.rolling(20).mean()
b, a = signal.butter(3, 0.05)
y = signal.filtfilt(b,a, df[1].values)
df3 = pd.DataFrame(y, index=df2.index)
plt.plot(df, alpha=.3)
plt.plot(df2, alpha=.3)
plt.plot(df3)
plt.show()
Output:
Use medfilt:
y = signal.medfilt(df[1].values)
Output:
There are many ways to smooth a curve (rolling mean, GAM, smoothing spline etc.), my favorite one is the Savitzky–Golay method.
It works as follows: after having regressed a small window around a data point y onto a polynomial (with least squares), it uses this polynomial to get the estimation of your data point ^y. Then the window is shifted forward by one data point.
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
x = np.linspace(0,5,150)
y = np.cos(x) + np.random.random(150) * 0.15
yhat = savgol_filter(y, 49, 3)
plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()
Note that rolling mean can't work in your case with a perimeter as low as 20, since the outlier point will have a non-negligible weight (5%) and will always induce a big bias...

Ridgeline/Joyplot across a moving range

(Using Python 3.0) In increments of 0.25, I want to calculate and plot PDFs for the given data across specified ranges for easy visualization.
Calculating the individual plot has been done thanks to the SO community, but I cannot quite get the algorithm right to iterate properly across the range of values.
Data: https://www.dropbox.com/s/y78pynq9onyw9iu/Data.csv?dl=0
What I have so far is normalized toy data that looks like a shotgun blast with one of the target areas isolated between the black lines with an increment of 0.25:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
import seaborn as sns
Data=pd.read_csv("Data.csv")
g = sns.jointplot(x="x", y="y", data=Data)
bottom_lim = 0
top_lim = 0.25
temp = Data.loc[(Data.y>=bottom_lim)&(Data.y<top_lim)]
g.ax_joint.axhline(top_lim, c='k', lw=2)
g.ax_joint.axhline(bottom_lim, c='k', lw=2)
# we have to create a secondary y-axis to the joint-plot, otherwise the kde
might be very small compared to the scale of the original y-axis
ax_joint_2 = g.ax_joint.twinx()
sns.kdeplot(temp.x, shade=True, color='red', ax=ax_joint_2, legend=False)
ax_joint_2.spines['right'].set_visible(False)
ax_joint_2.spines['top'].set_visible(False)
ax_joint_2.yaxis.set_visible(False)
And now what I want to do is make a ridgeline/joyplot of this data across each 0.25 band of data.
I tried a few techniques from the various Seaborn examples out there, but nothing really accounts for the band or range of values as the y-axis. I'm struggling to translate my written algorithm into working code as a result.
I don't know if this is exactly what you are looking for, but hopefully this gets you in the ballpark. I also know very little about python, so here is some R:
library(tidyverse)
library(ggridges)
data = read_csv("https://www.dropbox.com/s/y78pynq9onyw9iu/Data.csv?dl=1")
data2 = data %>%
mutate(breaks = cut(x, breaks = seq(-1,7,.5), labels = FALSE))
data2 %>%
ggplot(aes(x=x,y=breaks)) +
geom_density_ridges() +
facet_grid(~breaks, scales = "free")
data2 %>%
ggplot(aes(x=x,y=y)) +
geom_point() +
geom_density() +
facet_grid(~breaks, scales = "free")
And please forgive the poorly formatted axis.

Plotting colored lines connecting individual data points of two swarmplots

I have:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
# Generate random data
set1 = np.random.randint(0, 40, 24)
set2 = np.random.randint(0, 100, 24)
# Put into dataframe and plot
df = pd.DataFrame({'set1': set1, 'set2': set2})
data = pd.melt(df)
sb.swarmplot(data=data, x='variable', y='value')
The two random distributions plotted with seaborn's swarmplot function:
I want the individual plots of both distributions to be connected with a colored line such that the first data point of set 1 in the dataframe is connected with the first data point of set 2.
I realize that this would probably be relatively simple without seaborn but I want to keep the feature that the individual data points do not overlap.
Is there any way to access the individual plot coordinates in the seaborn swarmfunction?
EDIT: Thanks to #Mead, who pointed out a bug in my post prior to 2021-08-23 (I forgot to sort the locations in the prior version).
I gave the nice answer by Paul Brodersen a try, and despite him saying that
Madness lies this way
... I actually think it's pretty straight forward and yields nice results:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Generate random data
rng = np.random.default_rng(42)
set1 = rng.integers(0, 40, 5)
set2 = rng.integers(0, 100, 5)
# Put into dataframe
df = pd.DataFrame({"set1": set1, "set2": set2})
print(df)
data = pd.melt(df)
# Plot
fig, ax = plt.subplots()
sns.swarmplot(data=data, x="variable", y="value", ax=ax)
# Now connect the dots
# Find idx0 and idx1 by inspecting the elements return from ax.get_children()
# ... or find a way to automate it
idx0 = 0
idx1 = 1
locs1 = ax.get_children()[idx0].get_offsets()
locs2 = ax.get_children()[idx1].get_offsets()
# before plotting, we need to sort so that the data points
# correspond to each other as they did in "set1" and "set2"
sort_idxs1 = np.argsort(set1)
sort_idxs2 = np.argsort(set2)
# revert "ascending sort" through sort_idxs2.argsort(),
# and then sort into order corresponding with set1
locs2_sorted = locs2[sort_idxs2.argsort()][sort_idxs1]
for i in range(locs1.shape[0]):
x = [locs1[i, 0], locs2_sorted[i, 0]]
y = [locs1[i, 1], locs2_sorted[i, 1]]
ax.plot(x, y, color="black", alpha=0.1)
It prints:
set1 set2
0 3 85
1 30 8
2 26 69
3 17 20
4 17 9
And you can see that the data is linked correspondingly in the plot.
Sure, it's possible (but you really don't want to).
seaborn.swarmplot returns the axis instance (here: ax). You can grab the children ax.get_children to get all plot elements. You will see that for each set of points there is an element of type PathCollection. You can determine the x, y coordinates by using the PathCollection.get_offsets() method.
I do not suggest you do this! Madness lies this way.
I suggest you have a look at the source code (found here), and derive your own _PairedSwarmPlotter from _SwarmPlotter and change the draw_swarmplot method to your needs.

Heat map for a very large matrix, including NaNs

I am trying to see if NaNs are concentrated somewhere, or if there is any pattern for their distribution.
The idea is to use python to plot a heatMap of the matrix (which is 200K rows and 1k columns) and set a special color for NaN values (the rest of the values can be represented by the same color, this doesn't matter)
An example of a possible display:
Thank you all in advance
A 1:200 aspect ratio is pretty bad and, since you could run into memory issues, you should probably break it up into several Nx1k pieces.
That being said, here is my solution (inspired by your example image):
from mpl_toolkits.axes_grid1 import AxesGrid
# generate random matrix
xDim = 2000
yDim = 4000
# number of nans
nNans = xDim*yDim*.1
rands = np.random.rand(yDim, xDim)
# create a skewed distribution for the nans
x = np.clip(np.random.gamma(2, yDim*.125, size=nNans).astype(np.int),0 ,yDim-1)
y = np.random.randint(0,xDim,size=nNans)
rands[x,y] = np.nan
# find the nans:
isNan = np.isnan(rands)
fig = plt.figure()
# make axesgrid so we can put a histogram-like plot next to the data
grid = AxesGrid(fig, 111, nrows_ncols=(1, 2), axes_pad=0.05)
# plot the data using binary colormap
grid[0].imshow(isNan, cmap=cm.binary)
# plot the histogram
grid[1].plot(np.sum(isNan,axis=1), range(isNan.shape[0]))
# set ticks and limits, so the figure looks nice
grid[0].set_xticks([0,250,500,750,1000,1250,1500,1750])
grid[1].set_xticks([0,250,500,750])
grid[1].set_xlim([0,750])
grid.axes_llc.set_ylim([0, yDim])
plt.show()
Here is what it looks like:
# Learn about API authentication here: https://plot.ly/python/getting-started
# Find your api_key here: https://plot.ly/settings/api
import plotly.plotly as py
import plotly.graph_objs as go
data = [
go.Heatmap(
z=[[1, 20, 30],
[20, 1, 60],
[30, 60, 1]]
)
]
plot_url = py.plot(data, filename='basic-heatm
soruce: https://plot.ly/python/heatmaps/
What you could do is use a scatter plot:
import matplotlib.pyplot as plt
import numpy as np
# create a matrix with random numbers
A = np.random.rand(2000,10)
# make some NaNs in it:
for _ in range(1000):
i = np.random.randint(0,2000)
j = np.random.randint(0,10)
A[i,j] = np.nan
# get a matrix to plot with only the NaNs:
B = np.isnan(A)
# if NaN plot a point.
for i in range(2000):
for j in range(10):
if B[i,j]: plt.scatter(i,j)
plt.show()
when using python 2.6 or 2.7 consider using xrange instead of range for speedup.
Note. it could be faster to do:
C = np.where(B)
plt.scatter(C[0],C[1])

Plotting CDF of a pandas series in python

Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.
I believe the functionality you're looking for is in the hist method of a Series object which wraps the hist() function in matplotlib
Here's the relevant documentation
In [10]: import matplotlib.pyplot as plt
In [11]: plt.hist?
...
Plot a histogram.
Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : False
If `True`, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints. If `normed` is also `True`
then the histogram is normalized such that the last bin equals 1.
If `cumulative` evaluates to less than 0 (e.g., -1), the direction
of accumulation is reversed. In this case, if `normed` is also
`True`, then the histogram is normalized such that the first bin
equals 1.
...
For example
In [12]: import pandas as pd
In [13]: import numpy as np
In [14]: ser = pd.Series(np.random.normal(size=1000))
In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>
In [16]: plt.show()
In case you are also interested in the values, not just the plot.
import pandas as pd
# If you are in jupyter
%matplotlib inline
This will always work (discrete and continuous distributions)
# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series
# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df
# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.
# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
For continuous distributions only
Please note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).
In this case, a percent rank can be used to get to the cdf directly.
Use your best judgment when taking this kind of shortcut! :)
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)
# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)
A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.
First create an example series:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
Sort the series:
ser = ser.sort_values()
Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:
ser[len(ser)] = ser.iloc[-1]
Create a new series with the sorted values as index and the cumulative distribution as values:
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
Finally, plot the function as steps:
ser_cdf.plot(drawstyle='steps')
I came here looking for a plot like this with bars and a CDF line:
It can be achieved like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')
If you want to remove the vertical line, then it's explained how to accomplish that here. Or you could just do:
ax.set_xlim((ax.get_xlim()[0], series.max()))
I also saw an elegant solution here on how to do it with seaborn.
This is the easiest way.
import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )
Image of cumulative histogram
I found another solution in "pure" Pandas, that does not require specifying the number of bins to use in a histogram:
import pandas as pd
import numpy as np # used only to create example data
series = pd.Series(np.random.normal(size=10000))
cdf = series.value_counts().sort_index().cumsum()
cdf.plot()
To me, this seemed like a simply way to do it:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
heights = pd.Series(np.random.normal(size=100))
# empirical CDF
def F(x,data):
return float(len(data[data <= x]))/len(data)
vF = np.vectorize(F, excluded=['data'])
plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))
I really like the answer by Raphvanns. It is helpful because it not only produces the plot, but it also helps me understand what pdf, cdf, and ccdf is.
I have two things to add to Raphvanns's solution: (1) use collections.Counter wisely to make the process easier; (2) remember to sort (assending) value before calculating pdf, cdf, and ccdf.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
Generate random numbers:
s = pd.Series(np.random.randint(1000, size=(1000)))
Build a dataframe as Raphvanns suggested:
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
Calculate PDF, CDF, and CCDF:
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
Plot:
df.plot(x = 'value', y = ['cdf', 'ccdf'], grid = True)
You may wonder why we have to sort the value before calculating PDF, CDF, and CCDF. Well, let's say what would the results be if we don't sort them (note that dict(Counter(s)) automatically sorted the items, we will make the order random in the following).
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
df.plot(x = 'value', y = ['cdf'], grid = True)
This is the plot:
Why did it happen? Well, the essence of CDF is "The number of data points we have seen so far", citing YY's lecture slides of his Data Visualization class. Therefore, if the order of value is not sorted (either ascending or descending is fine), then when you plot, where x axis is in ascending order, the y value of course will be just a mess.
If you apply a descending order, you can imagine that the CDF and CCDF will just swap their places:
I will leave a question to the readers of this post: if I randomize the order of value like above, will sorting value after (rather than before) calculating PDF, CDF, and CCDF solve the problem?
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
# Will this solve the problem?
df = df.sort_values(by='value')
df.plot(x = 'value', y = ['cdf'], grid = True)
Upgrading the answer of #wroscoe
df[your_column].plot(kind = 'hist', histtype = 'step', density = True, cumulative = True)
You can also provide a number of desired bins.
If you're looking to plot a "true" empirical CDF, which jumps exactly at the values of your data set a, and with the jump at each value proportional to the frequency of the value, NumPy has builtin functions to do the work:
import matplotlib.pyplot as plt
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
y = np.cumsum(counts)
x = np.insert(x, 0, x[0])
y = np.insert(y/y[-1], 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
The call to unique() returns the data values in sorted order along with their corresponding frequencies. The option drawstyle='steps-post' in the plot() call ensures that the jumps occur where they should. To force a jump at the smallest data value, the code inserts an additional element in front of x and y.
Example usage:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
ecdf(xvec)
Another usage:
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
ecdf(df['x'])
with output:

Categories

Resources