Replace outliers with neighbour-Value - python

I have a plot with some outliers (wrong measurements):
The base data is good though. I want to just delete everything that is too far off the "current average". I tried using pd.rolling().mean() but with no satisfactory result:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = np.genfromtxt('shard_height_plot.csv', delimiter = ',')
df = pd.DataFrame(data)
df.set_index(0, inplace = True)
df2 = df.rolling(20).mean()
plt.plot(df)
plt.plot(df2)
plt.show()
I tried to search the web for a good solution but couldn't find one. It shouldn't be that hard to delete data points, that jump through the roof, should it?
Edit:
data file can be downloaded here: https://ufile.io/pviuc
Edit2:
I takled this problem of too many outliers by improving my data set creation.
The core of it:
if abs(D - D_List[-2]) > 30:
D = D_List[-2]
D_List.pop()
D_List.append(D)
Basically what this does is checking if the change of a value is larger than 30, if so it deletes the last value and replaces is with the second last. Not very spectacular but just what I need. I used one of the answers though because it is so much prettier. Thank you guys very much.

Let's try using scipy.signal see docs:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import signal
data = np.genfromtxt('shard_height_plot.csv', delimiter = ',')
df = pd.DataFrame(data)
df.set_index(0, inplace = True)
df2 = df.rolling(20).mean()
b, a = signal.butter(3, 0.05)
y = signal.filtfilt(b,a, df[1].values)
df3 = pd.DataFrame(y, index=df2.index)
plt.plot(df, alpha=.3)
plt.plot(df2, alpha=.3)
plt.plot(df3)
plt.show()
Output:
Use medfilt:
y = signal.medfilt(df[1].values)
Output:

There are many ways to smooth a curve (rolling mean, GAM, smoothing spline etc.), my favorite one is the Savitzky–Golay method.
It works as follows: after having regressed a small window around a data point y onto a polynomial (with least squares), it uses this polynomial to get the estimation of your data point ^y. Then the window is shifted forward by one data point.
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import savgol_filter
x = np.linspace(0,5,150)
y = np.cos(x) + np.random.random(150) * 0.15
yhat = savgol_filter(y, 49, 3)
plt.plot(x,y)
plt.plot(x,yhat, color='red')
plt.show()
Note that rolling mean can't work in your case with a perimeter as low as 20, since the outlier point will have a non-negligible weight (5%) and will always induce a big bias...

Related

How do I change the order of the x axis in Python?

import pandas as pd
from pandas_datareader import wb
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.formula.api as smf
G_pop = wb.download(indicator='SP.POP.TOTL', country="DEU", end=2020, start=2010)
G_pop = G_pop.reset_index(1)
G_pop.columns = ["year", "Population in Germany"]
pd.set_option("display.max.columns", 100000)
pd.set_option("display.max.rows", 300000)
pd.set_option("display.width", 1000000)
x = G_pop["year"]
y = G_pop["Population in Germany"]
plt.xticks(rotation=45, ha='left')
plt.plot(x,y)
plt.show()
I'm new to programming and am trying to modulate graphs with the World Bank database. It works quite well apart from my X-axis. Does anyone know how I can convert them? Since the left is 2000 and the right is 2020 ascending. It is currently the case that the left on the x axis is 2020 and it is descending. I've been struggling with this problem for two days and can't get any further. invert_xaxis() and invert_yaxis(). I've tried both and it only gives me error messages. I would be very thankful for any help.
My code and the Graph wit the wrong x axis:
Picture of the wrong Graph:
Add this line:
G_pop.columns = ["year", "Population in Germany"]
G_pop['year'] = pd.to_numeric(G_pop['year']) # Convert text to numeric for automatic sorting
pd.set_option("display.max.columns", 100000)
Output:

Can I take a table from excel and plot a histogram in python?

I have 2 tables a 10 by 110 and a 35 by 110 and both contain random numbers from a exponential distribution function provided by my professor. The assignment is to prove the central limit theorem in statistics.
What I thought to try is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
"importing data"
df1 = pd.read_excel(r'C:\Users\Henry\Desktop\n10.xlsx')
df2 = pd.read_excel(r'C:\Users\Henry\Desktop\n30.xlsx')
df1avg = pd.read_excel(r'C:\Users\Henry\Desktop\n10avg.xlsx')
df2avg = pd.read_excel(r'C:\Users\Henry\Desktop\n30avg.xlsx')
"plotting n10 histogram"
plt.hist(df1, bins=34)
plt.hist(df1avg, bins=11)
"plotting n30 histogram"
plt.hist(df2, bins=63)
plt.hist(df2avg, bins=11)
Is that ok or do I need to format the tables into a singular column, and if so what is the most efficient way to do that?
I suspect that you will want to flatten your dataframe first, as illustrated below.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
N = np.random.exponential(1, [40, 5])
df = pd.DataFrame(N) # convert to dataframe
bin_edges = np.linspace(0,6,30)
plt.figure()
plt.hist(df, bins = bin_edges, density = True)
plt.xlabel('Value')
plt.ylabel('Probability density')
The multiple (5) colours of lines per bin shows the histograms for each column of the data frame.
Fortunately, this is not hard to adjust. You can convert the data frame to a numpy array and flatten it:
plt.hist(df.to_numpy().flatten(), bins = bin_edges, density = True)
plt.ylabel('Probability density')
plt.xlabel('Value')

How to plot a smooth curve in python for a list of values?

I have created a list of values of Shannon entropy for a pair of multiple sequence aligned sequences. While plotting the values I get a simple plot. I want to plot a smooth curve over the lines. Can anyone suggest to me what will be the right way to process it? BAsically I want to plot a smooth curve that touches the tip of every bar and goes to zero where the "y axis value" is zero.
link for image: [1]: https://i.stack.imgur.com/SY3jH.png
#importing the relevant packages
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import make_interp_spline
from Bio import AlignIO
import warnings
warnings.filterwarnings("ignore")
#function to calculate the Shannon Entropy of a MSA
# H = -sum[p(x).log2(px)]
def shannon_entropy(list_input):
unique_aa = set(list_input)
M = len(list_input)
entropy_list = []
# Number of residues in column
for aa in unique_aa:
n_i = list_input.count(aa)
P_i = n_i/float(M)
entropy_i = P_i*(math.log(P_i,2))
entropy_list.append(entropy_i)
sh_entropy = -(sum(entropy_list))
#print(sh_entropy)
return sh_entropy
#importing the MSA file
#importing the clustal file
align_clustal1 =AlignIO.read("/home/clustal.aln", "clustal")
def shannon_entropy_list_msa(alignment_file):
shannon_entropy_list = []
for col_no in range(len(list(alignment_file[0]))):
list_input = list(alignment_file[:, col_no])
shannon_entropy_list.append(shannon_entropy(list_input))
return shannon_entropy_list
clustal_omega1 = shannon_entropy_list_msa(align_clustal1)
# Plotting the data
plt.figure(figsize=(18,10))
plt.plot(clustal_omega1, 'r')
plt.xlabel('Residue', fontsize=16)
plt.ylabel("Shannon's entropy", fontsize=16)
plt.show()
Edit 1:
Here is what my graph looks like after implementing the "pchip" method. link for the pchip output: https://i.stack.imgur.com/hA3KW.png
pchip monotonic spline output
One approach would be to use PCHIP interpolation, which will give you the monotonic curve with the required behaviour for zero values on the y-axis.
We can't run your exact code example on our machines because you point to a local Clustal file in your 'home' directory.
Here's a simple working example, with link to output image:
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import pchip
mylist = [10,0,0,0,0,9,9,0,0,0,11,11,11,0,0]
mylist_np = np.array(mylist)
samples = np.array(range(len(mylist)))
xnew = np.linspace(samples.min(), samples.max(), 100)
plt.plot(xnew,pchip(samples, mylist_np )(xnew))
plt.show()

How to generate legible plots in pandas when looping over columns?

Generate the dataframe for replicability:
df = pd.DataFrame(np.random.randn(50, 1000), columns=list('ABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDED'))
Check for normalcy of distribution of each variable (note: this takes a long time to run)
# Set the column names
columns= df.columns
# Loop over all columns
fig, axs = plt.subplots(len(df.columns), figsize=(5, 25))
for n, col in enumerate(df.columns):
df[col].hist(ax=axs[n])
Result generates illegible histograms and takes a very long time to run.
The length of time is okay, but I am curious if anyone has suggestions for generating legible histograms (do not have to be fancy), which can be quickly reviewed for the entire dataframe to ensure the normality of the distributions.
This code generates 1000 histograms and allows you to see each one in sufficient detail to understand how normally-distributed the columns are:
import pandas as pd
import matplotlib.pyplot as plt
cols = 1000
df = pd.DataFrame(np.random.normal(0, 1, [50, cols]))
# Loop over all columns
fig, ax = plt.subplots(figsize = (16, 10))
for n, col in enumerate(df.columns):
plt.subplot(25, 40, n+1)
df[col].hist(ax = plt.gca())
plt.axis('off')
plt.tight_layout()
plt.savefig('1000_histograms.png', bbox_inches='tight', pad_inches = 0, dpi = 200)
Another way to ascertain normality is with a QQ plot, which may be easier to visualize in bulk compared to a histogram:
import statsmodels.api as sm
cols = 1000
df = pd.DataFrame(np.random.normal(0,1, [50, cols]))
fig, axs = plt.subplots(figsize=(18, 12))
for n, col in enumerate(df.columns):
plt.subplot(25,40,n+1)
sm.qqplot(df[col], ax=plt.gca(), #line='45',
marker='.', markerfacecolor='C0', markeredgecolor='C0',
markersize=2)
# sm.qqline(ax=plt.gca(), line='45', fmt='lightgray')
plt.axis('off')
plt.savefig('1000_QQ_plots13.png', bbox_inches='tight', pad_inches=0, dpi=200)
The closer each line is to a 45 degree diagonal, the more normally-distributed the column data is.
Plotting vs normality test
Proposition
Output example
Corresponding code sample
Plotting vs normality test
As discussed in comments below, the OP question has changed to thousands of plots management. From that perspective, Nathaniel answer's is appropriate.
However, I felt that the unsaid intent was to decide wheter a given variable was normally distributed or not, with thousands+ variables to consider.
Check for normalcy of distribution of each variable (note: this takes a long time to run)
With that in mind, it appears (to me) that having a human reviewing thousands of plots to spot normal/non-normal distributions is an innapropriate method. There is a french idiom for this: "usine à gaz" ("gas factory")
Therefore, this answer focuses on performing the analysis programmatically and provide some kind of more concise report.
Proposition
Perform analysis of data normality over a huge number of columns.
It relies on the suggestion expressed in this answer.
The idea is to:
perform a distribution test (normality) for all columns
capitalize into a dataframe the results
Report into a graph the normal/non-normal ratios.
Report the non-normal column names.
With this method, we can further use programming to manipulate the normal/non-normal columns. For instance, we could perform additional distribution tests, or plot only the non normal distribution, thus reducing the number of graphs to actually observe.
Output example:
------------
Columns probably not a normal dist:
Column Not_Normal p-value Normality
0 V True 0.0 Not Normal
0 W True 0.0 Not Normal
0 X True 0.0 Not Normal
0 Y True 0.0 Not Normal
0 Z True 0.0 Not Normal
Disclaimer: methods used may not be statistically "canonical". One should be very careful when using statistical tools, since each one as its specific usage domain/use case.
I chose a 0.01 (1%) p-value, since it could be the upcoming standard value in scientific publications instead of the usual 0.05 (5%))
One should read https://en.wikipedia.org/wiki/Normality_test
Tests of univariate normality include the following:
D'Agostino's K-squared test,
Jarque–Bera test,
Anderson–Darling test,
Cramér–von Mises criterion,
Lilliefors test,
Kolmogorov–Smirnov test
Shapiro–Wilk test, and
Pearson's chi-squared test.
Code
Behavior may vary on your computer depending on RNG (random numbers generation).
The following example is made with 5 normal random sampling and 5 pareto random sampling using numpy.
The normality test performs well in these conditions (even if I feel that the 0.0 p value tests are suspicious even for a pareto random generation)
Nevertheless, I think we can agree that it is about the method, not actual the results.
import pandas as pd
import numpy as np
import scipy
from scipy import stats
import seaborn as sb
import matplotlib.pyplot as plt
import sys
print('System: {}'.format(sys.version))
for module in [pd, np, scipy, sb]:
print('Module {:10s} - version {}'.format(module.__name__, module.__version__))
nb_lines = 10000
headers_normal = 'ABCDE'
headers_pareto = 'VWXYZ'
reapeat_factor = 1
nb_cols = len(list(reapeat_factor * headers_normal))
df_normal = pd.DataFrame(np.random.randn(nb_lines, nb_cols), columns=list(reapeat_factor * headers_normal))
df_pareto = pd.DataFrame((np.random.pareto(12.0, size=(nb_lines,nb_cols )) + 15.) * 4., columns=list(reapeat_factor * headers_pareto))
df = df_normal.join(df_pareto)
alpha = 0.01
df_list = list()
# normality code taken from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html
cat_map = {True: 'Not Normal',
False: 'Maybe Normal'}
for col in df.columns:
k2, p = stats.normaltest(df[col])
is_not_normal = p < alpha
tmp_df = pd.DataFrame({'Column': [col],
'Not_Normal': [is_not_normal],
'p-value': [p],
'Normality': cat_map[is_not_normal]
})
df_list.append(tmp_df)
df_results = pd.concat(df_list)
df_results['Normality'] = df_results['Normality'].astype('category')
print('------------')
print('Columns names probably not a normal dist:')
# full data
print(df_results[(df_results['Normality'] == 'Not Normal')])
# only column names
# print(df_results[(df_results['Normality'] == 'Not Normal')]['Column'])
print('------------')
print('Plotting countplot')
sb.countplot(data=df_results, y='Normality', orient='v')
plt.show()
Outputs:
System: 3.7.2 (default, Feb 21 2019, 17:35:59) [MSC v.1915 64 bit (AMD64)]
Module pandas - version 0.24.1
Module numpy - version 1.16.2
Module scipy - version 1.2.1
Module seaborn - version 0.9.0
------------
Columns names probably not a normal dist:
Column Not_Normal p-value Normality
0 V True 0.0 Not Normal
0 W True 0.0 Not Normal
0 X True 0.0 Not Normal
0 Y True 0.0 Not Normal
0 Z True 0.0 Not Normal
------------
Plotting countplot
I really like Nathaniel's answer but I will add my two cents.
I would go for seaborn and in particular seaborn.distplot.
This will allow you to easily fit a normal distribution to each histogram plot and make the visualization easier.
import seaborn as sns
from scipy.stats import norm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
cols = 1000
df = pd.DataFrame(np.random.normal(0, 1, [50, cols]))
from scipy.stats import norm
fig, ax = plt.subplots(figsize = (16, 10))
for i, col in enumerate(df.columns):
ax=fig.add_subplot(25, 4, i+1)
sns.distplot(df[col],fit=norm, kde=False,ax=ax)
plt.tight_layout()
Additionally, I am not sure if putting columns with the same name in your example was done on purpose. If that's the case the easiest solution to loop through the columns is to use .iloc and the code would look like this:
import seaborn as sns
from scipy.stats import norm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(50, 1000), columns=list('ABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDEDABCDABCDED'))
fig, ax = plt.subplots(figsize = (12, 10))
for i, col in enumerate(df.columns):
plt.subplot(25, 40, i+1)
sns.distplot(df.iloc[:,i],fit=norm, kde=False,ax=plt.gca())
plt.axis('off')
plt.tight_layout()
Try something like this:
plt.figure(figsize=(26, 3 * len(df.columns))
for i, col in enumerate(df.columns):
plt.subplot(3, 4, i + 1)
plt.hist(df[col], color='blue', bins=100)
plt.title(col)
4 is the number of columns, 3 is the number of rows. I suppose instead of 3 it is better to write something like this:
plt.subplot(len(df.columns) / 4, 4, i + 1)
Try this - tight_layout ensures no overlap, figsize controls the size of each plot.
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 3*30), columns=list('ABC'*30))
df.hist(figsize=(20,20))
plt.tight_layout()
plt.show()
However, if you are after a normality test, would suggest to use something like this: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.normaltest.html instead of relying on visual inspection, especially if you have many variables.

Plotting CDF of a pandas series in python

Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.
I believe the functionality you're looking for is in the hist method of a Series object which wraps the hist() function in matplotlib
Here's the relevant documentation
In [10]: import matplotlib.pyplot as plt
In [11]: plt.hist?
...
Plot a histogram.
Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : False
If `True`, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints. If `normed` is also `True`
then the histogram is normalized such that the last bin equals 1.
If `cumulative` evaluates to less than 0 (e.g., -1), the direction
of accumulation is reversed. In this case, if `normed` is also
`True`, then the histogram is normalized such that the first bin
equals 1.
...
For example
In [12]: import pandas as pd
In [13]: import numpy as np
In [14]: ser = pd.Series(np.random.normal(size=1000))
In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>
In [16]: plt.show()
In case you are also interested in the values, not just the plot.
import pandas as pd
# If you are in jupyter
%matplotlib inline
This will always work (discrete and continuous distributions)
# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series
# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df
# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.
# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
For continuous distributions only
Please note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).
In this case, a percent rank can be used to get to the cdf directly.
Use your best judgment when taking this kind of shortcut! :)
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)
# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)
A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.
First create an example series:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
Sort the series:
ser = ser.sort_values()
Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:
ser[len(ser)] = ser.iloc[-1]
Create a new series with the sorted values as index and the cumulative distribution as values:
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
Finally, plot the function as steps:
ser_cdf.plot(drawstyle='steps')
I came here looking for a plot like this with bars and a CDF line:
It can be achieved like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')
If you want to remove the vertical line, then it's explained how to accomplish that here. Or you could just do:
ax.set_xlim((ax.get_xlim()[0], series.max()))
I also saw an elegant solution here on how to do it with seaborn.
This is the easiest way.
import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )
Image of cumulative histogram
I found another solution in "pure" Pandas, that does not require specifying the number of bins to use in a histogram:
import pandas as pd
import numpy as np # used only to create example data
series = pd.Series(np.random.normal(size=10000))
cdf = series.value_counts().sort_index().cumsum()
cdf.plot()
To me, this seemed like a simply way to do it:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
heights = pd.Series(np.random.normal(size=100))
# empirical CDF
def F(x,data):
return float(len(data[data <= x]))/len(data)
vF = np.vectorize(F, excluded=['data'])
plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))
I really like the answer by Raphvanns. It is helpful because it not only produces the plot, but it also helps me understand what pdf, cdf, and ccdf is.
I have two things to add to Raphvanns's solution: (1) use collections.Counter wisely to make the process easier; (2) remember to sort (assending) value before calculating pdf, cdf, and ccdf.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
Generate random numbers:
s = pd.Series(np.random.randint(1000, size=(1000)))
Build a dataframe as Raphvanns suggested:
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
Calculate PDF, CDF, and CCDF:
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
Plot:
df.plot(x = 'value', y = ['cdf', 'ccdf'], grid = True)
You may wonder why we have to sort the value before calculating PDF, CDF, and CCDF. Well, let's say what would the results be if we don't sort them (note that dict(Counter(s)) automatically sorted the items, we will make the order random in the following).
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
df.plot(x = 'value', y = ['cdf'], grid = True)
This is the plot:
Why did it happen? Well, the essence of CDF is "The number of data points we have seen so far", citing YY's lecture slides of his Data Visualization class. Therefore, if the order of value is not sorted (either ascending or descending is fine), then when you plot, where x axis is in ascending order, the y value of course will be just a mess.
If you apply a descending order, you can imagine that the CDF and CCDF will just swap their places:
I will leave a question to the readers of this post: if I randomize the order of value like above, will sorting value after (rather than before) calculating PDF, CDF, and CCDF solve the problem?
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
# Will this solve the problem?
df = df.sort_values(by='value')
df.plot(x = 'value', y = ['cdf'], grid = True)
Upgrading the answer of #wroscoe
df[your_column].plot(kind = 'hist', histtype = 'step', density = True, cumulative = True)
You can also provide a number of desired bins.
If you're looking to plot a "true" empirical CDF, which jumps exactly at the values of your data set a, and with the jump at each value proportional to the frequency of the value, NumPy has builtin functions to do the work:
import matplotlib.pyplot as plt
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
y = np.cumsum(counts)
x = np.insert(x, 0, x[0])
y = np.insert(y/y[-1], 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
The call to unique() returns the data values in sorted order along with their corresponding frequencies. The option drawstyle='steps-post' in the plot() call ensures that the jumps occur where they should. To force a jump at the smallest data value, the code inserts an additional element in front of x and y.
Example usage:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
ecdf(xvec)
Another usage:
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
ecdf(df['x'])
with output:

Categories

Resources