I want to overlay some graphs out of CSV data (two datasets).
The graph I got from my dataset is shown down below.
Is there any way to plot those datasets over specific points? I would like to overlay these plots by using the anchor of the "big drop" to compare them in a better way.
The code used:
import pandas as pd
import matplotlib.pyplot as plt
# Read the data
data1 = pd.read_csv('data1.csv', delimiter=";", decimal=",")
data2 = pd.read_csv('data2.csv', delimiter=";", decimal=",")
data3 = pd.read_csv('data3.csv', delimiter=";", decimal=",")
data4 = pd.read_csv('data4.csv', delimiter=";", decimal=",")
# Plot the data
plt.plot(data1['Zeit'], data1['Kanal A'])
plt.plot(data2['Zeit'], data2['Kanal A'])
plt.plot(data3['Zeit'], data3['Kanal A'])
plt.plot(data4['Zeit'], data4['Kanal A'])
plt.show()
plt.close()
I would like to share you some data here:
Link to data
Part 1: Anchor times
A simple way is to find the times of interest (lowest point) in each frame, then plot each series with x=t - t_peak instead of x=t. Two ways come to mind to find the desired anchor points:
Simply using the global minimum (in your plots, that would work fine), or
Using the most prominent local minimum, either from first principles, or using scipy's find_peaks().
But first of all, let us attempt to build a reproducible example:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
def make_sample(t_peak, tmax_approx=17.5, n=100):
# uneven times
t = np.random.uniform(0, 2*tmax_approx/n, n).cumsum()
y = -1 / (0.1 + 2 * np.abs(t - t_peak))
trend = 4 * np.random.uniform(-1, 1) / n
level = np.random.uniform(10, 12)
y += np.random.normal(trend, 1/n, n).cumsum() + level
return pd.DataFrame({'t': t, 'y': y})
poi = [2, 2.48, 2.6, 2.1]
np.random.seed(0)
frames = [make_sample(t_peak) for t_peak in poi]
plt.rcParams['figure.figsize'] = (6,2)
fig, ax = plt.subplots()
for df in frames:
ax.plot(*df.values.T)
In this case, we made the problem maximally inconvenient by giving each time series its own, independent, unevenly distributed time sampling.
Now, retrieving the "maximum drop" by global minimum:
peaks = [df.loc[df['y'].idxmin(), 't'] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
fig, ax = plt.subplots()
for t_peak, df in zip(peaks, frames):
ax.plot(df['t'] - t_peak, df['y'])
But imagine a case where the global minimum is not suitable. For example, add a large sine wave to each series:
frames = [df.assign(y=df['y'] + 5 * np.sin(df['t'])) for df in frames]
# just plotting the first series
df = frames[0]
plt.plot(*df.values.T)
Clearly, there are several local minima, and the one we want ("sharpest drop") is not the global one.
A simple way to find the desired sharpest drop time is by looking at the difference from each point to its two neighbors:
def arg_steepest_min(v):
# simply find the minimum that is most separated from the surrounding points
diff = np.diff(v)
i = np.argmin(diff[:-1] - diff[1:]) + 1
return i
peaks = [df['t'].iloc[arg_steepest_min(df['y'])] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
# just plotting the first curve and the peak found
df = frames[0]
plt.plot(*df.values.T)
plt.plot(*df.iloc[arg_steepest_min(df['y'])].T, 'x')
There are more complex cases where you want to bring the full power of find_peaks(). Here is an example that uses the most prominent minimum, using a certain number of samples for neighborhood:
from scipy.signal import find_peaks, peak_prominences
def arg_most_prominent_min(v, prominence=1, wlen=10):
peaks, details = find_peaks(-v, prominence=prominence, wlen=wlen)
i = peaks[np.argmax(details['prominences'])]
return i
peaks = [df['t'].iloc[arg_most_prominent_min(df['y'])] for df in frames]
>>> peaks
[2.0209774600118764, 2.4932468358014157, 2.5835972003585472, 2.12438578790615]
In this case, the peaks found by both methods are the same. Aligning the curves gives:
fig, ax = plt.subplots()
for t_peak, df in zip(peaks, frames):
ax.plot(df['t'] - t_peak, df['y'])
Part 2: aligning the time series for numeric operations
Having found the anchor times and plotted the time series by shifting the x-axis accordingly, suppose now that we want to align all the time series, for example to somehow compare them to one another (e.g.: differences, correlation, etc.). In this example we made up, the time samples are not equidistant and all series have their own sampling.
We can use resample() to achieve our goal. Let us convert the frames into actual time series, transforming the column t (supposed in seconds) into a DateTimeIndex, after shifting the time using the previously found t_peak and using an arbitrary "0" date:
frames = [
pd.Series(
df['y'].values,
index=pd.Timestamp(0) + (df['t'] - t_peak) * pd.Timedelta(1, 's')
) for t_peak, df in zip(peaks, frames)]
>>> frames[0]
t
1969-12-31 23:59:58.171107267 11.244308
1969-12-31 23:59:58.421423545 12.387291
1969-12-31 23:59:58.632390727 13.268186
1969-12-31 23:59:58.823099841 13.942224
1969-12-31 23:59:58.971379021 14.359900
...
1970-01-01 00:00:14.022717327 10.422229
1970-01-01 00:00:14.227996854 9.504693
1970-01-01 00:00:14.235034496 9.489011
1970-01-01 00:00:14.525163506 8.388377
1970-01-01 00:00:14.526806922 8.383366
Length: 100, dtype: float64
At this point, the sampling is still uneven, so we use resample to get a fixed frequency. One strategy is to oversample and interpolate:
frames = [df.resample('100ms').mean().interpolate() for df in frames]
for df in frames:
df.plot()
At this point, we can compare the Series. Here are the pairwise differences and correlations:
fig, axes = plt.subplots(nrows=len(frames), ncols=len(frames), figsize=(10, 5))
for axrow, a in zip(axes, frames):
for ax, b in zip(axrow, frames):
(b-a).plot(ax=ax)
ax.set_title(fr'$\rho = {b.corr(a):.3f}$')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
plt.tight_layout()
Related
I have this dataframe and I want to line plot it. As I have plotted it.
Graph is
Code to generate is
fig, ax = plt.subplots(figsize=(15, 5))
date_time = pd.to_datetime(df.Date)
df = df.set_index(date_time)
plt.xticks(rotation=90)
pd.DataFrame(df, columns=df.columns).plot.line( ax=ax,
xticks=pd.to_datetime(frame.Date))
I want a marker of innovationScore with value(where innovationScore is not 0) on open, close line. I want to show that that is the change when InnovationScore changes.
You have to address two problems - marking the corresponding spots on your curves and using the dates on the x-axis. The first problem can be solved by identifying the dates, where the score is not zero, then plotting markers on top of the curve at these dates. The second problem is more of a structural nature - pandas often interferes with matplotlib when it comes to date time objects. Using pandas standard plotting functions is good because it addresses common problems. But mixing pandas with matplotlib plotting (and to set the markers, you have to revert to matplotlib afaik) is usually a bad idea because they do not necessarily present the date time in the same format.
import pandas as pd
from matplotlib import pyplot as plt
#fake data generation, the following code block is just for illustration
import numpy as np
np.random.seed(1234)
n = 50
date_range = pd.date_range("20180101", periods=n, freq="D")
choice = np.zeros(10)
choice[0] = 3
df = pd.DataFrame({"Date": date_range,
"Open": np.random.randint(100, 150, n),
"Close": np.random.randint(100, 150, n),
"Innovation Score": np.random.choice(choice, n)})
fig, ax = plt.subplots()
#plot the three curves
l = ax.plot(df["Date"], df[["Open", "Close", "Innovation Score"]])
ax.legend(iter(l), ["Open", "Close", "Innovation Score"])
#filter dataset for score not zero
IS = df[df["Innovation Score"] > 0]
#plot markers on these positions
ax.plot(IS["Date"], IS[["Open", "Close"]], "ro")
#and/or set vertical lines to indicate the position
ax.vlines(IS["Date"], 0, max(df[["Open", "Close"]].max()), ls="--")
#label x-axis score not zero
ax.set_xticks(IS["Date"])
#beautify the output
ax.set_xlabel("Month")
ax.set_ylabel("Artifical score people take seriously")
fig.autofmt_xdate()
plt.show()
Sample output:
You can achieve it like this:
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], "ro-") # r is red, o is for larger marker, - is for line
plt.plot([3, 2, 1], "b.-") # b is blue, . is for small marker, - is for line
plt.show()
Check out also example here for another approach:
https://matplotlib.org/3.3.3/gallery/lines_bars_and_markers/markevery_prop_cycle.html
I very often get inspiration from this list of examples:
https://matplotlib.org/3.3.3/gallery/index.html
I have a 1,000,000 x 2 DataFrame object consisting of data I'm trying to understand visually. Its basically a simulation of 1,000,000 events where a packet traveling along a network is either queued or dropped depending on the buffer's size. So, the two column values are Packets in Queue and Packets Dropped.
I'm trying to make a line plot using Python, Matplotlib and Jupyter Notebooks that has the ID of the event on the x-axis and the number of packets in the queue at a specific ID point on the y-axis. There should be two lines, the first representing the number of packets in the queue and the second representing the number of packets dropped. However, given that there are over 1,000,000 simulations, the graph isn't intelligible. The values are too squished together. Is it possible to make a readable graph with 1,000,000 event instances or do I need to dramatically trim the number of events?
With a million data points it will require a lot of effort and zooming in to see them in fine detail. Plotly has some nice tools for zooming in and out of plots as well as sliding your data window along the x-axis.
If you're okay with some averaging, you can plot a moving average and get close to a hundred thousand points. You can stack two subplots on each other to see both columns of data in reasonable detail. You can of course average them more, but you lose the ability to see fine details.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def moving_avg(x, N=30):
return np.convolve(x, np.ones((N,))/N, mode='valid')
plt.figure(figsize = (16,12))
plt.subplot(3,1,1)
x = np.random.random(1000)
plt.plot(x, linewidth = 1, alpha = 0.5, label = 'linewidth = 1')
plt.plot(moving_avg(x, 10), 'C0', label = 'moving average, N = 10')
plt.xlim(0,len(x))
plt.legend(loc=2)
plt.subplot(3,1,2)
x = np.random.random(10000)
plt.plot(x, linewidth = 0.2, alpha = 0.5, label = 'linewidth = 0.2')
plt.plot(moving_avg(x, 100), 'C0', label = 'moving average, N = 100')
plt.xlim(0,len(x))
plt.legend(loc=2)
plt.subplot(3,1,3)
x = np.random.random(100000)
plt.plot(x, linewidth = 0.05, alpha = 0.5, label = 'linewidth = 0.05')
plt.plot(moving_avg(x, 500), 'C0', label = 'moving average, N = 500')
plt.xlim(0,len(x))
plt.legend(loc=2)
plt.tight_layout()
Try histogram
from matplotlib.pyplot import hist
import pandas as pd
df = pd.DataFrame()
df['x'] = np.random.rand(1000000)
hist(df.index, weights=df.x, bins=1000)
plt.show()
Method 2 line plots
df['x'] = np.random.rand(1000000)
df['y'] = np.random.rand(1000000)
w = 1000
v1 = df['x'].rolling(min_periods=1, window=w).sum()[[i*w for i in range(1, int(len(df)/w))]]/w
v2 = df['y'].rolling(min_periods=1, window=w).sum()[[i*w for i in range(1, int(len(df)/w))]]/w
plt.plot(np.arange(len(v1)),v1, c='b')
plt.plot(np.arange(len(v1)),v2, c='r')
plt.show()
We are calculating the mean of w=1000 points i.e averaging w values together and plotting them.
Looks like below when 1000000 points are bucked at every 1000 interval
I have a csv file with 2 columns:
col1- Timestamp data(yyyy-mm-dd hh:mm:ss.ms (8 months data))
col2 : Heat data (continuous variable) .
Since there are almost 50k record, I would like to partition the col1(timestamp col) into months or weeks and then apply box plot on the heat data w.r.t timestamp.
I tried in R,it takes a long time. Need help to do in Python. I think I need to use seaborn.boxplot.
Please guide.
Group by Frequency then plot groups
First Read your csv data into a Pandas DataFrame
import numpy as np
import Pandas as pd
from matplotlib import pyplot as plt
# assumes NO header line in csv
df = pd.read_csv('\file\path', names=['time','temp'], parse_dates=[0])
I will use some fake data, 30 days of hourly samples.
heat = np.random.random(24*30) * 100
dates = pd.date_range('1/1/2011', periods=24*30, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
Set the timestamps as the DataFrame's index
df = df.set_index('time')
Now group by by the period you want, seven days for this example
gb = df.groupby(pd.Grouper(freq='7D'))
Now you can plot each group separately
for g, week in gb2:
#week.plot()
week.boxplot()
plt.title(f'Week Of {g.date()}')
plt.show()
plt.close()
And... I didn't realize you could do this but it is pretty cool
ax = gb.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=30)
plt.show()
plt.close()
heat = np.random.random(24*300) * 100
dates = pd.date_range('1/1/2011', periods=24*300, freq='H')
df = pd.DataFrame({'time':dates,'temp':heat})
df = df.set_index('time')
To partition the data in five time periods then get weekly boxplots of each:
Determine the total timespan; divide by five; create a frequency alias; then groupby
dt = df.index[-1] - df.index[0]
dt = dt/5
alias = f'{dt.total_seconds()}S'
gb = df.groupby(pd.Grouper(freq=alias))
Each group is a DataFrame so iterate over the groups; create weekly groups from each and boxplot them.
for g,d_frame in gb:
gb_tmp = d_frame.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
There might be a better way to do this, if so I'll post it or maybe someone will fill free to edit this. Looks like this could lead to the last group not having a full set of data. ...
If you know that your data is periodic you can just use slices to split it up.
n = len(df) // 5
for tmp_df in (df[i:i+n] for i in range(0, len(df), n)):
gb_tmp = tmp_df.groupby(pd.Grouper(freq='7D'))
ax = gb_tmp.boxplot(subplots=False)
plt.setp(ax.xaxis.get_ticklabels(),rotation=90)
plt.show()
plt.close()
Frequency aliases
pandas.read_csv()
pandas.Grouper()
Im trying to smooth a graph line out but since the x-axis values are dates im having great trouble doing this. Say we have a dataframe as follows
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
startDate = '2015-05-15'
endDate = '2015-12-5'
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ['value']
df = pd.DataFrame(data, index=index, columns=cols)
Then we plot the data
fig, axs = plt.subplots(1,1, figsize=(18,5))
x = df.index
y = df.value
axs.plot(x, y)
fig.show()
we get
Now to smooth this line there are some usefull staekoverflow questions allready like:
Generating smooth line graph using matplotlib,
Plot smooth line with PyPlot
Creating numpy linspace out of datetime
But I just cant seem to get some code working to do this for my example, any suggestions?
You can use interpolation functionality that is shipped with pandas. Because your dataframe has a value for every index already, you can populate it with an index that is more sparse, and fill every previously non-existent indices with NaN values. Then, after choosing one of many interpolation methods available, interpolate and plot your data:
index_hourly = pd.date_range(startDate, endDate, freq='1H')
df_smooth = df.reindex(index=index_hourly).interpolate('cubic')
df_smooth = df_smooth.rename(columns={'value':'smooth'})
df_smooth.plot(ax=axs, alpha=0.7)
df.plot(ax=axs, alpha=0.7)
fig.show()
There is one workaround, we will create two plots - 1) non smoothed /interploted with date labels 2) smoothed without date labels.
Plot the 1) using argument linestyle=" " and convert the dates to be plotted on x-axis to string type.
Plot the 2) using the argument linestyle="-" and interpolating the x-axis and y-axis using np.linespace and make_interp_spline respectively.
Following is the use of the discussed workaround for your code.
# your initial code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.interpolate import make_interp_spline
%matplotlib inline
startDate = "2015-05-15"
endDate = "2015-07-5" #reduced the end date so smoothness is clearly seen
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ["value"]
df = pd.DataFrame(data, index=index, columns=cols)
fig, axs = plt.subplots(1, 1, figsize=(40, 12))
x = df.index
y = df.value
# workaround by creating linespace for length of your x axis
x_new = np.linspace(0, len(df.index), 300)
a_BSpline = make_interp_spline(
[i for i in range(0, len(df.index))],
df.value,
k=5,
)
y_new = a_BSpline(x_new)
# plot this new plot with linestyle = "-"
axs.plot(
x_new[:-5], # removing last 5 entries to remove noise, because interpolation outputs large values at the end.
y_new[:-5],
"-",
label="interpolated"
)
# to get the date on x axis we will keep our previous plot but linestyle will be None so it won't be visible
x = list(x.astype(str))
axs.plot(x, y, linestyle=" ", alpha=0.75, label="initial")
xt = [x[i] for i in range(0,len(x),5)]
plt.xticks(xt,rotation="vertical")
plt.legend()
fig.show()
Resulting Plot
Overalpped plot to see the smoothing.
Depending on what exactly you mean by "smoothing," the easiest way can be the use of savgol_filter or something similar. Unlike with interpolated splines, this method means that the smoothed line does not pass through the measured points, effectively filtering out higher-frequency noise.
from scipy.signal import savgol_filter
...
windowSize = 21
polyOrder = 1
smoothed = savgol_filter(values, windowSize, polyOrder)
axes.plot(datetimes, smoothed, color=chart.color)
The higher the polynomial order value, the closer the smoothed line is to the raw data.
Here is an example.
Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.
I believe the functionality you're looking for is in the hist method of a Series object which wraps the hist() function in matplotlib
Here's the relevant documentation
In [10]: import matplotlib.pyplot as plt
In [11]: plt.hist?
...
Plot a histogram.
Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : False
If `True`, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints. If `normed` is also `True`
then the histogram is normalized such that the last bin equals 1.
If `cumulative` evaluates to less than 0 (e.g., -1), the direction
of accumulation is reversed. In this case, if `normed` is also
`True`, then the histogram is normalized such that the first bin
equals 1.
...
For example
In [12]: import pandas as pd
In [13]: import numpy as np
In [14]: ser = pd.Series(np.random.normal(size=1000))
In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>
In [16]: plt.show()
In case you are also interested in the values, not just the plot.
import pandas as pd
# If you are in jupyter
%matplotlib inline
This will always work (discrete and continuous distributions)
# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series
# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df
# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.
# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
For continuous distributions only
Please note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).
In this case, a percent rank can be used to get to the cdf directly.
Use your best judgment when taking this kind of shortcut! :)
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)
# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)
A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.
First create an example series:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
Sort the series:
ser = ser.sort_values()
Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:
ser[len(ser)] = ser.iloc[-1]
Create a new series with the sorted values as index and the cumulative distribution as values:
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
Finally, plot the function as steps:
ser_cdf.plot(drawstyle='steps')
I came here looking for a plot like this with bars and a CDF line:
It can be achieved like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')
If you want to remove the vertical line, then it's explained how to accomplish that here. Or you could just do:
ax.set_xlim((ax.get_xlim()[0], series.max()))
I also saw an elegant solution here on how to do it with seaborn.
This is the easiest way.
import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )
Image of cumulative histogram
I found another solution in "pure" Pandas, that does not require specifying the number of bins to use in a histogram:
import pandas as pd
import numpy as np # used only to create example data
series = pd.Series(np.random.normal(size=10000))
cdf = series.value_counts().sort_index().cumsum()
cdf.plot()
To me, this seemed like a simply way to do it:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
heights = pd.Series(np.random.normal(size=100))
# empirical CDF
def F(x,data):
return float(len(data[data <= x]))/len(data)
vF = np.vectorize(F, excluded=['data'])
plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))
I really like the answer by Raphvanns. It is helpful because it not only produces the plot, but it also helps me understand what pdf, cdf, and ccdf is.
I have two things to add to Raphvanns's solution: (1) use collections.Counter wisely to make the process easier; (2) remember to sort (assending) value before calculating pdf, cdf, and ccdf.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
Generate random numbers:
s = pd.Series(np.random.randint(1000, size=(1000)))
Build a dataframe as Raphvanns suggested:
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
Calculate PDF, CDF, and CCDF:
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
Plot:
df.plot(x = 'value', y = ['cdf', 'ccdf'], grid = True)
You may wonder why we have to sort the value before calculating PDF, CDF, and CCDF. Well, let's say what would the results be if we don't sort them (note that dict(Counter(s)) automatically sorted the items, we will make the order random in the following).
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
df.plot(x = 'value', y = ['cdf'], grid = True)
This is the plot:
Why did it happen? Well, the essence of CDF is "The number of data points we have seen so far", citing YY's lecture slides of his Data Visualization class. Therefore, if the order of value is not sorted (either ascending or descending is fine), then when you plot, where x axis is in ascending order, the y value of course will be just a mess.
If you apply a descending order, you can imagine that the CDF and CCDF will just swap their places:
I will leave a question to the readers of this post: if I randomize the order of value like above, will sorting value after (rather than before) calculating PDF, CDF, and CCDF solve the problem?
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
# Will this solve the problem?
df = df.sort_values(by='value')
df.plot(x = 'value', y = ['cdf'], grid = True)
Upgrading the answer of #wroscoe
df[your_column].plot(kind = 'hist', histtype = 'step', density = True, cumulative = True)
You can also provide a number of desired bins.
If you're looking to plot a "true" empirical CDF, which jumps exactly at the values of your data set a, and with the jump at each value proportional to the frequency of the value, NumPy has builtin functions to do the work:
import matplotlib.pyplot as plt
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
y = np.cumsum(counts)
x = np.insert(x, 0, x[0])
y = np.insert(y/y[-1], 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
The call to unique() returns the data values in sorted order along with their corresponding frequencies. The option drawstyle='steps-post' in the plot() call ensures that the jumps occur where they should. To force a jump at the smallest data value, the code inserts an additional element in front of x and y.
Example usage:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
ecdf(xvec)
Another usage:
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
ecdf(df['x'])
with output: