matplotlib: drawing lines between points ignoring missing data

matplotlib: drawing lines between points ignoring missing data - python

I have a set of data which I want plotted as a line-graph. For each series, some data is missing (but different for each series). Currently matplotlib does not draw lines which skip missing data: for example
import matplotlib.pyplot as plt
xs = range(8)
series1 = [1, 3, 3, None, None, 5, 8, 9]
series2 = [2, None, 5, None, 4, None, 3, 2]
plt.plot(xs, series1, linestyle='-', marker='o')
plt.plot(xs, series2, linestyle='-', marker='o')
plt.show()
results in a plot with gaps in the lines. How can I tell matplotlib to draw lines through the gaps? (I'd rather not have to interpolate the data).

You can mask the NaN values this way:
import numpy as np
import matplotlib.pyplot as plt
xs = np.arange(8)
series1 = np.array([1, 3, 3, None, None, 5, 8, 9]).astype(np.double)
s1mask = np.isfinite(series1)
series2 = np.array([2, None, 5, None, 4, None, 3, 2]).astype(np.double)
s2mask = np.isfinite(series2)
plt.plot(xs[s1mask], series1[s1mask], linestyle='-', marker='o')
plt.plot(xs[s2mask], series2[s2mask], linestyle='-', marker='o')
plt.show()
This leads to

Qouting #Rutger Kassies (link) :
Matplotlib only draws a line between consecutive (valid) data points,
and leaves a gap at NaN values.
A solution if you are using Pandas, :
#pd.Series
s.dropna().plot() #masking (as #Thorsten Kranz suggestion)
#pd.DataFrame
df['a_col_ffill'] = df['a_col'].ffill()
df['b_col_ffill'] = df['b_col'].ffill() # changed from a to b
df[['a_col_ffill','b_col_ffill']].plot()

A solution with pandas:
import matplotlib.pyplot as plt
import pandas as pd
def splitSerToArr(ser):
return [ser.index, ser.as_matrix()]
xs = range(8)
series1 = [1, 3, 3, None, None, 5, 8, 9]
series2 = [2, None, 5, None, 4, None, 3, 2]
s1 = pd.Series(series1, index=xs)
s2 = pd.Series(series2, index=xs)
plt.plot( *splitSerToArr(s1.dropna()), linestyle='-', marker='o')
plt.plot( *splitSerToArr(s2.dropna()), linestyle='-', marker='o')
plt.show()
The splitSerToArr function is very handy, when plotting in Pandas. This is the output:

Without interpolation you'll need to remove the None's from the data. This also means you'll need to remove the X-values corresponding to None's in the series. Here's an (ugly) one liner for doing that:
x1Clean,series1Clean = zip(* filter( lambda x: x[1] is not None , zip(xs,series1) ))
The lambda function returns False for None values, filtering the x,series pairs from the list, it then re-zips the data back into its original form.

For what it may be worth, after some trial and error I would like to add one clarification to Thorsten's solution. Hopefully saving time for users who looked elsewhere after having tried this approach.
I was unable to get success with an identical problem while using
from pyplot import *
and attempting to plot with
plot(abscissa[mask],ordinate[mask])
It seemed it was required to use import matplotlib.pyplot as plt to get the proper NaNs handling, though I cannot say why.

Another solution for pandas DataFrames:
plot = df.plot(style='o-') # draw the lines so they appears in the legend
colors = [line.get_color() for line in plot.lines] # get the colors of the markers
df = df.interpolate(limit_area='inside') # interpolate
lines = plot.plot(df.index, df.values) # add more lines (with a new set of colors)
for color, line in zip(colors, lines):
line.set_color(color) # overwrite the new lines colors with the same colors as the old lines

I had the same problem, but the mask eliminate the point between and the line was cut either way (the pink lines that we see in the picture were the only not NaN data that was consecutive, that´s why the line). Here is the result of masking the data (still with gaps):
xs = df['time'].to_numpy()
series1 = np.array(df['zz'].to_numpy()).astype(np.double)
s1mask = np.isfinite(series1)
fplt.plot(xs[s1mask], series1[s1mask], ax=ax_candle, color='#FF00FF', width = 1, legend='ZZ')
Maybe because I was using finplot (to plot candle chart), so I decided to make the Y-axe points that was missing with the linear formula y2-y1=m(x2-x1) and then formulate the function that generate the Y values between the missing points.
def fillYLine(y):
#Line Formula
fi=0
first = None
next = None
for i in range(0,len(y),1):
ne = not(isnan(y[i]))
next = y[i] if ne else next
if not(next is None):
if not(first is None):
m = (first-next)/(i-fi) #m = y1 - y2 / x1 - x2
cant_points = np.abs(i-fi)-1
if (cant_points)>0:
points = createLine(next,first,i,fi,cant_points)#Create the line with the values of the difference to generate the points x that we need
x = 1
for p in points:
y[fi+x] = p
x = x + 1
first = next
fi = i
next = None
return y
def createLine(y2,y1,x2,x1,cant_points):
m = (y2-y1)/(x2-x1) #Pendiente
points = []
x = x1 + 1#first point to assign
for i in range(0,cant_points,1):
y = ((m*(x2-x))-y2)*-1
points.append(y)
x = x + 1#The values of the line are numeric we don´t use the time to assign them, but we will do it at the same order
return points
Then I use simple call the function to fill the gaps between like y = fillYLine(y), and my finplot was like:
x = df['time'].to_numpy()
y = df['zz'].to_numpy()
y = fillYLine(y)
fplt.plot(x, y, ax=ax_candle, color='#FF00FF', width = 1, legend='ZZ')
You need to think that the data in Y variable is only for the plot, I need the NaN values between in the operations (or remove them from the list), that´s why I created a Y variable from the pandas dataset df['zz'].
Note: I noticed that the data is eliminated in my case because if I don´t mask X (xs) the values slide left in the graph, in this case they become consecutive not NaN values and it draws the consecutive line but shrinked to the left:
fplt.plot(xs, series1[s1mask], ax=ax_candle, color='#FF00FF', width = 1, legend='ZZ') #No xs masking (xs[masking])
This made me think that the reason for some people to work the mask is because they are only plotting that line or there´s no great difference between the non masked and masked data (few gaps, not like my data that have a lot).

Perhaps I missed the point, but I believe Pandas now does this automatically. The example below is a little involved, and requires internet access, but the line for China has lots of gaps in the early years, hence the straight line segments.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# read data from Maddison project
url = 'http://www.ggdc.net/maddison/maddison-project/data/mpd_2013-01.xlsx'
mpd = pd.read_excel(url, skiprows=2, index_col=0, na_values=[' '])
mpd.columns = map(str.rstrip, mpd.columns)
# select countries
countries = ['England/GB/UK', 'USA', 'Japan', 'China', 'India', 'Argentina']
mpd = mpd[countries].dropna()
mpd = mpd.rename(columns={'England/GB/UK': 'UK'})
mpd = np.log(mpd)/np.log(2) # convert to log2
# plots
ax = mpd.plot(lw=2)
ax.set_title('GDP per person', fontsize=14, loc='left')
ax.set_ylabel('GDP Per Capita (1990 USD, log2 scale)')
ax.legend(loc='upper left', fontsize=10, handlelength=2, labelspacing=0.15)
fig = ax.get_figure()
fig.show()

Related

Change x- and y-numbering in imshow

I would like to plot a function of two variables in python. Similar to this article, we can obtain an output like
using this code:
from numpy import exp,arange
from pylab import meshgrid,cm,imshow,contour,clabel,colorbar,axis,title,show
from matplotlib import pyplot
# the function that I'm going to plot
def z_func(x,y):
return (1-(x**2+y**3))*exp(-(x**2+y**2)/2)
x = arange(-3.0,3.0,0.1)
y = arange(-3.0,3.0,0.1)
z = [[0] * y.__len__() for i in range(x.__len__())]
for i in range(0, x.__len__()):
for j in range(0, y.__len__()):
z[j][i] = z_func(x[i], y[j])
im = imshow(z,cmap=cm.RdBu, extent = [-3, 3, -3, 3], interpolation = "none", origin='lower') # drawing the function
# adding the Contour lines with labels
cset = contour(z,arange(-1,1.5,0.2),linewidths=2,cmap=cm.Set2)
clabel(cset,inline=True,fmt='%1.1f',fontsize=10)
colorbar(im) # adding the colobar on the right
# latex fashion title
title('$z=(1-x^2+y^3) e^{-(x^2+y^2)/2}$')
show()
As you can see, the x- and y-labels go from 0 to 59 (which is the count of elements in x and y). How can I correct these values such that they range from -3 to 3?
A minor sub-question: Why do I need to "transpose" in z[j][i] = z_func(x[i], y[j])? Does Python treat the first dimension as "column" and the second as "row"?

You're trying to plot both the z-function and the countour plots. You need to add the "extent" parameter to matplotlib.pyplot.countour plot too.
cset = contour(z, arange(-1,1.5,0.2),
extent = [-3, 3, -3, 3],
linewidths = 2,
cmap = cm.Set2)

plot a point within ridgeplots

having the following dataframe:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import joypy
sample1 = np.random.normal(5, 10, size = (200, 5))
sample2 = np.random.normal(40, 5, size = (200, 5))
sample3 = np.random.normal(10, 5, size = (200, 5))
b = []
for i in range(0, 3):
a = "Sample" + "{}".format(i)
lst = np.repeat(a, 200)
b.append(lst)
b = np.asarray(b).reshape(600,1)
data_arr = np.vstack((sample1,sample2, sample3))
df1 = pd.DataFrame(data = data_arr, columns = ["foo", "bar", "qux", "corge", "grault"])
df1.insert(0, column="sampleNo", value = b)
I am able to produce the following ridgeplot:
fig, axes = joypy.joyplot(df1, column = ['foo'], by = 'sampleNo',
alpha=0.6,
linewidth=.5,
linecolor='w',
fade=True)
Now, let's say I have the following vector:
vectors = np.asarray([10, 40, 50])
How do I plot each one of those points into the density plots? E.g., on the distribution plot of sample 1, I'd like to have a single point (or line) on 10; sample 2 on 40, etc..
I've tried to use axvline, and I sort of expected this to work, but no luck:
for ax in axes:
ax.axvline(vectors(ax))
I am not sure if what I want is possible at all...

You almost had the correct approach.
axes holds 4 axis objects, in order: the three stacked plots from top to bottom and the big one where all the other 3 live in. So,
for ax, v in zip(axes, vectors):
ax.axvline(v)
zip() will only zip up to the shorter iterable, which is vectors. So, it will match each point from vectors with each axis from the stacked plots.

Is it possible to have a given number (n>2) of y-axes in matplotlib?

prices = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
I have my prices dataframe, and it currently has 3 columns. But at other times, it could have more or fewer columns. Is there a way to use some sort of twinx() loop to create a line-chart of all the different timeseries with a (potentially) infinite number of y-axes?
I tried the double for loop below but I got typeError'd:bTypeError: 'AxesSubplot' object does not support item assignment
# for i in range(0,len(prices.columns)):
# for column in list(prices.columns):
# fig, ax[i] = plt.subplots()
# ax[i].set_xlabel(prices.index())
# ax[i].set_ylabel(column[i])
# ax[i].plot(prices.Date, prices[column])
# ax[i].tick_params(axis ='y')
#
# ax[i+1] = ax[i].twinx()
# ax[i+1].set_ylabel(column[i+1])
# ax[i+1].plot(prices.Date, column[i+1])
# ax[i+1].tick_params(axis ='y')
#
# fig.suptitle('matplotlib.pyplot.twinx() function \ Example\n\n', fontweight ="bold")
# plt.show()
# =============================================================================
I believe I understand why I got the error - the ax object does not allow the assignment of the i variable. I'm hoping there is some ingenious way to accomplish this.

Turned out, the main problem was that you should not mix pandas plotting function with matplotlib which led to a duplication of the axes. Otherwise, the implementation is rather straight forward adapted from this matplotlib example.
from mpl_toolkits.axes_grid1 import host_subplot
import mpl_toolkits.axisartist as AA
from matplotlib import pyplot as plt
from itertools import cycle
import pandas as pd
#fake data creation with different spread for different axes
#this entire block can be deleted if you import your df
from pandas._testing import rands_array
import numpy as np
fakencol=5
fakenrow=7
np.random.seed(20200916)
df = pd.DataFrame(np.random.randint(1, 10, fakenrow*fakencol).reshape(fakenrow, fakencol), columns=rands_array(2, fakencol))
df = df.multiply(np.power(np.asarray([10]), np.arange(fakencol)))
df.index = pd.date_range("20200916", periods=fakenrow)
#defining a color scheme with unique colors
#if you want to include more than 20 axes, well, what can I say
sc_color = cycle(plt.cm.tab20.colors)
#defining the size of the figure in relation to the number of dataframe columns
#might need adjustment for optimal data presentation
offset = 60
plt.rcParams['figure.figsize'] = 10+df.shape[1], 5
#host figure and first plot
host = host_subplot(111, axes_class=AA.Axes)
h, = host.plot(df.index, df.iloc[:, 0], c=next(sc_color), label=df.columns[0])
host.set_ylabel(df.columns[0])
host.axis["left"].label.set_color(h.get_color())
host.set_xlabel("time")
#plotting the rest of the axes
for i, cols in enumerate(df.columns[1:]):
curr_ax = host.twinx()
new_fixed_axis = curr_ax.get_grid_helper().new_fixed_axis
curr_ax.axis["right"] = new_fixed_axis(loc="right",
axes=curr_ax,
offset=(offset*i, 0))
curr_p, = curr_ax.plot(df.index, df[cols], c=next(sc_color), label=cols)
curr_ax.axis["right"].label.set_color(curr_p.get_color())
curr_ax.set_ylabel(cols)
curr_ax.yaxis.label.set_color(curr_p.get_color())
plt.legend()
plt.tight_layout()
plt.show()
Coming to think of it - it would probably have been better to distribute the axes equally to the left and the right of the plot. Oh, well.

Replacing part of a plot with a dotted line

I would like to replace part of my plot where the function dips down to '-1' with a dashed line carrying on from the previous point (see plots below).
Here's some code I've written, along with its output:
import numpy as np
import matplotlib.pyplot as plt
y = [5,6,8,3,5,7,3,6,-1,3,8,5]
plt.plot(np.linspace(1,12,12),y,'r-o')
plt.show()
for i in range(1,len(y)):
if y[i]!=-1:
plt.plot(np.linspace(i-1,i,2),y[i-1:i+1],'r-o')
else:
y[i]=y[i-1]
plt.plot(np.linspace(i-1,i,2),y[i-1:i+1],'r--o')
plt.ylim(-1,9)
plt.show()
Here's the original plot
Modified plot:
The code I've written works (it produces the desired output), but it's inefficient and takes a long time when I actually run it on my (much larger) dataset. Is there a smarter way to go about doing this?

You can achieve something similar without the loops:
import pandas as pd
import matplotlib.pyplot as plt
# Create a data frame from the list
a = pd.DataFrame([5,6,-1,-1, 8,3,5,7,3,6,-1,3,8,5])
# Prepare a boolean mask
mask = a > 0
# New data frame with missing values filled with the last element of
# the previous segment. Choose 'bfill' to use the first element of
# the next segment.
a_masked = a[mask].fillna(method = 'ffill')
# Prepare the plot
fig, ax = plt.subplots()
line, = ax.plot(a_masked, ls = '--', lw = 1)
ax.plot(a[mask], color=line.get_color(), lw=1.5, marker = 'o')
plt.show()
You can also highlight the negative regions by choosing a different colour for the lines:
My answer is based on a great post from July, 2017. The latter also tackles the case when the first element is NaN or in your case a negative number:
Dotted lines instead of a missing value in matplotlib

I would use numpy functionality to cut your line into segments and then plot all solid and dashed lines separately. In the example below I added two additional -1s to your data to see that this works universally.
import numpy as np
import matplotlib.pyplot as plt
Y = np.array([5,6,-1,-1, 8,3,5,7,3,6,-1,3,8,5])
X = np.arange(len(Y))
idxs = np.where(Y==-1)[0]
sub_y = np.split(Y,idxs)
sub_x = np.split(X,idxs)
fig, ax = plt.subplots()
##replacing -1 values and plotting dotted lines
for i in range(1,len(sub_y)):
val = sub_y[i-1][-1]
sub_y[i][0] = val
ax.plot([sub_x[i-1][-1], sub_x[i][0]], [val, val], 'r--')
##plotting rest
for x,y in zip(sub_x, sub_y):
ax.plot(x, y, 'r-o')
plt.show()
The result looks like this:
Note, however, that this will fail if the first value is -1, as then your problem is not well defined (no previous value to copy from). Hope this helps.

Not too elegant, but here's something that doesn't use loops which I came up with (based on the above answers) which works. #KRKirov and #Thomas Kühn , thank you for your answers, I really appreciate them
import pandas as pd
import matplotlib.pyplot as plt
# Create a data frame from the list
a = pd.DataFrame([5,6,-1,-1, 8,3,5,7,3,6,-1,3,8,5])
b=a.copy()
b[2]=b[0].shift(1,axis=0)
b[4]=(b[0]!=-1) & (b[2]==-1)
b[5]=b[4].shift(-1,axis=0)
b[0] = (b[5] | b[4])
c=b[0]
d=pd.DataFrame(c)
# Prepare a boolean mask
mask = a > 0
# New data frame with missing values filled with the last element of
# the previous segment. Choose 'bfill' to use the first element of
# the next segment.
a_masked = a[mask].fillna(method = 'ffill')
# Prepare the plot
fig, ax = plt.subplots()
line, = ax.plot(a_masked, 'b:o', lw = 1)
ax.plot(a[mask], color=line.get_color(), lw=1.5, marker = 'o')
ax.plot(a_masked[d], color=line.get_color(), lw=1.5, marker = 'o')
plt.show()

Plotting CDF of a pandas series in python

Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.

I believe the functionality you're looking for is in the hist method of a Series object which wraps the hist() function in matplotlib
Here's the relevant documentation
In [10]: import matplotlib.pyplot as plt
In [11]: plt.hist?
...
Plot a histogram.
Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : False
If `True`, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints. If `normed` is also `True`
then the histogram is normalized such that the last bin equals 1.
If `cumulative` evaluates to less than 0 (e.g., -1), the direction
of accumulation is reversed. In this case, if `normed` is also
`True`, then the histogram is normalized such that the first bin
equals 1.
...
For example
In [12]: import pandas as pd
In [13]: import numpy as np
In [14]: ser = pd.Series(np.random.normal(size=1000))
In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>
In [16]: plt.show()

In case you are also interested in the values, not just the plot.
import pandas as pd
# If you are in jupyter
%matplotlib inline
This will always work (discrete and continuous distributions)
# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series
# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df
# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.
# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
For continuous distributions only
Please note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).
In this case, a percent rank can be used to get to the cdf directly.
Use your best judgment when taking this kind of shortcut! :)
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)
# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)

A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.
First create an example series:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
Sort the series:
ser = ser.sort_values()
Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:
ser[len(ser)] = ser.iloc[-1]
Create a new series with the sorted values as index and the cumulative distribution as values:
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
Finally, plot the function as steps:
ser_cdf.plot(drawstyle='steps')

I came here looking for a plot like this with bars and a CDF line:
It can be achieved like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')
If you want to remove the vertical line, then it's explained how to accomplish that here. Or you could just do:
ax.set_xlim((ax.get_xlim()[0], series.max()))
I also saw an elegant solution here on how to do it with seaborn.

This is the easiest way.
import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )
Image of cumulative histogram

I found another solution in "pure" Pandas, that does not require specifying the number of bins to use in a histogram:
import pandas as pd
import numpy as np # used only to create example data
series = pd.Series(np.random.normal(size=10000))
cdf = series.value_counts().sort_index().cumsum()
cdf.plot()

To me, this seemed like a simply way to do it:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
heights = pd.Series(np.random.normal(size=100))
# empirical CDF
def F(x,data):
return float(len(data[data <= x]))/len(data)
vF = np.vectorize(F, excluded=['data'])
plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))

I really like the answer by Raphvanns. It is helpful because it not only produces the plot, but it also helps me understand what pdf, cdf, and ccdf is.
I have two things to add to Raphvanns's solution: (1) use collections.Counter wisely to make the process easier; (2) remember to sort (assending) value before calculating pdf, cdf, and ccdf.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
Generate random numbers:
s = pd.Series(np.random.randint(1000, size=(1000)))
Build a dataframe as Raphvanns suggested:
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
Calculate PDF, CDF, and CCDF:
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
Plot:
df.plot(x = 'value', y = ['cdf', 'ccdf'], grid = True)
You may wonder why we have to sort the value before calculating PDF, CDF, and CCDF. Well, let's say what would the results be if we don't sort them (note that dict(Counter(s)) automatically sorted the items, we will make the order random in the following).
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
df.plot(x = 'value', y = ['cdf'], grid = True)
This is the plot:
Why did it happen? Well, the essence of CDF is "The number of data points we have seen so far", citing YY's lecture slides of his Data Visualization class. Therefore, if the order of value is not sorted (either ascending or descending is fine), then when you plot, where x axis is in ascending order, the y value of course will be just a mess.
If you apply a descending order, you can imagine that the CDF and CCDF will just swap their places:
I will leave a question to the readers of this post: if I randomize the order of value like above, will sorting value after (rather than before) calculating PDF, CDF, and CCDF solve the problem?
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
# Will this solve the problem?
df = df.sort_values(by='value')
df.plot(x = 'value', y = ['cdf'], grid = True)

Upgrading the answer of #wroscoe
df[your_column].plot(kind = 'hist', histtype = 'step', density = True, cumulative = True)
You can also provide a number of desired bins.

If you're looking to plot a "true" empirical CDF, which jumps exactly at the values of your data set a, and with the jump at each value proportional to the frequency of the value, NumPy has builtin functions to do the work:
import matplotlib.pyplot as plt
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
y = np.cumsum(counts)
x = np.insert(x, 0, x[0])
y = np.insert(y/y[-1], 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
The call to unique() returns the data values in sorted order along with their corresponding frequencies. The option drawstyle='steps-post' in the plot() call ensures that the jumps occur where they should. To force a jump at the smallest data value, the code inserts an additional element in front of x and y.
Example usage:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
ecdf(xvec)
Another usage:
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
ecdf(df['x'])
with output:

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

matplotlib: drawing lines between points ignoring missing data - python

Related

Change x- and y-numbering in imshow

plot a point within ridgeplots

Is it possible to have a given number (n>2) of y-axes in matplotlib?

Replacing part of a plot with a dotted line

Plotting CDF of a pandas series in python

Categories

Resources