Related
I have implemented a regression model and retrieved results. Now to evaluate the results I want to create plot, where MAE, and its standard deviation are represented in the same figure. However, I want to group the date into intervals and evaluate statistics. Though, I can use sklearn metrics for calculating mean absolute error, it works on entire range of data. Can some one give an idea about how to group the data based on intervals.
The data is very large hence, could not share here. However, random data and implemented code for calculating bias, I am attaching below.
import pandas as pd
import random
import matplotlib.pyplot as plt
yact = random.sample(range(1, 100), 50)
ypred=random.sample(range(1, 100), 50)
df = pd.DataFrame(yact,columns=['yact'])
df['ypred']=ypred
df['bias']=df['yact']-df['ypred']
#groups=[20,40,60,80,100]
I want to creat groups of y pred based on yact (similar to groups given above).
A reference figure which I am trying to plot is present in the first quadrant of below attached figure.
We could use only pandas/matplotlib but seaborn makes this kind of plotting so much easier. First, we categorize the data with pd.cut based on the bins provided, then we plot them with seaborns pointplot. The estimator mean is the default but I wanted to point out that you can feed other functions here into the plot.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#random data generation
rng = np.random.default_rng(123)
n=500
yact = rng.choice(range(1, 100), n)
ypred = rng.choice(range(1, 100), n)
df = pd.DataFrame({"yact": yact, "ypred": ypred})
df['bias']=df['yact']-df['ypred']
#binning of data
bins = [0, 30, 50, 80, 100]
labels = [f"({first}; {second}]" for first, second in zip(bins[:-1], bins[1:])]
df["cats"] = pd.cut(x=df['yact'], bins=bins, labels=labels, include_lowest=True)
#plotting with seaborn
sns.pointplot(x="cats", y="ypred", data=df, order=labels, estimator=np.mean, ci="sd", join=False)
plt.show()
(Unsurprisingly uniform) sample output:
I have this dataframe and I want to line plot it. As I have plotted it.
Graph is
Code to generate is
fig, ax = plt.subplots(figsize=(15, 5))
date_time = pd.to_datetime(df.Date)
df = df.set_index(date_time)
plt.xticks(rotation=90)
pd.DataFrame(df, columns=df.columns).plot.line( ax=ax,
xticks=pd.to_datetime(frame.Date))
I want a marker of innovationScore with value(where innovationScore is not 0) on open, close line. I want to show that that is the change when InnovationScore changes.
You have to address two problems - marking the corresponding spots on your curves and using the dates on the x-axis. The first problem can be solved by identifying the dates, where the score is not zero, then plotting markers on top of the curve at these dates. The second problem is more of a structural nature - pandas often interferes with matplotlib when it comes to date time objects. Using pandas standard plotting functions is good because it addresses common problems. But mixing pandas with matplotlib plotting (and to set the markers, you have to revert to matplotlib afaik) is usually a bad idea because they do not necessarily present the date time in the same format.
import pandas as pd
from matplotlib import pyplot as plt
#fake data generation, the following code block is just for illustration
import numpy as np
np.random.seed(1234)
n = 50
date_range = pd.date_range("20180101", periods=n, freq="D")
choice = np.zeros(10)
choice[0] = 3
df = pd.DataFrame({"Date": date_range,
"Open": np.random.randint(100, 150, n),
"Close": np.random.randint(100, 150, n),
"Innovation Score": np.random.choice(choice, n)})
fig, ax = plt.subplots()
#plot the three curves
l = ax.plot(df["Date"], df[["Open", "Close", "Innovation Score"]])
ax.legend(iter(l), ["Open", "Close", "Innovation Score"])
#filter dataset for score not zero
IS = df[df["Innovation Score"] > 0]
#plot markers on these positions
ax.plot(IS["Date"], IS[["Open", "Close"]], "ro")
#and/or set vertical lines to indicate the position
ax.vlines(IS["Date"], 0, max(df[["Open", "Close"]].max()), ls="--")
#label x-axis score not zero
ax.set_xticks(IS["Date"])
#beautify the output
ax.set_xlabel("Month")
ax.set_ylabel("Artifical score people take seriously")
fig.autofmt_xdate()
plt.show()
Sample output:
You can achieve it like this:
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], "ro-") # r is red, o is for larger marker, - is for line
plt.plot([3, 2, 1], "b.-") # b is blue, . is for small marker, - is for line
plt.show()
Check out also example here for another approach:
https://matplotlib.org/3.3.3/gallery/lines_bars_and_markers/markevery_prop_cycle.html
I very often get inspiration from this list of examples:
https://matplotlib.org/3.3.3/gallery/index.html
I have data for a scatter plot (for reference, x values are labelled sm, y values are labelled bhm) and my three goals are to find the medians of binned data, create standard deviation bands, and create bands at the 90th and 10th percentiles. I've managed to do the first, and while I've been able to make vertical bars indicating the standard deviation, I can't figure out how to make filled-in bands since every time I try to set parameters with the fill_between function, it says operators with sm/bhm are incompatible since they're datasets and I'm comparing them to singular values (the mean line). I copied all of my code down below and there's a comment pointing out the relevant stuff - I just kept all of it since the variable names are a bit important and also because some parts of the plot don't show up properly without the seemingly extraneous code
To create the bands at 90/10 percent, I tried this bit of code by trying to bin the mean as I did for the median, and then filling the top and bottom of the line +-90% of the data but I keep getting
patsy.PatsyError: model is missing required outcome variables
#stuff that really doesn't work
model = smf.quantreg(bhm, sm)
quantiles = [0.1, 0.9]
fits = [model.fit(q=q) for q in quantiles]
figure, axes = plt.subplots()
_sm = np.linspace(min(sm), max(sm))
for index, quantile in enumerate(quantiles):
_bhm = fits[index].params['world'] * _sm +
fits[index].params['Intercept']
axes.plot(_sm, _bhm, label = quantile)
axes.plot(_sm, _sm, 'g--', label = 'i guess this line is the mean')
#stuff that also doesn't really work
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.patches as mpatches
import h5py
import statistics as stat
import pandas as pd
import statsmodels.formula.api as smf
#my files and labels for things
f=h5py.File(r'C:\Users\hanna\Downloads\CatalogueGalsz0p0.hdf5', 'r')
sm = f['StellarMass']
bhm = f['BHMass']
bt = f['BtoT']
dt = f['DtoT']
nbins = 125
#titles and scaling for the plot
plt.title('Relationships Between Stellar Mass, Black Hole Mass, and Bulge
to Total Ratios')
plt.xlabel('Stellar Mass')
plt.ylabel('Black Hole Mass')
plt.xscale('log')
plt.yscale('log')
axes = plt.gca()
axes.set_ylim([500000,max(bhm)])
axes.set_xlim([min(sm),max(sm)])
#labels for the legend and how I colored the points in the plot
DtoT = np.copy(f['DtoT'].value)
colour = np.zeros(len(DtoT),dtype=str)
for i in np.arange(0, len(bt)):
if bt[i]>=0.5:
colour[i]='green'
else:
colour[i]='red'
redbt = mpatches.Patch(color = 'red', label = 'Bulge to Total Ratios Below 0.5')
greenbt = mpatches.Patch(color = 'green', label = 'Bulge to Total Ratios Above 0.5')
plt.legend(handles = [(redbt), (greenbt)])
#the important part - this is how I binned my data to make the median line, and this part works but not the standard deviation bands
bins = np.linspace(0, max(sm), nbins)
delta = bins[1]-bins[0]
idx = np.digitize(sm, bins)
runningmedian = [np.median(bhm[idx==k]) for k in range(nbins)]
runningstd = [bhm[idx==k].std() for k in range(nbins)]
plt.plot(bins-delta/2, runningmedian, c = 'b', lw=1)
plt.scatter(sm, bhm, c=colour, s=.2)
plt.show()
Im trying to smooth a graph line out but since the x-axis values are dates im having great trouble doing this. Say we have a dataframe as follows
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline
startDate = '2015-05-15'
endDate = '2015-12-5'
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ['value']
df = pd.DataFrame(data, index=index, columns=cols)
Then we plot the data
fig, axs = plt.subplots(1,1, figsize=(18,5))
x = df.index
y = df.value
axs.plot(x, y)
fig.show()
we get
Now to smooth this line there are some usefull staekoverflow questions allready like:
Generating smooth line graph using matplotlib,
Plot smooth line with PyPlot
Creating numpy linspace out of datetime
But I just cant seem to get some code working to do this for my example, any suggestions?
You can use interpolation functionality that is shipped with pandas. Because your dataframe has a value for every index already, you can populate it with an index that is more sparse, and fill every previously non-existent indices with NaN values. Then, after choosing one of many interpolation methods available, interpolate and plot your data:
index_hourly = pd.date_range(startDate, endDate, freq='1H')
df_smooth = df.reindex(index=index_hourly).interpolate('cubic')
df_smooth = df_smooth.rename(columns={'value':'smooth'})
df_smooth.plot(ax=axs, alpha=0.7)
df.plot(ax=axs, alpha=0.7)
fig.show()
There is one workaround, we will create two plots - 1) non smoothed /interploted with date labels 2) smoothed without date labels.
Plot the 1) using argument linestyle=" " and convert the dates to be plotted on x-axis to string type.
Plot the 2) using the argument linestyle="-" and interpolating the x-axis and y-axis using np.linespace and make_interp_spline respectively.
Following is the use of the discussed workaround for your code.
# your initial code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.interpolate import make_interp_spline
%matplotlib inline
startDate = "2015-05-15"
endDate = "2015-07-5" #reduced the end date so smoothness is clearly seen
index = pd.date_range(startDate, endDate)
data = np.random.normal(0, 1, size=len(index))
cols = ["value"]
df = pd.DataFrame(data, index=index, columns=cols)
fig, axs = plt.subplots(1, 1, figsize=(40, 12))
x = df.index
y = df.value
# workaround by creating linespace for length of your x axis
x_new = np.linspace(0, len(df.index), 300)
a_BSpline = make_interp_spline(
[i for i in range(0, len(df.index))],
df.value,
k=5,
)
y_new = a_BSpline(x_new)
# plot this new plot with linestyle = "-"
axs.plot(
x_new[:-5], # removing last 5 entries to remove noise, because interpolation outputs large values at the end.
y_new[:-5],
"-",
label="interpolated"
)
# to get the date on x axis we will keep our previous plot but linestyle will be None so it won't be visible
x = list(x.astype(str))
axs.plot(x, y, linestyle=" ", alpha=0.75, label="initial")
xt = [x[i] for i in range(0,len(x),5)]
plt.xticks(xt,rotation="vertical")
plt.legend()
fig.show()
Resulting Plot
Overalpped plot to see the smoothing.
Depending on what exactly you mean by "smoothing," the easiest way can be the use of savgol_filter or something similar. Unlike with interpolated splines, this method means that the smoothed line does not pass through the measured points, effectively filtering out higher-frequency noise.
from scipy.signal import savgol_filter
...
windowSize = 21
polyOrder = 1
smoothed = savgol_filter(values, windowSize, polyOrder)
axes.plot(datetimes, smoothed, color=chart.color)
The higher the polynomial order value, the closer the smoothed line is to the raw data.
Here is an example.
Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.
I believe the functionality you're looking for is in the hist method of a Series object which wraps the hist() function in matplotlib
Here's the relevant documentation
In [10]: import matplotlib.pyplot as plt
In [11]: plt.hist?
...
Plot a histogram.
Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : False
If `True`, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints. If `normed` is also `True`
then the histogram is normalized such that the last bin equals 1.
If `cumulative` evaluates to less than 0 (e.g., -1), the direction
of accumulation is reversed. In this case, if `normed` is also
`True`, then the histogram is normalized such that the first bin
equals 1.
...
For example
In [12]: import pandas as pd
In [13]: import numpy as np
In [14]: ser = pd.Series(np.random.normal(size=1000))
In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>
In [16]: plt.show()
In case you are also interested in the values, not just the plot.
import pandas as pd
# If you are in jupyter
%matplotlib inline
This will always work (discrete and continuous distributions)
# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series
# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df
# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.
# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
For continuous distributions only
Please note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).
In this case, a percent rank can be used to get to the cdf directly.
Use your best judgment when taking this kind of shortcut! :)
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)
# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)
A CDF or cumulative distribution function plot is basically a graph with on the X-axis the sorted values and on the Y-axis the cumulative distribution. So, I would create a new series with the sorted values as index and the cumulative distribution as values.
First create an example series:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
Sort the series:
ser = ser.sort_values()
Now, before proceeding, append again the last (and largest) value. This step is important especially for small sample sizes in order to get an unbiased CDF:
ser[len(ser)] = ser.iloc[-1]
Create a new series with the sorted values as index and the cumulative distribution as values:
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
Finally, plot the function as steps:
ser_cdf.plot(drawstyle='steps')
I came here looking for a plot like this with bars and a CDF line:
It can be achieved like this:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')
If you want to remove the vertical line, then it's explained how to accomplish that here. Or you could just do:
ax.set_xlim((ax.get_xlim()[0], series.max()))
I also saw an elegant solution here on how to do it with seaborn.
This is the easiest way.
import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )
Image of cumulative histogram
I found another solution in "pure" Pandas, that does not require specifying the number of bins to use in a histogram:
import pandas as pd
import numpy as np # used only to create example data
series = pd.Series(np.random.normal(size=10000))
cdf = series.value_counts().sort_index().cumsum()
cdf.plot()
To me, this seemed like a simply way to do it:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
heights = pd.Series(np.random.normal(size=100))
# empirical CDF
def F(x,data):
return float(len(data[data <= x]))/len(data)
vF = np.vectorize(F, excluded=['data'])
plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))
I really like the answer by Raphvanns. It is helpful because it not only produces the plot, but it also helps me understand what pdf, cdf, and ccdf is.
I have two things to add to Raphvanns's solution: (1) use collections.Counter wisely to make the process easier; (2) remember to sort (assending) value before calculating pdf, cdf, and ccdf.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
Generate random numbers:
s = pd.Series(np.random.randint(1000, size=(1000)))
Build a dataframe as Raphvanns suggested:
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
Calculate PDF, CDF, and CCDF:
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
Plot:
df.plot(x = 'value', y = ['cdf', 'ccdf'], grid = True)
You may wonder why we have to sort the value before calculating PDF, CDF, and CCDF. Well, let's say what would the results be if we don't sort them (note that dict(Counter(s)) automatically sorted the items, we will make the order random in the following).
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
df.plot(x = 'value', y = ['cdf'], grid = True)
This is the plot:
Why did it happen? Well, the essence of CDF is "The number of data points we have seen so far", citing YY's lecture slides of his Data Visualization class. Therefore, if the order of value is not sorted (either ascending or descending is fine), then when you plot, where x axis is in ascending order, the y value of course will be just a mess.
If you apply a descending order, you can imagine that the CDF and CCDF will just swap their places:
I will leave a question to the readers of this post: if I randomize the order of value like above, will sorting value after (rather than before) calculating PDF, CDF, and CCDF solve the problem?
dic = dict(Counter(s))
df = pd.DataFrame(s.items(), columns = ['value', 'frequency'])
# randomize the order of `value`:
df = df.sample(n=1000)
df['pdf'] = df.frequency/sum(df.frequency)
df['cdf'] = df['pdf'].cumsum()
df['ccdf'] = 1-df['cdf']
# Will this solve the problem?
df = df.sort_values(by='value')
df.plot(x = 'value', y = ['cdf'], grid = True)
Upgrading the answer of #wroscoe
df[your_column].plot(kind = 'hist', histtype = 'step', density = True, cumulative = True)
You can also provide a number of desired bins.
If you're looking to plot a "true" empirical CDF, which jumps exactly at the values of your data set a, and with the jump at each value proportional to the frequency of the value, NumPy has builtin functions to do the work:
import matplotlib.pyplot as plt
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
y = np.cumsum(counts)
x = np.insert(x, 0, x[0])
y = np.insert(y/y[-1], 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
The call to unique() returns the data values in sorted order along with their corresponding frequencies. The option drawstyle='steps-post' in the plot() call ensures that the jumps occur where they should. To force a jump at the smallest data value, the code inserts an additional element in front of x and y.
Example usage:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
ecdf(xvec)
Another usage:
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
ecdf(df['x'])
with output: