Cumulative distribution function in numpy not reaching 1? - python

I am trying to plot a CDF over a histogram using matplotlib with the following code:
values, base = np.histogram(df['0'], bins=50)
cumulative = np.cumsum(values) / df['0'].sum()
# plot the cumulative function
plt.hist(df['0'], bins=50, density=True)
plt.plot(base[:-1], cumulative, c='blue')
plt.show()
However my plot ends up looking like this, where the CDF looks like it is nearing .007 or thereabouts, when I would expect it to reach 1:
I'm not sure what I'm doing wrong, but I'd appreciate any help

I think the problem is that you are normalizing the cumulative sum of the bins with the sum of the values in your dataframe. The quantity stored in values is the number of occurrence of df['0'] inside the corresponding bin.
If you want to show the cumulative sum of the bins you need to normalize it to the total number of elements of df['0']:
cumulative = np.cumsum(values)/df['0'].values.shape[0]

Related

Why density histogram shows a bit weird values on y-axis?

A have a dataframe with values:
user value
1 0
2 1
3 4
4 2
5 1
When I'm trying to plot a histogram with density=True it shows pretty weird result
df.plot(kind='hist', denisty=True)
I know excatly that first bin covers almost 100% of the values. And density in this case should be more than 0.8. But plot shows something about 0.04.
How could that happen? Maybe I get the meaning of density wrong.
By the way there are abou 800 000 values in dataframe in case it's related. Here is a describe of the dataframe:
count 795846.000000
mean 5.220350
std 20.600285
min -3.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 247.000000
If you are interested in probability and not probability density I think you want to use weights instead of density. Take a look at this example to see the difference:
df = pd.DataFrame({'x':np.random.normal(loc=5, scale=10, size=80000)})
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 4))
df.plot(kind='hist', density=True, bins=np.linspace(-100, 100, 30), ax=ax0)
df.plot(kind='hist', bins=np.linspace(-100, 100, 30), weights=np.ones(len(df))/len(df), ax=ax1)
If you use density you normalize by the area of the plot, instead, if you use weights, you normalize by the sum of the heights of the bins.
You understood the meaning of density wrong. Refer to the documentation of numpy histogram (couldn't find the exact pandas one but is the same mechanism)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html
"Density ... If True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1"
This means that the sum of the histogram areas is one, not the sum of the heights. In particular you will get the probability to be in a bin by multiplying the height by the width of the bin.

Why is my matplotlib.pyplot.hist not binning my data

I am attempting to create a histogram out of an array I made. When I plot the histogram it does not plot like a regular histogram it just gives me lines where my data points are.
I have attempted to set bins = [0,10,20,30,40,50,60,70,80,90] including with 0 and 100 on the ends. I've tried bins = range() and bins= 'auto'
array2 = np.random.uniform(10.0,100.0,size=(1,100))
#create a random array uniformly distributed between 1 and 100
print array2
plt.hist(array2)
#print a histogram
plt.title('Histogram of a Uniformly Distributed Sample between 10 and
100')
plt.xlim(0,100)
plt.show()
I'm really new and I'm not sure how to paste pictures. The plot is just a bunch of vertical lines at the data points instead of a binned histogram. Or sometimes with some of the choices I make for bins = I end up with a complete blank plot. I woul like to appologize if this has been dealt with before I have not been able to find any previous questions that gave me help.
You create a 2D array with one row and 100 columns. Hence you get 100 histograms, each with one bin.
Use a 1D vector of data instead.
array2 = np.random.uniform(10.0,100.0,size=100)

Density plot using seaborn

I'm trying to make a density plot of the hourly demand:
data
The 'hr' means different hours, 'cnt' means demand.
I know how to make a density plot such as:
sns.kdeplot(bike['hr'])
However, this only works when the demand for different hours is unknown. Thus I can count each hour as its demand. Now I know the demand count of each hour, how I can make a density plot of such data?
A density plot aims to show an estimate of a distribution. To make a graph showing the density of hourly demand, we would really expect to see many iid samples of demand, with time-stamps, i.e. one row per sample. Then a density plot would make sense.
But in the type of data here, where the demand ('cnt') is sampled regularly and aggregated over that sample period (the hour), a density plot is not directly meaningful. But a bar graph as a histogram does make sense, using the hours as the bins.
Below I show how to use pandas functions to produce such a plot -- really simple. For reference I also show how we might produce a density plot, through a sort of reconstruction of "original" samples.
df = pd.read_csv("../data/hour.csv") # load dataset, inc cols hr, cnt, no NaNs
# using the bar plotter built in to pandas objects
fig, ax = plt.subplots(1,2)
df.groupby('hr').agg({'cnt':sum}).plot.bar(ax=ax[0])
# reconstructed samples - has df.cnt.sum() rows, each one containing an hour of a rental.
samples = np.hstack([ np.repeat(h, df.cnt.iloc[i]) for i, h in enumerate(df.hr)])
# plot a density estimate
sns.kdeplot(samples, bw=0.5, lw=3, c="r", ax=ax[1])
# to make a useful comparison with a density estimate, we need to have our bar areas
# sum up to 1, so we use groupby.apply to divide by the total of all counts.
tot = float(df.cnt.sum())
df.groupby('hr').apply(lambda x: x['cnt'].sum()/tot).plot.bar(ax=ax[1], color='C0')
Demand for bikes seems to be low during the night... But it is also apparent that they are probably used for commuting, with peaks at hours 8am and 5-6pm.

What is y axis in seaborn distplot?

I have some geometrically distributed data. When I want to take a look at it, I use
sns.distplot(data, kde=False, norm_hist=True, bins=100)
which results is a picture:
However, bins heights don't add up to 1, which means y axis doesn't show probability, it's something different. If instead we use
weights = np.ones_like(np.array(data))/float(len(np.array(data)))
plt.hist(data, weights=weights, bins = 100)
the y axis shall show probability, as bins heights sum up to 1:
It can be seen more clearly here: suppose we have a list
l = [1, 3, 2, 1, 3]
We have two 1s, two 3s and one 2, so their respective probabilities are 2/5, 2/5 and 1/5. When we use seaborn histplot with 3 bins:
sns.distplot(l, kde=False, norm_hist=True, bins=3)
we get:
As you can see, the 1st and the 3rd bin sum up to 0.6+0.6=1.2 which is already greater than 1, so y axis is not a probability. When we use
weights = np.ones_like(np.array(l))/float(len(np.array(l)))
plt.hist(l, weights=weights, bins = 3)
we get:
and the y axis is probability, as 0.4+0.4+0.2=1 as expected.
The amount of bins in these 2 cases are is the same for both methods used in each case: 100 bins for geometrically distributed data, 3 bins for small array l with 3 possible values. So bins amount is not the issue.
My question is: in seaborn distplot called with norm_hist=True, what is the meaning of y axis?
From the documentation:
norm_hist : bool, optional
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
So you need to take into account your bin width as well, i.e. compute the area under the curve and not just the sum of the bin heights.
The x-axis is the value of the variable just like in a histogram, but what exactly does the y-axis represent?
ANS-> The y-axis in a density plot is the probability density function for the kernel density estimation. However, we need to be careful to specify this is a probability density and not a probability. The difference is the probability density is the probability per unit on the x-axis. To convert to an actual probability, we need to find the area under the curve for a specific interval on the x-axis. Somewhat confusingly, because this is a probability density and not a probability, the y-axis can take values greater than one. The only requirement of the density plot is that the total area under the curve integrates to one. I generally tend to think of the y-axis on a density plot as a value only for relative comparisons between different categories.
from the reference of https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0
This code will help you make something like this :
sns.set_style("whitegrid")
ax = sns.displot(data=df_p,
x='Volume_Tonnes', kind='kde', fill=True, height=5, aspect=2)
# Here you can define the x limit
ax.set(xlim=(-50,100))
ax.set(xlabel = 'Volume Tonnes', ylabel = 'Probability Density')
ax.fig.suptitle("Volume Tonnes Distribution",
fontsize=20, fontdict={"weight": "bold"})
plt.show()

Creating a matplotlib or seaborn histogram which uses percent rather than count?

Specifically I'm dealing with the Kaggle Titanic dataset. I've plotted a stacked histogram which shows ages that survived and died upon the titanic. Code below.
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Age'], data[data['Survived']==0]['Age']], stacked=True, bins=30, label=['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()
I would like to alter the chart to show a single chart per bin of the percentage in that age group that survived. E.g. if a bin contained the ages between 10-20 years of age and 60% of people aboard the titanic in that age group survived, then the height would line up 60% along the y-axis.
Edit: I may have given a poor explanation to what I'm looking for. Rather than alter the y-axis values, I'm looking to change the actual shape of the bars based on the percentage that survived.
The first bin on the graph shows roughly 65% survived in that age group. I would like this bin to line up against the y-axis at 65%. The following bins look to be 90%, 50%, 10% respectively, and so on.
The graph would end up actually looking something like this:
For Seaborn, use the parameter stat. According to the documentation, currently supported values for the stat parameter are:
count shows the number of observations
frequency shows the number of observations divided by the bin width
density normalizes counts so that the area of the histogram is 1
probability normalizes counts so that the sum of the bar heights is 1
The result with stat being count:
seaborn.histplot(
data=data,
x='variable',
discrete=True,
stat='count'
)
The result after stat is changed to probability:
seaborn.histplot(
data=data,
x='variable',
discrete=True,
stat='probability'
)
Perhaps the following will help ...
Split the dataframe based on 'Survived'
df_survived=df[df['Survived']==1]
df_not_survive=df[df['Survived']==0]
Create Bins
age_bins=np.linspace(0,80,21)
Use np.histogram to generate histogram data
survived_hist=np.histogram(df_survived['Age'],bins=age_bins,range=(0,80))
not_survive_hist=np.histogram(df_not_survive['Age'],bins=age_bins,range=(0,80))
Calculate survival rate in each bin
surv_rates=survived_hist[0]/(survived_hist[0]+not_survive_hist[0])
Plot
plt.bar(age_bins[:-1],surv_rates,width=age_bins[1]-age_bins[0])
plt.xlabel('Age')
plt.ylabel('Survival Rate')
pd.Series.hist uses np.histogram underneath.
Let's explore that
np.random.seed([3,1415])
s = pd.Series(np.random.randn(100))
d = np.histogram(s, normed=True)
print('\nthese are the normalized counts\n')
print(d[0])
print('\nthese are the bin values, or average of the bin edges\n')
print(d[1])
these are the normalized counts
[ 0.11552497 0.18483996 0.06931498 0.32346993 0.39278491 0.36967992
0.32346993 0.25415494 0.25415494 0.02310499]
these are the bin edges
[-2.25905503 -1.82624818 -1.39344133 -0.96063448 -0.52782764 -0.09502079
0.33778606 0.77059291 1.20339976 1.6362066 2.06901345]
We can plot these while calculating the mean bin edges
pd.Series(d[0], pd.Series(d[1]).rolling(2).mean().dropna().round(2).values).plot.bar()
ACTUAL ANSWER
OR
We could have simply passed normed=True to the pd.Series.hist method. Which passes it along to np.histogram
s.hist(normed=True)
The library Dexplot is capable of returning relative frequencies of groups. Currently, you'll need to bin the age variable in pandas with the cut function. You can then, use Dexplot.
titanic['age2'] = pd.cut(titanic['age'], range(0, 110, 10))
Pass the variable you would like to count (age2) to the count function. Subdivide the counts with the split parameter and normalize by age2. Also, this might be a good time for a stacked bar plot
dxp.count('age2', data=titanic, split='survived', stacked=True, normalize='age2')
First of all it would be better if you create a function that splits your data in age groups
# This function splits our data frame in predifined age groups
def cutDF(df):
return pd.cut(
df,[0, 10, 20, 30, 40, 50, 60, 70, 80],
labels=['0-10', '11-20', '21-30', '31-40', '41-50', '51-60', '61-70', '71-80'])
data['AgeGroup'] = data[['Age']].apply(cutDF)
Then you can plot your graph as follows:
survival_per_age_group = data.groupby('AgeGroup')['Survived'].mean()
# Creating the plot that will show survival % per age group and gender
ax = survival_per_age_group.plot(kind='bar', color='green')
ax.set_title("Survivors by Age Group", fontsize=14, fontweight='bold')
ax.set_xlabel("Age Groups")
ax.set_ylabel("Percentage")
ax.tick_params(axis='x', top='off')
ax.tick_params(axis='y', right='off')
plt.xticks(rotation='horizontal')
# Importing the relevant fuction to format the y axis
from matplotlib.ticker import FuncFormatter
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))
plt.show()

Categories

Resources