Normalizing a histogram with matplotlib - python

I want to plot a histogram with Matplotlib, but I'd like the bins' values to represent the percentage of the total observations. A MWE would be like this:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import matplotlib.ticker as tck
import seaborn as sns
import numpy
sns.set(style='dark')
imagen2 = plt.figure(1, figsize=(5, 2))
imagen2.suptitle('StackOverflow Matplotlib histogram demo')
luminance = numpy.random.randn(1000, 1000)
# "Luminance" should range from 0.0...1.0 so we normalize it
luminance = (luminance - luminance.min())/(luminance.max() - luminance.min())
top_left = plt.subplot(121)
top_left.imshow(luminance)
bottom_left = plt.subplot(122)
sns.distplot(luminance.flatten(), kde_kws={"cumulative": True})
# plt.savefig("stackoverflow.pdf", dpi=300)
plt.tight_layout(rect=(0, 0, 1, 0.95))
plt.show()
The CDF here is OK (range: [0, 1]), but the resulting histogram doesn't match my expectations:
Why are the histogram's results in the range [0, 4]? Is there any way to fix this?

What you think you want
Here's how to plot the histogram such that the bins sum to 1:
import matplotlib.pyplot as plt
import matplotlib.ticker as tck
import seaborn as sns
import numpy as np
sns.set(style='dark')
imagen2 = plt.figure(1, figsize=(5, 2))
imagen2.suptitle('StackOverflow Matplotlib histogram demo')
luminance = numpy.random.randn(1000, 1000)
# "Luminance" should range from 0.0...1.0 so we normalize it
luminance = (luminance - luminance.min())/(luminance.max() - luminance.min())
# get the histogram values
heights,edges = np.histogram(luminance.flat, bins=30)
binCenters = (edges[:-1] + edges[1:])/2
# norm the heights
heights = heights/heights.sum()
# get the cdf
cdf = heights.cumsum()
left = plt.subplot(121)
left.imshow(luminance)
right = plt.subplot(122)
right.plot(binCenters, cdf, binCenters, heights)
# plt.savefig("stackoverflow.pdf", dpi=300)
plt.tight_layout(rect=(0, 0, 1, 0.95))
plt.show()
# confirm that the hist vals sum to 1
print('heights sum: %.2f' % heights.sum())
output:
heights sum: 1.00
The actual answer
This one is actually super easy. Just do
sns.distplot(luminance.flatten(), kde_kws={"cumulative": True}, norm_hist=True)
Here's what I get when I run your script with the above modification:
Surprise twist!
So it turns out that your histogram was normalized all along, as per the formal identity:
In plain(er) English, the general practice is to norm continuously valued histograms (ie their observations can be expressed as floating point number) in terms of their density. So in this case the sum of the bin widths times the bin heights will 1.0, as you can see by running this simplified version of your script:
import matplotlib.pyplot as plt
import matplotlib.ticker as tck
import numpy as np
imagen2 = plt.figure(1, figsize=(4,3))
imagen2.suptitle('StackOverflow Matplotlib histogram demo')
luminance = numpy.random.randn(1000, 1000)
luminance = (luminance - luminance.min())/(luminance.max() - luminance.min())
heights,edges,patches = plt.hist(luminance.ravel(), density=True, bins=30)
widths = edges[1:] - edges[:-1]
totalWeight = (heights*widths).sum()
# plt.savefig("stackoverflow.pdf", dpi=300)
plt.tight_layout(rect=(0, 0, 1, 0.95))
plt.show()
print(totalWeight)
And the totalWeight will indeed be exactly equal to 1.0, give or take a smidge of rounding error.

tel's answer is great! I just want to provide an alternative to give you the histogram you want with less lines. The key idea is to use weights arguments in the matplotlib hist function to normalize counts. You can replace your sns.distplot(luminance.flatten(), kde_kws={"cumulative": True}) with the following three lines of code:
lf = luminance.flatten()
sns.kdeplot(lf, cumulative=True)
sns.distplot(lf, kde=False,
hist_kws={'weights': numpy.full(len(lf), 1/len(lf))})
If you want to see the histogram on a second y-axis (better visual), add ax=bottom_left.twinx() to sns.distplot:

Related

How can you make a python histogram percentage sum to 100%?

I am struggling to make a histogram plot where the total percentage of events sums to 100%. Instead, for this particular example, it sums to approximately 3%. Will anyone be able to show me how I make the percentages of my events sum to 100% for any array used?
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
data = np.array([0,9,78,6,44,23,88,77,12,29])
length_of_data = len(data) # Length of data
bins = int(np.sqrt(length_of_data)) # Choose number of bins
y = data
plt.title('Histogram')
plt.ylabel('Percentage Of Events')
plt.xlabel('bins')
plt.hist(y,bins=bins, density = True)
plt.show()
print(bins)
One way of doing it is to get the bin heights that plt.hist returns, then re-set the patch heights to the normalized height you want. It's not that involved if you know what to do, but not that ideal. Here's your case:
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
plt.gca().yaxis.set_major_formatter(PercentFormatter(100)) # <-- changed here
data = np.array([0,9,78,6,44,23,88,77,12,29])
length_of_data = len(data) # Length of data
bins = int(np.sqrt(length_of_data)) # Choose number of bins
y = data
plt.title('Histogram')
plt.ylabel('Percentage Of Events')
plt.xlabel('bins')
#### Setting new heights
n, bins, patches = plt.hist(y, bins=bins, density = True, edgecolor='k')
scaled_n = n / n.sum() * 100
for new_height, patch in zip(scaled_n, patches):
patch.set_height(new_height)
####
# Setting cumulative sum as verification
plt.plot((bins[1:] + bins[:-1])/2, scaled_n.cumsum())
# If you want the cumsum to start from 0, uncomment the line below
#plt.plot(np.concatenate([[0], (bins[1:] + bins[:-1])/2]), np.concatenate([[0], scaled_n.cumsum()]))
plt.ylim(top=110)
plt.show()
This is the resulting picture:
As others said, you can use seaborn. Here's how to reproduce my code above. You'd still need to add all the labels and styling you want.
import seaborn as sns
sns.histplot(data, bins=int(np.sqrt(length_of_data)), stat='percent')
sns.histplot(data, bins=int(np.sqrt(length_of_data)), stat='percent', cumulative=True, element='poly', fill=False, color='C1')
This is the resulting picture:

How to change yticks in PSD plot?

Download Zip with signal.csv
I can create a psd plot like this:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.read_csv('signal.csv')/1000
dt = df.iloc[1,0] - df.iloc[0,0]
data=df['1']
print(f'time delta: {dt*1e12:.0f} ps')
print(f'time: {(len(data)*dt*1e6):.3f} \u03BCs')
resolution = 2000
plt.psd(data,
resolution,
1/0.1,
lw=1,
color='red');
plt.xlim(0,5)
plt.xlabel('Frequency / GHz')
plt.ylabel('Power Spectral Density / dB/Hz')
plt.show()
How can I change its ytickrate? When I try adding:
plt.yticks(np.arange(-100,0,10))
the whole graph is morphed:
How can I change the psd plot so that the y axis is represented in 10's without changing the plot?
You just have to set your y-lim again:
plt.ylim((-72,-25))
and you can adjust the limits to match your desired output.
Edit
If you want to make it automatic you can use axs.get_ylim():
fig, axs = plt.subplots(figsize=(10, 10),constrained_layout=True)
axs.psd(data,resolution,1/0.1,lw=1,color='red')
axs.set_xlim(0,5)
axs.set_xlabel('Frequency / GHz')
axs.set_ylabel('Power Spectral Density / dB/Hz')
ylim=axs.get_ylim()
axs.set_yticks(np.arange(-100,0,10))
axs.set_ylim(ylim)

Normal distribution appears too dense when plotted in matplotlib

I am trying to estimate the probability density function of my data. IN my case, the data is a satellite image with a shape 8200 x 8100.
Below, I present you the code of PDF (the function 'is_outlier' is borrowed by a guy that post this code on here ). As we can see, the PDF is in figure 1 too dense. I guess, this is due to the thousands of pixels that the satellite image is composed of. This is very ugly.
My question is, how can I plot a PDF that is not too dense? something like shown in figure 2 for example.
lst = 'satellite_img.tif' #import the image
lst_flat = lst.flatten() #create 1D array
#the function below removes the outliers
def is_outlier(points, thres=3.5):
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thres
lst_flat = np.r_[lst_flat]
lst_flat_filtered = lst_flat[~is_outlier(lst_flat)]
fit = stats.norm.pdf(lst_flat_filtered, np.mean(lst_flat_filtered), np.std(lst_flat_filtered))
plt.plot(lst_flat_filtered, fit)
plt.hist(lst_flat_filtered, bins=30, normed=True)
plt.show()
figure 1
figure 2
The issue is that the x values in the PDF plot are not sorted, so the plotted line is going back and forwards between random points, creating the mess you see.
Two options:
Don't plot the line, just plot points (not great if you have lots of points, but will confirm if what I said above is right or not):
plt.plot(lst_flat_filtered, fit, 'bo')
Sort the lst_flat_filtered array before calculating the PDF and plotting it:
lst_flat = np.r_[lst_flat]
lst_flat_filtered = np.sort(lst_flat[~is_outlier(lst_flat)]) # Changed this line
fit = stats.norm.pdf(lst_flat_filtered, np.mean(lst_flat_filtered), np.std(lst_flat_filtered))
plt.plot(lst_flat_filtered, fit)
Here's some minimal examples showing these behaviours:
Reproducing your problem:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
lst_flat_filtered = np.random.normal(7, 5, 1000)
fit = stats.norm.pdf(lst_flat_filtered, np.mean(lst_flat_filtered), np.std(lst_flat_filtered))
plt.hist(lst_flat_filtered, bins=30, normed=True)
plt.plot(lst_flat_filtered, fit)
plt.show()
Plotting points
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
lst_flat_filtered = np.random.normal(7, 5, 1000)
fit = stats.norm.pdf(lst_flat_filtered, np.mean(lst_flat_filtered), np.std(lst_flat_filtered))
plt.hist(lst_flat_filtered, bins=30, normed=True)
plt.plot(lst_flat_filtered, fit, 'bo')
plt.show()
Sorting the data
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
lst_flat_filtered = np.sort(np.random.normal(7, 5, 1000))
fit = stats.norm.pdf(lst_flat_filtered, np.mean(lst_flat_filtered), np.std(lst_flat_filtered))
plt.hist(lst_flat_filtered, bins=30, normed=True)
plt.plot(lst_flat_filtered, fit)
plt.show()

matplotlib overlay a normal distribution with stddev axis onto another plot

I have a series of data that I'm reading in from a tutorial site.
I've managed to plot the distribution of the TV column in that data, however I also want to overlay a normal distribution curve with StdDev ticks on a second x-axis (so I can compare the two curves). I'm struggling to work out how to do it..
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
# draw distribution curve
h = sorted(data.TV)
hmean = np.mean(h)
hstd = np.std(h)
pdf = stats.norm.pdf(h, hmean, hstd)
plt.plot(h, pdf)
Here is a diagram close to what I'm after, where x is the StdDeviations. All this example needs is a second x axis to show the values of data.TV
Not sure what you really want, but you could probably use second axis like this
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('Advertising.csv', index_col=0)
fig, ax1 = plt.subplots()
# draw distribution curve
h = sorted(data.TV)
ax1.plot(h,'b-')
ax1.set_xlabel('TV')
ax1.set_ylabel('Count', color='b')
for tl in ax1.get_yticklabels():
tl.set_color('b')
hmean = np.mean(h)
hstd = np.std(h)
pdf = stats.norm.pdf(h, hmean, hstd)
ax2 = ax1.twinx()
ax2.plot(h, pdf, 'r.')
ax2.set_ylabel('pdf', color='r')
for tl in ax2.get_yticklabels():
tl.set_color('r')
plt.show()
Ok, assuming that you want to plot the distribution of your data, the fitted normal distribution with two x-axes, one way to achieve this is as follows.
Plot the normalized data together with the standard normal distribution. Then use matplotlib's twiny() to add a second x-axis to the plot. Use the same tick positions as the original x-axis on the second axis, but scale the labels so that you get the corresponding original TV values. The result looks like this:
Code
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import matplotlib.mlab as mlab
import math
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
h = sorted(data.TV)
hmean = np.mean(h)
hstd = np.std(h)
h_n = (h - hmean) / hstd
pdf = stats.norm.pdf( h_n )
# plot data
f,ax1 = plt.subplots()
ax1.hist( h_n, 20, normed=1 )
ax1.plot( h_n , pdf, lw=3, c='r')
ax1.set_xlim( [h_n.min(), h_n.max()] )
ax1.set_xlabel( r'TV $[\sigma]$' )
ax1.set_ylabel( r'Relative Frequency')
ax2 = ax1.twiny()
ax2.grid( False )
ax2.set_xlim( ax1.get_xlim() )
ax2.set_ylim( ax1.get_ylim() )
ax2.set_xlabel( r'TV' )
ticklocs = ax2.xaxis.get_ticklocs()
ticklocs = [ round( t*hstd + hmean, 2) for t in ticklocs ]
ax2.xaxis.set_ticklabels( map( str, ticklocs ) )

How to locate the median in a (seaborn) KDE plot?

I am trying to do a Kernel Density Estimation (KDE) plot with seaborn and locate the median. The code looks something like this:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
sns.kdeplot(data, shade=True)
# x_median, y_median = magic_function()
# plt.vlines(x_median, 0, y_median)
plt.show()
As you can see I need a magic_function() to fetch the median x and y values from the kdeplot. Then I would like to plot them with e.g. vlines. However, I can't figure out how to do that. The result should look something like this (obviously the black median bar is wrong here):
I guess my question is not strictly related to seaborn and also applies to other kinds of matplotlib plots. Any ideas are greatly appreciated.
You need to:
Extract the data of the kde line
Integrate it to calculate the cumulative distribution function (CDF)
Find the value that makes CDF equal 1/2, that is the median
import numpy as np
import scipy
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_palette("hls", 1)
data = np.random.randn(30)
p=sns.kdeplot(data, shade=True)
x,y = p.get_lines()[0].get_data()
#care with the order, it is first y
#initial fills a 0 so the result has same length than x
cdf = scipy.integrate.cumtrapz(y, x, initial=0)
nearest_05 = np.abs(cdf-0.5).argmin()
x_median = x[nearest_05]
y_median = y[nearest_05]
plt.vlines(x_median, 0, y_median)
plt.show()

Categories

Resources