plt.hist's density argument does not work.
I tried to use the density argument in the plt.hist function to normalize stock returns in my plot, but it didn't work.
The following code worked fine for me and give me the probability density function which I desired.
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(19680801)
# example data
mu = 100 # mean of distribution
sigma = 15 # standard deviation of distribution
x = mu + sigma * np.random.randn(437)
num_bins = 50
plt.hist(x, num_bins, density=1)
plt.show()
But when I tried it with stock data, it simply didn't work. The result gave the unnormalized data. I didn't find any abnormal data in my data array.
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
plt.hist(returns, 50,density = True)
plt.show()
# "returns" is a np array consisting of 360 days of stock returns
This is a known issue in Matplotlib.
As stated in Bug Report: The density flag in pyplot.hist() does not work correctly
When density = False, the histogram plot would have counts on the Y-axis. But when density = True, the Y-axis does not mean anything useful. I think a better implementation would plot the PDF as the histogram when density = True.
The developers view this as a feature not a bug since it maintains compatibility with numpy. They have closed several the bug reports about it already with since it is working as intended. Creating even more confusion the example on the matplotlib site appears to show this feature working with the y-axis being assigned a meaningful value.
What you want to do with matplotlib is reasonable but matplotlib will not let you do it that way.
It is not a bug.
Area of the bars equal to 1.
Numbers only seem strange because your bin sizes are small
Since this isn't resolved; based on #user14518925's response which is actually correct, this is treating bin width as an actual valid number whereas from my understanding you want each bin to have a width of 1 such that the sum of frequencies is 1. More succinctly, what you're seeing is:
\sum_{i}y_{i}\times\text{bin size} =1
Whereas what you want is:
\sum_{i}y_{i} =1
therefore, all you really need to change is the tick labels on the y-axis. One way to this is to disable the density option :
density = false
and instead divide by the total sample size as such (shown in your example):
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(19680801)
# example data
mu = 0 # mean of distribution
sigma = 0.0000625 # standard deviation of distribution
x = mu + sigma * np.random.randn(437)
fig = plt.figure()
plt.hist(x, 50, density=False)
locs, _ = plt.yticks()
print(locs)
plt.yticks(locs,np.round(locs/len(x),3))
plt.show()
Another approach, besides that of tvbc, is to change the yticks on the plot.
import matplotlib.pyplot as plt
import numpy as np
steps = 10
bins = np.arange(0, 101, steps)
data = np.random.random(100000) * 100
plt.hist(data, bins=bins, density=True)
yticks = plt.gca().get_yticks()
plt.yticks(yticks, np.round(yticks * steps, 2))
plt.show()
Related
Problem
I have a spectrum that can be download here: https://www.dropbox.com/s/ax1b32aotuzx9f1/example_spectrum.npy?dl=0
Using Python, I am trying to use zero padding to increase the number of points in the frequency domain. To do so I rely on scipy.fft and scipy.ifft functions. I do not obtain the desired result, and would be grateful for anyone that could explain why that is.
Code
Here is the code I have tried:
import numpy as np
from scipy.fft import fft, ifft
import matplotlib.pyplot as plt
spectrum = np.load('example_spectrum.npy')
spectrum_time = ifft(spectrum) # In time domain
spectrum_oversampled = fft(spectrum_time, len(spectrum)+1000) # FFT of zero padded spectrum
xaxis = np.linspace(0, len(spectrum)-1, len(spectrum_oversampled)) # to plot oversampled spectrum
fig, (ax1, ax2) = plt.subplots(2,1)
ax1.plot(spectrum, '.-')
ax1.plot(xaxis, spectrum_oversampled)
ax1.set_xlim(500, 1000)
ax1.set_xlabel('Arbitrary units')
ax1.set_ylabel('Normalized flux')
ax1.set_title('Frequency domain')
ax2.plot(spectrum_time)
ax2.set_ylim(-0.02, 0.02)
ax2.set_title('Time domain')
ax2.set_xlabel('bin number')
plt.tight_layout()
plt.show()
Results
Added figure to show results. Blue is original spectrum, orange is zero padded spectrum.
Expected behavior
I would expect the zero padding to result in a sort of sinc interpolation of the original spectrum. However, the orange curve does not go through the points of the original spectrum.
Does anyone have any idea why I obtain this behavior and/or how to fix this?
I am playing around with PP plots in statsmodels and I wonder why comparing Normal distribution with scale = 5 and loc = 20 to Standard Normal distribution results in a straight line on the PP plot when the distributions are much different. Please find sample code below:
import numpy as np
import statsmodels.api as sm
import pylab
test = np.random.normal(20, 5, 100000)
pp = sm.ProbPlot(test, loc=0, scale=1)
fig = pp.ppplot()
plt.show()
You can try to reduce the sample size and you will see the effect.
test = np.random.normal(20, 5, 100)
pp = sm.ProbPlot(test, loc=0, scale=1, fit=False).ppplot(line='45')
plt.show()
If fit is false, loc, scale, and distargs are passed to the distribution. If fit is True then the parameters for dist are fit automatically using dist.fit. The quantiles are formed from the standardized data, after subtracting the fitted loc and dividing by the fitted scale. fit cannot be used if dist is a SciPy frozen distribution.
Could you please help me, how to modify the code so to get a histogram with bins counts including right bin edge i.e. bins[i-1] < x <= bins[i] (and no the Left as by default) ?
import matplotlib.pyplot as plt
import numpy as np
data = [0,1,2,3,4]
binwidth = 1
plt.hist(data, bins=np.arange(min(data), max(data) + binwidth, binwidth))
plt.xlabel('Time')
plt.ylabel('Counts')
plt.show()
I do not think there is an option to do it explicitly in either matplotlib or numpy.
However, you may use np.histogram() with negative value of your data (and bins), then negate the output and plot it with plt.bar() function.
bins = np.arange(min(data), max(data) + binwidth, binwidth)
hist, binsHist = np.histogram(-data, bins=sorted(-bins))
plt.plot(-binsHist[1:], -hist, np.diff(binHist))
I have started to use python for analysis. I would like to do the following:
Get the distribution of dataset
Get the peaks in this distribution
I used gaussian_kde from scipy.stats to make estimation for kernel density function. Does guassian_kde make any assumption about the data ?. I am using data that are changed over time. so if data has one distribution (e.g. Gaussian), it could have another distribution later. Does gaussian_kde have any drawbacks in this scenario?. It was suggested in question to try to fit the data in every distribution in order to get the data distribution. So what's the difference between using gaussian_kde and the answer provided in question. I used the code below, I was wondering also to know is gaussian_kde good way to estimate pdf if the data will be changed over time ?. I know one advantage of gaussian_kde is that it calculate bandwidth automatically by a rule of thumb as in here. Also, how can I get its peaks?
import pandas as pd
import numpy as np
import pylab as pl
import scipy.stats
df = pd.read_csv('D:\dataset.csv')
pdf = scipy.stats.kde.gaussian_kde(df)
x = np.linspace((df.min()-1),(df.max()+1), len(df))
y = pdf(x)
pl.plot(x, y, color = 'r')
pl.hist(data_column, normed= True)
pl.show(block=True)
I think you need to distinguish non-parametric density (the one implemented in scipy.stats.kde) from parametric density (the one in the StackOverflow question you mention). To illustrate the difference between these two, try the following code.
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(0)
gaussian1 = -6 + 3 * np.random.randn(1700)
gaussian2 = 4 + 1.5 * np.random.randn(300)
gaussian_mixture = np.hstack([gaussian1, gaussian2])
df = pd.DataFrame(gaussian_mixture, columns=['data'])
# non-parametric pdf
nparam_density = stats.kde.gaussian_kde(df.values.ravel())
x = np.linspace(-20, 10, 200)
nparam_density = nparam_density(x)
# parametric fit: assume normal distribution
loc_param, scale_param = stats.norm.fit(df)
param_density = stats.norm.pdf(x, loc=loc_param, scale=scale_param)
fig, ax = plt.subplots(figsize=(10, 6))
ax.hist(df.values, bins=30, normed=True)
ax.plot(x, nparam_density, 'r-', label='non-parametric density (smoothed by Gaussian kernel)')
ax.plot(x, param_density, 'k--', label='parametric density')
ax.set_ylim([0, 0.15])
ax.legend(loc='best')
From the graph, we see that the non-parametric density is nothing but a smoothed version of histogram. In histogram, for a particular observation x=x0, we use a bar to represent it (put all probability mass on that single point x=x0 and zero elsewhere) whereas in non-parametric density estimation, we use a bell-shaped curve (the gaussian kernel) to represent that point (spreads over its neighbourhood). And the result is a smoothed density curve. This internal gaussian kernel has nothing to do with your distributional assumption on the underlying data x. Its sole purpose is for smoothing.
To get the mode of non-parametric density, we need to do an exhaustive search, as the density is not guaranteed to have uni-mode. As shown in the example above, if you quasi-Newton optimization algo starts between [5,10], it is very likely to end up with a local optimal point rather than the global one.
# get mode: exhastive search
x[np.argsort(nparam_density)[-1]]
I am discovering wavelets in practice thanks to the python module pywt.
I have browsed some examples of the pywt module usage, but I could not grasp the essential step: I don't know how to display the multidimensionnal output of a wavelet analysis with matplotlib, basically.
This is what I tried, (given one pyplot axe ax):
import pywt
data_1_dimension_series = [0,0.1,0.2,0.4,-0.1,-0.1,-0.3,-0.4,1.0,1.0,1.0,0]
# indeed my data_1_dimension_series is much longer
cA, cD = pywt.dwt(data_1_dimension_series, 'haar')
ax.set_xlabel('seconds')
ax.set_ylabel('wavelet affinity by scale factor')
ax.plot(axe_wt_time, zip(cA,cD))
or also
data_wt_analysis = pywt.dwt(data_1_dimension_series, 'haar')
ax.plot(axe_wt_time, data_wt_analysis)
Both ax.plot(axe_wt_time, data_wt_analysis) and ax.plot(axe_wt_time, zip(cA,cD)) are not appropriate and returns error. Both throws x and y must have the same first dimension
The thing is data_wt_analysis does contain several 1D series, one for each wavelet scale factor.
I surely could display as many graphs as there are scale factors. But I want them all in the same graph.
How could I simply display such data, in only one graph, with matplotlib ?
Something like the colourful square below:
You should extract the different 1D series from your array of interest, and use matplotlib as in most simple example
import matplotlib.pyplot as plt
plt.plot([1,2,3,4])
plt.ylabel('some numbers')
plt.show()
from doc.
You wish to superimpose 1D plots (or line plots). So, if you have lists l1, l2, l3, you will do
import matplotlib.pyplot as plt
plt.plot(l1)
plt.plot(l2)
plt.plot(l3)
plt.show()
For a scalogram: what i used was imshow(). This was not for wavelets, but same ID: a colormap.
I have found this sample for use of imshow() with wavelets, didn t try thought
from pylab import *
import pywt
import scipy.io.wavfile as wavfile
# Find the highest power of two less than or equal to the input.
def lepow2(x):
return 2 ** floor(log2(x))
# Make a scalogram given an MRA tree.
def scalogram(data):
bottom = 0
vmin = min(map(lambda x: min(abs(x)), data))
vmax = max(map(lambda x: max(abs(x)), data))
gca().set_autoscale_on(False)
for row in range(0, len(data)):
scale = 2.0 ** (row - len(data))
imshow(
array([abs(data[row])]),
interpolation = 'nearest',
vmin = vmin,
vmax = vmax,
extent = [0, 1, bottom, bottom + scale])
bottom += scale
# Load the signal, take the first channel, limit length to a power of 2 for simplicity.
rate, signal = wavfile.read('kitten.wav')
signal = signal[0:lepow2(len(signal)),0]
tree = pywt.wavedec(signal, 'db5')
# Plotting.
gray()
scalogram(tree)
show()