My aim:
I have x, y and z values as arrays. For example:
x=np.array([10,2,-4,12,3,6,8,14])
y=np.array([5,5,-6,8,20,10,2,2])
z=np.array([4,6,10,40,22,14,20,8])
I want to plot a heatmap where the z-values will act as the intensity or 'weight' for every pair of (x,y) and the axes will be x and y values. So, my plot will be on a x-y plane. I want to lay a 'grid' on top of my plot by dividing my x-y plane into bins and then calculate the mean of the z-values within every bin and use that mean value as my color or intensity for that bin. I also want to make another plot but there I want to plot the variance of z-values as the intensity within the bins.
What I have done:
I coded it the following way but I think I am misinterpreting things..I don't think I understand bins etc well (I am new to programming).
import numpy as np
import matplotlib.pyplot as plt
x=np.array([10,2,-4,12,3,6,8,14])
y=np.array([5,5,-6,8,20,10,2,2])
z=np.array([4,6,10,40,22,14,-20,8])
# Bin the data onto a 2x2 grid
# Have to reverse x & y due to row-first indexing
zi, yi, xi = np.histogram2d(y, x, bins=(2,2), weights=z, normed=False)
counts, _, _ = np.histogram2d(y, x, bins=(2,2))
#to get mean divide by counts
zi = zi / counts
print(zi)
zi = np.ma.masked_invalid(zi)
fig, ax = plt.subplots()
sc=ax.pcolormesh(xi, yi, zi, edgecolors='black')
sct = ax.scatter(x, y, c=z, s=200) #shows the points in the bins
fig.colorbar(sc)
ax.margins(0.05)
plt.show()
Where I am stuck:
I am not even sure if the above code is doing the right thing. So, feel free to forget it and advise me on any other standard way of solving this problem.
With the above code I get a plot where the axes limits are determined by the given dataset automatically but I want to keep my axes constant at xmin=-20,xmax=20,ymin=-20,ymax=20.
Also, I am not sure how to manipulate the z-values within the bins to calculate other statistical quantities like variance or standard deviation etc.
EDIT: so, I have got some better code that gives the mean z values in bins and plot using np.histogram2d and the I can set the axes etc to my liking now but using this gives H as the sum of values in bins and I can get the mean from that but not other statistical quantities like variance. I wanted a way to code this so that I can have access to the values in the bin and I can calculate variance of those and use that result as the weight/intensity of the heatmap.
I am attaching the plot for mean z in bins.
import numpy as np
import matplotlib.pyplot as plt
x=np.array([10,2,4,12,3,6,8,14])
y=np.array([5,5,6,8,20,10,2,2])
z=np.array([4,6,10,40,22,14,20,8])
x_bins = np.linspace(0, 20, 3)
y_bins = np.linspace(0, 20, 3)
H, xedges, yedges = np.histogram2d(x, y, bins = [x_bins, y_bins], weights = z)
H_counts, xedges, yedges = np.histogram2d(x, y, bins = [x_bins, y_bins])
print(H)
H1 = H/H_counts
print(H1)
plt.xlabel("x")
plt.ylabel("y")
plt.imshow(H1.T, origin='lower', cmap='RdBu',
extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
plt.colorbar().set_label('mean z', rotation=270)
EDIT 2: When I use stats for standard deviation I get the following plot
The deep red bin on the top right is actually empty and has no z values so I want the standard deviation to be 'Nan' instead of being assigned a value of 0. How can I do that?
My code for this plot is:
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
x=np.array([10,2,4,12,3,6,8,14])
y=np.array([5,5,6,8,20,10,2,2])
z=np.array([4,6,10,40,22,14,20,8])
x_bins = np.linspace(0, 20, 3)
y_bins = np.linspace(0, 20, 3)
H, xedges, yedges = np.histogram2d(x, y, bins = [x_bins, y_bins], weights = z)
#mean = stats.binned_statistic_2d(x,y,z,statistic='',bins=[x_bins,y_bins])
#mean.statistic
std = stats.binned_statistic_2d(x,y,z,statistic='std',bins=[x_bins,y_bins])
#std.statistic
#print(std.statistic)
plt.xlabel("x")
plt.ylabel("y")
plt.imshow(std.statistic.T, origin='lower', cmap='RdBu',
extent=[xedges[0], xedges[-1], yedges[0], yedges[-1]])
#plt.clim(0, 20)
plt.colorbar().set_label('std z', rotation=270)
You data need to be interpolated on a regular grid since your computer do not know which is the z value where there is no value. Lukily there is already a function for that: scipy.interpolate.griddata.
import numpy as np
from scipy.interpolate import griddata
import matplotlib.pyplot as plt
# Dummy data
x=np.array([10,2,-4,12,3,6,8,14])
y=np.array([5,5,-6,8,20,10,2,2])
z=np.array([4,6,10,40,22,14,-20,8])
# Create a regular grid along x and y axis
grid_x, grid_y = np.mgrid[x.min():x.max()+1, y.min():y.max()+1]
# Linear interpolation
# But you could also use a cubic interpolation or whatever you want/need
z_interpolated = griddata((x,y), z, (grid_x, grid_y), method='linear')
# Plot the result:
plt.imshow(z_interpolated, cmap='plasma')
And we obtain:
Noticed that there is no value on the image boundary because your spatial domain is not defined beyond the value contained in x and y so with a linear interpolation, your computer can not guess what should be the value beyond those points. So the heatmap is restricted to the convexhull formed by your points, anything else will be extrapolation.
Edit:
If you need to compute a bidimentionnal binned statistic you can use:
scipy.stats.binned_statistic_2d()
In your case if we want to compute the variance and the mean:
from scipy import stats
std = stats.binned_statistic_2d(x,y,z,statistic='std',bins=[x_bins,y_bins])
mean = stats.binned_statistic_2d(x,y,z,statistic='mean',bins=[x_bins,y_bins])
Where mean is totally equivalent to your H/H_counts
Summary of Question:
Why is my density from my sample so different to the pmf and how can I perform this simulation so that the pmf and the sample estimates are similar.
Question:
I have simulated a sample of independent Bernoulli trials using scipy. I am now trying to take a density histogram of the sample that I created and compare it to the pmf (probability mass function). I would expect the density histogram to show two bins each hovering near the pmf but instead, I have 2 bins above the pmf values at 5. Could someone please show me how to create a density histogram that does not do this for the Bernoulli? I tried a similar simulation with a few other distributions and it seemed to work fine. What am I missing here and could you show me how to manipulate my code to make this work?
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
trials = 10**3
p = 0.5
sample_bernoulli = stats.bernoulli.rvs(p, size=trials) # Generate benoulli RV
plt.plot((0,1), stats.bernoulli.pmf((0,1), p), 'bo', ms=8, label='bernoulli pmf')
# Density histogram of generated values
plt.hist(sample_bernoulli, density=True, alpha=0.5, color='steelblue', edgecolor='none')
plt.show()
I must apologize if this is a simple or trivial question but I couldn't find a solution online and found the issue interesting. Any help at all would be appreciated.
The reason is that plt.hist is primarily meant to work with continuous distributions. If you don't provide explicit bin boundaries, plt.hist just creates 10 equally spaced bins between the minimum and maximum value. Most of these bins will be empty. With only two possible data values, there should be just two bins, so 3 boundaries:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
trials = 10**3
p = 0.5
sample_bernoulli = stats.bernoulli.rvs(p, size=trials) # Generate benoulli RV
plt.plot((0,1), stats.bernoulli.pmf((0,1), p), 'bo', ms=8, label='bernoulli pmf')
# Density histogram of generated values
plt.hist(sample_bernoulli, density=True, alpha=0.5, color='steelblue', edgecolor='none', bins=np.linspace(-0.5, 1.5, 3))
plt.show()
Here is a visualization of the default bin boundaries and how the samples fit into the bins. Note that with density=True, the histogram is normalized such that the area of all the bars sums to 1. In this case two bars are 0.1 wide and about 5.0 high, while 8 others have height zero. So, the total area is 2*0.1*5 + 8*0.0 = 1.
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
trials = 10 ** 3
p = 0.5
sample_bernoulli = stats.bernoulli.rvs(p, size=trials) # Generate benoulli RV
# Density histogram of generated values with default bins
values, binbounds, bars = plt.hist(sample_bernoulli, density=True, alpha=0.2, color='steelblue', edgecolor='none')
# show the bin boundaries
plt.vlines(binbounds, 0, max(values) * 1.05, color='crimson', ls=':')
# show the sample values with a random displacement
plt.scatter(sample_bernoulli * 0.9 + np.random.uniform(0, 0.1, trials),
np.random.uniform(0, max(values), trials), color='lime')
# show the index of each bin
for i in range(len(binbounds) - 1):
plt.text((binbounds[i] + binbounds[i + 1]) / 2, max(values) / 2, i, ha='center', va='center', fontsize=20, color='crimson')
plt.show()
In Python, I have estimated the parameters for the density of a model of my distribution and I would like to plot the density function above the histogram of the distribution. In R it is similar to using the option prop=TRUE.
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
# initialization of the list "data"
# estimation of the parameter, in my case, mean and variance of a normal distribution
plt.hist(data, bins="auto") # data is the list of data
# here I would like to draw the density above the histogram
plt.show()
I guess the trickiest part is to make it fit.
Edit: I have tried this according to the first answer:
mean = np.mean(logdata)
var = np.var(logdata)
std = np.sqrt(var) # standard deviation, used by numpy as a replacement of the variance
plt.hist(logdata, bins="auto", alpha=0.5, label="données empiriques")
x = np.linspace(min(logdata), max(logdata), 100)
plt.plot(x, mlab.normpdf(x, mean, std))
plt.xlabel("log(taille des fichiers)")
plt.ylabel("nombre de fichiers")
plt.legend(loc='upper right')
plt.grid(True)
plt.show()
But it doesn't fit the graph, here is how it looks:
** Edit 2 ** Works with the option normed=True in the histogram function.
If I understand you correctly you have the mean and standard deviation of some data. You have plotted a histogram of this and would like to plot the normal distribution line over the histogram. This line can be generated using matplotlib.mlab.normpdf(), the documentation can be found here.
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
mean = 100
sigma = 5
data = np.random.normal(mean,sigma,1000) # generate fake data
x = np.linspace(min(data), max(data), 100)
plt.hist(data, bins="auto",normed=True)
plt.plot(x, mlab.normpdf(x, mean, sigma))
plt.show()
Which gives the following figure:
Edit: The above only works with normed = True. If this is not an option, we can define our own function:
def gauss_function(x, a, x0, sigma):
return a * np.exp(-(x - x0) ** 2 / (2 * sigma ** 2))
mean = 100
sigma = 5
data = np.random.normal(mean,sigma,1000) # generate fake data
x = np.linspace(min(data), max(data), 1000)
test = gauss_function(x, max(data), mean, sigma)
plt.hist(data, bins="auto")
plt.plot(x, test)
plt.show()
All what you are looking for, already are in seaborn.
You just have to use distplot
import seaborn as sns
import numpy as np
data = np.random.normal(5, 2, size=1000)
sns.distplot(data)
I am trying to cluster data according to the density of the data points.
I want to draw contours around these regions according to the density.Like so:
I am trying to adapt the following code from here to get to this point:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde
# Generate fake data
x = np.random.normal(size=1000)
y = x * 3 + np.random.normal(size=1000)
# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)
# Sort the points by density, so that the densest points are plotted last
idx = z.argsort()
x, y, z = x[idx], y[idx], z[idx]
fig, ax = plt.subplots()
img=ax.scatter(x, y, c=z, edgecolor='')
plt.show()
To cluster by density, try an algorithm like DBSCAN. However, it looks like you rather want to estimate the density itself and not cluster points together because you want to color your output by density. In that case, use a simple Kernel density estimate (density function in R) or an adaptive kernel density estimate if you have broad and very sharp peaks at the same time. Example for Matlab: Adaptive Kernel density estimate on Matlab File Exchange
I have data as a list of floats and I want to plot it as a histogram. Hist() function does the job perfectly for plotting the absolute histogram. However, I cannot figure out how to represent it in a relative frequency format - I would like to have it as a fraction or ideally as a percentage on the y-axis.
Here is the code:
fig = plt.figure()
ax = fig.add_subplot(111)
n, bins, patches = ax.hist(mydata, bins=100, normed=1, cumulative=0)
ax.set_xlabel('Bins', size=20)
ax.set_ylabel('Frequency', size=20)
ax.legend
plt.show()
I thought normed=1 argument would do it, but it gives fractions that are too high and sometimes are greater than 1. They also seem to depend on the bin size, as if they are not normalized by the bin size or something. Nevertheless, when I set cumulative=1, it nicely sums up to 1. So, where is the catch? By the way, when I feed the same data into Origin and plot it, it gives me perfectly correct fractions. Thank you!
Because normed option of hist returns the density of points, e.g dN/dx
What you need is something like that:
# assuming that mydata is an numpy array
ax.hist(mydata, weights=np.zeros_like(mydata) + 1. / mydata.size)
# this will give you fractions
Or you can use set_major_formatter to adjust the scale of the y-axis, as follows:
from matplotlib import ticker as tick
def adjust_y_axis(x, pos):
return x / (len(mydata) * 1.0)
ax.yaxis.set_major_formatter(tick.FuncFormatter(adjust_y_axis))
just call adjust_y_axis as above before plt.show().
For relative frequency format set the option density=True. The figure below shows a histogram for 1000 samples taken from a normal distribution with mean 5 and standard deviation 2.0.
The code is
import numpy as np
import matplotlib.pyplot as plt
# Generate data from normal distibution
mu, sigma = 5, 2.0 # mean and standard deviation
mydata = np.random.normal(mu, sigma, 1000)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(mydata,bins=100,density=True);
plt.show()
If you want % on the y-axis you can use PercentFormatter as shown below
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
# Generate data from normal distibution
mu, sigma = 5, 2.0 # mean and standard deviation
mydata = np.random.normal(mu, sigma, 1000)
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(mydata,bins=100,density=False);
ax.yaxis.set_major_formatter(PercentFormatter(xmax=100))
plt.show()
You can use numpy.histogram to get the histogram value and bins, and then calculate frequency by yourself. Finally, use bar plot to get the frequency histogram.
hist, edges = np.histogram(p_hat)
freq = hist / float(hist.sum())
width = np.diff(edges) # edges is bins
plt.bar(edges[1:], freq, width=width, align="edge", ec="k")
plt.set(xlabel='x', ylabel='frequency')