Points outside histogram range are discarted from plot - python

I am plotting a histogram of values and I want all histograms to have the same range of values for the bins, so plots can be compared. To do so, I specify a vector x with the values and range of each bin.
data = np.array([0.1, 0.1, 0.2, 0.2, 0.2, 0.32])
x = np.linspace(0, 0.2, 9)
plt.hist(data, x)
What I notice is that if I specify the range of x to be between 0 and 0.2 then values larger than 0.2 (0.32 in the example) are discarted from the plot.
Is there a way of accumulating all values greater than 0.2 in the last bin and all values lower than 0.0 in the first bin?
Of course I can do something like
data[data>0.2] = 0.2
data[data<0.0] = 0.0
But I'd prefer not to modify my original array and not have to make a copy of it unless there isn't another way.

You can pass the bins argument as an array with demarcation wherever you want. It does not have to be linearly spaced. This will make the bars of different widths though. For your particular case, you can use the .clip method of the data array.
data = np.array([0.1, 0.1, 0.2, 0.2, 0.2, 0.32])
x = np.linspace(0, 0.2, 9)
plt.hist(data.clip(min=0, max=0.2), x)

Related

Merging similar columns in NumPy, probability vector

I have a numpy array of the following shape:
img1
It is a probability vector, where the second row corresponds to a value and the first row to the probability that this value is realized. (e.g. the probability of getting 1.0 is 20%)
When two values are close to each other, I want to merge their columns by adding up the probabilities. In this example I want to have:
img2
My current solution involves 3 loops and is really slow for larger arrays. Does someone know an efficient way to program this in NumPy?
While it won't do exactly what you want, you could try to use np.histogram to tackle the problem.
For example say you just want two "bins" like in your example you could do
import numpy as np
x = np.array([[0.1, 0.2, 0.6, 0.1], [0.0, 1.0, 0.0, 1.01]])
hist, bin_edges = np.histogram(x[1, :], bins=[0, 1.0, 1.5], weights=x[0, :])
and then stack your histogram with the leading bin edges to get your output
print(np.stack([hist, bin_edges[:-1]]))
This will print
[[0.7 0.3]
[0. 1. ]]
You can use the bins parameter to get your desired output. I hope this helps.

Matplotlib: Probability Mass Graph

Why does the total probability exceed 1?
import matplotlib.pyplot as plt
figure, axes = plt.subplots(nrows = 1, ncols = 1)
axes.hist(x = [0.1, 0.2, 0.3, 0.4], density = True)
figure.show()
Expected y-values: [0.25, 0.25, 0.25, 0.25]
Following is my understanding as per the documentation. I don't claim to be an expert in matplotlib nor I am one of the authors. Your question made me think and then I read the documentation and took some logical steps to understand it. So this is not an expert opinion.
===================================================================
Since you have not passed bins information, matplotlib went ahead and created its own bins. In this case the bins are as below.
bins = [0.1 , 0.13, 0.16, 0.19, 0.22, 0.25, 0.28, 0.31, 0.34, 0.37, 0.4 ]
You can see the bind width is 0.03.
Now according to the documentation.
density : bool, optional
If True, the first element of the return
tuple will be the counts normalized to form a probability density,
i.e., the area (or integral) under the histogram will sum to 1. This
is achieved by dividing the count by the number of observations times
the bin width and not dividing by the total number of observations.
In order to make the sum to 1, it is normalizing the counts so that when you multiply the resulting normalized counts in each bin with bin width the resulting sum of the individual product becomes 1.
Your counts are as below for X = [0.1,0.2,0.3,0.4]
OriginalCounts = [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1]
As you can see if you multiply the OriginalCounts array with the bin width and sum all of them, it is going to come out to 4*0.03 = 0.12 .. which is less than one.
So according to the documentation we need to divide the OriginalCounts array with a factor .. which is (number of observations * bin width).
In this case the number of observations are 4 and bin width is 0.03. So 4*0.03 is equal to 0.12. Thus you divide each element of OriginalCounts with 0.12 to get a Normalized histogram values array.
That means that the revised counts are as below
NormalizedCounts = [8.33333333, 0. , 0. , 8.33333333, 0. , 0. , 8.33333333, 0. , 0. , 8.33333333]
Please note that, now, if you sum the Normalized counts multiplied by bin width, it will be equal to 1. You can quickly check this: 8.333333*4*0.03=0.9999999.. which is very close to 1.
This normalized counts is finally shown in the graph. This is the reason why the height of the bars in the histogram is close to 8 for at four positions.
Hope this helps.

Difference between "counts" and "number of observations" in matplotlib histogram

The matplotlib.pyplot.hist() documentation describes the parameter "density" (its deprecated name was "normed") as:
density : bool, optional
If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., the area (or integral) under the histogram will sum to 1. This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations.
With the first element of the tuple it refers to the y-axis values. It says that it manages to get the area under the histogram to be 1 by: dividing the count by the number of observations times the bin width.
What is the difference between count and number of observations? In my head they are the same thing: the number of instances (or number of counts or number of observations) the variable value falls into a certain bin. However, this would mean that the transformed number of counts for each bin is just one over the bin width (since # / #*bin_width = 1/bin_width) which does not make any sense.
Could someone clarify this for me? Thank you for your help and sorry for the probably stupid question.
I think the wording in the documentation is a bit confusing. The count is the number of entries in a given bin (height of the bin) and the number of observation is the total number of events that go into the histogram.
The documentation makes the distinction about how they normalized because there are generally two ways to do the normalization:
count / number of observations - in this case if you add up all the entries of the output array you would get 1.
count / (number of observations * bin width) - in this case the integral of the output array is 1 so it is a true probability density. This is what matplotlib does, and they just want to be clear in this distinction.
The count of all obervations is the number of observations. But with a histogram you're interested in the counts per bin. So for each bin you divide the count of this bin by the total number of observations times the bin width.
import numpy as np
observations = [1.2, 1.5, 1.7, 1.9, 2.2, 2.3, 3.6, 4.1, 4.2, 4.4]
bin_edges = [0,1,2,3,4,5]
counts, edges = np.histogram(observations, bins=bin_edges)
print(counts) # prints [0 4 2 1 3]
density, edges = np.histogram(observations, bins=bin_edges, density=True)
print(density) # prints [0. 0.4 0.2 0.1 0.3]
# calculate density manually according to formula
man_density = counts/(len(observations)*np.diff(edges))
print(man_density) # prints [0. 0.4 0.2 0.1 0.3]
# Check that density == manually calculated density
assert(np.all(man_density == density))

Finding standard deviations along x and y in 2D numpy array

If I have a 2D numpy array composed of points (x, y) that give some value z(x, y) at each point, can I find the standard deviation along the x-axis and along the y-axis? I know that np.std(data) will simply find the standard deviation of the entire dataset, but that's not want I want. Also, adding in axis=0 or axis=1 computes the standard deviations along each axis for as many rows or columns that you have. If I just want one standard deviation along the y-axis, and another along the x-axis, can I find these in a dataset like this? From my understanding, standard deviations along x and y normally make sense when you have points x with values y(x). But I need some sigma_x and sigma_y for a 2D Gaussian fit I'm trying to do. Is this possible?
Here is an oversimplified example, since my actual data is much larger.
import numpy as np
data = np.array([[1, 5, 0, 3], [3, 5, 1, 1], [41, 33, 9, 20], [11, 20, 4, 13]])
print(np.std(data)) #not what I want
>>> 11.78386
print(np.std(data, axis=0)) #this gives me as many results as there are rows/columns so it's not what I want
>>> [16.03 11.69 3.5 7.69]
I'm not sure how the output corresponding to what I want would look like, since I'm not even sure if it's possible in a 2D array with shape > nx2. But I want to know if it's possible to compute a standard deviation along the x-axis, and one along the y-axis. I'm not even sure if this makes sense for a 2D array... But if it doesn't, I'm not sure what to input as my sigma_x and sigma_y for a 2D Gaussian fit.
Standard deviation doesn't care whether y = f(x) or (x, y) are coordinates. It just measures how spread a set of values are. If you have n points (x, y) which make up a nX2 size array, then the std(axis=0) is what you want. It creates a (2, )shaped array, where the first elements is the x-axis std, and the second the y-axis std. Whether that is useful, depends on what you want, and it ignores the correlation between x and y.
I think what you want is to separate the x axis in small intervals and compute the standard deviation of the y coordinates of the points within those intervals.
You could compute std(y_i), where y_i are the y coordinates for points x in the interval (x_min+i*delta_x, x_min+(i+1)*delta_x), choosing a small delta_x, such that enough points (x_j, y_j) lie within the interval.
import numpy as np
x = np.array([0, 0.11, 0.1, 0.01, 0.2, 0.22, 0.23])
y = np.array([1, 2, 3, 2, 2, 2.1, 2.2])
num_intervals = 3
#sort the arrays
sort_inds = np.argsort(x)
x = x[sort_inds]
y = y[sort_inds]
# create intervals
x_range = x.max() - x.min()
x_intervals = np.linspace(np.min(x)+x_range/num_intervals, x.max()-x_range/num_intervals, num_intervals)
print(x_intervals)
>> [0.07666667 0.115 0.15333333]
Next, we split the arrays y and x using these intervals:
# get indices of x where the elements of x_intervals
# should be inserted, in order to maintain the order
# for sufficiently large num_intervals it
# approximates the closest value in x to an element
# in x_intervals
split_indices = np.unique(np.searchsorted(x, x_intervals, side='left'))
ls_of_arrays_x = np.array_split(x, split_indices)
ls_of_arrays_y = np.array_split(y, split_indices)
print(ls_of_arrays_x)
print(ls_of_arrays_y)
>> [array([0. , 0.01]), array([0.1 , 0.11]), array([0.2 , 0.22, 0.23])]
>> [array([1., 2.]), array([3., 2.]), array([2. , 2.1, 2.2])]
Now compute the x coordinates and the corresponding y std:
y_stds = np.array([np.std(yi) for yi in ls_of_arrays_y])
x_mean = np.array([np.std(xi) for xi in ls_of_arrays_x])
print(x_mean)
print(y_stds)
>> [0.005 0.105 0.21666667]
>> [0.5 0.5 0.08164966]
I hope it was what you were looking for.

Probability density function numpy histogram/scipy stats

We have the arraya=range(10). Using numpy.histogram:
hist,bins=numpy.histogram(a,bins=(np.max(a)-np.min(a))/1, range=np.min(a),np.max(a)),density=True)
According to numpy tutorial:
If density=True, the result is the value of the probability density function at the bin, normalized such that the integral over the range is 1.
The result is:
array([ 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2])
I try to do the same using scipy.stats:
mean = np.mean(a)
sigma = np.std(a)
norm.pdf(a, mean, sigma)
However the result is different:
array([ 0.04070852, 0.06610774, 0.09509936, 0.12118842, 0.13680528,0.13680528, 0.12118842, 0.09509936, 0.06610774, 0.04070852])
I want to know why.
Update:I would like to set a more general question. How can we have the probability density function of an array without using numpy.histogram for density=True ?
If density=True, the result is the value of the probability density
function at the bin, normalized such that the integral over the
range is 1.
The "normalized" there does not mean that it will be transformed using a Normal Distribution. It simply says that each value in the bin will be divided by the total number of entries so that the total density would be equal to 1.
You can't compare numpy.histogram() and scipy.stats.norm() for this sample reason:
scipy.stats.norm() is A normal continuous random variable while numpy.histogram() deal with sequences (discontinuous)
Plotting a Continuous Probability Function(PDF) from a Histogram – Solved in Python. refer this blog for detailed explanation. (http://howdoudoittheeasiestway.blogspot.com/2017/09/plotting-continuous-probability.html) Else you can use the code below.
n, bins, patches = plt.hist(A, 40, histtype='bar')
plt.show()
n = n/len(A)
n = np.append(n, 0)
mu = np.mean(n)
sigma = np.std(n)
plt.bar(bins,n, width=(bins[len(bins)-1]-bins[0])/40)
y1= (1/(sigma*np.sqrt(2*np.pi))*np.exp(-(bins - mu)**2 /(2*sigma**2)))*0.03
plt.plot(bins, y1, 'r--', linewidth=2)
plt.show()

Categories

Resources