How to find what points lie in each bin of a histogram? - python

I have a 2D dimensional histogram having bin size 10. I wish to know whether there is a numpy function (or any faster method) to obtain what points lie in each bin in the 2d grid. Is there a way to access the bin elements?

I hope this solve your problem. However, I believe other can improve my code because I am new in python.
Create Histogram with matplotlib
import matplotlib.pyplot as plt
rng = np.random.RandomState(10) # deterministic random data
a = np.hstack((rng.normal(size=100), rng.normal(loc=5, scale=2, size=1000)))
n ,bins ,patches = plt.hist(a, bins=10) # arguments are passed to np.histogram
plt.title("Histogram with '10' bins")
plt.show()
Reshape arrays and..
newbin = np.repeat(np.reshape(bins,(-1, len(bins))), a.shape, axis=0)
newa = np.repeat(np.reshape(a,(len(a),-1)),len(bins),axis=1)
#index_bin = (np.where(newbin[:,0] >np.reshape(a,(1,-1))[:,0] ) )[0][0]
index_bin = (newbin>newa).argmax(axis=1).T
test
print(a[0] , bins)
print(index_bin[0])
Output
1.331586504129518 [-2.13171211 -0.88255884 0.36659444 1.61574771 2.86490098 4.11405425
5.36320753 6.6123608 7.86151407 9.11066734 10.35982062]
3

Related

Simulate the compound random variable S

Let S=X_1+X_2+...+X_N where N is a nonnegative integer-valued random variable and X_1,X_2,... are i.i.d random variables.(If N=0, we set S=0).
Simulate S in the case where N ~ Poi(100) and X_i ~ Exp(0.5). (draw histograms and use the numpy or scipy built-in functions).And check the equations E(S)=E(N)*E(X_1) and Var(S)=E(N)*Var(X_1)+E(X_1)^2 *Var(N)
I was trying to solve it, but I'm not sure yet of everything and also got stuck on the histogram part. Note: I'm new to python or more generally , new to programming.
My work:
import scipy.stats as stats
import matplotlib as plt
N = stats.poisson(100)
X = stats.expon(0.5)
arr = X.rvs(N.rvs())
S = 0
for i in arr:
S=S+i
print(arr)
print("S=",S)
expected_S = (N.mean())*(X.mean())
variance_S = (N.mean()*X.var()) + (X.mean()*X.mean()*N.var())
print("E(X)=",expected_S)
print("Var(S)=",variance_S)
Your existing code mostly looks sensible, but I'd simplify:
arr = X.rvs(N.rvs())
S = 0
for i in arr:
S=S+i
down to:
S = X.rvs(N.rvs()).sum()
To draw a histogram, you need many samples from this distribution, which is now easily accomplished via:
arr = []
for _ in range(10_000):
arr.append(X.rvs(N.rvs()).sum())
or, equivalently, using a list comprehension:
arr = [X.rvs(N.rvs()).sum() for _ in range(10_000)]
to plot these in a histogram, you need the pyplot module from Matplotlib, so your import should be:
from matplotlib.pyplot import plt
plt.hist(arr, 50)
The 50 above says to use that number of "bins" when drawing the histogram. We can also compare these to the mean and variance you calculated by assuming the distribution is well approximated by a normal:
approx = stats.norm(expected_S, np.sqrt(variance_S))
_, x, _ = plt.hist(arr, 50, density=True)
plt.plot(x, approx.pdf(x))
This works because the second value returned from matplotlib's hist method are the locations of the bins. I used density=True so I could work with probability densities, but another option could be to just multiply the densities by the number of samples to get expected counts like the previous histogram.
Running this gives me:

logarithmic rebinning of 2D array

I have a 1D ray containing data that looks like this (48000 points), spaced by one wavenumber (R = 1 cm-1). The shape of the x and y array is (48000, 1), I want to rebin both in a similar way
xarr=[50000,9999,9998,....,2000]
yarr=[0.1,0.02,0.8,0.5....0.1]
I wish to decrease the spatial resolution, lets say R= 10 cm-1), so I want ten times less points (4800), from 50000 to 2000. And do the same for the y array
How to start?
I try by taking the natural log of the wavelength scale, then re-bin this onto a new log of wavelength scale generate using np.linspace()
xi=np.log(xarr[0])
xf=np.log(xarr[-1])
xnew=np.linspace(xi, xf, num=4800)
now I need to recast the y array into this xnew array, I am thinking of using rebin, a 2D rebin, but not sure how to use this. Any suggestions?
import numpy as np
arr1=[2,3,65,3,5...,32,2]
series=np.array(arr1)
print(series[:3])
I tried this and it seems to work!
import numpy as np
import scipy.stats as stats
#irregular x and y arrays
yirr= np.random.randint(1,101,10)
xirr=np.arange(10)
nbins=5
bin_means, bin_edges, binnumber = stats.binned_statistic(xirr,yirr, 'mean', bins=nbins)
yreg=bin_means # <== regularized yarr
xi=xirr[0]
xf=xirr[-1]
xreg=np.linspace(xi, xf, num=nbins)
print('yreg',yreg)
print('xreg',xreg) # <== regularized xarr
If anyone can find an improvement or see a problem with this, please post!
I'll try it on my logarithmically scaled data now

Gaussian Mixture Model with discrete data

I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?
You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))

Matplotlib plot pmf from list of 2D numpy arrays

I have a dataset from my simulations where I combine the results from each simulation seed into a bigger list using bl.extend(df['column'].tolist()).
I'm also running several simulation scenarios, so I append each scenario to a list of lists.
Finally, I'm computing the Probability Mass Function (PMF) of each list as follows (from How to plot a PMF of a sample?)
for idx,sublist in enumerate(pmf_list):
val, cnt = np.unique(sublist, return_counts=True)
pmf = cnt / float(len(sublist))
plot_pmf.append(np.column_stack((val, pmf)))
The issue is that I end up with a list of numpy arrays which I don't know how to plot. The minimum code to reproduce the problem is the following:
import numpy as np
list1 = np.empty([2, 2])
list2 = np.empty([2, 2])
list3 = np.empty([2, 2])
bl = [] # big list
bl.append(list1)
bl.append(list2)
bl.append(list3)
print bl
I can plot using plt.hist(bl[0]) but it doesn't give me the right results. See plot attached for the following list.
<type 'numpy.ndarray'>
[[0.00000000e+00 1.91734780e-01]
[1.00000000e+00 2.94277080e-02]
[2.00000000e+00 3.28276369e-01]
[3.00000000e+00 4.43357154e-01]
[4.00000000e+00 3.54294582e-03]
[5.00000000e+00 1.57306794e-03]
[6.00000000e+00 2.00530733e-03]
[7.00000000e+00 2.95245485e-05]
[8.00000000e+00 2.24386568e-05]
[9.00000000e+00 2.83435665e-05]
[1.00000000e+01 1.18098194e-06]
[1.20000000e+01 1.18098194e-06]]
Formatting the y-values I get:
0.1944084241
0.0415880165
0.3480178394
0.4031723062
0.0050902199
0.0033411939
0.0040175705
0.0001480127
0.0001031961
0.0001008373
0.0000058969
0.0000011794
0.0000047175
0.0000005897
very different from the y-values on the histogram plot
Does the following graph look right?
import matplotlib.pyplot as plt
import numpy as np
X = np.array([[0.00000000e+00, 1.91734780e-01],
[1.00000000e+00, 2.94277080e-02],
[2.00000000e+00, 3.28276369e-01],
[3.00000000e+00, 4.43357154e-01],
[4.00000000e+00, 3.54294582e-03],
[5.00000000e+00, 1.57306794e-03],
[6.00000000e+00, 2.00530733e-03],
[7.00000000e+00, 2.95245485e-05],
[8.00000000e+00, 2.24386568e-05],
[9.00000000e+00, 2.83435665e-05],
[1.00000000e+01, 1.18098194e-06],
[1.20000000e+01, 1.18098194e-06],])
plt.bar(x=X[:, 0], height=X[:, 1])
plt.show()
If you already have the first column as the possible values of the random variable, and the second column as the corresponding probability values, you could use a bar plot to visualize the PMF.
The histogram plot function plt.hist is for a vector of observed values. For example,
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
np.random.seed(0)
plt.hist(np.random.normal(size=1000))
plt.show()

matplotlib: plot hist2d piecewise

I would like to plot a large sample stored in the arrays a and b with matplotlib's hist2d feature. However, generating H, xedges, yedges, img does not work directly for this data, as it uses too much memory. It works for half the number of samples, though, so I would like to do something like
H_1, xedges_1, yedges_1, img_1 = plt.hist2d(a[:len(a)/2], b[:len(b)/2], bins = 10)
followed by
H_2, xedges_2, yedges_2, img_2 = plt.hist2d(a[len(a)/2:], b[len(b)/2:], bins = 10)
While perhaps deleting the first half of the arrays after calculating the first set of variables. Is there a way to merge these two sets of variables and generate a combined plot for the data?
If (and only if!) you specify the bin edges manually, then your histograms will be compatible. You can simply add the occurences of each bin for both subsets, and you'll recover the full histogram:
import numpy as np
import matplotlib.pyplot as plt
a=np.random.rand(200)*10
b=np.random.rand(200)*10
binmin=min(a.min(),b.min())
binmax=max(a.max(),b.max())
H_1, xedges_1, yedges_1, img_1 = plt.hist2d(a[:len(a)/2], b[:len(b)/2], bins = np.linspace(binmin,binmax,10+1))
H_2, xedges_2, yedges_2, img_2 = plt.hist2d(a[len(a)/2:], b[len(b)/2:], bins = np.linspace(binmin,binmax,10+1))
H_3, xedges_3, yedges_3, img_3 = plt.hist2d(a, b, bins = np.linspace(binmin,binmax,10+1))
Result:
In [150]: (H_1+H_2==H_3).all()
Out[150]: True
Which you can easily plot using plt.pcolor. That's what hist2d seems to use, albeit with an additional transpose of the data:
plt.figure()
plt.pcolor((H_1+H_2).T)
img_3 (left) vs (H_1+H_2).T (right):

Categories

Resources