I have a set of data, and a set of thresholds for creating bins:
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
thresholds = np.array([0,5,10])
bins = np.digitize(data, thresholds, right=True)
For each of the elements in bins, I want to know the base percentile. For example, in bins, the smallest bin should start at the 0th percentile. Then the next bin, for example, the 20th percentile. So that if a value in data falls between the 0th and 20th percentile of data, it belongs in the first bin.
I've looked into pandas rank(pct=True) but can't seem to get this done correctly.
Suggestions?
You can calculate the percentile for each element in your data array as described in a previous StackOverflow question (Map each list value to its corresponding percentile).
import numpy as np
from scipy import stats
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
Method 1: Using scipy.stats.percentileofscore :
data_percentile = np.array([stats.percentileofscore(data, a) for a in data])
data_percentile
Out[1]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Method 2: Using scipy.stats.rankdata and normalising to 100 (faster) :
ranked = stats.rankdata(data)
data_percentile = ranked/len(data)*100
data_percentile
Out[2]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Now that you have a list of percentiles, you can bin them as before using numpy.digitize :
bins_percentile = [0,20,40,60,80,100]
data_binned_indices = np.digitize(data_percentile, bins_percentile, right=True)
data_binned_indices
Out[3]:
array([1, 1, 2, 2, 2, 3, 3, 5, 5, 4, 5], dtype=int64)
This gives you the data binned according to the indices of your chosen list of percentiles. If desired, you could also return the actual (upper) percentiles using numpy.take :
data_binned_percentiles = np.take(bins_percentile, data_binned_indices)
data_binned_percentiles
Out[4]:
array([ 20, 20, 40, 40, 40, 60, 60, 100, 100, 80, 100])
Related
I want to segment A 1D dataset where each value represents an error into 2 segments:
A cluster with the smallest values
All the others
Example:
X = np.array([1, 1.5, 0.4, 1.1, 23, 24, 22.5, 21, 20, 25, 40, 50, 50, 51, 52, 53]).reshape(-1, 1)
In this small example, I would like to regroup the 4 first values in a cluster and forget about the others. I do not want a solution based on a threshold. The point is that the cluster of interest centroid will not always have the same value. It might be 1e-6, or it might be 1e-3, or it might be 1.
My idea was to use a k-means clustering algorithm, which would work fine if I did know how many clusters existed in my data. In the example above, the number is 3, one around 1 (the cluster of interest), one around 22, and one around 51. But sadly, I do not know the number of clusters... Simply searching for 2 clusters will not lead to a segmentation of the dataset as intended.
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
Returns a cluster 1 way too large, which also includes the data from the cluster centered around 22.
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
I did find some interesting answers on methods to select the k, but it complexifies the algorithm and I feel like there must be a far better way to solve this problem.
I'm open to any suggestions and example which could work on the X array provided.
You might find AffinityPropagation useful here, as it does not require to specify the amount of clusters to generate. You might have to tune however the damping factor and preference, so that it produces the expected results.
On the provided example, the default parameters seem to do the job:
from sklearn.cluster import AffinityPropagation
X = np.array([1, 1.5, 0.4, 1.1, 23, 24, 22.5,
21, 20, 25, 40, 50, 50, 51, 52, 53]).reshape(-1, 1)
ap = AffinityPropagation(random_state=12).fit(X)
y = ap.predict(X)
print(y)
# array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], dtype=int64)
To obtain individual clusters from X, you can index using y:
first_cluster = X[y==0].ravel()
first_cluster
# array([1. , 1.5, 0.4, 1.1])
second_cluster = X[y==1].ravel()
second_cluster
# array([23. , 24. , 22.5, 21. , 20. , 25. ])
I have some data and corresponding labels labels like below:
data = [img1, img2, img3, ...] # each category has 1000 samples, total data is 10000
labels = [1, 1, 2, 2, 3, 3, 4, 4, ...] # total num of labels is 10
I want to make new sub dataset which have one category has 1000 samples, and other categories has 100 samples respectively. So the number of total data in sub dataset will be 1900. (1000 vs 900)
(My intend is to make sub dataset for binary classification)
So I need to sample the data ramdomly for all each category with same amount.
I think it is similar with stratified sampling so I tried to find method in scikit-learn, but I couldn't.
How can I do this?
I couldn't find a function either so I made one.
Let's make a bogus dataset:
import numpy as np
x = np.random.choice(np.arange(10), 10_000)
Now, let's find indexes that will return equally stratified samples, if taken from x.
d = dict()
for val in np.unique(x):
d[str(val)] = np.where(x == val)
d[str(val)] = np.random.choice(d[str(val)][0], 100, replace=False)
ix = np.concatenate([values for values in d.values()])
Let's test it:
print(np.unique(x[ix], return_counts=True))
Out[64]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([100, 100, 100, 100, 100, 100, 100, 100, 100, 100], dtype=int64))
You can also use ix with your y, or any other array.
I have a dataset with an array of time bins of size 1/4096 seconds against the number of photons in each time bin. Now, I want to change the resolution by making the time bins a factor of 2 larger, by summing up 2 of them and taking the mean, both with the times and with the photon count. I tried a couple of things like:
tnew = []
for n in range(int((len(t))/2)):
tnew[n] = (t[2*n]+t[2*n+1])/2
and:
for l in range(int((len(t))/2):
np.append(t, (np.sum(t[2*l:4096*(2*l+1)]))/2)
but I can't seem to make this work. I'm really new to Python.
If you want to take the means of adjacent elements in a NumPy array, you can do the following:
In [2]: a = np.arange(10)
In [3]: a
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [4]: (a[:-1:2] + a[1::2])/2.
Out[4]: array([ 0.5, 2.5, 4.5, 6.5, 8.5])
Here, a[:-1:2] is all the elements at even indexes and a[1::2] is all the elements at odd indexes.
In your case, since your array's length is a power of 2, you might choose to allow binning by m = 2, 4, 8, etc with by reshaping and taking the mean along the corresponding axis:
In [5]: n = 1024
In [6]: a = np.arange(n)
In [7]: m = 8
In [8]: b = a.reshape((a.shape[0]/m, m))
In [9]: b.mean(axis=1)
Out[9]:
array([ 3.5, 11.5, 19.5, 27.5, 35.5, 43.5, 51.5,
59.5, 67.5, 75.5, 83.5, 91.5, 99.5, 107.5,
...
])
I am trying to sort values in an numpy array so that I can store all of the values that are in a certain range (That could probably be phrased better). Anyway ill give an example of what I am trying to do. I have an array called bins that looks like this:
bins = array([11,11.5,12,12.5,13,13.5,14])
I also have another array called avgs:
avgs = array([11.02, 13.67, 11.78, 12.34, 13.24, 12.98, 11.3, 12.56, 13.95, 13.56,
11.64, 12.45, 13.23, 13.64, 12.46, 11.01, 11.87, 12.34, 13,87, 13.04,
12.49, 12.5])
What I am trying to do is to find the index values of the avgs array that are in the ranges between the values of the bins array. For example I was trying to make a while loop that would create new variables for each bin. The first bin would be everything that is between bins[0] and bins[1] and would look like:
bin1 = array([0, 6, 15])
Those index values would correspond to the values 11.02, 11.3, and 11.01 in the avgs and would be the values of avgs that were between index values 0 and 1 in bins. I also need the other bins so another example would be:
bin2 = array([2, 10, 16])
However the challenging part of this for me was that the size of bins and avgs changes based on other parameters so I was trying to build something that would be able to be expanded to larger or smaller bins and avgs arrays.
Numpy has some pretty powerful bin counting functions.
>>> binplace = np.digitize(avgs, bins) #Returns which bin an average belongs
>>> binplace
array([1, 6, 2, 3, 5, 4, 1, 4, 6, 6, 2, 3, 5, 6, 3, 1, 2, 3, 5, 7, 5, 3, 4])
>>> np.where(binplace == 1)
(array([ 0, 6, 15]),)
>>> np.where(binplace == 2)
(array([ 2, 10, 16]),)
>>> avgs[np.where(binplace == 1)]
array([ 11.02, 11.3 , 11.01])
I am trying to group a numpy array into smaller size by taking average of the elements. Such as take average foreach 5x5 sub-arrays in a 100x100 array to create a 20x20 size array. As I have a huge data need to manipulate, is that an efficient way to do that?
I have tried this for smaller array, so test it with yours:
import numpy as np
nbig = 100
nsmall = 20
big = np.arange(nbig * nbig).reshape([nbig, nbig]) # 100x100
small = big.reshape([nsmall, nbig//nsmall, nsmall, nbig//nsmall]).mean(3).mean(1)
An example with 6x6 -> 3x3:
nbig = 6
nsmall = 3
big = np.arange(36).reshape([6,6])
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[12, 13, 14, 15, 16, 17],
[18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29],
[30, 31, 32, 33, 34, 35]])
small = big.reshape([nsmall, nbig//nsmall, nsmall, nbig//nsmall]).mean(3).mean(1)
array([[ 3.5, 5.5, 7.5],
[ 15.5, 17.5, 19.5],
[ 27.5, 29.5, 31.5]])
This is pretty straightforward, although I feel like it could be faster:
from __future__ import division
import numpy as np
Norig = 100
Ndown = 20
step = Norig//Ndown
assert step == Norig/Ndown # ensure Ndown is an integer factor of Norig
x = np.arange(Norig*Norig).reshape((Norig,Norig)) #for testing
y = np.empty((Ndown,Ndown)) # for testing
for yr,xr in enumerate(np.arange(0,Norig,step)):
for yc,xc in enumerate(np.arange(0,Norig,step)):
y[yr,yc] = np.mean(x[xr:xr+step,xc:xc+step])
You might also find scipy.signal.decimate interesting. It applies a more sophisticated low-pass filter than simple averaging before downsampling the data, although you'd have to decimate one axis, then the other.
Average a 2D array over subarrays of size NxN:
height, width = data.shape
data = average(split(average(split(data, width // N, axis=1), axis=-1), height // N, axis=1), axis=-1)
Note that eumiro's approach does not work for masked arrays as .mean(3).mean(1) assumes that each mean along axis 3 was computed from the same number of values. If there are masked elements in your array, this assumption does not hold any more. In that case, you have to keep track of the number of values used to compute .mean(3) and replace .mean(1) by a weighted mean. The weights are the normalized number of values used to compute .mean(3).
Here is an example:
import numpy as np
def gridbox_mean_masked(data, Nbig, Nsmall):
# Reshape data
rshp = data.reshape([Nsmall, Nbig//Nsmall, Nsmall, Nbig//Nsmall])
# Compute mean along axis 3 and remember the number of values each mean
# was computed from
mean3 = rshp.mean(3)
count3 = rshp.count(3)
# Compute weighted mean along axis 1
mean1 = (count3*mean3).sum(1)/count3.sum(1)
return mean1
# Define test data
big = np.ma.array([[1, 1, 2],
[1, 1, 1],
[1, 1, 1]])
big.mask = [[0, 0, 0],
[0, 0, 1],
[0, 0, 0]]
Nbig = 3
Nsmall = 1
# Compute gridbox mean
print gridbox_mean_masked(big, Nbig, Nsmall)