Making time bins bigger in a large dataset with python

Making time bins bigger in a large dataset with python - python

I have a dataset with an array of time bins of size 1/4096 seconds against the number of photons in each time bin. Now, I want to change the resolution by making the time bins a factor of 2 larger, by summing up 2 of them and taking the mean, both with the times and with the photon count. I tried a couple of things like:
tnew = []
for n in range(int((len(t))/2)):
tnew[n] = (t[2*n]+t[2*n+1])/2
and:
for l in range(int((len(t))/2):
np.append(t, (np.sum(t[2*l:4096*(2*l+1)]))/2)
but I can't seem to make this work. I'm really new to Python.

If you want to take the means of adjacent elements in a NumPy array, you can do the following:
In [2]: a = np.arange(10)
In [3]: a
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [4]: (a[:-1:2] + a[1::2])/2.
Out[4]: array([ 0.5, 2.5, 4.5, 6.5, 8.5])
Here, a[:-1:2] is all the elements at even indexes and a[1::2] is all the elements at odd indexes.
In your case, since your array's length is a power of 2, you might choose to allow binning by m = 2, 4, 8, etc with by reshaping and taking the mean along the corresponding axis:
In [5]: n = 1024
In [6]: a = np.arange(n)
In [7]: m = 8
In [8]: b = a.reshape((a.shape[0]/m, m))
In [9]: b.mean(axis=1)
Out[9]:
array([ 3.5, 11.5, 19.5, 27.5, 35.5, 43.5, 51.5,
59.5, 67.5, 75.5, 83.5, 91.5, 99.5, 107.5,
...
])

Related

How to find the median of different-sized lists

I have a list of numbers which I want to sort into bins and find the median of each bin. If the bins all had the same number of data points, this would be easy to do reasonably efficiently using numpy arrays:
import numpy as np
indices=np.array([0,1,0,1,1,2,3,3,3,2,0,2])
length=np.max(indices)+1
data = np.arange(len(indices))
binned = np.array([data[indices == i] for i in range(length)])
The binned data (in the array binned) is then
array([[ 0, 2, 10],
[ 1, 3, 4],
[ 5, 9, 11],
[ 6, 7, 8]])
The median of each bin is:
np.median(binned, axis=1)
Result:
array([2., 3., 9., 7.])
However, if the list is such that there are different numbers of points in each bin (or no points in some of the bins), I can't create a numpy array or use np.median and instead have to do the heavy lifting in a for loop:
indices=np.array([0,1,1,1,3,1,1,0,0,0,3])
data = np.arange(len(indices))
The binned data is
[data[indices == i] for i in range(length)]
[array([0, 7, 8, 9]),
array([1, 2, 3, 5, 6]),
array([], dtype=int64),
array([ 4, 10])]
But I can't take a median of the list of arrays. Instead, I can do
[np.median(data[indices == i]) for i in range(length)]
and get
[7.5, 3.0, nan, 7.0]
But that for loop is pretty slow. (I have a few million data points and tens or hundreds of thousands of bins in my real data.)
Is there a way to do this that avoids heavy reliance on for loops (or even gets rid of for loops altogether)?

Just put your two columns in a pandas DataFrame and you can easily compute your medians by grouping by 'indices'. Let's see it in practice :
import numpy as np , pandas as pd
indices = [0,1,1,1,3,1,1,0,0,0,3]
data = np.arange(len(indices))
df = pd.DataFrame({"indices": indices, "data": data}) # Your DataFrame
df.head() # Take a look
indices data
0 0 0
1 1 1
2 1 2
3 1 3
4 3 4
medians = df.groupby("indices").median()# median for each value of `indices`
medians
data
indices
0 7.5
1 3.0
3 7.0
# Finding indices with no data point
desired_indices = pd.Series([0, 1, 10, -5, 2])
is_in_index = desired_indices.isin(medians.index)
has_no_data = desired_indices[~ is_in_index]
has_no_data
2 10
3 -5
4 2
dtype: int64

Calculating percentile of bins from numpy digitize?

I have a set of data, and a set of thresholds for creating bins:
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
thresholds = np.array([0,5,10])
bins = np.digitize(data, thresholds, right=True)
For each of the elements in bins, I want to know the base percentile. For example, in bins, the smallest bin should start at the 0th percentile. Then the next bin, for example, the 20th percentile. So that if a value in data falls between the 0th and 20th percentile of data, it belongs in the first bin.
I've looked into pandas rank(pct=True) but can't seem to get this done correctly.
Suggestions?

You can calculate the percentile for each element in your data array as described in a previous StackOverflow question (Map each list value to its corresponding percentile).
import numpy as np
from scipy import stats
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
Method 1: Using scipy.stats.percentileofscore :
data_percentile = np.array([stats.percentileofscore(data, a) for a in data])
data_percentile
Out[1]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Method 2: Using scipy.stats.rankdata and normalising to 100 (faster) :
ranked = stats.rankdata(data)
data_percentile = ranked/len(data)*100
data_percentile
Out[2]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Now that you have a list of percentiles, you can bin them as before using numpy.digitize :
bins_percentile = [0,20,40,60,80,100]
data_binned_indices = np.digitize(data_percentile, bins_percentile, right=True)
data_binned_indices
Out[3]:
array([1, 1, 2, 2, 2, 3, 3, 5, 5, 4, 5], dtype=int64)
This gives you the data binned according to the indices of your chosen list of percentiles. If desired, you could also return the actual (upper) percentiles using numpy.take :
data_binned_percentiles = np.take(bins_percentile, data_binned_indices)
data_binned_percentiles
Out[4]:
array([ 20, 20, 40, 40, 40, 60, 60, 100, 100, 80, 100])

Fast way to take average of every N rows in a .npy array

I have a very large masked NumPy array (originalArray) with many rows and two columns. I want take the average of every two rows in originalArray and build a newArray in which each row is the average of two rows in originalArray (so newArray has half as many rows as originalArray). This should be a simple thing to do, but the script below is EXTREMELY slow. Any advice from the community would be greatly appreciated.
newList = []
for i in range(0, originalArray.shape[0], 2):
r = originalArray[i:i+2,:].mean(axis=0)
newList.append(r)
newArray = np.asarray(newList)
There must be a more elegant way of doing this. Many thanks!

The mean of two values a and b is 0.5*(a+b)
Therefore you can do it like this:
newArray = 0.5*(originalArray[0::2] + originalArray[1::2])
It will sum up all two consecutive rows and in the end multiply every element by 0.5.
Since in the title you are asking for avg over N rows, here is a more general solution:
def groupedAvg(myArray, N=2):
result = np.cumsum(myArray, 0)[N-1::N]/float(N)
result[1:] = result[1:] - result[:-1]
return result
The general form of the average over n elements is sum([x1,x2,...,xn])/n.
The sum of elements m to m+n in vector v is the same as subtracting the m-1th element from the m+nth element of cumsum(v). Unless m is 0, in that case you don't subtract anything (result[0]).
That is what we take advantage of here. Also since everything is linear, it is not important where we divide by N, so we do it right at the beginning, but that is just a matter of taste.
If the last group has less than N elements, it will be ignored completely.
If you don't want to ignore it, you have to treat the last group specially:
def avg(myArray, N=2):
cum = np.cumsum(myArray,0)
result = cum[N-1::N]/float(N)
result[1:] = result[1:] - result[:-1]
remainder = myArray.shape[0] % N
if remainder != 0:
if remainder < myArray.shape[0]:
lastAvg = (cum[-1]-cum[-1-remainder])/float(remainder)
else:
lastAvg = cum[-1]/float(remainder)
result = np.vstack([result, lastAvg])
return result

Your problem (average of every two rows with two columns):
>>> a = np.reshape(np.arange(12),(6,2))
>>> a
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11]])
>>> a.transpose().reshape(-1,2).mean(1).reshape(2,-1).transpose()
array([[ 1., 2.],
[ 5., 6.],
[ 9., 10.]])
Other dimensions (average of every four rows with three columns):
>>> a = np.reshape(np.arange(24),(8,3))
>>> a
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17],
[18, 19, 20],
[21, 22, 23]])
>>> a.transpose().reshape(-1,4).mean(1).reshape(3,-1).transpose()
array([[ 4.5, 5.5, 6.5],
[ 16.5, 17.5, 18.5]])
General formula for taking the average of r rows for a 2D array a with c columns:
a.transpose().reshape(-1,r).mean(1).reshape(c,-1).transpose()

import numpy as np
def av(array):
return 1. * np.sum(array.reshape(1. * array.shape[0] / 2,2, array.shape[1]),axis = 1) / array.shape[1]
a = np.array([[1,1],[2,2],[3,3],[4,4]])
print av(a)
>> [[ 1.5 1.5] [ 3.5 3.5]]

How to make numpy.cumsum start after the first value

I have:
import numpy as np
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7, ..., 4])
x = (B/position**2)*dt
A = np.cumsum(x)
assert A[0] == 0 # I want this to be true.
Where B and dt are scalar constants. This is for a numerical integration problem with initial condition of A[0] = 0. Is there a way to set A[0] = 0 and then do a cumsum for everything else?

I don't understand what exactly your problem is, but here are some things you can do to have A[0] = 0.
You can create A to be longer by one index to have the zero as the first entry:
# initialize example data
import numpy as np
B = 1
dt = 1
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7])
# do calculation
A = np.zeros(len(position) + 1)
A[1:] = np.cumsum((B/position**2)*dt)
Result:
A = [ 0. 0.0625 0.11559096 0.16105356 0.20073547 0.23633533 0.26711403]
len(A) == len(position) + 1
Alternatively, you can manipulate the calculation to substract the first entry of the result:
# initialize example data
import numpy as np
B = 1
dt = 1
position = np.array([4, 4.34, 4.69, 5.02, 5.3, 5.7])
# do calculation
A = np.cumsum((B/position**2)*dt)
A = A - A[0]
Result:
[ 0. 0.05309096 0.09855356 0.13823547 0.17383533 0.20461403]
len(A) == len(position)
As you see, the results have different lengths. Is one of them what you expect?

1D cumsum
A wrapper around np.cumsum that sets first element to 0:
def cumsum(pmf):
cdf = np.empty(len(pmf) + 1, dtype=pmf.dtype)
cdf[0] = 0
np.cumsum(pmf, out=cdf[1:])
return cdf
Example usage:
>>> np.arange(1, 11)
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> cumsum(np.arange(1, 11))
array([ 0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55])
N-D cumsum
A wrapper around np.cumsum that sets first element to 0, and works with N-D arrays:
def cumsum(pmf, axis=None, dtype=None):
if axis is None:
pmf = pmf.reshape(-1)
axis = 0
if dtype is None:
dtype = pmf.dtype
idx = [slice(None)] * pmf.ndim
# Create array with extra element along cumsummed axis.
shape = list(pmf.shape)
shape[axis] += 1
cdf = np.empty(shape, dtype)
# Set first element to 0.
idx[axis] = 0
cdf[tuple(idx)] = 0
# Perform cumsum on remaining elements.
idx[axis] = slice(1, None)
np.cumsum(pmf, axis=axis, dtype=dtype, out=cdf[tuple(idx)])
return cdf
Example usage:
>>> np.arange(1, 11).reshape(2, 5)
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10]])
>>> cumsum(np.arange(1, 11).reshape(2, 5), axis=-1)
array([[ 0, 1, 3, 6, 10, 15],
[ 0, 6, 13, 21, 30, 40]])

I totally understand your pain, I wonder why Numpy doesn't allow this with np.cumsum. Anyway, though I'm really late and there's already another good answer, I prefer this one a bit more:
np.cumsum(np.pad(array, (1, 0), "constant"))
where array in your case is (B/position**2)*dt. You can change the order of np.pad and np.cumsum as well. I'm just adding a zero to the start of the array and calling np.cumsum.

You can use roll (shift right by 1) and then set the first entry to zero.

How to organize values in a numpy array into bins that contain a certain range of values?

I am trying to sort values in an numpy array so that I can store all of the values that are in a certain range (That could probably be phrased better). Anyway ill give an example of what I am trying to do. I have an array called bins that looks like this:
bins = array([11,11.5,12,12.5,13,13.5,14])
I also have another array called avgs:
avgs = array([11.02, 13.67, 11.78, 12.34, 13.24, 12.98, 11.3, 12.56, 13.95, 13.56,
11.64, 12.45, 13.23, 13.64, 12.46, 11.01, 11.87, 12.34, 13,87, 13.04,
12.49, 12.5])
What I am trying to do is to find the index values of the avgs array that are in the ranges between the values of the bins array. For example I was trying to make a while loop that would create new variables for each bin. The first bin would be everything that is between bins[0] and bins[1] and would look like:
bin1 = array([0, 6, 15])
Those index values would correspond to the values 11.02, 11.3, and 11.01 in the avgs and would be the values of avgs that were between index values 0 and 1 in bins. I also need the other bins so another example would be:
bin2 = array([2, 10, 16])
However the challenging part of this for me was that the size of bins and avgs changes based on other parameters so I was trying to build something that would be able to be expanded to larger or smaller bins and avgs arrays.

Numpy has some pretty powerful bin counting functions.
>>> binplace = np.digitize(avgs, bins) #Returns which bin an average belongs
>>> binplace
array([1, 6, 2, 3, 5, 4, 1, 4, 6, 6, 2, 3, 5, 6, 3, 1, 2, 3, 5, 7, 5, 3, 4])
>>> np.where(binplace == 1)
(array([ 0, 6, 15]),)
>>> np.where(binplace == 2)
(array([ 2, 10, 16]),)
>>> avgs[np.where(binplace == 1)]
array([ 11.02, 11.3 , 11.01])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Making time bins bigger in a large dataset with python - python

Related

How to find the median of different-sized lists

Calculating percentile of bins from numpy digitize?

Fast way to take average of every N rows in a .npy array

How to make numpy.cumsum start after the first value

How to organize values in a numpy array into bins that contain a certain range of values?

Categories

Resources