Efficiently calculate mean within specified bins of a numpy array

Efficiently calculate mean within specified bins of a numpy array - python

I have a numpy array (p) that represents the evolution of some prices. Additionally I have another numpy array (bins) of the same length that assigns each entry of p to a specific bin starting from 0. The goal is to calculate the mean of prices within each bin and return them in an array p_mean. This can be done by using a for loop like this:
p = np.array([100, 100, 101, 102, 103, 104, 106, 103, 102, 103, 100, 105])
bins = np.array([3, 3, 3, 5, 6, 6, 6, 7, 7, 7, 8, 9])
lengthBins = 10
p_mean = np.empty(lengthBins)
p_mean.fill(np.nan)
for i in range(lengthBins):
p_mean[i] = p[bins==i].mean()
p_mean // Output: array([nan, nan, nan, 100.33, nan, 102., 104.33, 102.67, 100., 105.]
This gives the desired output (including the nans), however I feel there must be a faster way of doing this using numpy without using the for loop.

Related

Save data to 3D numpy array

I have accelerometer data (x,y,z) which is being updated every 50ms. I need to store 80 values of the data into a 3D numpy array (1, 80, 3). For example:
[[[x,y,z] (at 0ms)
[x,y,z] (at 50ms)
...
[x,y,z]]] (at 4000ms)
After getting the first 80 values, I need to update the array with upcoming values, for example:
[[[x,y,z] (at 50ms)
[x,y,z] (at 100ms)
...
[x,y,z]]] (at 4050ms)
I'm sure there is a way to update the array without needing to manually write 80 variables to store the data into, but I can't think of it. Would really appreciate some help here.

It sounds like you want your array to always be 80 long, so what I would suggest is roll the array and then update the last value.
import numpy as np
data = np.arange(80*3).reshape(80, 3)
data
>>> array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
...,
[231, 232, 233],
[234, 235, 236],
[237, 238, 239]])
data = np.roll(data, -1, axis=0)
data
>>> array([[ 3, 4, 5], # this is second row (index 1) in above array
[ 6, 7, 8], # third row
[ 9, 10, 11], # etc.
...,
[234, 235, 236],
[237, 238, 239],
[ 0, 1, 2]]) # the first row has been rolled to the last position
# now update last position with new data
data[-1] = [x, y, z] # new xyz data
data
>>> data
>>> array([[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11],
...,
[234, 235, 236],
[237, 238, 239],
[ 76, 76, 76]]) # new data updates in correct position in array

You can use vstack (initializing the array for the first iteration):
data=[x,y,x] # first iteration
data=np.vstack([data,[x,y,z]]) # for the rest
print(data) # you would have a Nx3 array
For the update every N seconds it is easier if you use a FIFO or a ring buffer:
https://pypi.org/project/numpy_ringbuffer/

How to append many numpy files into one numpy file in python

I am trying to put many numpy files to get one big numpy file, I tried to follow those two links Append multiple numpy files to one big numpy file in python and Python append multiple files in given order to one big file this is what I did:
import matplotlib.pyplot as plt
import numpy as np
import glob
import os, sys
fpath ="/home/user/Desktop/OutFileTraces.npy"
npyfilespath ="/home/user/Desktop/test"
os.chdir(npyfilespath)
with open(fpath,'wb') as f_handle:
for npfile in glob.glob("*.npy"):
# Find the path of the file
filepath = os.path.join(npyfilespath, npfile)
print filepath
# Load file
dataArray= np.load(filepath)
print dataArray
np.save(f_handle,dataArray)
dataArray= np.load(fpath)
print dataArray
An example of the result that I have:
/home/user/Desktop/Trace=96
[[ 0.01518007 0.01499514 0.01479736 ..., -0.00392216 -0.0039761
-0.00402747]]
[[-0.00824758 -0.0081808 -0.00811402 ..., -0.0077236 -0.00765425
-0.00762086]]
/home/user/Desktop/Trace=97
[[ 0.00614908 0.00581004 0.00549154 ..., -0.00814741 -0.00813457
-0.00809347]]
[[-0.00824758 -0.0081808 -0.00811402 ..., -0.0077236 -0.00765425
-0.00762086]]
/home/user/Desktop/Trace=98
[[-0.00291786 -0.00309509 -0.00329287 ..., -0.00809861 -0.00797789
-0.00784175]]
[[-0.00824758 -0.0081808 -0.00811402 ..., -0.0077236 -0.00765425
-0.00762086]]
/home/user/Desktop/Trace=99
[[-0.00379887 -0.00410453 -0.00438963 ..., -0.03497837 -0.0353842
-0.03575151]]
[[-0.00824758 -0.0081808 -0.00811402 ..., -0.0077236 -0.00765425
-0.00762086]
this line represents the first trace:
[[-0.00824758 -0.0081808 -0.00811402 ..., -0.0077236 -0.00765425
-0.00762086]]
It is repeated all the time.
I asked the second question two days ago, at first I think that I had the best answer, but after trying to model to print and lot the final file 'OutFileTraces.npy' I found that my code:
1/ doesn't print numpy files from folder 'test' with respecting their order(trace0,trace1, trace2,...)
2/ saves only the last trace in the file, I mean by that when print or plot the OutFileTraces.npy, I found just one trace , it is the first one.
So I need to correct my code because really I am blocked. I would be very grateful if you could help me.
Thanks in advance.

Glob produces unordered lists. You need to sort explicitly with an extra line as the sorting procedure is in-place and does not return the list.
npfiles = glob.glob("*.npy")
npfiles.sort()
for npfile in npfiles:
...
NumPy files contain a single array. If you want to store several arrays in a single file you may have a look at .npz files with np.savez https://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html#numpy.savez I have not seen this in use widely, so you may wish seriously to consider alternatives.
If your arrays are all of the same shape and store related data, you can make a larger array. Say that the current shape is (N_1, N_2) and that you have N_0 such arrays. A loop with
all_arrays = []
for npfile in npfiles:
all_arrays.append(np.load(os.path.join(npyfilespath, npfile)))
all_arrays = np.array(all_arrays)
np.save(f_handle, all_array)
will produce a file with a single array of shape (N_0, N_1, N_2)
If you need per-name access to the arrays, HDF5 files are a good match. See http://www.h5py.org/ (a full intro is too much for a SO reply, see the quick start guide http://docs.h5py.org/en/latest/quick.html)

As discussed in
loading arrays saved using numpy.save in append mode
it is possible to save multiple times to an open file, and it possible to load multiple times. That's not documented, and probably not preferred, but it works. savez archive is the preferred method for saving multiple arrays.
Here's a toy example:
In [777]: with open('multisave.npy','wb') as f:
...: arr = np.arange(10)
...: np.save(f, arr)
...: arr = np.arange(20)
...: np.save(f, arr)
...: arr = np.ones((3,4))
...: np.save(f, arr)
...:
In [778]: ll multisave.npy
-rw-rw-r-- 1 paul 456 Feb 13 08:38 multisave.npy
In [779]: with open('multisave.npy','rb') as f:
...: arr = np.load(f)
...: print(arr)
...: print(np.load(f))
...: print(np.load(f))
...:
[0 1 2 3 4 5 6 7 8 9]
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
Here's a simple example of saving a list of arrays of the same shape
In [780]: traces = [np.arange(10),np.arange(10,20),np.arange(100,110)]
In [781]: traces
Out[781]:
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109])]
In [782]: arr = np.array(traces)
In [783]: arr
Out[783]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109]])
In [785]: np.save('mult1.npy', arr)
In [786]: data = np.load('mult1.npy')
In [787]: data
Out[787]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[ 10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109]])
In [788]: list(data)
Out[788]:
[array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]),
array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109])]

Calculating percentile of bins from numpy digitize?

I have a set of data, and a set of thresholds for creating bins:
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
thresholds = np.array([0,5,10])
bins = np.digitize(data, thresholds, right=True)
For each of the elements in bins, I want to know the base percentile. For example, in bins, the smallest bin should start at the 0th percentile. Then the next bin, for example, the 20th percentile. So that if a value in data falls between the 0th and 20th percentile of data, it belongs in the first bin.
I've looked into pandas rank(pct=True) but can't seem to get this done correctly.
Suggestions?

You can calculate the percentile for each element in your data array as described in a previous StackOverflow question (Map each list value to its corresponding percentile).
import numpy as np
from scipy import stats
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
Method 1: Using scipy.stats.percentileofscore :
data_percentile = np.array([stats.percentileofscore(data, a) for a in data])
data_percentile
Out[1]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Method 2: Using scipy.stats.rankdata and normalising to 100 (faster) :
ranked = stats.rankdata(data)
data_percentile = ranked/len(data)*100
data_percentile
Out[2]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Now that you have a list of percentiles, you can bin them as before using numpy.digitize :
bins_percentile = [0,20,40,60,80,100]
data_binned_indices = np.digitize(data_percentile, bins_percentile, right=True)
data_binned_indices
Out[3]:
array([1, 1, 2, 2, 2, 3, 3, 5, 5, 4, 5], dtype=int64)
This gives you the data binned according to the indices of your chosen list of percentiles. If desired, you could also return the actual (upper) percentiles using numpy.take :
data_binned_percentiles = np.take(bins_percentile, data_binned_indices)
data_binned_percentiles
Out[4]:
array([ 20, 20, 40, 40, 40, 60, 60, 100, 100, 80, 100])

Taking an average of an array according to another array of indices

Say I have an array that looks like this:
a = np.array([0, 20, 40, 30, 60, 35, 15, 18, 2])
and I have an array of indices that I want to average between:
averaging_indices = np.array([2, 4, 7, 8])
What I want to do is to average the elements of array a according to the averaging_indices array. Just to make that clear I want to take the averages:
np.mean(a[0:2]), np.mean(a[2:4]), np.mean(a[4:7]), np.mean(a[7,8]), np.mean(a[8:])
and I want to return an array that then has the correct dimensions, in this case
result = [10, 35, 36.66, 18, 2]
Can anyone think of a neat way to do this? The only way I can imagine is by looping, which is very anti-numpy.

Here's a vectorized approach with np.bincount -
# Create "shifts array" and then IDs array for use with np.bincount later on
shifts_array = np.zeros(a.size,dtype=int)
shifts_array[averaging_indices] = 1
IDs = shifts_array.cumsum()
# Use np.bincount to get the summations for each tag and also tag counts.
# Thus, get tagged averages as final output.
out = np.bincount(IDs,a)/np.bincount(IDs)
Sample input, output -
In [60]: a
Out[60]: array([ 0, 20, 40, 30, 60, 35, 15, 18, 2])
In [61]: averaging_indices
Out[61]: array([2, 4, 7, 8])
In [62]: out
Out[62]: array([ 10. , 35. , 36.66666667, 18. , 2. ])

Averaging indexes of peaks if they are close in Python

This might be a simple problem but I haven't come up with a solution.
Say I have an array as np.array([0,1,0,1,0,0,0,1,0,1,0,0,1]) with peaks at indexes [1,3,7,9,12]. How can I replace the indexes with [2,8,12], that is, averaging indexes close in distance, if a threshold distance between peaks is set to be greater than 2 in this example?
Please note that the binary values of the array are just for illustration, the peak value can be any real number.

You could use Raymond Hettinger's cluster function:
from __future__ import division
def cluster(data, maxgap):
"""Arrange data into groups where successive elements
differ by no more than *maxgap*
>>> cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
[[1, 6, 9], [100, 102, 105, 109], [134, 139]]
>>> cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
[[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]
"""
data.sort()
groups = [[data[0]]]
for item in data[1:]:
val = abs(item - groups[-1][-1])
if val <= maxgap:
groups[-1].append(item)
else:
groups.append([item])
return groups
peaks = [1,3,7,9,12]
print([sum(arr)/len(arr) for arr in cluster(peaks, maxgap=2)])
yields
[2.0, 8.0, 12.0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently calculate mean within specified bins of a numpy array - python

Related

Save data to 3D numpy array

How to append many numpy files into one numpy file in python

Calculating percentile of bins from numpy digitize?

Taking an average of an array according to another array of indices

Averaging indexes of peaks if they are close in Python

Categories

Resources