How to efficiently index into a 1D numpy array via slice ranges - python

I have a big 1D array of data. I have a starts array of indexes into that data where important things happened. I want to get an array of ranges so that I get windows of length L, one for each starting point in starts. Bogus sample data:
data = np.linspace(0,10,50)
starts = np.array([0,10,21])
length = 5
I want to instinctively do something like
data[starts:starts+length]
But really, I need to turn starts into 2D array of range "windows." Coming from functional languages, I would think of it as a map from a list to a list of lists, like:
np.apply_along_axis(lambda i: np.arange(i,i+length), 0, starts)
But that won't work because apply_along_axis only allows scalar return values.
You can do this:
pairs = np.vstack([starts, starts + length]).T
ranges = np.apply_along_axis(lambda p: np.arange(*p), 1, pairs)
data[ranges]
Or you can do it with a list comprehension:
data[np.array([np.arange(i,i+length) for i in starts])]
Or you can do it iteratively. (Bleh.)
Is there a concise, idiomatic way to slice into an array at certain start points like this? (Pardon the numpy newbie-ness.)

data = np.linspace(0,10,50)
starts = np.array([0,10,21])
length = 5
For a NumPy only way of doing this, you can use numpy.meshgrid() as described here
http://docs.scipy.org/doc/numpy/reference/generated/numpy.meshgrid.html
As hpaulj pointed out in the comments, meshgrid actually isn't needed for this problem as you can use array broadcasting.
http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
# indices = sum(np.meshgrid(np.arange(length), starts))
indices = np.arange(length) + starts[:, np.newaxis]
# array([[ 0, 1, 2, 3, 4],
# [10, 11, 12, 13, 14],
# [21, 22, 23, 24, 25]])
data[indices]
returns
array([[ 0. , 0.20408163, 0.40816327, 0.6122449 , 0.81632653],
[ 2.04081633, 2.24489796, 2.44897959, 2.65306122, 2.85714286],
[ 4.28571429, 4.48979592, 4.69387755, 4.89795918, 5.10204082]])

If you need to do this a lot of time, you can use as_strided() to create a sliding windows array of data
data = np.linspace(0,10,50000)
length = 5
starts = np.random.randint(0, len(data)-length, 10000)
from numpy.lib.stride_tricks import as_strided
sliding_window = as_strided(data, (len(data) - length + 1, length),
(data.itemsize, data.itemsize))
Then you can use:
sliding_window[starts]
to get what you want.
It's also faster than creating the index array.

Related

Workaround for using a float slice indices in lists [duplicate]

I have an array of arbitrary length, and I want to select N elements of it, evenly spaced out (approximately, as N may be even, array length may be prime, etc) that includes the very first arr[0] element and the very last arr[len-1] element.
Example:
>>> arr = np.arange(17)
>>> arr
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])
Then I want to make a function like the following to grab numElems evenly spaced out within the array, which must include the first and last element:
GetSpacedElements(numElems = 4)
>>> returns 0, 5, 11, 16
Does this make sense?
I've tried arr[0:len:numElems] (i.e. using the array start:stop:skip notation) and some slight variations, but I'm not getting what I'm looking for here:
>>> arr[0:len:numElems]
array([ 0, 4, 8, 12, 16])
or
>>> arr[0:len:numElems+1]
array([ 0, 5, 10, 15])
I don't care exactly what the middle elements are, as long as they're spaced evenly apart, off by an index of 1 let's say. But getting the right number of elements, including the index zero and last index, are critical.
To get a list of evenly spaced indices, use np.linspace:
idx = np.round(np.linspace(0, len(arr) - 1, numElems)).astype(int)
Next, index back into arr to get the corresponding values:
arr[idx]
Always use rounding before casting to integers. Internally, linspace calls astype when the dtype argument is provided. Therefore, this method is NOT equivalent to:
# this simply truncates the non-integer part
idx = np.linspace(0, len(array) - 1, numElems).astype(int)
idx = np.linspace(0, len(arr) - 1, numElems, dtype='int')
Your GetSpacedElements() function should also take in the array to avoid unfortunate side effects elsewhere in code. That said, the function would need to look like this:
import numpy as np
def GetSpacedElements(array, numElems = 4):
out = array[np.round(np.linspace(0, len(array)-1, numElems)).astype(int)]
return out
arr = np.arange(17)
print(array)
spacedArray = GetSpacedElements(arr, 4)
print (spacedArray)
If you want to know more about finding indices that match values you seek, also have a look at numpy.argmin and numpy.where. Implementing the former:
import numpy as np
test = np.arange(17)
def nearest_index(array, value):
return (np.abs(np.asarray(array) - value)).argmin()
def evenly_spaced_indices(array, steps):
return [nearest_index(array, value) for value in np.linspace(np.min(array), np.max(array), steps)]
print(evenly_spaced_indices(test,4))
You should keep in mind that this is an unnecessary amount of function calls for the initial question you asked as switftly demonstrated by coldspeed. np.round intuitively rounds to the closest matching integer serving as index, implementing a similar process but optimised in C++. If you are interested in the indices too, you could have your function simply return both:
import numpy as np
def GetSpacedElements(array, numElems=4, returnIndices=False):
indices = np.round(np.linspace(0, len(arr) - 1, numElems)).astype(int)
values = array[indices]
return (values, indices) if returnIndices else (values)
arr = np.arange(17) + 42
print(arr)
print(GetSpacedElements(arr, 4)) # values only
print(GetSpacedElements(arr, 4, returnIndices=True)[0]) # values only
print(GetSpacedElements(arr, 4, returnIndices=True)[1]) # indices only
To get N evenly spaced elements from list 'x':
x[::int(np.ceil( len(x) / N ))]

Retrieve intervals from array based on multiple ranges

Let's say I have a Numpy array called a:
a = np.array([2,3,8,11,30,39,44,49,55,61])
I would like to retrieve multiple intervals based on two other arrays:
l = np.array([2,5,42])
r = np.array([10,40,70])
Doing something equivalent to this:
a[(a > l) & (a < r)]
With this as the desired output:
Out[1]: [[3 8],[ 8 11 30 39],[44 49 55 61]]
Of course I could do a simple for loop iterating over l and r, but the real life dataset is huge, so I would like to prevent looping as much as possible.
You can't avoid looping given the ragged nature of output. But we should try to reduce compute when iterating. So, here's one way to simply slice into the input array while iterating, as we will most of the compute part with getting the start,stop indices per group with searchsorted -
lidx = np.searchsorted(a,l,'right')
ridx = np.searchsorted(a,r,'left')
out = [a[i:j] for (i,j) in zip(lidx,ridx)]
Here's one approach, broadcasting to obtain the indexing arrays, and using np.split to split the array:
# generates a (3,len(a)) where the windows are found in each column
w = (a[:,None] > l) & (a[:,None] < r)
# indices where in the (3,len(a)) array condition is satisfied
ix, _ = np.where(w)
# splits according to the sum along the columns
np.split(a[ix], np.cumsum(w.sum(0)))[:-1]
# [array([3, 8]), array([ 8, 11, 30, 39]), array([44, 49, 55, 61])]

Loop over clump_masked indices

I have an array y_filtered that contains some masked values. I want to replace these values by some value I calculate based on their neighbouring values. I can get the indices of the masked values by using masked_slices = ma.clump_masked(y_filtered). This returns a list of slices, e.g. [slice(194, 196, None)].
I can easily get the values from my masked array, by using y_filtered[masked_slices], and even loop over them. However, I need to access the index of the values as well, so i can calculate its new value based on its neighbours. Enumerate (logically) returns 0, 1, etc. instead of the indices I need.
Here's the solution I came up with.
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
y_enum = [(i, y_i) for i, y_i in zip(range(len(y_filtered)), y_filtered)]
for sl in masked_slices:
for i, y_i in y_enum[sl]:
# simplified example calculation
y_filtered[i] = np.average(y_filtered[i-2:i+2])
It is very ugly method i.m.o. and I think there has to be a better way to do this. Any suggestions?
Thanks!
EDIT:
I figured out a better way to achieve what I think you want to do. This code picks every window of 5 elements and compute its (masked) average, then uses those values to fill the gaps in the original array. If some index does not have any unmasked value close enough it will just leave it as masked:
import numpy as np
from numpy.lib.stride_tricks import as_strided
SMOOTH_MARGIN = 2
x = np.ma.array(data=[1, 2, 3, 4, 5, 6, 8, 9, 10],
mask=[0, 1, 0, 0, 1, 1, 1, 1, 0])
print(x)
# [1 -- 3 4 -- -- -- -- 10]
pad_data = np.pad(x.data, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant')
pad_mask = np.pad(x.mask, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant',
constant_values=True)
k = 2 * SMOOTH_MARGIN + 1
isize = x.dtype.itemsize
msize = x.mask.dtype.itemsize
x_pad = np.ma.array(
data=as_strided(pad_data, (len(x), k), (isize, isize), writeable=False),
mask=as_strided(pad_mask, (len(x), k), (msize, msize), writeable=False))
x_avg = np.ma.average(x_pad, axis=1).astype(x_pad.dtype)
fill_mask = ~x_avg.mask & x.mask
result = x.copy()
result[fill_mask] = x_avg[fill_mask]
print(result)
# [1 2 3 4 3 4 10 10 10]
(note all the values are integers here because x was originally of integer type)
The original posted code has a few errors, firstly it both reads and writes values from y_filtered in the loop, so the results of later indices are affected by the previous iterations, this could be fixed with a copy of the original y_filtered. Second, [i-2:i+2] should probably be [max(i-2, 0):i+3], in order to have a symmetric window starting at zero or later always.
You could do this:
from itertools import chain
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
for idx in chain.from_iterable(range(s.start, s.stop) for s in masked_slices):
y_filtered[idx] = np.average(y_filtered[max(idx - 2, 0):idx + 3])

Divide numpy array into multiple arrays using indices array (Python)

I have an array:
a = [1, 3, 5, 7, 29 ... 5030, 6000]
This array gets created from a previous process, and the length of the array could be different (it is depending on user input).
I also have an array:
b = [3, 15, 67, 78, 138]
(Which could also be completely different)
I want to use the array b to slice the array a into multiple arrays.
More specifically, I want the result arrays to be:
array1 = a[:3]
array2 = a[3:15]
...
arrayn = a[138:]
Where n = len(b).
My first thought was to create a 2D array slices with dimension (len(b), something). However we don't know this something beforehand so I assigned it the value len(a) as that is the maximum amount of numbers that it could contain.
I have this code:
slices = np.zeros((len(b), len(a)))
for i in range(1, len(b)):
slices[i] = a[b[i-1]:b[i]]
But I get this error:
ValueError: could not broadcast input array from shape (518) into shape (2253412)
You can use numpy.split:
np.split(a, b)
Example:
np.split(np.arange(10), [3,5])
# [array([0, 1, 2]), array([3, 4]), array([5, 6, 7, 8, 9])]
b.insert(0,0)
result = []
for i in range(1,len(b)):
sub_list = a[b[i-1]:b[i]]
result.append(sub_list)
result.append(a[b[-1]:])
You are getting the error because you are attempting to create a ragged array. This is not allowed in numpy.
An improvement on #Bohdan's answer:
from itertools import zip_longest
result = [a[start:end] for start, end in zip_longest(np.r_[0, b], b)]
The trick here is that zip_longest makes the final slice go from b[-1] to None, which is equivalent to a[b[-1]:], removing the need for special processing of the last element.
Please do not select this. This is just a thing I added for fun. The "correct" answer is #Psidom's answer.

NumPy Array Indexing

Simple question here about indexing an array to get a subset of its values. Say I have a recarray which holds ages in one space, and corresponding values in another. I also have an array which is my desired subset of ages. Here is what I mean:
ages = np.arange(100)
values = np.random.uniform(low=0, high= 1, size = ages.shape)
data = np.core.rec.fromarrays([ages, values], names='ages,values')
desired_ages = np.array([1,4, 16, 29, 80])
What I'm trying to do is something like this:
data.values[data.ages==desired_ages]
But, it's not working.
You want to create an subarray containing only the values whose indexes are in desired_ages.
Python doesn't have any syntax that directly corresponds to this, but list comprehensions can do a pretty nice job:
result = [value for index, value in enumerate(data.values) if index in desired_ages]
However, doing it this way results in Python scanning through desired_ages for each element in data.values, which is slow. If you could insert
desired_ages = set(desired_ages)
on the line before, this would improve performance. (You can determine if a value in is a set in constant time, regardless of the set's size.)
Complete Example
import numpy as np
ages = np.arange(100)
values = np.random.uniform(low=0, high= 1, size = ages.shape)
data = np.core.rec.fromarrays([ages, values], names='ages,values')
desired_ages = np.array([1,4, 16, 29, 80])
result = [value for index, value in enumerate(data.values) if index in desired_ages]
print result
Output
[0.45852624094611272, 0.0099713014816563694, 0.26695859251958864, 0.10143425810157047, 0.93647796171383935]
I changed your example a little, shuffle the order of ages:
import numpy as np
np.random.seed(0)
ages = np.arange(3,103)
np.random.shuffle(ages)
values = np.random.uniform(low=0, high= 1, size = ages.shape)
data = np.core.rec.fromarrays([ages, values], names='ages,values')
desired_ages = np.array([4, 16, 29, 80])
If all the elements of desired_ages are in data.ages, you can sort data by age field first, and then use searchsorted() to find all the index quickly:
data.sort(order="ages") # sort by ages
print data.values[np.searchsorted(data.ages, desired_ages)]
or you can use np.in1d the get a bool array and use it as index:
print data.values[np.in1d(data.ages, desired_ages)]
This is a reasonable first approach:
>>> bool_indices = reduce(numpy.logical_or,
(data.ages == x for x in desired_ages))
>>> data.values[bool_indices]
array([ 0.63143784, 0.93852927, 0.0026815 , 0.66263594, 0.2603184 ])
But that uses python functions, so it's probably slower. We can translate it pretty easily into pure numpy, using ix_ to make the arrays broadcast against each other nicely. (meshgrid with swapped arguments would work too, but would use more memory.):
>>> bools_2d = numpy.equal(*numpy.ix_(desired_ages, data.ages))
>>> bool_indices = numpy.logical_or.reduce(bools_2d)
>>> data.ages[bool_indices]
array([ 1, 4, 16, 29, 80])
>>> data.values[bool_indices]
array([ 0.32324063, 0.65453647, 0.9300062 , 0.34534668, 0.12151951])
See also HYRY's answer for a potentially faster solution (using searchsorted) and a potentially more readable solution (using in1d).

Categories

Resources