Numpy - compute all possible differences in an array at fixed distance - python

Suppose I have an array, and I want to compute differences between elements at a distance Delta. I can use numpy.diff(Array[::Delta-1]), but this will not give all possible differences (from each possible starting point). To get them, I can think of something like this:
for j in xrange(Delta-1):
NewDiff = numpy.diff(Array[j::Delta-1])
if j==0:
Diff = NewDiff
else:
Diff = numpy.hstack((Diff,NewDiff))
But I would be surprised if this is the most efficient way to do it. Any idea from those familiar with the most exoteric functionalities of numpy?

The following function returns a two-dimensional numpy array diff which contains the differences between all possible combinations of a list or numpy array a. For example, diff[3,2] would contain the result of a[3] - a[2] and so on.
def difference_matrix(a):
x = np.reshape(a, (len(a), 1))
return x - x.transpose()
Update
It seems I misunderstood the question and you are only asking for an the differences of array elements which are a certain distance d apart.1)
This can be accomplished as follows:
>>> a = np.array([1,3,7,11,13,17,19])
>>> d = 2
>>> a[d:] - a[:-d]
array([6, 8, 6, 6, 6])
Have a look at the documentation to learn more about this notation.
But, the function for the difference matrix I've posted above shall not be in vain. In fact, the array you're looking for is a diagonal of the matrix that difference_matrix returns.
>>> a = [1,3,7,11,13,17,19]
>>> d = 2
>>> m = difference_matrix(a)
>>> np.diag(m, -d)
array([6, 8, 6, 6, 6])
1) Judging by your comment, this distance d is different than the Delta you seem to be using, with d = Delta - 1, so that the distance between an element and itself is 0, and its distance to the adjacent elements is 1.

Related

Calculation on list of numpy array

I'm trying to do some calculation (mean, sum, etc.) on a list containing numpy arrays.
For example:
list = [array([2, 3, 4]),array([4, 4, 4]),array([6, 5, 4])]
How can retrieve the mean (for example) ?
In a list like [4,4,4] or a numpy array like array([4,4,4]) ?
Thanks in advance for your help!
EDIT : Sorry, I didn't explain properly what I was aiming to do : I would like to get the mean of i-th index of the arrays. For example, for index 0 :
(2+4+6)/3 = 4
I don't want this :
(2+3+4)/3 = 3
Therefore the end result will be
[4,4,4] / and not [3,4,5]
If L were a list of scalars then calculating the mean could be done using the straight forward expression:
sum(L) / len(L)
Luckily, this works unchanged on lists of arrays:
L = [np.array([2, 3, 4]), np.array([4, 4, 4]), np.array([6, 5, 4])]
sum(L) / len(L)
# array([4., 4., 4.])
For this example this happens to be quitea bit faster than the numpy function
np.mean
timeit(lambda: np.mean(L, axis=0))
# 13.708808058872819
timeit(lambda: sum(L) / len(L))
# 3.4780975924804807
You can use a for loop and iterate through the elements of your array, if your list is not too big:
mean = []
for i in range(len(list)):
mean.append(np.mean(list[i]))
Given a 1d array a, np.mean(a) should do the trick.
If you have a 2d array and want the means for each one separately, specify np.mean(a, axis=1).
There are equivalent functions for np.sum, etc.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html
You can use map
import numpy as np
my_list = [np.array([2, 3, 4]),np.array([4, 4, 4]),np.array([6, 5, 4])]
np.mean(my_list,axis=0) #[4,4,4]
Note: Do not name your variable as list as it will shadow the built-ins

Loop over clump_masked indices

I have an array y_filtered that contains some masked values. I want to replace these values by some value I calculate based on their neighbouring values. I can get the indices of the masked values by using masked_slices = ma.clump_masked(y_filtered). This returns a list of slices, e.g. [slice(194, 196, None)].
I can easily get the values from my masked array, by using y_filtered[masked_slices], and even loop over them. However, I need to access the index of the values as well, so i can calculate its new value based on its neighbours. Enumerate (logically) returns 0, 1, etc. instead of the indices I need.
Here's the solution I came up with.
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
y_enum = [(i, y_i) for i, y_i in zip(range(len(y_filtered)), y_filtered)]
for sl in masked_slices:
for i, y_i in y_enum[sl]:
# simplified example calculation
y_filtered[i] = np.average(y_filtered[i-2:i+2])
It is very ugly method i.m.o. and I think there has to be a better way to do this. Any suggestions?
Thanks!
EDIT:
I figured out a better way to achieve what I think you want to do. This code picks every window of 5 elements and compute its (masked) average, then uses those values to fill the gaps in the original array. If some index does not have any unmasked value close enough it will just leave it as masked:
import numpy as np
from numpy.lib.stride_tricks import as_strided
SMOOTH_MARGIN = 2
x = np.ma.array(data=[1, 2, 3, 4, 5, 6, 8, 9, 10],
mask=[0, 1, 0, 0, 1, 1, 1, 1, 0])
print(x)
# [1 -- 3 4 -- -- -- -- 10]
pad_data = np.pad(x.data, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant')
pad_mask = np.pad(x.mask, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant',
constant_values=True)
k = 2 * SMOOTH_MARGIN + 1
isize = x.dtype.itemsize
msize = x.mask.dtype.itemsize
x_pad = np.ma.array(
data=as_strided(pad_data, (len(x), k), (isize, isize), writeable=False),
mask=as_strided(pad_mask, (len(x), k), (msize, msize), writeable=False))
x_avg = np.ma.average(x_pad, axis=1).astype(x_pad.dtype)
fill_mask = ~x_avg.mask & x.mask
result = x.copy()
result[fill_mask] = x_avg[fill_mask]
print(result)
# [1 2 3 4 3 4 10 10 10]
(note all the values are integers here because x was originally of integer type)
The original posted code has a few errors, firstly it both reads and writes values from y_filtered in the loop, so the results of later indices are affected by the previous iterations, this could be fixed with a copy of the original y_filtered. Second, [i-2:i+2] should probably be [max(i-2, 0):i+3], in order to have a symmetric window starting at zero or later always.
You could do this:
from itertools import chain
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
for idx in chain.from_iterable(range(s.start, s.stop) for s in masked_slices):
y_filtered[idx] = np.average(y_filtered[max(idx - 2, 0):idx + 3])

Pythonic way to get the first AND the last element of the sequence

What is the easiest and cleanest way to get the first AND the last elements of a sequence? E.g., I have a sequence [1, 2, 3, 4, 5], and I'd like to get [1, 5] via some kind of slicing magic. What I have come up with so far is:
l = len(s)
result = s[0:l:l-1]
I actually need this for a bit more complex task. I have a 3D numpy array, which is cubic (i.e. is of size NxNxN, where N may vary). I'd like an easy and fast way to get a 2x2x2 array containing the values from the vertices of the source array. The example above is an oversimplified, 1D version of my task.
Use this:
result = [s[0], s[-1]]
Since you're using a numpy array, you may want to use fancy indexing:
a = np.arange(27)
indices = [0, -1]
b = a[indices] # array([0, 26])
For the 3d case:
vertices = [(0,0,0),(0,0,-1),(0,-1,0),(0,-1,-1),(-1,-1,-1),(-1,-1,0),(-1,0,0),(-1,0,-1)]
indices = list(zip(*vertices)) #Can store this for later use.
a = np.arange(27).reshape((3,3,3)) #dummy array for testing. Can be any shape size :)
vertex_values = a[indices].reshape((2,2,2))
I first write down all the vertices (although I am willing to bet there is a clever way to do it using itertools which would let you scale this up to N dimensions ...). The order you specify the vertices is the order they will be in the output array. Then I "transpose" the list of vertices (using zip) so that all the x indices are together and all the y indices are together, etc. (that's how numpy likes it). At this point, you can save that index array and use it to index your array whenever you want the corners of your box. You can easily reshape the result into a 2x2x2 array (although the order I have it is probably not the order you want).
This would give you a list of the first and last element in your sequence:
result = [s[0], s[-1]]
Alternatively, this would give you a tuple
result = s[0], s[-1]
With the particular case of a (N,N,N) ndarray X that you mention, would the following work for you?
s = slice(0,N,N-1)
X[s,s,s]
Example
>>> N = 3
>>> X = np.arange(N*N*N).reshape(N,N,N)
>>> s = slice(0,N,N-1)
>>> print X[s,s,s]
[[[ 0 2]
[ 6 8]]
[[18 20]
[24 26]]]
>>> from operator import itemgetter
>>> first_and_last = itemgetter(0, -1)
>>> first_and_last([1, 2, 3, 4, 5])
(1, 5)
Why do you want to use a slice? Getting each element with
result = [s[0], s[-1]]
is better and more readable.
If you really need to use the slice, then your solution is the simplest working one that I can think of.
This also works for the 3D case you've mentioned.

Python/NumPy: implementing a running sum (but not quite)

Given are two arrays of equal length, one holding data, one holding the results but initially set to zero, e.g.:
a = numpy.array([1, 0, 0, 1, 0, 1, 0, 0, 1, 1])
b = numpy.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
I'd like to compute the sum of all possible subsets of three adjacent elements in a. If the sum is 0 or 1, the three corresponding elements in b are left unchanged; only if the sum exceeds 1 are the three corresponding elements in b set to 1, so that after the computation b becomes
array([0, 0, 0, 1, 1, 1, 0, 1, 1, 1])
A simple loop will accomplish this:
for x in range(len(a)-2):
if a[x:x+3].sum() > 1:
b[x:x+3] = 1
After this, b has the desired form.
I have to do this for a large amount of data, so speed is an issue. Is there a faster way in NumPy to carry out the operation above?
(I understand this is similar to a convolution, but not quite the same).
You can start with a convolution, choose the values that exceed 1, and finally use a "dilation":
b = numpy.convolve(a, [1, 1, 1], mode="same") > 1
b = b | numpy.r_[0, b[:-1]] | numpy.r_[b[1:], 0]
Since this avoids the Python loop, it should be faster than your approach, but I didn't do timings.
An alternative is to use a second convolution to dilate:
kernel = [1, 1, 1]
b = numpy.convolve(a, kernel, mode="same") > 1
b = numpy.convolve(b, kernel, mode="same") > 0
If you have SciPy available, yet another option for the dilation is
b = numpy.convolve(a, [1, 1, 1], mode="same") > 1
b = scipy.ndimage.morphology.binary_dilation(b)
Edit: By doing some timings, I found that this solution seems to be fastest for large arrays:
b = numpy.convolve(a, kernel) > 1
b[:-1] |= b[1:] # Shift and "smearing" to the *left* (smearing with b[1:] |= b[:-1] does not work)
b[:-1] |= b[1:] # … and again!
b = b[:-2]
For an array of one million entries, it was more than 200 times faster than your original approach on my machine. As pointed out by EOL in the comments, this solution might be considered a bit fragile, though, since it depends on implementation details of NumPy.
You can calculate the "convolution" sums in an efficient way with:
>>> a0 = a[:-2]
>>> a1 = a[1:-1]
>>> a2 = a[2:]
>>> a_large_sum = a0 + a1 + a2 > 1
Updating b can then be done efficiently by writing something that means "at least one of the three neighboring a_large_sum values is True": you first extend you a_large_sum array back to the same number of elements as a (to the right, to the left and to the right, and then to the left):
>>> a_large_sum_0 = np.hstack([a_large_sum, [False, False]])
>>> a_large_sum_1 = np.hstack([[False], a_large_sum, [False]])
>>> a_large_sum_2 = np.hstack([[False, False], a_large_sum])
You then obtain b in an efficient way:
>>> b = a_large_sum_0 | a_large_sum_1 | a_large_sum_2
This gives the result that you obtain, but in a very efficient way, through a leveraging of NumPy internal fast loops.
PS: This approach is essentially the same as Sven's first solution, but is way more pedestrian than Sven's elegant code; it is as fast, however. Sven's second solution (double convolve()) is even more elegant, and it is twice as fast.
You might also like to have a look at NumPy's stride_tricks. Using Sven's timing setup (see link in Sven's answer), I found that for (very) large arrays, this is also a fast way to do what you want (i.e. with your definition of a):
shape = (len(a)-2,3)
strides = a.strides+a.strides
a_strided = numpy.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
b = np.r_[numpy.sum(a_strided, axis=-1) > 1, False, False]
b[2:] |= b[1:-1] | b[:-2]
After edit (see comments below) it is no longer the fastest way.
This creates a specially strided view on your original array. The data in a is not copied, but is simply viewed in a new way. We want to basically make a new array in which the last index contains the sub-arrays that we want to sum (i.e. the three elements that you want to sum). This way, we can easily sum in the end with the last command.
The last element of this new shape therefore has to be 3, and the first element will be the length of the old a minus 2 (because we can only sum up to the -2nd element).
The strides list contains the strides, in bytes, that the new array a_strided needs to make to get to the next element in each of the dimensions of the shape. If you set these equal, it means that a_strided[0,1] and a_strided[1,0] will both be a[1], which is exactly what we want. In a normal array this would not be the case (the first stride would be "size-of-first-dimension times length-of-array-first-dimension (= shape[0])"), but in this case we can make good use of it.
Not sure if I explained this all really well, but just print out a_strided and you'll see what the result is and how easy this makes the operation.

compare two following values in numpy array

What is the best way to touch two following values in an numpy array?
example:
npdata = np.array([13,15,20,25])
for i in range( len(npdata) ):
print npdata[i] - npdata[i+1]
this looks really messed up and additionally needs exception code for the last iteration of the loop.
any ideas?
Thanks!
numpy provides a function diff for this basic use case
>>> import numpy
>>> x = numpy.array([1, 2, 4, 7, 0])
>>> numpy.diff(x)
array([ 1, 2, 3, -7])
Your snippet computes something closer to -numpy.diff(x).
How about range(len(npdata) - 1) ?
Here's code (using a simple array, but it doesn't matter):
>>> ar = [1, 2, 3, 4, 5]
>>> for i in range(len(ar) - 1):
... print ar[i] + ar[i + 1]
...
3
5
7
9
As you can see it successfully prints the sums of all consecutive pairs in the array, without any exceptions for the last iteration.
You can use ediff1d to get differences of consecutive elements. More generally, a[1:] - a[:-1] will give the differences of consecutive elements and can be used with other operators as well.

Categories

Resources