numpy: Compressing block matrix - python

Consider a matrix M1 giving values for all combinations x,y. Consider a partition f(x)->X and a partition g(y)->Y. Furthermore consider an operation p(A) on a set A of numbers, i.e. max(A) or sum(A).
The mappings f,g can be used to create from M1 a block matrix M2 where all x that are mapped to the same X are adjacent, and the same for all y.
This matrix M2 has a block for each combination of the 'sets' X,Y.
Now I would like to condense this matrix M2 into another matrix M3 by applying p on each block separately. M3 has one value for each combination of X,Y.
Ideally, I would like to skip the transformation of M1 into M2 using f and g on the fly.
What would be the most efficient way to perform such operation and would it be possible to deploy numpy or scipy for it?
Special case: Actually, in my case x and y are identical and there is only one function f applied to both of them. I only care about the part of M2 that is under the diagonal.

The most straightforward way I can think of to do this, although perhaps not the most efficient (especially if your matrix is huge), is to convert your matrix to a one-dimensional array, and then have corresponding arrays for the partition group indices X and Y. You can then group by the partition group indices and finally restructure the matrix back into its original form.
For example, if your matrix is
>>> M1 = np.arange(25).reshape((5,5))
>>> M1
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
and your partitions are
>>> def f(x):
... return np.array([1,1,1,2,2])[x]
>>> def g(y):
... return np.array([3,4,4,4,5])[y]
From that point, there are several ways to implement the reshaping and subsequent grouping. You can do it with Pandas, for instance, by constructing a DataFrame and using its stack() method to "stack" all the rows on top of each other in a single column, indexed by their original row and column indices.
>>> st = pd.DataFrame(M1).stack().to_frame('M1')
>>> st
M1
0 0 0
1 1
2 2
3 3
4 4
1 0 5
...
4 3 23
4 24
(I have truncated the output for readability, and I trust that you can evaluate the rest of these examples yourself if you want to see their output.) You can then add columns representing the partition group indices:
>>> st['X'] = f(st.index.get_level_values(0))
>>> st['Y'] = g(st.index.get_level_values(1))
Then you can group by those indices and apply your aggregation function of choice.
>>> stp = st.groupby(['X', 'Y']).agg(p)
You will have to define p (or find an existing definition) such that it takes a one-dimensional Numpy array and returns a single number. If you want to use something like sum(), you can just use st.groupby(...).sum() because Pandas has built-in support for that and a few other standard functions, but agg is general and works for any reduction function p you can provide.
Finally, the unstack() method will convert the DataFrame back into the properly 2D "matrix form", and then if you want you can use the as_matrix() method to turn it back into a pure Numpy array.
>>> M3 = stp.unstack().as_matrix()
>>> M3
array([[ 15, 63, 27],
[ 35, 117, 43]])
If you don't want to bring in Pandas, there are other libraries that do the same thing. You might look at numpy-groupies, for example. However I haven't found any library that does true two-dimensional grouping, which you might need if you are working with very large matrices, large enough that having an additional 2 or 3 copies of them would exhaust the available memory.

Let M1 be a numpy n x m array. You can start by determining which partitions you have. The set constructor removes repeated entries, but orders them arbitrarily. I sort them just to have a well-defined ordering:
xs = sorted(set(f(i) for i in range(n)))
ys = sorted(set(g(i) for i in range(m)))
To build a block matrix for each X,Y you can use numpy boolean indexing along with the grid-construction helper ix_ to select only rows and columns belonging to X and Y, respectively. Finally, apply p to the selected submatrix:
from numpy import zeros, arange, ix_
ii, jj = arange(n), arange(m)
M3 = zeros((len(xs), len(ys)))
for k, X in enumerate(xs):
for l, Y in enumerate(ys):
M3[k,l] = p(M1[ix_(f(ii) == X, g(jj) == Y)])
The partitions f and g have to apply element-wise to numpy arrays for this to work. As mentioned in the other answer the numpy.vectorize decorator can be used to achieve this.
To give an example:
from __future__ import division
n = m = 5
M1 = np.arange(25).reshape(5,5)
f = lambda x: x // 3 # f(ii) = [0, 0, 0, 1, 1]
g = lambda x: (x+2) // 3 # g(jj) = [0, 1, 1, 1, 2]
p = numpy.sum
M3 = [[ 15., 63., 27.],
[ 35., 117., 43.]]

I've encountered with the same problem some years after and in my opinion, the best solution to do this is as follows:
M2 = np.zeros((n,m))
for i in range(n):
for j in range(m):
M2[i,j] = p(M1[f(x) == i, :][: , g(y) == j])
This assumes that f takes values on [0,1,..,n-1] and that g takes values on [0,1,..,m-1]
An example would be
import numpy as np
M1 = np.random.random((4,6))
print(M1)
x = range(4)
y = range(6)
p = np.sum
def f(x):
return np.array([0,0,1,2])[x]
def g(y):
return np.array([0,1,1,0,1,0])[y]
n = 3 # number of elements in partition f
m = 2 # number of elements in partition g
M2 = np.zeros((n,m))
for i in range(n):
for j in range(m):
M2[i,j] = p(M1[f(x) == i, :][: , g(y) == j])
print(M2)
To automate n and m you can use len(set(f(x))) and len(set(g(y)))

Related

Index an array with a ragged indexing list and perform sum/mean reductions

I have some 2D data where the first axis is time and the second axis is person's ID. Thus the data entries are the persons' property values over time.
What I want to do is to group the persons and average the properties in each group at all time frames. Here is a sample of 6 time points and 5 persons with 2 group
import numpy as np
data = np.arange(30)
data.shape = 6, 5
groups = [[0, 1, 4], [2, 3]]
result = np.empty((6, 2))
for i, indices in enumerate(groups):
result[:, i] = data[:, indices].mean(axis=1)
And the result is
array([[ 1.66666667, 2.5 ],
[ 6.66666667, 7.5 ],
[11.66666667, 12.5 ],
[16.66666667, 17.5 ],
[21.66666667, 22.5 ],
[26.66666667, 27.5 ]])
Is this the best we can do in terms of efficiency? I was wondering if that looping over the groups could also be eliminated.
Approach #1 : Generic case
Here's an almost vectorized approach making use of np.add.reduceat -
g = np.concatenate(groups)
lens = list(map(len, groups))
cuts = np.r_[0,np.cumsum(lens)[:-1]]
out = np.add.reduceat(data[:, g], cuts, axis=1)/lens
Approach #2 : Specific case
If groups is a regular 2D array/list, we can simply do -
data[:, groups].mean(2)
Approach #3 : Approach 1 + 2
Mixing approaches 1 and 2 we can come up another generic case method -
from itertools import zip_longest
idx = np.vstack(list(zip_longest(*groups, fillvalue=-1)))
c = (idx == -1).sum(0)
sums = data[:, idx].sum(1) - c*(data[:,-1,None])
lens = list(map(len, groups))
out = sums/lens
Approach #4 : With matrix-multiplication
We will create a mask that when matrix-multiplied with data, with its sum-reduction will give us the sum-reduced version and then just divide by the lengths to get our desired average values -
mask = np.zeros((data.shape[1], len(groups)), dtype=bool)
for i, indices in enumerate(groups):
mask[indices,i] = 1
out = data.dot(mask)/list(map(len, groups))
Also, we might want to use float32 for faster matrix-multiplication -
data.dot(mask.astype(np.float32))
Approach #5 : Approach 2 + 3
We will pad with zeros as an additional column, and with a regular indexing array created off zip_longest, index and sum and hence get the mean values -
data0 = np.pad(data,((0,0),(0,1)))
idx = np.vstack(list(zip_longest(*groups, fillvalue=-1)))
out = data0[:, idx].sum(1)/list(map(len, groups))

Retrieve intervals from array based on multiple ranges

Let's say I have a Numpy array called a:
a = np.array([2,3,8,11,30,39,44,49,55,61])
I would like to retrieve multiple intervals based on two other arrays:
l = np.array([2,5,42])
r = np.array([10,40,70])
Doing something equivalent to this:
a[(a > l) & (a < r)]
With this as the desired output:
Out[1]: [[3 8],[ 8 11 30 39],[44 49 55 61]]
Of course I could do a simple for loop iterating over l and r, but the real life dataset is huge, so I would like to prevent looping as much as possible.
You can't avoid looping given the ragged nature of output. But we should try to reduce compute when iterating. So, here's one way to simply slice into the input array while iterating, as we will most of the compute part with getting the start,stop indices per group with searchsorted -
lidx = np.searchsorted(a,l,'right')
ridx = np.searchsorted(a,r,'left')
out = [a[i:j] for (i,j) in zip(lidx,ridx)]
Here's one approach, broadcasting to obtain the indexing arrays, and using np.split to split the array:
# generates a (3,len(a)) where the windows are found in each column
w = (a[:,None] > l) & (a[:,None] < r)
# indices where in the (3,len(a)) array condition is satisfied
ix, _ = np.where(w)
# splits according to the sum along the columns
np.split(a[ix], np.cumsum(w.sum(0)))[:-1]
# [array([3, 8]), array([ 8, 11, 30, 39]), array([44, 49, 55, 61])]

Loop over clump_masked indices

I have an array y_filtered that contains some masked values. I want to replace these values by some value I calculate based on their neighbouring values. I can get the indices of the masked values by using masked_slices = ma.clump_masked(y_filtered). This returns a list of slices, e.g. [slice(194, 196, None)].
I can easily get the values from my masked array, by using y_filtered[masked_slices], and even loop over them. However, I need to access the index of the values as well, so i can calculate its new value based on its neighbours. Enumerate (logically) returns 0, 1, etc. instead of the indices I need.
Here's the solution I came up with.
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
y_enum = [(i, y_i) for i, y_i in zip(range(len(y_filtered)), y_filtered)]
for sl in masked_slices:
for i, y_i in y_enum[sl]:
# simplified example calculation
y_filtered[i] = np.average(y_filtered[i-2:i+2])
It is very ugly method i.m.o. and I think there has to be a better way to do this. Any suggestions?
Thanks!
EDIT:
I figured out a better way to achieve what I think you want to do. This code picks every window of 5 elements and compute its (masked) average, then uses those values to fill the gaps in the original array. If some index does not have any unmasked value close enough it will just leave it as masked:
import numpy as np
from numpy.lib.stride_tricks import as_strided
SMOOTH_MARGIN = 2
x = np.ma.array(data=[1, 2, 3, 4, 5, 6, 8, 9, 10],
mask=[0, 1, 0, 0, 1, 1, 1, 1, 0])
print(x)
# [1 -- 3 4 -- -- -- -- 10]
pad_data = np.pad(x.data, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant')
pad_mask = np.pad(x.mask, (SMOOTH_MARGIN, SMOOTH_MARGIN), mode='constant',
constant_values=True)
k = 2 * SMOOTH_MARGIN + 1
isize = x.dtype.itemsize
msize = x.mask.dtype.itemsize
x_pad = np.ma.array(
data=as_strided(pad_data, (len(x), k), (isize, isize), writeable=False),
mask=as_strided(pad_mask, (len(x), k), (msize, msize), writeable=False))
x_avg = np.ma.average(x_pad, axis=1).astype(x_pad.dtype)
fill_mask = ~x_avg.mask & x.mask
result = x.copy()
result[fill_mask] = x_avg[fill_mask]
print(result)
# [1 2 3 4 3 4 10 10 10]
(note all the values are integers here because x was originally of integer type)
The original posted code has a few errors, firstly it both reads and writes values from y_filtered in the loop, so the results of later indices are affected by the previous iterations, this could be fixed with a copy of the original y_filtered. Second, [i-2:i+2] should probably be [max(i-2, 0):i+3], in order to have a symmetric window starting at zero or later always.
You could do this:
from itertools import chain
# get indices of masked data
masked_slices = ma.clump_masked(y_filtered)
for idx in chain.from_iterable(range(s.start, s.stop) for s in masked_slices):
y_filtered[idx] = np.average(y_filtered[max(idx - 2, 0):idx + 3])

Fast algorithm to find indices where multiple arrays have the same value

I'm looking for ways to speed up (or replace) my algorithm for grouping data.
I have a list of numpy arrays. I want to generate a new numpy array, such that each element of this array is the same for each index where the original arrays are the same as well. (And different where this is not the case.)
This sounds kind of awkward, so have an example:
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# * *
Note that elements I marked (indices 0 and 4) of the expected outcome have the same value (0) because the original two arrays were also the same (namely 10 and 21). Similar for elements with indices 3 and 5 (3).
The algorithm has to deal with an arbitrary number of (equally-size) input arrays, and also return, for each resulting number, what values of the original arrays they correspond to. (So for this example, "3" refers to (11, 22).)
Here is my current algorithm:
import numpy as np
def groupify(values):
group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped.
group_meanings = {}
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
this_combo = {}
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
this_combo[curr_id] = needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
group_meanings[next_hash] = this_combo
next_hash += 1
return group, group_meanings
Note that the expression value_array[matching] == needed_value is evaluated many times for each individual index, which is where the slowness comes from.
I'm not sure if my algorithm can be sped up much more, but I'm also not sure if it's the optimal algorithm to begin with. Is there a better way of doing this?
Cracked it finally for a vectorized solution! It was an interesting problem. The problem was we had to tag each pair of values taken from the corresponding array elements of the list. Then, we are supposed to tag each such pair based on their uniqueness among othet pairs. So, we can use np.unique abusing all its optional arguments and finally do some additional work to keep the order for the final output. Here's the implementation basically done in three stages -
# Stack as a 2D array with each pair from values as a column each.
# Convert to linear index equivalent considering each column as indexing tuple
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
# Do the heavy work with np.unique to give us :
# 1. Starting indices of unique elems,
# 2. Srray that has unique IDs for each element in idx, and
# 3. Group ID counts
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
# Best part happens here : Use mask to ignore the repeated elems and re-tag
# each unqID using argsort() of masked elements from idx
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
out = idx[mask].argsort()[unqID]
Runtime test
Let's compare the proposed vectorized approach against the original code. Since the proposed code gets us the group IDs only, so for a fair benchmarking, let's just trim off parts from the original code that are not used to give us that. So, here are the function definitions -
def groupify(values): # Original code
group = np.zeros((len(values[0]),), dtype=np.int64) - 1
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
next_hash += 1
return group
def groupify_vectorized(values): # Proposed code
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
return idx[mask].argsort()[unqID]
Runtime results on a list with large arrays -
In [345]: # Input list with random elements
...: values = [item for item in np.random.randint(10,40,(10,10000))]
In [346]: np.allclose(groupify(values),groupify_vectorized(values))
Out[346]: True
In [347]: %timeit groupify(values)
1 loops, best of 3: 4.02 s per loop
In [348]: %timeit groupify_vectorized(values)
100 loops, best of 3: 3.74 ms per loop
This should work, and should be considerably faster, since we're using broadcasting and numpy's inherently fast boolean comparisons:
import numpy as np
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# for every value in values, check where duplicate values occur
same_mask = [val[:,np.newaxis] == val[np.newaxis,:] for val in values]
# get the conjunction of all those tests
conjunction = np.logical_and.reduce(same_mask)
# ignore the diagonal
conjunction[np.diag_indices_from(conjunction)] = False
# initialize the labelled array with nans (used as flag)
labelled = np.empty(values[0].shape)
labelled.fill(np.nan)
# keep track of labelled value
val = 0
for k, row in enumerate(conjunction):
if np.isnan(labelled[k]): # this element has not been labelled yet
labelled[k] = val # so label it
labelled[row] = val # and label every element satisfying the test
val += 1
print(labelled)
# outputs [ 0. 1. 2. 3. 0. 3. 4.]
It is about a factor of 1.5x faster than your version when dealing with the two arrays, but I suspect the speedup should be better for more arrays.
The numpy_indexed package (disclaimer: I am its author) contains generalized variants of the numpy arrayset operations, which can be used to solve your problem in an elegant and efficient (vectorized) manner:
import numpy_indexed as npi
unique_values, labels = npi.unique(tuple(values), return_inverse=True)
The above will work for arbitrary type combinations, but alternatively, the below will be even more efficient if values is a list of many arrays of the same dtype:
unique_values, labels = npi.unique(np.asarray(values), axis=1, return_inverse=True)
If I understand correctly, you are trying to hash values according to columns. Its better to convert the columns into arbitrary values by themselves, and then find the hashes from them.
So you actually want to hash on list(np.array(values).T).
This functionality is already built into Pandas. You dont need to write it. The only problem is that it takes a list of values without further lists within it. In this case, you can just convert the inner list to string map(str, list(np.array(values).T)) and factorize that!
>>> import pandas as pd
>>> pd.factorize(map(str, list(np.array(values).T)))
(array([0, 1, 2, 3, 0, 3, 4]),
array(['[10 21]', '[11 21]', '[10 22]', '[11 22]', '[10 23]'], dtype=object))
I have converted your list of arrays into an array, and then into a string ...

Mask One 2D Numpy Array By Argmax Along Axis Of Another Array

I have a 2D numpy array that I need to take the max of along a specific axis. I then need to later know which indexes were selected for this operation as a mask for another operation which is only done on those same indexes but on another array of the same shape.
Right how I'm doing it by using 2d array indexing, but it's slow and kind of convoluted, particularly the mgrid hack to generate the row indexes. It's just [0,1] for this example but I need the robustness to work with arbitrary shapes.
a = np.array([[0,0,5],[0,0,5]])
b = np.array([[1,1,1],[1,1,1]])
columnIndexes = np.argmax(a,axis=1)
rowIndexes = np.mgrid[0:a.shape[0],0:columnIdx.size-1][0].flatten()
b[rowIndexes,columnIndexes] = b[rowIndexes,columnIndexes]+1
B should now be array([[1,1,2],[1,1,2]]) since it preformed the operation on b for only the indexes of the max along the columns of a.
Anyone know a better way? Preferably using just boolean masking arrays so that I can port this code to run on a GPU without too much hassle. Thanks!
I will suggest an answer but with slightly different data.
c = np.array([[0,1,1],[2,1,0]]) # note that this data has dupes for max in row 1
d = np.array([[0,10,10],[20,10,0]]) # data to be chaged
c_argmax = np.argmax(c,axis=1)[:,np.newaxis]
b_map1 = c_argmax == np.arange(c.shape[1])
# now use the bool map as you described
d[b_map1] += 1
d
[out]
array([[ 0, 11, 10],
[21, 10, 0]])
Note that I created an original with a duplicate of the largest number. The above works with argmax as you requested but you might have wanted to increment all max values. as in:
c_max = np.max(c,axis=1)[:,np.newaxis]
b_map2 = c_max == c
d[b_map2] += 1
d
[out]
array([[ 0, 12, 11],
[22, 10, 0]])

Categories

Resources