numpy make sub-arrays based of unique column - python

I have an example array that looks like array = np.array([[1,1,0,1], [0,1,0,0], [1,1,1,0], [0,0,1,2], [0,1,3,2], [1,1,0,1], [0,1,0,0]]) ...
array([[1, 1, 0, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 1, 2],
[0, 1, 3, 2],
[1, 1, 0, 1],
[0, 1, 0, 0]])
With this in mind I want reformat this array into subarrays based off of the first two columns. Using How to split a numpy array based on a column? as a reference, I made this array into a list of arrays with ...
df = pd.DataFrame(array)
df['4'] = df[0].astype(str) + df[1].astype(str)
df['4'] = df['4'].astype(int)
arr = df.to_numpy()
y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])]
where y is ...
[array([[0, 0, 1, 2, 0]]),
array([[0, 1, 0, 0, 1],
[0, 1, 3, 2, 1],
[0, 1, 0, 0, 1]]),
array([[ 1, 1, 0, 1, 11],
[ 1, 1, 1, 0, 11],
[ 1, 1, 0, 1, 11]])]
This works fine but it takes far too long for y to run. The amount of time it takes increases exponentially with every row. I am playing around with hundreds of millions of rows and y = [arr[arr[:,4]==k] for k in np.unique(arr[:,4])] is not practical from a time standpoint.
Any ideas on how to speed this up?

What about using the numpy_indexed library:
import numpy as np
import numpy_indexed as npi
a = np.array([[1, 1, 0, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 1, 2],
[0, 1, 3, 2],
[1, 1, 0, 1],
[0, 1, 0, 0]])
key = np.dot(a[:,:2], [1, 10])
y = npi.group_by(key).split_array_as_list(arr)
Output
y
[array([[0, 0, 1, 2]]),
array([[0, 1, 0, 0],
[0, 1, 3, 2],
[0, 1, 0, 0]]),
array([[ 1, 1, 0, 1],
[ 1, 1, 1, 0],
[ 1, 1, 0, 1]])]
You can easily install the library with:
> pip install numpy-indexed

Let me know if this performs better,
from collections import defaultdict
import numpy as np
outgen = defaultdict(lambda: [])
# arr: The input numpy array, :type: np.ndarray.
c = map(lambda x: ((x[0], x[1]), x), arr)
for key, val in c:
outgen[key].append(val)
# outgen: The required output, :type: list[np.ndarray].
outgen = [np.array(x) for x in outgen.values()]

You can use np.unique directly here.
unique, indexer = np.unique(arr[:, :2], axis=0, return_inverse=True)
{i: arr[indexer == k, :] for i, k in enumerate(unique)}
This is probably about as good as it gets for your desired output. However, instead of splitting it into a list of subarrays you could sort it by the unique key and then work with slices. This might be helpful if there are many unique values leading to a long list.
arr[:] = arr[np.argsort(indexer), :] # not sure if this is guaranteed to preserve the order within each group
EDIT:
Here is a powerful solution which I have been using for a sort of 2-D factorization. It takes 8ms for 1 million rows of single digit integers (vs > 100ms for np.unique).
columns = x[:, 0], x[:, 1]
factored = map(pd.factorize, columns)
codes, unique_values = map(list, zip(*factored))
group_index = get_group_index(codes, map(len, unique_values), sort=False, xnull=False)
It uses the internal algorithm of Dataframe.drop_duplicates.
Note that the ordering of the keys is not the sort order of the unique tuples.
There is also a new open source library, riptable which emulates numpy and pandas in some ways but is can be a lot more powerful. The creation of th takes around 4ms
import riptable as rt
columns = [x[:, 0], x[:, 1]]
unique_values, key = rt.unique(columns, return_inverse=True)
Here, unique_values is a tuple containing two arrays which can be zipped to get the unique tuples

Related

Replacing the values of a numpy array of zeros using a array of indexes

I'm working with numpy and I got a problem with index, I have a numpy array of zeros, and a 2D array of indexes, what I need is to use this indexes to change the values of the array of zeros by the value of 1, I tried something, but it's not working, here is what I tried.
import numpy as np
idx = np.array([0, 3, 4],
[1, 3, 5],
[0, 4, 5]]) #Array of index
zeros = np.zeros(6) #Array of zeros [0, 0, 0, 0, 0, 0]
repeat = np.tile(zeros, (idx.shape[0], 1)) #This repeats the array of zeros to match the number of rows of the index array
res = []
for i, j in zip(repeat, idx):
res.append(i[j] = 1) #Here I try to replace the matching index by the value of 1
output = np.array(res)
but I get the syntax error
expression cannot contain assignment, perhaps you meant "=="?
my desired output should be
output = [[1, 0, 0, 1, 1, 0],
[0, 1, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 1]]
This is just an example, the idx array can be bigger, I think the problem is the indexing, and I believe there is a much simple way of doing this without repeating the array of zeros and using the zip function, but I can't figure it out, any help would be aprecciated, thank you!
EDIT: When I change the = by == I get a boolean array which I don't need, so I don't know what's happening there either.
You can use np.put_along_axis to assign values into the array repeat based on indices in idx. This is more efficient than a loop (and easier).
import numpy as np
idx = np.array([[0, 3, 4],
[1, 3, 5],
[0, 4, 5]]) #Array of index
zeros = np.zeros(6).astype(int) #Array of zeros [0, 0, 0, 0, 0, 0]
repeat = np.tile(zeros, (idx.shape[0], 1))
np.put_along_axis(repeat, idx, 1, 1)
repeat will then be:
array([[1, 0, 0, 1, 1, 0],
[0, 1, 0, 1, 0, 1],
[1, 0, 0, 0, 1, 1]])
FWIW, you can also make the array of zeros directly by passing in the shape:
np.zeros([idx.shape[0], 6])

Updating numpy 2-dimensional array according to conditions across different 2-D arrays

In the code that I am writing, I have three 2D numpy arrays with the same dimensions (m x n), with each 2D array containing info about a specific trait, but each corresponding cell (with a specific row/col value) across all three 2D arrays corresponding to a specific person. The three 2D arrays are trait1, trait2, and trait3. As an example, person (0, 0) will have traits 1, 2, but not three, if only trait1 and trait2 have a value of 1 at location (0,0), but trait3 does not.
What would be an efficient method of updating a 2D array at a specific location based on the values of other corresponding 2D arrays of the same dimension at the same location? That is, how can I efficiently update a 2D array at a specific location such that the other 2D arrays at this same location fulfill specific conditions?
I am currently trying to update the values of the 2D array trait1 and trait2 according to the current values of trait1 and trait2 (such that the corresponding trait1 value == 1, and the corresponding trait2 value == 0); I am also trying to update the values of trait3 according to the current values of trait1, and trait2 (under the same conditions as the previous). However, I am having trouble doing this without using nested for loops, which greatly slows down my program.
Below is my current approach, which works, but is much too slow for my purposes:
for i in range (0, m):
for j in range (0, n):
if trait1[i][j] == 1:
if trait2[i][j] == 0:
trait1[i][j] = 0
trait2[i][j] = 1
new_color(i, j, 1) #updates the color of the specific person on a grid
trait3[i][j] = 0
elif trait1[i][j] == 0:
if trait2[i][j] <= 0:
trait1[i][j] = 1
trait2[i][j] = 0
new_color(i, j, 0)
Numpy array are really slow if you use loop indeed. If you can use matrices operations / numpy function for everything, it will go much faster.
In your case, you could first extract the indices you're interested about, and then update your matrices like this:
import numpy as np
np.random.seed(1)
# Generate some sample data
trait1, trait2, trait3 = ( np.random.randint(0,2, [4,4]) for _ in range(3) )
In [4]: trait1
Out[4]:
array([[1, 1, 0, 0],
[1, 1, 1, 1],
[1, 0, 0, 1],
[0, 1, 1, 0]])
In [5]: trait2
Out[5]:
array([[0, 1, 0, 0],
[0, 1, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0]])
In [6]: trait3
Out[6]:
array([[1, 1, 1, 1],
[1, 0, 0, 0],
[1, 1, 1, 1],
[1, 1, 0, 1]])
And then:
cond1_idx = np.where((trait1 == 1) & (trait2==0))
cond2_idx = np.where((trait1 == 0) & (trait2<=0))
trait1[cond1_idx] = 0
trait2[cond1_idx] = 1
trait3[cond1_idx] = 0
[ new_color(i, j, 1) for i,j in zip(*cond1_idx) ]
trait1[cond2_idx] = 1
trait2[cond2_idx] = 0
[ new_color(i, j, 0) for i,j in zip(*cond2_idx) ]
Result:
In [2]: trait1
Out[2]:
array([[0, 1, 1, 1],
[0, 1, 0, 0],
[1, 1, 1, 0],
[0, 0, 0, 1]])
In [3]: trait2
Out[3]:
array([[1, 1, 0, 0],
[1, 1, 1, 1],
[1, 0, 0, 1],
[1, 1, 1, 0]])
In [4]: trait3
Out[4]:
array([[0, 1, 1, 1],
[0, 0, 0, 0],
[1, 1, 1, 0],
[1, 0, 0, 1]])
I cannot really test the new_color though since I don't have the function

Count occurrences of unique arrays in array

I have a numpy array of various one hot encoded numpy arrays, eg;
x = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 0]])
I would like to count the occurances of each unique one hot vector,
{[1, 0, 0]: 2, [0, 0, 1]: 1}
Approach #1
Seems like a perfect setup to use the new functionality of numpy.unique (v1.13 and newer) that lets us work along an axis of a NumPy array -
unq_rows, count = np.unique(x,axis=0, return_counts=1)
out = {tuple(i):j for i,j in zip(unq_rows,count)}
Sample outputs -
In [289]: unq_rows
Out[289]:
array([[0, 0, 1],
[1, 0, 0]])
In [290]: count
Out[290]: array([1, 2])
In [291]: {tuple(i):j for i,j in zip(unq_rows,count)}
Out[291]: {(0, 0, 1): 1, (1, 0, 0): 2}
Approach #2
For NumPy versions older than v1.13, we can make use of the fact that the input array is one-hot encoded array, like so -
_, idx, count = np.unique(x.argmax(1), return_counts=1, return_index=1)
out = {tuple(i):j for i,j in zip(x[idx],count)} # x[idx] is unq_rows
You could convert your arrays to tuples and use a Counter:
import numpy as np
from collections import Counter
x = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 0]])
Counter([tuple(a) for a in x])
# Counter({(1, 0, 0): 2, (0, 0, 1): 1})
The fastest way given your data format is:
x.sum(axis=0)
which gives:
array([2, 0, 1])
Where the 1st result is the count of arrays where the 1st is hot:
[1, 0, 0] [2
[0, 1, 0] 0
[0, 0, 1] 1]
This exploits the fact that only one can be on at a time, so we can decompose the direct sum.
If you absolutely need it expanded to the same format, it can be converted via:
sums = x.sum(axis=0)
{tuple(int(k == i) for k in range(len(sums))): e for i, e in enumerate(sums)}
or, similarly to tarashypka:
{tuple(row): count for row, count in zip(np.eye(len(sums), dtype=np.int64), sums)}
yields:
{(1, 0, 0): 2, (0, 1, 0): 0, (0, 0, 1): 1}
Here is another interesting solution with sum
>> {tuple(v): n for v, n in zip(np.eye(x.shape[1], dtype=int), np.sum(x, axis=0))
if n > 0}
{(0, 0, 1): 1, (1, 0, 0): 2}
Lists (including numpy arrays) are unhashable, i.e. they can't be keys of a dictionary. So your precise desired output, a dictionary with keys that look like [1, 0, 0] is never possible in Python. To deal with this you need to map your vectors to tuples.
from collections import Counter
import numpy as np
x = np.array([[1, 0, 0], [0, 0, 1], [1, 0, 0]])
counts = Counter(map(tuple, x))
That will get you:
In [12]: counts
Out[12]: Counter({(0, 0, 1): 1, (1, 0, 0): 2})

What is a faster way to get the location of unique rows in numpy

I have a list of unique rows and another larger array of data (called test_rows in example). I was wondering if there was a faster way to get the location of each unique row in the data. The fastest way that I could come up with is...
import numpy
uniq_rows = numpy.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = numpy.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
# this gives me the indexes of each group of unique rows
for row in uniq_rows.tolist():
print row, numpy.where((test_rows == row).all(axis=1))[0]
This prints...
[0, 1, 0] [ 1 4 10]
[1, 1, 0] [ 3 8 12]
[1, 1, 1] [7 9]
[0, 1, 1] [0 5 6]
Is there a better or more numpythonic (not sure if that word exists) way to do this? I was searching for a numpy group function but could not find it. Basically for any incoming dataset I need the fastest way to get the locations of each unique row in that data set. The incoming dataset will not always have every unique row or the same number.
EDIT:
This is just a simple example. In my application the numbers would not be just zeros and ones, they could be anywhere from 0 to 32000. The size of uniq rows could be between 4 to 128 rows and the size of test_rows could be in the hundreds of thousands.
Numpy
From version 1.13 of numpy you can use numpy.unique like np.unique(test_rows, return_counts=True, return_index=True, axis=1)
Pandas
df = pd.DataFrame(test_rows)
uniq = pd.DataFrame(uniq_rows)
uniq
0 1 2
0 0 1 0
1 1 1 0
2 1 1 1
3 0 1 1
Or you could generate the unique rows automatically from the incoming DataFrame
uniq_generated = df.drop_duplicates().reset_index(drop=True)
yields
0 1 2
0 0 1 1
1 0 1 0
2 0 0 0
3 1 1 0
4 1 1 1
and then look for it
d = dict()
for idx, row in uniq.iterrows():
d[idx] = df.index[(df == row).all(axis=1)].values
This is about the same as your where method
d
{0: array([ 1, 4, 10], dtype=int64),
1: array([ 3, 8, 12], dtype=int64),
2: array([7, 9], dtype=int64),
3: array([0, 5, 6], dtype=int64)}
There are a lot of solutions here, but I'm adding one with vanilla numpy. In most cases numpy will be faster than list comprehensions and dictionaries, although the array broadcasting may cause memory to be an issue if large arrays are used.
np.where((uniq_rows[:, None, :] == test_rows).all(2))
Wonderfully simple, eh? This returns a tuple of unique row indices and the corresponding test row.
(array([0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 3]),
array([ 1, 4, 10, 3, 8, 12, 7, 9, 0, 5, 6]))
How it works:
(uniq_rows[:, None, :] == test_rows)
Uses array broadcasting to compare each element of test_rows with each row in uniq_rows. This results in a 4x13x3 array. all is used to determine which rows are equal (all comparisons returned true). Finally, where returns the indices of these rows.
With the np.unique from v1.13 (downloaded from the source link on the latest documentation, https://github.com/numpy/numpy/blob/master/numpy/lib/arraysetops.py#L112-L247)
In [157]: aset.unique(test_rows, axis=0,return_inverse=True,return_index=True)
Out[157]:
(array([[0, 0, 0],
[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]]),
array([2, 1, 0, 3, 7], dtype=int32),
array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32))
In [158]: a,b,c=_
In [159]: c
Out[159]: array([2, 1, 0, 3, 1, 2, 2, 4, 3, 4, 1, 0, 3], dtype=int32)
In [164]: from collections import defaultdict
In [165]: dd = defaultdict(list)
In [166]: for i,v in enumerate(c):
...: dd[v].append(i)
...:
In [167]: dd
Out[167]:
defaultdict(list,
{0: [2, 11],
1: [1, 4, 10],
2: [0, 5, 6],
3: [3, 8, 12],
4: [7, 9]})
or indexing the dictionary with the unique rows (as hashable tuple):
In [170]: dd = defaultdict(list)
In [171]: for i,v in enumerate(c):
...: dd[tuple(a[v])].append(i)
...:
In [172]: dd
Out[172]:
defaultdict(list,
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]})
This will do the job:
import numpy as np
uniq_rows = np.array([[0, 1, 0],
[1, 1, 0],
[1, 1, 1],
[0, 1, 1]])
test_rows = np.array([[0, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0],
[0, 1, 0],
[0, 1, 1],
[0, 1, 1],
[1, 1, 1],
[1, 1, 0],
[1, 1, 1],
[0, 1, 0],
[0, 0, 0],
[1, 1, 0]])
indices=np.where(np.sum(np.abs(np.repeat(uniq_rows,len(test_rows),axis=0)-np.tile(test_rows,(len(uniq_rows),1))),axis=1)==0)[0]
loc=indices//len(test_rows)
indices=indices-loc*len(test_rows)
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
print(res)
[[1, 4, 10], [3, 8, 12], [7, 9], [0, 5, 6]]
This will work for all the cases including the cases in which not all the rows in uniq_rows are present in test_rows. However, if somehow you know ahead that all of them are present, you could replace the part
res=[[] for i in range(len(uniq_rows))]
for i in range(len(indices)):
res[loc[i]].append(indices[i])
with just the row:
res=np.split(indices,np.where(np.diff(loc)>0)[0]+1)
Thus avoiding loops entirely.
Not very 'numpythonic', but for a bit of an upfront cost, we can make a dict with the keys as a tuple of your row, and a list of indices:
test_rowsdict = {}
for i,j in enumerate(test_rows):
test_rowsdict.setdefault(tuple(j),[]).append(i)
test_rowsdict
{(0, 0, 0): [2, 11],
(0, 1, 0): [1, 4, 10],
(0, 1, 1): [0, 5, 6],
(1, 1, 0): [3, 8, 12],
(1, 1, 1): [7, 9]}
Then you can filter based on your uniq_rows, with a fast dict lookup: test_rowsdict[tuple(row)]:
out = []
for i in uniq_rows:
out.append((i, test_rowsdict.get(tuple(i),[])))
For your data, I get 16us for just the lookup, and 66us for building and looking up, versus 95us for your np.where solution.
Approach #1
Here's one approach, not sure about the level of "NumPythonic-ness" though to such a tricky problem -
def get1Ds(a, b): # Get 1D views of each row from the two inputs
# check that casting to void will create equal size elements
assert a.shape[1:] == b.shape[1:]
assert a.dtype == b.dtype
# compute dtypes
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
# convert to 1d void arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
a_void = a.reshape(a.shape[0], -1).view(void_dt).ravel()
b_void = b.reshape(b.shape[0], -1).view(void_dt).ravel()
return a_void, b_void
def matching_row_indices(uniq_rows, test_rows):
A, B = get1Ds(uniq_rows, test_rows)
validA_mask = np.in1d(A,B)
sidx_A = A.argsort()
validA_mask = validA_mask[sidx_A]
sidx = B.argsort()
sortedB = B[sidx]
split_idx = np.flatnonzero(sortedB[1:] != sortedB[:-1])+1
all_split_indx = np.split(sidx, split_idx)
match_mask = np.in1d(B,A)[sidx]
valid_mask = np.logical_or.reduceat(match_mask, np.r_[0, split_idx])
locations = [e for i,e in enumerate(all_split_indx) if valid_mask[i]]
return uniq_rows[sidx_A[validA_mask]], locations
Scope(s) of improvement (on performance) :
np.split could be replaced by a for-loop for splitting using slicing.
np.r_ could be replaced by np.concatenate.
Sample run -
In [331]: unq_rows, idx = matching_row_indices(uniq_rows, test_rows)
In [332]: unq_rows
Out[332]:
array([[0, 1, 0],
[0, 1, 1],
[1, 1, 0],
[1, 1, 1]])
In [333]: idx
Out[333]: [array([ 1, 4, 10]),array([0, 5, 6]),array([ 3, 8, 12]),array([7, 9])]
Approach #2
Another approach to beat the setup overhead from the previous one and making use of get1Ds from it, would be -
A, B = get1Ds(uniq_rows, test_rows)
idx_group = []
for row in A:
idx_group.append(np.flatnonzero(B == row))
The numpy_indexed package (disclaimer: I am its author) was created to solve problems of this kind in an elegant and efficient manner:
import numpy_indexed as npi
indices = np.arange(len(test_rows))
unique_test_rows, index_groups = npi.group_by(test_rows, indices)
If you dont care about the indices of all rows, but only those present in test_rows, npi has a bunch of simple ways of tackling that problem too; f.i:
subset_indices = npi.indices(unique_test_rows, unique_rows)
As a sidenote; it might be useful to take a look at the examples in the npi library; in my experience, most of the time people ask a question of this kind, these grouped indices are just a means to an end, and not the endgoal of the computation. Chances are that using the functionality in npi you can reach that end goal more efficiently, without ever explicitly computing those indices. Do you care to give some more background to your problem?
EDIT: if you arrays are indeed this big, and always consist of a small number of columns with binary values, wrapping them with the following encoding might boost efficiency a lot further still:
def encode(rows):
return (rows * [[2**i for i in range(rows.shape[1])]]).sum(axis=1, dtype=np.uint8)

Efficient way to count unique elements in array in numpy/scipy in Python

I have a scipy array, e.g.
a = array([[0, 0, 1], [1, 1, 1], [1, 1, 1], [1, 0, 1]])
I want to count the number of occurrences of each unique element in the array. For example, for the above array a, I want to get out that there is 1 occurrence of [0, 0, 1], 2 occurrences of [1, 1, 1] and 1 occurrence of [1, 0, 1].
One way I thought of doing it is:
from collections import defaultdict
d = defaultdict(int)
for elt in a:
d[elt] += 1
is there a better/more efficient way?
thanks.
If sticking with Python 2.7 (or 3.1) is not an issue and any of these two Python versions is available to you, perhaps the new collections.Counter might be something for you if you stick to hashable elements like tuples:
>>> from collections import Counter
>>> c = Counter([(0,0,1), (1,1,1), (1,1,1), (1,0,1)])
>>> c
Counter({(1, 1, 1): 2, (0, 0, 1): 1, (1, 0, 1): 1})
I haven't done any performance testing on these two approaches, though.
You can sort the array lexicographically by rows and the look for points where the rows change:
In [1]: a = array([[0, 0, 1], [1, 1, 1], [1, 1, 1], [1, 0, 1]])
In [2]: b = a[lexsort(a.T)]
In [3]: b
Out[3]:
array([[0, 0, 1],
[1, 0, 1],
[1, 1, 1],
[1, 1, 1]])
...
In [5]: (b[1:] - b[:-1]).any(-1)
Out[5]: array([ True, True, False], dtype=bool)
The last array says that the first three rows differ and the third row is repeated twice.
For arrays of ones and zeros you can encode the values:
In [6]: bincount(dot(a, array([4,2,1])))
Out[6]: array([0, 1, 0, 0, 0, 1, 0, 2])
Dictionaries can also be used. Which of the various methods will be fastest will depend on the sort of arrays you are actually working with.
for python 2.6 <
import itertools
data_array = [[0, 0, 1], [1, 1, 1], [1, 1, 1], [1, 0, 1]]
dict_ = {}
for list_, count in itertools.groupby(data_array):
dict_.update({tuple(list_), len(list(count))})
The numpy_indexed package (disclaimer: I am its author) provides a solution similar to the one posted by chuck; which is a nicely vectorized one. But with tests, a nice interface, and many more related useful functions:
import numpy_indexed as npi
npi.count(a)

Categories

Resources