Build a large numpy array from itertools.product

Build a large numpy array from itertools.product - python

I want to build a numpy array from the result of itertools.product. My first approach was a simple:
from itertools import product
import numpy as np
max_init = 6
init_values = range(1, max_init + 1)
repetitions = 12
result = np.array(list(product(init_values, repeat=repetitions)))
This code works well for "small" repetitions (like <=4), but with "large" values (>= 12) it completely hogs the memory and crashes. I assumed that building the list was the thing eating all the RAM, so I searched how to make it directly with an array. I found Numpy equivalent of itertools.product and Using numpy to build an array of all combinations of two arrays.
So, I tested the following alternatives:
Alternative #1:
results = np.empty((max_init**repetitions, repetitions))
for i, row in enumerate(product(init_values, repeat=repetitions)):
result[:][i] = row
Alternative #2:
init_values_args = [init_values] * repetitions
results = np.array(np.meshgrid(*init_values_args)).T.reshape(-1, repetitions)
Alternative #3:
results = np.indices([sides] * num_dice).reshape(num_dice, -1).T + 1
#1 is extremely slow. I didn't have enough patience to let it finish (after a few minutes of processing on a 2017 MacBook Pro). #2 and #3 eat all the memory until the python interpreter crashes, as with the initial approach.
After that, I thought that I could express the same information in a different way that was still useful for me: a dict where the keys would be all the possible (sorted) combinations, and the values would be the counting of these combinations. So I tried:
Alternative #4:
from collections import Counter
def sorted_product(iterable, repeat=1):
for el in product(iterable, repeat=repeat):
yield tuple(sorted(el))
def count_product(repetitions=1, max_init=6):
init_values = range(1, max_init + 1)
sp = sorted_product(init_values, repeat=repetitions)
counted_sp = Counter(sp)
return np.array(list(counted_sp.values())), \
np.array(list(counted_sp.keys()))
cnt, values = count(repetitions=repetitions, max_init=max_init)
But the line counted_sp = Counter(sp), which triggers getting all the values of the generators, is also too slow for my needs (it also took several minutes before I canceled it).
Is there another way to generate the same data (or a different data structure containing the same information) that does not have the mentioned shortcomings of being too slow or using too much memory?
PS: I tested all the implementations above against my tests with a small repetitions, and all the tests passed, so they give consistent results.
I hope that editing the question is the best way to expand it. Otherwise, let me know, and I'll edit post where I should.
After reading the first two answers below and thinking about it, I agree that I am approaching the issue from the wrong angle. Instead of going with a "brute force" approach I should have used probabilities and work with that.
My intention is, later on, for each combination:
- Count how many values are under a threshold X.
- Count how many values are equal or over threshold X and below a threshold Y.
- Count how many values are over threshold Y.
And group the combinations that have the same counts.
As an illustrative example:
If I roll 12 dice of 6 sides, what's the probability of having M dice with a value <3, N dice with a value >=3 and <4, and P dice with a value >5, for all possible combinations of M, N, and P?
So, I think that I'll close this question in a few days while I go with this new approach. Thank you for all the feedback and your time!

The number tuples that list(product(range(1,7), repeats=12)) makes is 6**12, 2,176,782,336. Whether a list or array that's probably too large for most computers.
In [119]: len(list(product(range(1,7),repeat=12)))
....
MemoryError:
Trying to make an array of that size directly:
In [129]: A = np.ones((6**12,12),int)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-129-e833a9e859e0> in <module>()
----> 1 A = np.ones((6**12,12),int)
/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py in ones(shape, dtype, order)
190
191 """
--> 192 a = empty(shape, dtype, order)
193 multiarray.copyto(a, 1, casting='unsafe')
194 return a
ValueError: Maximum allowed dimension exceeded
Array memory size, 4 bytes per item
In [130]: 4*12*6**12
Out[130]: 104,485,552,128
100GB?
Why do you need to generate 2B combinations of 7 numbers?
So with your Counter you reduce the number of items
In [138]: sp = sorted_product(range(1,7), 2)
In [139]: counter=Counter(sp)
In [140]: counter
Out[140]:
Counter({(1, 1): 1,
(1, 2): 2,
(1, 3): 2,
(1, 4): 2,
(1, 5): 2,
(1, 6): 2,
(2, 2): 1,
(2, 3): 2,
(2, 4): 2,
(2, 5): 2,
(2, 6): 2,
(3, 3): 1,
(3, 4): 2,
(3, 5): 2,
(3, 6): 2,
(4, 4): 1,
(4, 5): 2,
(4, 6): 2,
(5, 5): 1,
(5, 6): 2,
(6, 6): 1})
from 36 to 21 (for 2 repetitions). It shouldn't be hard to generalize this to more repetitions (combinations? permutations?) It still will push time and/or memory boundaries.
A variant on meshgrid using mgrid:
In [175]: n=7; A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
In [176]: A.shape
Out[176]: (279936, 7)
In [177]: B=np.array(list(product(range(1,7),repeat=7)))
In [178]: B.shape
Out[178]: (279936, 7)
In [179]: A[:10]
Out[179]:
array([[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 3],
[1, 1, 1, 1, 1, 1, 4],
[1, 1, 1, 1, 1, 1, 5],
[1, 1, 1, 1, 1, 1, 6],
[1, 1, 1, 1, 1, 2, 1],
[1, 1, 1, 1, 1, 2, 2],
[1, 1, 1, 1, 1, 2, 3],
[1, 1, 1, 1, 1, 2, 4]])
In [180]: B[:10]
Out[180]:
array([[1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1, 2],
[1, 1, 1, 1, 1, 1, 3],
[1, 1, 1, 1, 1, 1, 4],
[1, 1, 1, 1, 1, 1, 5],
[1, 1, 1, 1, 1, 1, 6],
[1, 1, 1, 1, 1, 2, 1],
[1, 1, 1, 1, 1, 2, 2],
[1, 1, 1, 1, 1, 2, 3],
[1, 1, 1, 1, 1, 2, 4]])
In [181]: np.allclose(A,B)
mgrid is quite a bit faster:
In [182]: timeit B=np.array(list(product(range(1,7),repeat=7)))
317 ms ± 3.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [183]: timeit A=np.mgrid[[slice(1,7)]*n].reshape(n,-1).T
13.9 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but, yes, it will have the same overall memory usage and limit.
With n=10,
ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

The right answer is: Don't. Whatever you want to do with all these combinations, adjust your approach so that you either generate them one at a time and use them immediately without storing them, or better yet, find a way to get the job done without inspecting every combination. Your current solution works for toy problems, but is not suitable for larger parameters. Explain what you are up to and maybe someone here can help.

Related

Row frequency in numpy array [duplicate]

I have the following type of arrays:
a = array([[1,1,1],
[1,1,1],
[1,1,1],
[2,2,2],
[2,2,2],
[2,2,2],
[3,3,0],
[3,3,0],
[3,3,0]])
I would like to count the number of occurrences of each type of array such as
[1,1,1]:3, [2,2,2]:3, and [3,3,0]: 3
How could I achieve this in python? Is it possible without using a for loop and counting into a dictionary? It has to be fast and should take less than 0.1 seconds or so. I looked into Counter, numpy bincount, etc. But, those are for individual element not for an array.
Thanks.

If you don't mind mapping to tuples just to get the count you can use a Counter dict which runs in 28.5 µs on my machine using python3 which is well below your threshold:
In [5]: timeit Counter(map(tuple, a))
10000 loops, best of 3: 28.5 µs per loop
In [6]: c = Counter(map(tuple, a))
In [7]: c
Out[7]: Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})

collections.Counter can do this conveniently, and almost like the example given.
>>> from collections import Counter
>>> c = Counter()
>>> for x in a:
... c[tuple(x)] += 1
...
>>> c
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
This converts each sub-list to a tuple, which can be keys in a dictionary since they are immutable. Lists are mutable so can't be used as dict keys.
Why do you want to avoid using for loops?
And similar to #padraic-cunningham's much cooler answer:
>>> Counter(tuple(x) for x in a)
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
>>> Counter(map(tuple, a))
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})

You could convert those rows to a 1D array using the elements as two-dimensional indices with np.ravel_multi_index. Then, use np.unique to give us the positions of the start of each unique row and also has an optional argument return_counts to give us the counts. Thus, the implementation would look something like this -
def unique_rows_counts(a):
# Calculate linear indices using rows from a
lidx = np.ravel_multi_index(a.T,a.max(0)+1 )
# Get the unique indices and their counts
_, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)
# return the unique groups from a and their respective counts
return a[unq_idx], counts
Sample run -
In [64]: a
Out[64]:
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1],
[2, 2, 2],
[2, 2, 2],
[2, 2, 2],
[3, 3, 0],
[3, 3, 0],
[3, 3, 0]])
In [65]: unqrows, counts = unique_rows_counts(a)
In [66]: unqrows
Out[66]:
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 0]])
In [67]: counts
Out[67]: array([3, 3, 3])
Benchmarking
Assuming you are okay with either numpy arrays or collections as outputs, one can benchmark the solutions provided thus far, like so -
Function definitions:
import numpy as np
from collections import Counter
def unique_rows_counts(a):
lidx = np.ravel_multi_index(a.T,a.max(0)+1 )
_, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)
return a[unq_idx], counts
def map_Counter(a):
return Counter(map(tuple, a))
def forloop_Counter(a):
c = Counter()
for x in a:
c[tuple(x)] += 1
return c
Timings:
In [53]: a = np.random.randint(0,4,(10000,5))
In [54]: %timeit map_Counter(a)
10 loops, best of 3: 31.7 ms per loop
In [55]: %timeit forloop_Counter(a)
10 loops, best of 3: 45.4 ms per loop
In [56]: %timeit unique_rows_counts(a)
1000 loops, best of 3: 1.72 ms per loop

The numpy_indexed package (disclaimer: I am its author) contains efficient vectorized functionality for these kind of operations:
import numpy_indexed as npi
unique_rows, row_count = npi.count(a, axis=0)
Note that this works for arrays of any dimensionality or datatype.

Since numpy-1.13.0, np.unique can be used with axis argument:
>>> np.unique(a, axis=0, return_counts=True)
(array([[1, 1, 1],
[2, 2, 2],
[3, 3, 0]]), array([3, 3, 3]))

Count how many times each row is present in numpy.array

I am trying to count a number each row shows in a np.array, for example:
import numpy as np
my_array = np.array([[1, 2, 0, 1, 1, 1],
[1, 2, 0, 1, 1, 1], # duplicate of row 0
[9, 7, 5, 3, 2, 1],
[1, 1, 1, 0, 0, 0],
[1, 2, 0, 1, 1, 1], # duplicate of row 0
[1, 1, 1, 1, 1, 0]])
Row [1, 2, 0, 1, 1, 1] shows up 3 times.
A simple naive solution would involve converting all my rows to tuples, and applying collections.Counter, like this:
from collections import Counter
def row_counter(my_array):
list_of_tups = [tuple(ele) for ele in my_array]
return Counter(list_of_tups)
Which yields:
In [2]: row_counter(my_array)
Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1})
However, I am concerned about the efficiency of my approach. And maybe there is a library that provides a built-in way of doing this. I tagged the question as pandas because I think that pandas might have the tool I am looking for.

You can use the answer to this other question of yours to get the counts of the unique items.
In numpy 1.9 there is a return_counts optional keyword argument, so you can simply do:
>>> my_array
array([[1, 2, 0, 1, 1, 1],
[1, 2, 0, 1, 1, 1],
[9, 7, 5, 3, 2, 1],
[1, 1, 1, 0, 0, 0],
[1, 2, 0, 1, 1, 1],
[1, 1, 1, 1, 1, 0]])
>>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1]))
>>> b = np.ascontiguousarray(my_array).view(dt)
>>> unq, cnt = np.unique(b, return_counts=True)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 2, 0, 1, 1, 1],
[9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])
In earlier versions, you can do it as:
>>> unq, _ = np.unique(b, return_inverse=True)
>>> cnt = np.bincount(_)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0],
[1, 2, 0, 1, 1, 1],
[9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

I think just specifying axis in np.unique gives what you need.
import numpy as np
unq, cnt = np.unique(my_array, axis=0, return_counts=True)
Note: this feature is available only in numpy>=1.13.0.

(This assumes that the array is fairly small, e.g. fewer than 1000 rows.)
Here's a short NumPy way to count how many times each row appears in an array:
>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1)
array([3, 3, 1, 1, 3, 1])
This counts how many times each row appears in my_array, returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, and so on.

A pandas approach might look like this
import pandas as pd
df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6'])
df.groupby(['c1','c2','c3','c4','c5','c6']).size()
Note: supplying column names is not necessary

You solution is not bad, but if your matrix is large you will probably want to use a more efficient hash (compared to the default one Counter uses) for the rows before counting. You can do that with joblib:
A = np.random.rand(5, 10000)
%timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1)
10000 loops, best of 3: 132 µs per loop
%timeit Counter(joblib.hash(row) for row in A).values()
1000 loops, best of 3: 1.37 ms per loop
%timeit Counter(tuple(ele) for ele in A).values()
100 loops, best of 3: 3.75 ms per loop
%timeit pd.DataFrame(A).groupby(range(A.shape[1])).size()
1 loops, best of 3: 2.24 s per loop
The pandas solution is extremely slow (about 2s per loop) with this many columns. For a small matrix like the one you showed your method is faster than joblib hashing but slower than numpy:
numpy: 100000 loops, best of 3: 15.1 µs per loop
joblib:1000 loops, best of 3: 885 µs per loop
tuple: 10000 loops, best of 3: 27 µs per loop
pandas: 100 loops, best of 3: 2.2 ms per loop
If you have a large number of rows then you can probably find a better substitute for Counter to find hash frequencies.
Edit: Added numpy benchmarks from #acjr's solution in my system so that it is easier to compare. The numpy solution is the fastest one in both cases.

A solution identical to Jaime's can be found in the numpy_indexed package (disclaimer: I am its author)
import numpy_indexed as npi
npi.count(my_array)

Fastest way to remove identical sub-arrays in a nd-array? [duplicate]

I need to find unique rows in a numpy.array.
For example:
>>> a # I have
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])
I know that i can create a set and loop over the array, but I am looking for an efficient pure numpy solution. I believe that there is a way to set data type to void and then I could just use numpy.unique, but I couldn't figure out how to make it work.

As of NumPy 1.13, one can simply choose the axis for selection of unique values in any N-dim array. To get unique rows, one can do:
unique_rows = np.unique(original_array, axis=0)

Yet another possible solution
np.vstack({tuple(row) for row in a})

Another option to the use of structured arrays is using a view of a void type that joins the whole row into a single item:
a = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_a = a[idx]
>>> unique_a
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
EDIT
Added np.ascontiguousarray following #seberg's recommendation. This will slow the method down if the array is not already contiguous.
EDIT
The above can be slightly sped up, perhaps at the cost of clarity, by doing:
unique_a = np.unique(b).view(a.dtype).reshape(-1, a.shape[1])
Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:
a = np.random.randint(2, size=(10000, 6))
%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
100 loops, best of 3: 3.17 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
100 loops, best of 3: 5.93 ms per loop
a = np.random.randint(2, size=(10000, 100))
%timeit np.unique(a.view(np.dtype((np.void, a.dtype.itemsize*a.shape[1])))).view(a.dtype).reshape(-1, a.shape[1])
10 loops, best of 3: 29.9 ms per loop
%timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]
10 loops, best of 3: 116 ms per loop

If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy's structured arrays.
The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn't make a copy, and is quite efficient.
As a quick example:
import numpy as np
data = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)
uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
print uniq
To understand what's going on, have a look at the intermediary results.
Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it's a similar data structure to a list of tuples.)
In [71]: struct
Out[71]:
array([[(1, 1, 1, 0, 0, 0)],
[(0, 1, 1, 1, 0, 0)],
[(0, 1, 1, 1, 0, 0)],
[(1, 1, 1, 0, 0, 0)],
[(1, 1, 1, 1, 1, 0)]],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
In [72]: struct[0]
Out[72]:
array([(1, 1, 1, 0, 0, 0)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
Once we run numpy.unique, we'll get a structured array back:
In [73]: np.unique(struct)
Out[73]:
array([(0, 1, 1, 1, 0, 0), (1, 1, 1, 0, 0, 0), (1, 1, 1, 1, 1, 0)],
dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8'), ('f3', '<i8'), ('f4', '<i8'), ('f5', '<i8')])
That we then need to view as a "normal" array (_ stores the result of the last calculation in ipython, which is why you're seeing _.view...):
In [74]: _.view(data.dtype)
Out[74]: array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0])
And then reshape back into a 2D array (-1 is a placeholder that tells numpy to calculate the correct number of rows, give the number of columns):
In [75]: _.reshape(-1, ncols)
Out[75]:
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
Obviously, if you wanted to be more concise, you could write it as:
import numpy as np
def unique_rows(data):
uniq = np.unique(data.view(data.dtype.descr * data.shape[1]))
return uniq.view(data.dtype).reshape(-1, data.shape[1])
data = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
print unique_rows(data)
Which results in:
[[0 1 1 1 0 0]
[1 1 1 0 0 0]
[1 1 1 1 1 0]]

np.unique when I run it on np.random.random(100).reshape(10,10) returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:
array = #your numpy array of lists
new_array = [tuple(row) for row in array]
uniques = np.unique(new_array)
That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your "not looping through"

np.unique works by sorting a flattened array, then looking at whether each item is equal to the previous. This can be done manually without flattening:
ind = np.lexsort(a.T)
a[ind[np.concatenate(([True],np.any(a[ind[1:]]!=a[ind[:-1]],axis=1)))]]
This method does not use tuples, and should be much faster and simpler than other methods given here.
NOTE: A previous version of this did not have the ind right after a[, which mean that the wrong indices were used. Also, Joe Kington makes a good point that this does make a variety of intermediate copies. The following method makes fewer, by making a sorted copy and then using views of it:
b = a[np.lexsort(a.T)]
b[np.concatenate(([True], np.any(b[1:] != b[:-1],axis=1)))]
This is faster and uses less memory.
Also, if you want to find unique rows in an ndarray regardless of how many dimensions are in the array, the following will work:
b = a[lexsort(a.reshape((a.shape[0],-1)).T)];
b[np.concatenate(([True], np.any(b[1:]!=b[:-1],axis=tuple(range(1,a.ndim)))))]
An interesting remaining issue would be if you wanted to sort/unique along an arbitrary axis of an arbitrary-dimension array, something that would be more difficult.
Edit:
To demonstrate the speed differences, I ran a few tests in ipython of the three different methods described in the answers. With your exact a, there isn't too much of a difference, though this version is a bit faster:
In [87]: %timeit unique(a.view(dtype)).view('<i8')
10000 loops, best of 3: 48.4 us per loop
In [88]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True], np.any(a[ind[1:]]!= a[ind[:-1]], axis=1)))]
10000 loops, best of 3: 37.6 us per loop
In [89]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10000 loops, best of 3: 41.6 us per loop
With a larger a, however, this version ends up being much, much faster:
In [96]: a = np.random.randint(0,2,size=(10000,6))
In [97]: %timeit unique(a.view(dtype)).view('<i8')
10 loops, best of 3: 24.4 ms per loop
In [98]: %timeit b = [tuple(row) for row in a]; np.unique(b)
10 loops, best of 3: 28.2 ms per loop
In [99]: %timeit ind = np.lexsort(a.T); a[np.concatenate(([True],np.any(a[ind[1:]]!= a[ind[:-1]],axis=1)))]
100 loops, best of 3: 3.25 ms per loop

I've compared the suggested alternative for speed and found that, surprisingly, the void view unique solution is even a bit faster than numpy's native unique with the axis argument. If you're looking for speed, you'll want
numpy.unique(
a.view(numpy.dtype((numpy.void, a.dtype.itemsize*a.shape[1])))
).view(a.dtype).reshape(-1, a.shape[1])
I've implemented that fastest variant in npx.unique_rows.
There is a bug report on GitHub for this, too.
Code to reproduce the plot:
import numpy
import perfplot
def unique_void_view(a):
return (
numpy.unique(a.view(numpy.dtype((numpy.void, a.dtype.itemsize * a.shape[1]))))
.view(a.dtype)
.reshape(-1, a.shape[1])
)
def lexsort(a):
ind = numpy.lexsort(a.T)
return a[
ind[numpy.concatenate(([True], numpy.any(a[ind[1:]] != a[ind[:-1]], axis=1)))]
]
def vstack(a):
return numpy.vstack([tuple(row) for row in a])
def unique_axis(a):
return numpy.unique(a, axis=0)
perfplot.show(
setup=lambda n: numpy.random.randint(2, size=(n, 20)),
kernels=[unique_void_view, lexsort, vstack, unique_axis],
n_range=[2 ** k for k in range(15)],
xlabel="len(a)",
equality_check=None,
)

Here is another variation for #Greg pythonic answer
np.vstack(set(map(tuple, a)))

I didn’t like any of these answers because none handle floating-point arrays in a linear algebra or vector space sense, where two rows being “equal” means “within some 𝜀”. The one answer that has a tolerance threshold, https://stackoverflow.com/a/26867764/500207, took the threshold to be both element-wise and decimal precision, which works for some cases but isn’t as mathematically general as a true vector distance.
Here’s my version:
from scipy.spatial.distance import squareform, pdist
def uniqueRows(arr, thresh=0.0, metric='euclidean'):
"Returns subset of rows that are unique, in terms of Euclidean distance"
distances = squareform(pdist(arr, metric=metric))
idxset = {tuple(np.nonzero(v)[0]) for v in distances <= thresh}
return arr[[x[0] for x in idxset]]
# With this, unique columns are super-easy:
def uniqueColumns(arr, *args, **kwargs):
return uniqueRows(arr.T, *args, **kwargs)
The public-domain function above uses scipy.spatial.distance.pdist to find the Euclidean (customizable) distance between each pair of rows. Then it compares each each distance to a threshold to find the rows that are within thresh of each other, and returns just one row from each thresh-cluster.
As hinted, the distance metric needn’t be Euclidean—pdist can compute sundry distances including cityblock (Manhattan-norm) and cosine (the angle between vectors).
If thresh=0 (the default), then rows have to be bit-exact to be considered “unique”. Other good values for thresh use scaled machine-precision, i.e., thresh=np.spacing(1)*1e3.

Why not use drop_duplicates from pandas:
>>> timeit pd.DataFrame(image.reshape(-1,3)).drop_duplicates().values
1 loops, best of 3: 3.08 s per loop
>>> timeit np.vstack({tuple(r) for r in image.reshape(-1,3)})
1 loops, best of 3: 51 s per loop

The numpy_indexed package (disclaimer: I am its author) wraps the solution posted by Jaime in a nice and tested interface, plus many more features:
import numpy_indexed as npi
new_a = npi.unique(a) # unique elements over axis=0 (rows) by default

Based on the answer in this page I have written a function that replicates the capability of MATLAB's unique(input,'rows') function, with the additional feature to accept tolerance for checking the uniqueness. It also returns the indices such that c = data[ia,:] and data = c[ic,:]. Please report if you see any discrepancies or errors.
def unique_rows(data, prec=5):
import numpy as np
d_r = np.fix(data * 10 ** prec) / 10 ** prec + 0.0
b = np.ascontiguousarray(d_r).view(np.dtype((np.void, d_r.dtype.itemsize * d_r.shape[1])))
_, ia = np.unique(b, return_index=True)
_, ic = np.unique(b, return_inverse=True)
return np.unique(b).view(d_r.dtype).reshape(-1, d_r.shape[1]), ia, ic

Beyond #Jaime excellent answer, another way to collapse a row is to uses a.strides[0] (assuming a is C-contiguous) which is equal to a.dtype.itemsize*a.shape[0]. Furthermore void(n) is a shortcut for dtype((void,n)). we arrive finally to this shortest version :
a[unique(a.view(void(a.strides[0])),1)[1]]
For
[[0 1 1 1 0 0]
[1 1 1 0 0 0]
[1 1 1 1 1 0]]

np.unique works given a list of tuples:
>>> np.unique([(1, 1), (2, 2), (3, 3), (4, 4), (2, 2)])
Out[9]:
array([[1, 1],
[2, 2],
[3, 3],
[4, 4]])
With a list of lists it raises a TypeError: unhashable type: 'list'

For general purpose like 3D or higher multidimensional nested arrays, try this:
import numpy as np
def unique_nested_arrays(ar):
origin_shape = ar.shape
origin_dtype = ar.dtype
ar = ar.reshape(origin_shape[0], np.prod(origin_shape[1:]))
ar = np.ascontiguousarray(ar)
unique_ar = np.unique(ar.view([('', origin_dtype)]*np.prod(origin_shape[1:])))
return unique_ar.view(origin_dtype).reshape((unique_ar.shape[0], ) + origin_shape[1:])
which satisfies your 2D dataset:
a = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
unique_nested_arrays(a)
gives:
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
But also 3D arrays like:
b = np.array([[[1, 1, 1], [0, 1, 1]],
[[0, 1, 1], [1, 1, 1]],
[[1, 1, 1], [0, 1, 1]],
[[1, 1, 1], [1, 1, 1]]])
unique_nested_arrays(b)
gives:
array([[[0, 1, 1], [1, 1, 1]],
[[1, 1, 1], [0, 1, 1]],
[[1, 1, 1], [1, 1, 1]]])

None of these answers worked for me. I'm assuming as my unique rows contained strings and not numbers. However this answer from another thread did work:
Source: https://stackoverflow.com/a/38461043/5402386
You can use .count() and .index() list's methods
coor = np.array([[10, 10], [12, 9], [10, 5], [12, 9]])
coor_tuple = [tuple(x) for x in coor]
unique_coor = sorted(set(coor_tuple), key=lambda x: coor_tuple.index(x))
unique_count = [coor_tuple.count(x) for x in unique_coor]
unique_index = [coor_tuple.index(x) for x in unique_coor]

We can actually turn m x n numeric numpy array into m x 1 numpy string array, please try using the following function, it provides count, inverse_idx and etc, just like numpy.unique:
import numpy as np
def uniqueRow(a):
#This function turn m x n numpy array into m x 1 numpy array storing
#string, and so the np.unique can be used
#Input: an m x n numpy array (a)
#Output unique m' x n numpy array (unique), inverse_indx, and counts
s = np.chararray((a.shape[0],1))
s[:] = '-'
b = (a).astype(np.str)
s2 = np.expand_dims(b[:,0],axis=1) + s + np.expand_dims(b[:,1],axis=1)
n = a.shape[1] - 2
for i in range(0,n):
s2 = s2 + s + np.expand_dims(b[:,i+2],axis=1)
s3, idx, inv_, c = np.unique(s2,return_index = True, return_inverse = True, return_counts = True)
return a[idx], inv_, c
Example:
A = np.array([[ 3.17 9.502 3.291],
[ 9.984 2.773 6.852],
[ 1.172 8.885 4.258],
[ 9.73 7.518 3.227],
[ 8.113 9.563 9.117],
[ 9.984 2.773 6.852],
[ 9.73 7.518 3.227]])
B, inv_, c = uniqueRow(A)
Results:
B:
[[ 1.172 8.885 4.258]
[ 3.17 9.502 3.291]
[ 8.113 9.563 9.117]
[ 9.73 7.518 3.227]
[ 9.984 2.773 6.852]]
inv_:
[3 4 1 0 2 4 0]
c:
[2 1 1 1 2]

Lets get the entire numpy matrix as a list, then drop duplicates from this list, and finally return our unique list back into a numpy matrix:
matrix_as_list=data.tolist()
matrix_as_list:
[[1, 1, 1, 0, 0, 0], [0, 1, 1, 1, 0, 0], [0, 1, 1, 1, 0, 0], [1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 0]]
uniq_list=list()
uniq_list.append(matrix_as_list[0])
[uniq_list.append(item) for item in matrix_as_list if item not in uniq_list]
unique_matrix=np.array(uniq_list)
unique_matrix:
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])

The most straightforward solution is to make the rows a single item by making them strings. Each row then can be compared as a whole for its uniqueness using numpy. This solution is generalize-able you just need to reshape and transpose your array for other combinations. Here is the solution for the problem provided.
import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
uniques, index = np.unique([str(i) for i in original], return_index=True)
cleaned = original[index]
print(cleaned)
Will Give:
array([[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
Send my nobel prize in the mail

import numpy as np
original = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
# create a view that the subarray as tuple and return unique indeies.
_, unique_index = np.unique(original.view(original.dtype.descr * original.shape[1]),
return_index=True)
# get unique set
print(original[unique_index])

repeat array in arbitary length

Is it possible to create an array that looks like
0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1
having the following array in the beginning
4, 3, 5, 2
without using loops in Python/Numpy?
EDIT:
This is just an example and the information (4,3,5,2) may have any length or numbers.

>>> lengths = np.array([4, 3, 5, 2])
>>> np.concatenate(map(np.arange, lengths))
array([0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1])
Of course, this is cheating, because map is a loop in disguise. There's no NumPy idiom to do this any more directly, AFAIK.
The above creates len(lengths) temporaries. An alternative that does not construct these temporaries is to use fromiter and an adapted version of #jonrsharpe's answer:
>>> from itertools import chain, imap
>>> np.fromiter(chain.from_iterable(imap(xrange, lengths)), dtype=int,
... count=np.sum(lengths))
array([0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1])
Somewhat surprisingly, the fromiter idiom is faster, and it gets faster if you don't compute the count first:
>>> lengths = np.arange(30)
>>> %timeit np.concatenate(map(np.arange, lengths))
10000 loops, best of 3: 64.8 µs per loop
>>> %timeit np.fromiter(chain.from_iterable(imap(xrange, lengths)), dtype=int, count=np.sum(lengths))
10000 loops, best of 3: 28.3 µs per loop
>>> %timeit np.fromiter(chain.from_iterable(imap(xrange, lengths)), dtype=int)
10000 loops, best of 3: 25.8 µs per loop
(Timings of NumPy 1.8.1 and Python 2.7.6 on an x86-64 running Linux.)

You can do it without writing for or while, but I assure you there's a loop under there somewhere!
>>> from itertools import chain, imap
>>> list(chain.from_iterable(imap(xrange, (4, 3, 5, 2))))
[0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1]
Powered by itertools.

Getting the indexes to the duplicate columns of a numpy array [duplicate]

This question already has answers here:
Find unique columns and column membership
(3 answers)
Closed 8 years ago.
I have a numpy array with duplicate columns:
import numpy as np
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
I need to find the indexes to those duplicates or something like that:
[0, 4]
[1, 2, 5]
I have a hard time dealing with indexes in Python. I really don't know to approach it.
Thanks
I tried identifying the unique columns first with this function:
def unique_columns(data):
ind = np.lexsort(data)
return data.T[ind[np.concatenate(([True], any(data.T[ind[1:]]!=data.T[ind[:-1]], axis=1)))]].T
But I can't figure out the indexes from there.

There is not a simple way to do this unfortunately. Using a np.unique answer. This method requires that the axis you want to unique is contiguous in memory and numpy's typical memory layout is C contiguous or contiguous in rows. Fortunately numpy makes this conversion simple:
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
def unique_columns2(data):
dt = np.dtype((np.void, data.dtype.itemsize * data.shape[0]))
dataf = np.asfortranarray(data).view(dt)
u,uind = np.unique(dataf, return_inverse=True)
u = u.view(data.dtype).reshape(-1,data.shape[0]).T
return (u,uind)
Our result:
u,uind = unique_columns2(A)
u
array([[0, 1, 1],
[0, 1, 2],
[0, 1, 3]])
uind
array([1, 2, 2, 0, 1, 2])
I am not really sure what you want to do from here, for example you can do something like this:
>>> [np.where(uind==x)[0] for x in range(u.shape[0])]
[array([3]), array([0, 4]), array([1, 2, 5])]
Some timings:
tmp = np.random.randint(0,4,(30000,500))
#BiRico and OP's answer
%timeit unique_columns(tmp)
1 loops, best of 3: 2.91 s per loop
%timeit unique_columns2(tmp)
1 loops, best of 3: 208 ms per loop

Here is an outline of how to approach it. Use numpy.lexsort to sort the columns, that way all the duplicates will be grouped together. Once the duplicates are all together, you can easily tell which columns are duplicates and the indices that correspond with those columns.
Here's an implementation of the method described above.
import numpy as np
def duplicate_columns(data, minoccur=2):
ind = np.lexsort(data)
diff = np.any(data.T[ind[1:]] != data.T[ind[:-1]], axis=1)
edges = np.where(diff)[0] + 1
result = np.split(ind, edges)
result = [group for group in result if len(group) >= minoccur]
return result
A = np.array([[1, 1, 1, 0, 1, 1],
[1, 2, 2, 0, 1, 2],
[1, 3, 3, 0, 1, 3]])
print(duplicate_columns(A))
# [array([0, 4]), array([1, 2, 5])]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Build a large numpy array from itertools.product - python

Related

Row frequency in numpy array [duplicate]

Count how many times each row is present in numpy.array

Fastest way to remove identical sub-arrays in a nd-array? [duplicate]

repeat array in arbitary length

Getting the indexes to the duplicate columns of a numpy array [duplicate]

Categories

Resources