Numpy split array by grouping array - python

There are the following 2 arrays with equal length. My goal is to split the array B into groups defined by the array A. So finally there should be 3 arrays or an list of array. The final list of arrays should consists of the following rows of array B:
First and second
Third and fifth
Fourth
The order is not really relevant.
A = array([[-1],
[ 1],
[ 0],
[ 0],
[ 1]])
B = array([[ 624.5 , 548. ],
[ 912.8201, 564.3444],
[1564.5 , 764. ],
[1463.4163, 785.9251],
[1698.0757, 846.6306]])
The problem occured to me by using the dbscan clustering function. The A array describes the clusters (0, 1) of the points in array B. The values -1 declares the point as outlier. (The values used are not precise).
My goal is to calculate the compactness, ... of each found cluster

The numpy_indexed package (disclaimer: i am its author) was designed with these type of use cases in mind.
import numpy_indexed as npi
C = npi.group_by(A).split(B)
Not sure what you mean by compactness of each group; but rather than splitting and doing subsequent computations, it is typically more efficient to compute reductions over groups directly; whereby you can reuse the grouping object for increased efficiency:
groups = npi.group_by(A)
mean = groups.mean(B)
std = groups.std(B)

Keep is simple:
[data[labels == l] for l in np.unique(labels)]
Similarly, you can build a dict in a one-liner.

this is a bit lengthy but it should work.
final_dict = {}
for counter in range(0,len(A)):
if(A[counter] not in final_dict):
final_dict[A[counter]] = B[counter]
else:
final_dict[A[counter]] = final_dict[A[counter]] + B[counter]
final_array = []
for key,value in final_dict.items():
final_array.append(value)
Basically since you have odd values like -1 to work with you can set it as keys of a dictionary and then you iterate over the dictionary to get the groups of values which you can then append to a final output array

Related

Determining index each group duplicate values in an array in Python with the fastest way

I want to find an index of each group duplicate value like this:
s = [2,6,2,88,6,...]
The results must return the index from original s: [[0,2],[1,4],..] or the result can show another way.
I find many solutions so I find the fastest way to get duplicate group:
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
But after sort I got wrong index from original s.
In my case, I have ~ 200mil value on the list and I want to find the fastest way to do that. I use an array to store value because I want to use GPU to make it faster.
Using hash structures like dict helps.
For example:
import numpy as np
from collections import defaultdict
a=np.array([2,4,2,88,15,4])
table=defaultdict(list)
for ind,num in enumerate(a):
table[num]+=[ind]
Outputs:
{2: [0, 2], 4: [1, 5], 88: [3], 15: [4]}
If you want to show duplicated elements in the order from small to large:
for k,v in sorted(table.items()):
if len(v)>1:
print(k,":",v)
Outputs:
2 : [0, 2]
4 : [1, 5]
The speed is determined by how many different values in the number list.
See if this meets your performance requirements (here, s is your input array):
counts = np.bincount(s)
cum_counts = np.add.accumulate(counts)
sorted_inds = np.argsort(s)
result = np.split(sorted_inds, cum_counts[:-1])
Notes:
The result would be a list of arrays.
Each of these arrays would contain indices of a repeated value in s. Eg, if the value 13 is repeated 7 times in s, there would be an array with 7 indices among the arrays of result
If you want to ignore singleton values of s (values that occur only once in s), you can pass minlength=2 to np.bincount()
(This is a variation of my other answer. Here, instead of splitting the large array sorted_inds, we take slices from it, so it's likely to have a different kind of performance characteristic)
If s is the input array:
counts = np.bincount(s)
cum_counts = np.add.accumulate(counts)
sorted_inds = np.argsort(s)
result = [sorted_inds[:cum_counts[0]]] + [sorted_inds[cum_counts[i]:cum_counts[i+1]] for i in range(cum_counts.size-1)]

Generation of numpy arrays for permutations with constraints

To make a long story short, I'm trying to generate all the possible permutations of a set of numpy arrays. I have three numbers [j,k,m] and I would like to specify a maximum value for each one [J,K,M]. How would I then get all the combinations of arrays under these values? How could I force the k values to always be even as well? For instance:
So with the max values set to [1,2,2], the permutations would be: [0,0,0], [0,0,1], [0,0,2], [0,2,0], [0,2,1], [0,2,2], [1,0,0], [1,0,1] ...
I realise I don't have any example to code to show but I'm afraid I have literally no idea where to start with this.
From other answers it seems like sympy would be of some use?
I found answer that might be interested for you here and generalised it. So you can construct list of possible values for each item like so:
X = [[0, 1], [0, 1, 2], [0, 1, 2]]
And then use:
np.array(np.meshgrid(*X)).T.reshape(-1, len(X))
Output contains 18 items that you wanted. Actually, if you have only maximum values [J, K, L], you can construct X using X = [range(J+1), range(K+1), range(L+1)]

Check how many numpy array within a numpy array are equal to other numpy arrays within another numpy array of different size

My problem
Suppose I have
a = np.array([ np.array([1,2]), np.array([3,4]), np.array([5,6]), np.array([7,8]), np.array([9,10])])
b = np.array([ np.array([5,6]), np.array([1,2]), np.array([3,192])])
They are two arrays, of different sizes, containing other arrays (the inner arrays have same sizes!)
I want to count how many items of b (i.e. inner arrays) are also in a. Notice that I am not considering their position!
How can I do that?
My Try
count = 0
for bitem in b:
for aitem in a:
if aitem==bitem:
count+=1
Is there a better way? Especially in one line, maybe with some comprehension..
The numpy_indexed package contains efficient (nlogn, generally) and vectorized solutions to these types of problems:
import numpy_indexed as npi
count = len(npi.intersection(a, b))
Note that this is subtly different than your double loop, discarding duplicate entries in a and b for instance. If you want to retain duplicates in b, this would work:
count = npi.in_(b, a).sum()
Duplicate entries in a could also be handled by doing npi.count(a) and factoring in the result of that; but anyway, im just rambling on for illustration purposes since I imagine the distinction probably does not matter to you.
Here is a simple way to do it:
a = np.array([ np.array([1,2]), np.array([3,4]), np.array([5,6]), np.array([7,8]), np.array([9,10])])
b = np.array([ np.array([5,6]), np.array([1,2]), np.array([3,192])])
count = np.count_nonzero(
np.any(np.all(a[:, np.newaxis, :] == b[np.newaxis, :, :], axis=-1), axis=0))
print(count)
>>> 2
You can do what you want in one liner as follows:
count = sum([np.array_equal(x,y) for x,y in product(a,b)])
Explanation
Here's an explanation of what's happening:
Iterate through the two arrays using itertools.product which will create an iterator over the cartesian product of the two arrays.
Compare each two arrays in a tuple (x,y) coming from step 1. using np.array_equal
True is equal to 1 when using sum on a list
Full example:
The final code looks like this:
import numpy as np
from itertools import product
a = np.array([ np.array([1,2]), np.array([3,4]), np.array([5,6]), np.array([7,8]), np.array([9,10])])
b = np.array([ np.array([5,6]), np.array([1,2]), np.array([3,192])])
count = sum([np.array_equal(x,y) for x,y in product(a,b)])
# output: 2
You can convert the rows to dtype = np.void and then use np.in1d as on the resulting 1d arrays
def void_arr(a):
return np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
b[np.in1d(void_arr(b), void_arr(a))]
array([[5, 6],
[1, 2]])
If you just want the number of intersections, it's
np.in1d(void_arr(b), void_arr(a)).sum()
2
Note: if there are repeat items in b or a, then np.in1d(void_arr(b), void_arr(a)).sum() likely won't be equal to np.in1d(void_arr(a), void_arr(b)).sum(). I've reversed the order from my original answer to match your question (i.e. how many elements of b are in a?)
For more information, see the third answer here

NumPy: Select and sum data into array

I have a (large) data array and a (large) list of lists of (a few) indices, e.g.,
data = [1.0, 10.0, 100.0]
contribs = [[1, 2], [0], [0, 1]]
For each entry in contribs, I'd like to sum up the corresponding values of data and put those into an array. For the above example, the expected result would be
out = [110.0, 1.0, 11.0]
Doing this in a loop works,
c = numpy.zeros(len(contribs))
for k, indices in enumerate(contribs):
for idx in indices:
c[k] += data[idx]
but since data and contribs are large, it's taking too long.
I have the feeling this can be improved using numpy's fancy indexing.
Any hints?
One possibility would be
data = np.array(data)
out = [np.sum(data[c]) for c in contribs]
Should be faster than the double loop, at least.
Here's an almost vectorized * approach -
# Get lengths of list element in contribs and the cumulative lengths
# to be used for creating an ID array later on.
clens = np.cumsum([len(item) for item in contribs])
# Setup ID array that corresponds to same ID for same list element in contribs.
# These IDs would be used to accumulate values from a corresponnding array
# that is created by indexing into data array with a flattened contribs
id_arr = np.zeros(clens[-1],dtype=int)
id_arr[clens[:-1]] = 1
out = np.bincount(id_arr.cumsum(),np.take(data,np.concatenate(contribs)))
This approach involves some setting up work. So, the benefits would be hopefully seen when fed with decent sized input arrays and a decent number of list elements in contribs, which would correspond to the looping in an otherwise loopy solution.
*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.
I'm not certain that all cases work, but for your example, with data as a numpy.array:
# Flatten "contribs"
f = [j for i in contribs for j in i]
# Get the "ranges" of data[f] that will be summed in the next step
i = [0] + numpy.cumsum([len(i) for i in contribs]).tolist()[:-1]
# Take the required sums
numpy.add.reduceat(data[f], i)

Fast algorithm to find indices where multiple arrays have the same value

I'm looking for ways to speed up (or replace) my algorithm for grouping data.
I have a list of numpy arrays. I want to generate a new numpy array, such that each element of this array is the same for each index where the original arrays are the same as well. (And different where this is not the case.)
This sounds kind of awkward, so have an example:
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# * *
Note that elements I marked (indices 0 and 4) of the expected outcome have the same value (0) because the original two arrays were also the same (namely 10 and 21). Similar for elements with indices 3 and 5 (3).
The algorithm has to deal with an arbitrary number of (equally-size) input arrays, and also return, for each resulting number, what values of the original arrays they correspond to. (So for this example, "3" refers to (11, 22).)
Here is my current algorithm:
import numpy as np
def groupify(values):
group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped.
group_meanings = {}
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
this_combo = {}
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
this_combo[curr_id] = needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
group_meanings[next_hash] = this_combo
next_hash += 1
return group, group_meanings
Note that the expression value_array[matching] == needed_value is evaluated many times for each individual index, which is where the slowness comes from.
I'm not sure if my algorithm can be sped up much more, but I'm also not sure if it's the optimal algorithm to begin with. Is there a better way of doing this?
Cracked it finally for a vectorized solution! It was an interesting problem. The problem was we had to tag each pair of values taken from the corresponding array elements of the list. Then, we are supposed to tag each such pair based on their uniqueness among othet pairs. So, we can use np.unique abusing all its optional arguments and finally do some additional work to keep the order for the final output. Here's the implementation basically done in three stages -
# Stack as a 2D array with each pair from values as a column each.
# Convert to linear index equivalent considering each column as indexing tuple
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
# Do the heavy work with np.unique to give us :
# 1. Starting indices of unique elems,
# 2. Srray that has unique IDs for each element in idx, and
# 3. Group ID counts
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
# Best part happens here : Use mask to ignore the repeated elems and re-tag
# each unqID using argsort() of masked elements from idx
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
out = idx[mask].argsort()[unqID]
Runtime test
Let's compare the proposed vectorized approach against the original code. Since the proposed code gets us the group IDs only, so for a fair benchmarking, let's just trim off parts from the original code that are not used to give us that. So, here are the function definitions -
def groupify(values): # Original code
group = np.zeros((len(values[0]),), dtype=np.int64) - 1
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
next_hash += 1
return group
def groupify_vectorized(values): # Proposed code
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
return idx[mask].argsort()[unqID]
Runtime results on a list with large arrays -
In [345]: # Input list with random elements
...: values = [item for item in np.random.randint(10,40,(10,10000))]
In [346]: np.allclose(groupify(values),groupify_vectorized(values))
Out[346]: True
In [347]: %timeit groupify(values)
1 loops, best of 3: 4.02 s per loop
In [348]: %timeit groupify_vectorized(values)
100 loops, best of 3: 3.74 ms per loop
This should work, and should be considerably faster, since we're using broadcasting and numpy's inherently fast boolean comparisons:
import numpy as np
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# for every value in values, check where duplicate values occur
same_mask = [val[:,np.newaxis] == val[np.newaxis,:] for val in values]
# get the conjunction of all those tests
conjunction = np.logical_and.reduce(same_mask)
# ignore the diagonal
conjunction[np.diag_indices_from(conjunction)] = False
# initialize the labelled array with nans (used as flag)
labelled = np.empty(values[0].shape)
labelled.fill(np.nan)
# keep track of labelled value
val = 0
for k, row in enumerate(conjunction):
if np.isnan(labelled[k]): # this element has not been labelled yet
labelled[k] = val # so label it
labelled[row] = val # and label every element satisfying the test
val += 1
print(labelled)
# outputs [ 0. 1. 2. 3. 0. 3. 4.]
It is about a factor of 1.5x faster than your version when dealing with the two arrays, but I suspect the speedup should be better for more arrays.
The numpy_indexed package (disclaimer: I am its author) contains generalized variants of the numpy arrayset operations, which can be used to solve your problem in an elegant and efficient (vectorized) manner:
import numpy_indexed as npi
unique_values, labels = npi.unique(tuple(values), return_inverse=True)
The above will work for arbitrary type combinations, but alternatively, the below will be even more efficient if values is a list of many arrays of the same dtype:
unique_values, labels = npi.unique(np.asarray(values), axis=1, return_inverse=True)
If I understand correctly, you are trying to hash values according to columns. Its better to convert the columns into arbitrary values by themselves, and then find the hashes from them.
So you actually want to hash on list(np.array(values).T).
This functionality is already built into Pandas. You dont need to write it. The only problem is that it takes a list of values without further lists within it. In this case, you can just convert the inner list to string map(str, list(np.array(values).T)) and factorize that!
>>> import pandas as pd
>>> pd.factorize(map(str, list(np.array(values).T)))
(array([0, 1, 2, 3, 0, 3, 4]),
array(['[10 21]', '[11 21]', '[10 22]', '[11 22]', '[10 23]'], dtype=object))
I have converted your list of arrays into an array, and then into a string ...

Categories

Resources