remove duplicate elements from two numpy arrays

remove duplicate elements from two numpy arrays - python

I have two numpy arrays a and b, with twenty million elements (float number). If the combination elements of those two arrays are the same, then we call it duplicate, which should be remove from the two arrays. For instance,
a = numpy.array([1,3,6,3,7,8,3,2,9,10,14,6])
b = numpy.array([2,4,15,4,7,9,2,2,0,11,4,15])
From those two arrays, we have a[2]&b[2] is the same as a[11]&b[11], then we call it duplicate element, which should be removed. The same as a[1]&b[1] vs a[3]&b[3]Although each array has duplicate elements itself, they are not treated as duplicate elements. So I want the returned arrays to be:
a = numpy.array([1,3,6,7,8,3,2,9,10,14])
b = numpy.array([2,4,15,7,9,2,2,0,11,4])
Anyone has the cleverest way to implement such reduction?

First you have to pack a and b to identify duplicates.
If values are positive integers (see the edit in other cases), this can be achieved by :
base=a.max()+1
c=a+base*b
Then just find unique values in c:
val,ind=np.unique(c,return_index=True)
and retrieve the associated values in a and b.
ind.sort()
print(a[ind])
print(b[ind])
for the disparition of the duplicate. (two here):
[ 1 3 6 7 8 3 2 9 10 14]
[ 2 4 15 7 9 2 2 0 11 4]
EDIT
regardless of datatype, the c array can be made as follow, packing data to bytes :
ab=ascontiguousarray(vstack((a,b)).T)
dtype = 'S'+str(2*a.itemsize)
c=ab.view(dtype=dtype)

This is done in one pass and without requiring any extra memory for the resulting arrays.
Pair up the elements at each index and iterate over them. Keep a track of which pairs have been seen so far and a counter of the index of the arrays. When a new pair has not been seen before, the index will increase by 1, effectively writing them back to their original place. However, for a duplicate pair you don't increase the index, effectively shifting every new pair one position to the left. At the end, keep the first indexth number of elements to shorten the arrays.
import itertools as it
def delete_duplicate_pairs(*arrays):
unique = set()
arrays = list(arrays)
n = range(len(arrays))
index = 0
for pair in it.izip(*arrays):
if pair not in unique:
unique.add(pair)
for i in n:
arrays[i][index] = pair[i]
index += 1
return [a[:index] for a in arrays]
If you are on Python 2, zip() creates the list of pairs up front. If you have a lot of elements in your arrays, it'll be more efficient to use itertools.izip() which will create the pairs as you request them. However, zip() in Python 3 behaves like that by default.
For your case,
>>> import numpy as np
>>> a = np.array([1,3,6,3,7,8,3,2,9,10,14,6])
>>> b = np.array([2,4,15,4,7,9,2,2,0,11,4,15])
>>> a, b = delete_duplicate_pairs(a, b)
>>> a
array([ 1, 3, 6, 7, 8, 3, 2, 9, 10, 14])
>>> b
array([ 2, 4, 15, 7, 9, 2, 2, 0, 11, 4])
Now, it all comes down to what values your arrays hold. If you have only the values 0-9, there are only 100 unique pairs and most elements will be duplicates, which saves you time. For 20 million elements for both a and b and containing values only between 0-9, the process completes in 6 seconds. For values between 0-999, it takes 12 seconds.

Related

Function Failing at Large List Sizes

I have a question: Starting with a 1-indexed array of zeros and a list of operations, for each operation add a value to each the array element between two given indices, inclusive. Once all operations have been performed, return the maximum value in the array.
Example: n = 10, Queries = [[1,5,3],[4,8,7],[6,9,1]]
The following will be the resultant output after iterating through the array, Index 1-5 will have 3 added to it etc...:
[0,0,0, 0, 0,0,0,0,0, 0]
[3,3,3, 3, 3,0,0,0,0, 0]
[3,3,3,10,10,7,7,7,0, 0]
[3,3,3,10,10,8,8,8,1, 0]
Finally you output the max value in the final list:
[3,3,3,10,10,8,8,8,1, 0]
My current solution:
def Operations(size, Array):
ResultArray = [0]*size
Values = [[i.pop(2)] for i in Array]
for index, i in enumerate(Array):
#Current Values in = Sum between the current values in the Results Array AND the added operation of equal length
#Results Array
ResultArray[i[0]-1:i[1]] = list(map(sum, zip(ResultArray[i[0]-1:i[1]], Values[index]*len(ResultArray[i[0]-1:i[1]]))))
Result = max(ResultArray)
return Result
def main():
nm = input().split()
n = int(nm[0])
m = int(nm[1])
queries = []
for _ in range(m):
queries.append(list(map(int, input().rstrip().split())))
result = Operations(n, queries)
if __name__ == "__main__":
main()
Example input: The first line contains two space-separated integers n and m, the size of the array and the number of operations.
Each of the next m lines contains three space-separated integers a,b and k, the left index, right index and summand.
5 3
1 2 100
2 5 100
3 4 100
Compiler Error at Large Sizes:
Runtime Error
Currently this solution is working for smaller final lists of length 4000, however in order test cases where length = 10,000,000 it is failing. I do not know why this is the case and I cannot provide the example input since it is so massive. Is there anything clear as to why it would fail in larger cases?

I think the problem is that you make too many intermediary trow away list here:
ResultArray[i[0]-1:i[1]] = list(map(sum, zip(ResultArray[i[0]-1:i[1]], Values[index]*len(ResultArray[i[0]-1:i[1]]))))
this ResultArray[i[0]-1:i[1]] result in a list and you do it twice, and one is just to get the size, which is a complete waste of resources, then you make another list with Values[index]*len(...) and finally compile that into yet another list that will also be throw away once it is assigned into the original, so you make 4 throw away list, so for example lets said the the slice size is of 5.000.000, then you are making 4 of those or 20.000.000 extra space you are consuming, 15.000.000 of which you don't really need, and if your original list is of 10.000.000 elements, well just do the math...
You can get the same result for your list(map(...)) with list comprehension like
[v+Value[index][0] for v in ResultArray[i[0]-1:i[1]] ]
now we use two less lists, and we can reduce one list more by making it a generator expression, given that slice assignment does not need that you assign a list specifically, just something that is iterable
(v+Value[index][0] for v in ResultArray[i[0]-1:i[1]] )
I don't know if internally the slice assignment it make it a list first or not, but hopefully it doesn't, and with that we go back to just one extra list
here is an example
>>> a=[0]*10
>>> a
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> a[1:5] = (3+v for v in a[1:5])
>>> a
[0, 3, 3, 3, 3, 0, 0, 0, 0, 0]
>>>
we can reduce it to zero extra list (assuming that internally it doesn't make one) by using itertools.islice
>>> import itertools
>>> a[3:7] = (1+v for v in itertools.islice(a,3,7))
>>> a
[0, 3, 3, 4, 4, 1, 1, 0, 0, 0]
>>>

to return unique indices of equal elements in two numpy arrays?

I have two numpy arrays; one is bigger than the other. I want my function to return the indices of the common elements concerning the bigger one, and this returned list of indices should have unique values, for example :
search = np.array([1,3,4,5,8,10,7,3,4,5,8,7])
data = np.array([7,10,1,12,7,1,5,18,4,3,10,5,8,4])
my function output should look like :
result = [2,9,none,6,12,10,0,9,13,11,none,4]
So, the first element in results is 2, which means that the first element in search can be found at the third element of data
However, we have two elements with value 4 in search, and one element with 4 value in data, so the first 4 value (9th element of search) will be mapped to 13, and the other 4 value should be assigned to another index if available, if there isn't another value, zero or none should be assigned to its index
I found this code in a previous question, this code will do this job, but the resulted array of indices have duplicates :
x = np.array([3, 5, 7, 1, 9, 8, 6])
y = np.array([2, 1, 5, 10, 100, 6,6])
index = np.argsort(x)
sorted_x = x[index]
sorted_index = np.searchsorted(sorted_x, y)
yindex = np.take(index, sorted_index, mode="clip")
mask = x[yindex] != y
result = np.ma.array(yindex, mask=mask)
print (result)
# result =[[-- 3 1 -- -- 6 6]]
For my case, there shouldn't be duplicated, instead of the second 6 value in result array, zero or none.

Fast algorithm to find indices where multiple arrays have the same value

I'm looking for ways to speed up (or replace) my algorithm for grouping data.
I have a list of numpy arrays. I want to generate a new numpy array, such that each element of this array is the same for each index where the original arrays are the same as well. (And different where this is not the case.)
This sounds kind of awkward, so have an example:
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# * *
Note that elements I marked (indices 0 and 4) of the expected outcome have the same value (0) because the original two arrays were also the same (namely 10 and 21). Similar for elements with indices 3 and 5 (3).
The algorithm has to deal with an arbitrary number of (equally-size) input arrays, and also return, for each resulting number, what values of the original arrays they correspond to. (So for this example, "3" refers to (11, 22).)
Here is my current algorithm:
import numpy as np
def groupify(values):
group = np.zeros((len(values[0]),), dtype=np.int64) - 1 # Magic number: -1 means ungrouped.
group_meanings = {}
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
this_combo = {}
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
this_combo[curr_id] = needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
group_meanings[next_hash] = this_combo
next_hash += 1
return group, group_meanings
Note that the expression value_array[matching] == needed_value is evaluated many times for each individual index, which is where the slowness comes from.
I'm not sure if my algorithm can be sped up much more, but I'm also not sure if it's the optimal algorithm to begin with. Is there a better way of doing this?

Cracked it finally for a vectorized solution! It was an interesting problem. The problem was we had to tag each pair of values taken from the corresponding array elements of the list. Then, we are supposed to tag each such pair based on their uniqueness among othet pairs. So, we can use np.unique abusing all its optional arguments and finally do some additional work to keep the order for the final output. Here's the implementation basically done in three stages -
# Stack as a 2D array with each pair from values as a column each.
# Convert to linear index equivalent considering each column as indexing tuple
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
# Do the heavy work with np.unique to give us :
# 1. Starting indices of unique elems,
# 2. Srray that has unique IDs for each element in idx, and
# 3. Group ID counts
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
# Best part happens here : Use mask to ignore the repeated elems and re-tag
# each unqID using argsort() of masked elements from idx
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
out = idx[mask].argsort()[unqID]
Runtime test
Let's compare the proposed vectorized approach against the original code. Since the proposed code gets us the group IDs only, so for a fair benchmarking, let's just trim off parts from the original code that are not used to give us that. So, here are the function definitions -
def groupify(values): # Original code
group = np.zeros((len(values[0]),), dtype=np.int64) - 1
next_hash = 0
matching = np.ones((len(values[0]),), dtype=bool)
while any(group == -1):
matching[:] = (group == -1)
first_ungrouped_idx = np.where(matching)[0][0]
for curr_id, value_array in enumerate(values):
needed_value = value_array[first_ungrouped_idx]
matching[matching] = value_array[matching] == needed_value
# Assign all of the found elements to a new group
group[matching] = next_hash
next_hash += 1
return group
def groupify_vectorized(values): # Proposed code
arr = np.vstack(values)
idx = np.ravel_multi_index(arr,arr.max(1)+1)
_,unq_start_idx,unqID,count = np.unique(idx,return_index=True, \
return_inverse=True,return_counts=True)
mask = ~np.in1d(unqID,np.where(count>1)[0])
mask[unq_start_idx] = 1
return idx[mask].argsort()[unqID]
Runtime results on a list with large arrays -
In [345]: # Input list with random elements
...: values = [item for item in np.random.randint(10,40,(10,10000))]
In [346]: np.allclose(groupify(values),groupify_vectorized(values))
Out[346]: True
In [347]: %timeit groupify(values)
1 loops, best of 3: 4.02 s per loop
In [348]: %timeit groupify_vectorized(values)
100 loops, best of 3: 3.74 ms per loop

This should work, and should be considerably faster, since we're using broadcasting and numpy's inherently fast boolean comparisons:
import numpy as np
# Test values:
values = [
np.array([10, 11, 10, 11, 10, 11, 10]),
np.array([21, 21, 22, 22, 21, 22, 23]),
]
# Expected outcome: np.array([0, 1, 2, 3, 0, 3, 4])
# for every value in values, check where duplicate values occur
same_mask = [val[:,np.newaxis] == val[np.newaxis,:] for val in values]
# get the conjunction of all those tests
conjunction = np.logical_and.reduce(same_mask)
# ignore the diagonal
conjunction[np.diag_indices_from(conjunction)] = False
# initialize the labelled array with nans (used as flag)
labelled = np.empty(values[0].shape)
labelled.fill(np.nan)
# keep track of labelled value
val = 0
for k, row in enumerate(conjunction):
if np.isnan(labelled[k]): # this element has not been labelled yet
labelled[k] = val # so label it
labelled[row] = val # and label every element satisfying the test
val += 1
print(labelled)
# outputs [ 0. 1. 2. 3. 0. 3. 4.]
It is about a factor of 1.5x faster than your version when dealing with the two arrays, but I suspect the speedup should be better for more arrays.

The numpy_indexed package (disclaimer: I am its author) contains generalized variants of the numpy arrayset operations, which can be used to solve your problem in an elegant and efficient (vectorized) manner:
import numpy_indexed as npi
unique_values, labels = npi.unique(tuple(values), return_inverse=True)
The above will work for arbitrary type combinations, but alternatively, the below will be even more efficient if values is a list of many arrays of the same dtype:
unique_values, labels = npi.unique(np.asarray(values), axis=1, return_inverse=True)

If I understand correctly, you are trying to hash values according to columns. Its better to convert the columns into arbitrary values by themselves, and then find the hashes from them.
So you actually want to hash on list(np.array(values).T).
This functionality is already built into Pandas. You dont need to write it. The only problem is that it takes a list of values without further lists within it. In this case, you can just convert the inner list to string map(str, list(np.array(values).T)) and factorize that!
>>> import pandas as pd
>>> pd.factorize(map(str, list(np.array(values).T)))
(array([0, 1, 2, 3, 0, 3, 4]),
array(['[10 21]', '[11 21]', '[10 22]', '[11 22]', '[10 23]'], dtype=object))
I have converted your list of arrays into an array, and then into a string ...

Keeping part of array constant

I have an array of float values which I then pass to an equation to produce a corresponding array. However, I would like to keep the first n values of this array constant, and then all values after that to be passed to the equation.
What is this best way to do this in Python?

Just slice the array to pass the values after the nth to your "equation" (which I assume is a function?).
def equation(l):
return sum(l) # for example
a = [1, 2, 3, 4, 5, 6, 7, 8]
n = 4
>>> equation(a[n:])
26
>>> equation(a[3:6])
15
This passes only those values after the fourth in list a. Actually it passes a copy of that part of the list after the fourth, so your function is free to change the values therein without side effects.

Skipping Same Values when Reading csv into Python

I am trying to subtract the previous item in a list from the following item in a list, but I think my type is preventing me from doing so. The type of each item in the list is int. If i have a list of integers such as
1 2 3 4 5 6 7
How will I subtract 1 from 2, 2 from 3, 3 from 4, etc., and print this value after each operation?
My list is torcount, which I acquired from a numpy operation, and this is the code I tried:
TorCount=len(np.unique(TorNum))
for i in range(TorCount):
TorCount=TorCount[i]-TorCount[i-1]
print TorCount
Thank you

Use np.diff:
Example:
>>> xs = np.array([1, 2, 3, 4])
>>> np.diff(xs, n=1)
array([1, 1, 1])
numpy.diff(a, n=1, axis=-1)
Calculate the n-th order discrete difference along given axis.
The first order difference is given by out[n] = a[n+1] - a[n]
along the given axis, higher order differences are calculated
by using diff recursively.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

remove duplicate elements from two numpy arrays - python

Related

Function Failing at Large List Sizes

to return unique indices of equal elements in two numpy arrays?

Fast algorithm to find indices where multiple arrays have the same value

Keeping part of array constant

Skipping Same Values when Reading csv into Python

Categories

Resources