Remove duplicates from array and elements in matching positions in another array - python

I have two numpy array, I want to remove duplicate values from the first array (including the original value) and remove the items in the matching positions in the second array.
For example:
a = [1, 2, 2, 3]
b = ['a', 'd', 'f', 'c']
Becomes:
a = [1, 3]
b = ['a', 'c']
I need to do this efficiently and not use the naive solution which is time consuming

Here's one with np.unique -
unq,idx,c = np.unique(a, return_index=True, return_counts=True)
unq_idx = np.sort(idx[c==1])
a_out = a[unq_idx]
b_out = b[unq_idx]
Sample run -
In [34]: a
Out[34]: array([1, 2, 2, 3])
In [35]: b
Out[35]: array(['a', 'd', 'f', 'c'], dtype='|S1')
In [36]: unq,idx,c = np.unique(a, return_index=1, return_counts=1)
...: unq_idx = idx[c==1]
...: a_out = a[unq_idx]
...: b_out = b[unq_idx]
In [37]: a_out
Out[37]: array([1, 3])
In [38]: b_out
Out[38]: array(['a', 'c'], dtype='|S1')

Since you are open to NumPy, you may wish to consider Pandas, which uses NumPy internally:
import pandas as pd
a = pd.Series([1, 2, 2, 3])
b = pd.Series(['a', 'd', 'f', 'c'])
flags = ~a.duplicated(keep=False)
idx = flags[flags].index
a = a[idx].values
b = b[idx].values
Result:
print(a, b, sep='\n')
array([1, 3], dtype=int64)
array(['a', 'c'], dtype=object)

Related

Numpy indexing using an array of booleans

I have the following array which indicates whether or not to take a certain item:
import numpy as np
test_array = np.array([[0, 0, 1],
[1, 1, 0],
[1, 1, 1]])
The array I want to index is this one:
classes = ['a', 'b', 'c']
This is what the result should be:
[['c'], ['a', 'b'], ['a', 'b', 'c']]
How can this be done?
The answers I've seen so far range from awkward to, quite frankly, baffling, so here is a straightforward solution.
import np
arr = np.array([[0, 0, 1], [1, 1, 0], [1, 1, 1]])
arr_bools = arr.astype(np.bool)
lookup_lst = np.array(['a', 'b', 'c'])
res = [lookup_lst[row].tolist() for row in arr_bools]
You could do the following:
import numpy as np
test_array = np.array([[0, 0, 1],
[1, 1, 0],
[1, 1, 1]])
classes = ['a', 'b', 'c']
lookup = dict(enumerate(classes))
result = [[lookup[i] for i, e in enumerate(arr) if e] for arr in test_array]
print(result)
Output
[['c'], ['a', 'b'], ['a', 'b', 'c']]
In one line you can do:
print ([[c for (x, c) in zip(l, classes) if x] for l in test_array])
I would do it as this:
result = []
for array in test_array:
result.append([classes[i] for i,value in enumerate(array ) if value ])
I would start with something like that:
result = []
for row in test_array:
partial_result = []
for i in range(3):
if row[i] == 1:
partial_result.append(classes[i])
result.append(partial_result)
print(result)
Results with:
[['c'], ['a', 'b'], ['a', 'b', 'c']]
In Python, we prefer list comprehension over loops, so time to improve:
print([[classes[i] for i, val in enumerate(row) if val] for row in test_array])
enumerate is an in-build function which takes an iterable object as a parameter and returns iterable of tuples (index, element) for all elements in the original iterable, so enumerate(row) will return (0, [0, 0, 1]), (1, [1, 1, 0]) and (2, [1, 1, 1]).
for i, val in enumerate(row) if val will work, because 1s are interpreted as True in Python and 0s as False.
[[classes[i] for i, val in enumerate(row) if val] for row in test_array]
^ create a list of elements based on some original list ------->^
^ each element of that list will be a list itself.
^ elements of that inner lists will be objects from classes list
^ for each pair (i, element) from enumerate(row) take this ith
element, but just if val == 1 ^
This can be done by matrix multiplication:
[*map(list, test_array.astype('O')#classes)]
# [['c'], ['a', 'b'], ['a', 'b', 'c']]

Combining 3 Arrays into 1 Matrix (Python 3)

I have 3 arrays of equal length (e.g.):
[a, b, c]
[1, 2, 3]
[i, ii, iii]
I would like to combine them into a matrix:
|a, 1, i |
|b, 2, ii |
|c, 3, iii|
The problem I have is that when I use codes such as dstack, hstack or concatenate. I get them numerically added or stacked in a fashion that I can work with.
You could use zip():
which maps the similar index of multiple containers so that they can be used just using as single entity.
a1 = ['a', 'b', 'c']
b1 = ['1', '2', '3']
c1 = ['i', 'ii', 'iii']
print(list(zip(a1,b1,c1)))
OUTPUT:
[('a', '1', 'i'), ('b', '2', 'ii'), ('c', '3', 'iii')]
EDIT:
I just thought of stepping forward, how about flattening the list afterwards and then use numpy.reshape
flattened_list = []
#flatten the list
for x in res:
for y in x:
flattened_list.append(y)
#print(flattened_list)
import numpy as np
data = np.array(flattened_list)
shape = (3, 3)
print(data.reshape( shape ))
OUTPUT:
[['a' '1' 'i']
['b' '2' 'ii']
['c' '3' 'iii']]
OR
for one liners out there:
#flatten the list
for x in res:
for y in x:
flattened_list.append(y)
# print(flattened_list)
print([flattened_list[i:i+3] for i in range(0, len(flattened_list), 3)])
OUTPUT:
[['a', '1', 'i'], ['b', '2', 'ii'], ['c', '3', 'iii']]
OR
As suggested by #norok2
print(list(zip(*zip(a1, b1, c1))))
OUTPUT:
[('a', 'b', 'c'), ('1', '2', '3'), ('i', 'ii', 'iii')]
Assuming that you have 3 numpy arrays:
>>> a, b, c = np.random.randint(0, 9, 9).reshape(3, 3)
>>> print(a, b, c)
[4 1 4] [5 8 5] [3 0 2]
then you can stack them vertically (i.e. along the first dimension), and then transpose the resulting matrix to get the order you need:
>>> np.vstack((a, b, c)).T
array([[4, 5, 3],
[1, 8, 0],
[4, 5, 2]])
A slightly more verbose example is to instead stack horizontally, but this requires that your arrays are made into 2D using reshape:
>>> np.hstack((a.reshape(3, 1), b.reshape(3, 1), c.reshape(3, 1)))
array([[4, 5, 3],
[1, 8, 0],
[4, 5, 2]])
this gives you a list of tuples, which might not be what you want:
>>> list(zip([1,2,3],[4,5,6],[7,8,9]))
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]
this gives you a numpy array:
>>> from numpy import array
>>> array([[1,2,3],[4,5,6],[7,8,9]]).transpose()
array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])
If you have different data types in each array, then it would make sense to use pandas for this:
# Iterative approach, using concat
import pandas as pd
my_arrays = [['a', 'b', 'c'], [1, 2, 3], ['i', 'ii', 'iii']]
df1 = pd.concat([pd.Series(array) for array in my_arrays], axis=1)
# Named arrays
array1 = ['a', 'b', 'c']
array2 = [1, 2, 3]
array3 = ['i', 'ii', 'iii']
df2 = pd.DataFrame({'col1': array1,
'col2': array2,
'col3': array3})
Now you have the structure you desired, with appropriate data types for each column:
print(df1)
# 0 1 2
# 0 a 1 i
# 1 b 2 ii
# 2 c 3 iii
print(df2)
# col1 col2 col3
# 0 a 1 i
# 1 b 2 ii
# 2 c 3 iii
print(df1.dtypes)
# 0 object
# 1 int64
# 2 object
# dtype: object
print(df2.dtypes)
# col1 object
# col2 int64
# col3 object
# dtype: object
You can extract the numpy array with the .values attribute:
df1.values
# array([['a', 1, 'i'],
# ['b', 2, 'ii'],
# ['c', 3, 'iii']], dtype=object)

More Elegant way to number a list according to values

I would like to map a list into numbers according to the values.
For example:
['aa', 'b', 'b', 'c', 'aa', 'b', 'a'] -> [0, 1, 1, 2, 0, 1, 3]
I'm trying to achieve this by using numpy and a mapping dict.
def number(lst):
x = np.array(lst)
unique_names = list(np.unique(x))
mapping = dict(zip(unique_names, range(len(unique_names)))) # Translating dict
map_func = np.vectorize(lambda name: d[name])
return map_func(x)
Is there a more elegant / faster way to do this?
Update: Bonus question -- do it with the order maintained.
You can use the return_inverse keyword:
x = np.array(['aa', 'b', 'b', 'c', 'aa', 'b', 'a'])
uniq, map_ = np.unique(x, return_inverse=True)
map_
# array([1, 2, 2, 3, 1, 2, 0])
Edit: Order preserving version:
x = np.array(['aa', 'b', 'b', 'c', 'aa', 'b', 'a'])
uniq, idx, map_ = np.unique(x, return_index=True, return_inverse=True)
mxi = idx.max()+1
mask = np.zeros((mxi,), bool)
mask[idx] = True
oidx = np.where(mask)[0]
iidx = np.empty_like(oidx)
iidx[map_[oidx]] = np.arange(oidx.size)
iidx[map_]
# array([0, 1, 1, 2, 0, 1, 3])
Here's a vectorized NumPy based solution -
def argsort_unique(idx):
# Original idea : http://stackoverflow.com/a/41242285/3293881 by #Andras
n = idx.size
sidx = np.empty(n,dtype=int)
sidx[idx] = np.arange(n)
return sidx
def map_uniquetags_keep_order(a):
arr = np.asarray(a)
sidx = np.argsort(arr)
s_arr = arr[sidx]
m = np.concatenate(( [True], s_arr[1:] != s_arr[:-1] ))
unq = s_arr[m]
tags = np.searchsorted(unq, arr)
rev_idx = argsort_unique(sidx[np.searchsorted(s_arr, unq)].argsort())
return rev_idx[tags]
Sample run -
In [169]: a = ['aa', 'b', 'b', 'c', 'aa', 'b', 'a'] # String input
In [170]: map_uniquetags_keep_order(a)
Out[170]: array([0, 1, 1, 2, 0, 1, 3])
In [175]: a = [4, 7, 7, 5, 4, 7, 2] # Numeric input
In [176]: map_uniquetags_keep_order(a)
Out[176]: array([0, 1, 1, 2, 0, 1, 3])
Use sets to remove duplicates:
myList = ['a', 'b', 'b', 'c', 'a', 'b']
mySet = set(myList)
Then build your dictionary using comprehension:
mappingDict = {letter:number for number,letter in enumerate(mySet)}
I did it using the ASCII values because it is easy and short.
def number(list):
return map(lambda x: ord(x)-97,list)
l=['a', 'b', 'b', 'c', 'a', 'b']
print number(l)
Output:
[0, 1, 1, 2, 0, 1]
If the order is not a concern:
[sorted(set(x)).index(item) for item in x]
# returns:
[1, 2, 2, 3, 1, 2, 0]

Merge two or more lists with given order of merging

On start I have 2 lists and 1 list that says in what order I should merge those two lists.
For example I have first list equal to [a, b, c] and second list equal to [d, e] and 'merging' list equal to [0, 1, 0, 0, 1].
That means: to make merged list first I need to take element from first list, then second, then first, then first, then second... And I end up with [a, d, b, c, e].
To solve this I just used for loop and two "pointers", but I was wondering if I can do this task more pythonic... I tried to find some functions that could help me, but no real result.
You could create iterators from those lists, loop through the ordering list, and call next on one of the iterators:
i1 = iter(['a', 'b', 'c'])
i2 = iter(['d', 'e'])
# Select the iterator to advance: `i2` if `x` == 1, `i1` otherwise
print([next(i2 if x else i1) for x in [0, 1, 0, 0, 1]]) # ['a', 'd', 'b', 'c', 'e']
It's possible to generalize this solution to any number of lists as shown below
def ordered_merge(lists, selector):
its = [iter(l) for l in lists]
for i in selector:
yield next(its[i])
In [4]: list(ordered_merge([[3, 4], [1, 5], [2, 6]], [1, 2, 0, 0, 1, 2]))
Out[4]: [1, 2, 3, 4, 5, 6]
If the ordering list contains strings, floats, or any other objects that can't be used as list indexes, use a dictionary:
def ordered_merge(mapping, selector):
its = {k: iter(v) for k, v in mapping.items()}
for i in selector:
yield next(its[i])
In [6]: mapping = {'A': [3, 4], 'B': [1, 5], 'C': [2, 6]}
In [7]: list(ordered_merge(mapping, ['B', 'C', 'A', 'A', 'B', 'C']))
Out[7]: [1, 2, 3, 4, 5, 6]
Of course, you can use integers as dictionary keys as well.
Alternatively, you could remove elements from the left side of each of the original lists one by one and add them to the resulting list. Quick example:
In [8]: A = ['a', 'b', 'c']
...: B = ['d', 'e']
...: selector = [0, 1, 0, 0, 1]
...:
In [9]: [B.pop(0) if x else A.pop(0) for x in selector]
Out[9]: ['a', 'd', 'b', 'c', 'e']
I would expect the first approach to be more efficient (list.pop(0) is slow).
How about this,
list1 = ['a', 'b', 'c']
list2 = ['d', 'e']
options = [0,1,0,0,1]
list1_iterator = iter(list1)
list2_iterator = iter(list2)
new_list = [next(list2_iterator) if option else next(list1_iterator) for option in options]
print(new_list)
# Output
['a', 'd', 'b', 'c', 'e']

How to Order a multidimensional List using another list

Quick Summary:
need_to_reorder = [['a', 'b', 'c', 'd'], [1, 2, 3, 4]]
I want to set an order for the need_to_reorder[0][x] x values using my sorting array
sorting_array = [1, 3, 0, 2]
Required result: need_to_reorder will equal
[['b', 'd', 'a', 'c'], [2, 4, 1, 3]]
Searching for the answer, I tried using numPy:
import numpy as np
sorting_array = [1, 3, 0, 2]
i = np.array(sorting_array)
print i ## Results: [1 3 0 2] <-- No Commas?
need_to_reorder[:,i]
RESULTS:
TypeError: list indicies must be integers, not tuple
I'm looking for a correction to the code above or an entirely different approach.
You can try a simple nested comprehension
>>> l = [['a', 'b', 'c', 'd'], [1, 2, 3, 4]]
>>> s = [1, 3, 0, 2]
>>> [[j[i] for i in s] for j in l]
[['b', 'd', 'a', 'c'], [2, 4, 1, 3]]
If you need this as a function you can have a very simple function as in
def reorder(need_to_reorder,sorting_array)
return [[j[i] for i in sorting_array] for j in need_to_reorder]
Do note that this can be solved using map function also. However in this case, a list comp is preferred as the map variant would require a lambda function. The difference between map and a list-comp is discussed in full length in this answer
def order_with_sort_array(arr, sort_arr):
assert len(arr) == len(sort_arr)
return [arr[i] for i in sort_arr]
sorting_array = [1, 3, 0, 2]
need_to_reorder = [['a', 'b', 'c', 'd'], [1, 2, 3, 4]]
after_reordered = map(lambda arr : order_with_sort_array(arr, sorting_array),
need_to_reorder)
This should work
import numpy as np
ntr = np.array([['a', 'b', 'c', 'd'], [1, 2, 3, 4]])
sa = np.array([1, 3, 0, 2])
print np.array( [ntr[0,] , np.array([ntr[1,][sa[i]] for i in range(sa.shape[0])])] )
>> [['a' 'b' 'c' 'd'],['2' '4' '1' '3']]

Categories

Resources