I have two flat lists where one of them contains duplicate values.
For example,
array1 = [1,4,4,7,10,10,10,15,16,17,18,20]
array2 = [4,6,7,8,9,10]
I need to find values in array1 that are also in array2, KEEPING THE DUPLICATES in array1.
Desired outcome will be
result = [4,4,7,10,10,10]
I want to avoid loops as actual arrays will contain over millions of values.
I have tried various set and intersect combinations, but just couldn't keep the duplicates..
What do you mean you don't want to use loops? You're going to have to iterate over it one way or another. Just take in each item individually and check if it's in array2 as you go:
items = set(array2)
found = [i for i in array1 if i in items]
Furthermore, depending on how you are going to use the result, consider having a generator:
found = (i for i in array1 if i in array2)
so that you won't have to have the whole thing in memory all at once.
There following will do it:
array1 = [1,4,4,7,10,10,10,15,16,17,18,20]
array2 = [4,6,7,8,9,10]
set2 = set(array2)
print [el for el in array1 if el in set2]
It keeps the order and repetitions of elements in array1.
It turns array2 into a set for faster lookups. Note that this is only beneficial if array2 is sufficiently large; if array2 is small, it may be more performant to keep it as a list.
Following on from #Alex's answer, if you also want to extract the indices for each token, then here's how:
found = [[index,i] for index,i in enumerate(array1) if i in array2]
Related
Given a list like the next one:
foo_list = [[1,8],[2,7],[3,6]]
I've found in questions like Tuple pairs, finding minimum using python and
minimum of list of lists that the pair with the minimum value of a list of lists can be found using a generator like:
min(x for x in foo_list)
which returns
[1, 8]
But I was wondering if there is a similar way to return both minimum values of the "columns" of the list:
output = [1,6]
I know this can be achieved using numpy arrays:
output = np.min(np.array(foo_list), axis=0)
But I'm interested in finding such a way of doing so with generators (if possible).
Thanks in advance!
[min(l) for l in zip(*foo_list)]
returns [1, 6]
zip(*foo_list) gets the list transpose and then we find the minimum in both lists.
Thanks #mousetail for suggestion.
You can use two min() for this. Like -
min1 = min(a for a, _ in foo_list)
min2 = min(b for _, b in foo_list)
print([min1, min2])
Will this do? But I think if you don't want to use third party library, you can just use plain old loop which will be more efficient.
I want to find an index of each group duplicate value like this:
s = [2,6,2,88,6,...]
The results must return the index from original s: [[0,2],[1,4],..] or the result can show another way.
I find many solutions so I find the fastest way to get duplicate group:
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
But after sort I got wrong index from original s.
In my case, I have ~ 200mil value on the list and I want to find the fastest way to do that. I use an array to store value because I want to use GPU to make it faster.
Using hash structures like dict helps.
For example:
import numpy as np
from collections import defaultdict
a=np.array([2,4,2,88,15,4])
table=defaultdict(list)
for ind,num in enumerate(a):
table[num]+=[ind]
Outputs:
{2: [0, 2], 4: [1, 5], 88: [3], 15: [4]}
If you want to show duplicated elements in the order from small to large:
for k,v in sorted(table.items()):
if len(v)>1:
print(k,":",v)
Outputs:
2 : [0, 2]
4 : [1, 5]
The speed is determined by how many different values in the number list.
See if this meets your performance requirements (here, s is your input array):
counts = np.bincount(s)
cum_counts = np.add.accumulate(counts)
sorted_inds = np.argsort(s)
result = np.split(sorted_inds, cum_counts[:-1])
Notes:
The result would be a list of arrays.
Each of these arrays would contain indices of a repeated value in s. Eg, if the value 13 is repeated 7 times in s, there would be an array with 7 indices among the arrays of result
If you want to ignore singleton values of s (values that occur only once in s), you can pass minlength=2 to np.bincount()
(This is a variation of my other answer. Here, instead of splitting the large array sorted_inds, we take slices from it, so it's likely to have a different kind of performance characteristic)
If s is the input array:
counts = np.bincount(s)
cum_counts = np.add.accumulate(counts)
sorted_inds = np.argsort(s)
result = [sorted_inds[:cum_counts[0]]] + [sorted_inds[cum_counts[i]:cum_counts[i+1]] for i in range(cum_counts.size-1)]
I created a large array that is similar to this:
data = [ [1,2,3], [0,1,3],[1,5,3]]
How can I make it so my new array sums up each individual array as shown?
data = [ [6],[4],[9] ]
List comprehensions are good for this:
[[sum(x)] for x in data] # [[6], [4], [9]]
List comprehensions provide a concise way to create lists. Common
applications are to make new lists where each element is the result of
some operations applied to each member of another sequence or
iterable, or to create a subsequence of those elements that satisfy a
certain condition.
You're wanting to make a new list where each element is the result of some operations (a sum in this case) applied to each member of another sequence or iterable (your list of lists).
This works.
a = [ [1,2,3], [0,1,3],[1,5,3]]
b = []
for i in a:
sum = 0
for j in i:
sum+=j
b.append([sum])
print(b)
I have a particular list of numbers(item_list) for which i need all the row indices associated inside a 2D array (C). Please find the code below :
# Sample code
item_list = [1, 2, 3]
C= [[0 for x in range(5)] for x in range(5)]
C[0][:]=[1,5,3,25,30]
C[1][:]=[7,9,15,2,45]
C[2][:]=[2,9,15,78,98]
C[3][:]=[3,90,15,1,98]
C[4][:]=[12,19,25,3,8]
rind=[]
for item in item_list:
v=[i for i in range(len(C)) for j in range (len(C[i])) if C[i][j]==item ]
r_ind.append(v)
My 2D array size is ~ 7M *4, could anyone help me make this faster?
For starters:
rind = [[i for i in range(len(C)) if item in C[i]]
for item in item_list]
The crucial change here is the use of in which should be faster than your manual check.
This also means you won't get the same i multiple times in sublists in the output if a number appears multiple times in a sublist in the input, which I assume is what you really want.
I was doing some practice with lists, arrays and matrices in python and I got confused at something.
if I do:
list1 = [1,2,3,4]
list2 = [2,3,4,5]
print list1 + list2
Output:
I get [1,2,3,4,2,3,4,5]
I think it was like yesterday I was doing something similar but I got
Output2:
[3,5,7,9]
the actual addition of the values of each respective element on both lists. But I was actually expecting it to be the first output, but it added the values.
I haven't done linear algebra or prob&stats in a while. What was the method called for the output I got in output1? and output2? I've confused myself bad.
edit: http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.add.html
If you look at the 2nd example they do a 3x3 array + 1x3 array. I thought if there not the same dimension you can't add them?
When using standard lists, addition is defined as concatenation of the two lists
import numpy as np
list1 = [1,2,3,4]
list2 = [2,3,4,5]
print list1 + list2
# [1, 2, 3, 4, 2, 3, 4, 5]
When using numpy types, addition is defined as element-wise addition rather than list concatenation.
array1 = np.array(list1)
array2 = np.array(list2)
print array1 + array2
# [3 5 7 9]
This is often called a vectorized operation. In cases where arrays are large it can be faster than iterating over the structures, since the vectorized operation utilize a highly optimized implementation which is provided by numpy.
If you do not understand zip or numpy -
Assuming both lists list1 and list2 have same length, this will do your work
[a[i]+b[i] for i in xrange(len(a))]
PS: simply using list1 + list2 would only concatenate these two lists.
To add each of the element you need to iterate through the lists.
You can get "Output2" canonically with sum() and zip():
result = [sum(item) for item in zip(list1, list2)]
If you put each list1, list2, etc. into a container (such as a tuple or another list, e.g. lists = [list1, list2]) you can instead use zip(*lists) and then not have to change that code for any quantity of lists.