I've had multiple scenarios, where i had to find a huge array's items in another huge array.
I usually solved it like this:
for i in range(0,len(arr1)):
for k in range(0,len(arr1)):
print(arr1[i],arr2[k])
Which works fine, but its kinda slow.
Can someone help me, how to make the iteration faster?
arr1 = [1,2,3,4,5]
arr2 = [4,5,6,7]
same_items = set(arr1).intersection(arr2)
print(same_items)
Out[5]: {4,5}
Sets hashes items so instead of O(n) look up time for any element it has O(1). Items inside need to be hashable for this to work. And if they are not, I highly suggest you find a way to make them hashable.
If you need to handle huge arrays, you may want to use Python's numpy library, which assists you with high-efficiency manipulation methods and avoids you to use loops at all in most cases.
if your array has duplicates and you want to keep them all:
arr1 = [1,2,3,4,5,7,5,4]
arr2 = [4,5,6,7]
res = [i for i in arr1 if i in arr2]
>>> res
'''
[4, 5, 7, 5, 4]
or using numpy:
import numpy as np
res = np.array(arr1)[np.isin(arr1, arr2)].tolist()
>>> res
'''
[4, 5, 7, 5, 4]
Related
Is there a more pythonic way to tell the list which parts of it has to stay in it an which parts has to be removed?
li = [1,2,3,4,5,6,7]
Wanted list:
[1,2,3,6,7]
I can do that this way:
wl = li[:-4]+li[-2:]
I'm looking for something like li[:-4,-2:] (in one statement/command)
Of course I can do remove but it can be used in many situations like:
Wanted list:
[3,4,5,6,7]
I can do del li[0:2]
But it's more common to do:
li[2:]
Other than regular python lists, NumPy arrays can be indexed by other sequence-like objects (other than tuples) e.g. by regular python lists or by another NumPy array.
import numpy as np
li = np.arange(1, 8)
# array([1, 2, 3, 4, 5, 6, 7])
wl = li[[0,1,2,5,6]]
# array([1, 2, 3, 6, 7])
Of course, this leaves you now with the problem of creating the index sequence (the regular python list [0,1,2,5,6] in this example), which puts you back on square one. (Unless you need to access several NumPy arrays at the same indices, so you create this index list once and then re-use it.)
You should probably only consider this if you have additional reasons to use NumPy in general or specifically NumPy arrays.
If you want the output list to follow a certain logic, you can use the filter function.
filter(lambda x: x > 2, li)
or maybe
filter(lambda x: x < 4 or x > 5, li)
Suppose I have an array
a = np.array([1, 2, 1, 3, 3, 3, 0])
How can I (efficiently, Pythonically) find which elements of a are duplicates (i.e., non-unique values)? In this case the result would be array([1, 3, 3]) or possibly array([1, 3]) if efficient.
I've come up with a few methods that appear to work:
Masking
m = np.zeros_like(a, dtype=bool)
m[np.unique(a, return_index=True)[1]] = True
a[~m]
Set operations
a[~np.in1d(np.arange(len(a)), np.unique(a, return_index=True)[1], assume_unique=True)]
This one is cute but probably illegal (as a isn't actually unique):
np.setxor1d(a, np.unique(a), assume_unique=True)
Histograms
u, i = np.unique(a, return_inverse=True)
u[np.bincount(i) > 1]
Sorting
s = np.sort(a, axis=None)
s[:-1][s[1:] == s[:-1]]
Pandas
s = pd.Series(a)
s[s.duplicated()]
Is there anything I've missed? I'm not necessarily looking for a numpy-only solution, but it has to work with numpy data types and be efficient on medium-sized data sets (up to 10 million in size).
Conclusions
Testing with a 10 million size data set (on a 2.8GHz Xeon):
a = np.random.randint(10**7, size=10**7)
The fastest is sorting, at 1.1s. The dubious xor1d is second at 2.6s, followed by masking and Pandas Series.duplicated at 3.1s, bincount at 5.6s, and in1d and senderle's setdiff1d both at 7.3s. Steven's Counter is only a little slower, at 10.5s; trailing behind are Burhan's Counter.most_common at 110s and DSM's Counter subtraction at 360s.
I'm going to use sorting for performance, but I'm accepting Steven's answer because the performance is acceptable and it feels clearer and more Pythonic.
Edit: discovered the Pandas solution. If Pandas is available it's clear and performs well.
As of numpy version 1.9.0, np.unique has an argument return_counts which greatly simplifies your task:
u, c = np.unique(a, return_counts=True)
dup = u[c > 1]
This is similar to using Counter, except you get a pair of arrays instead of a mapping. I'd be curious to see how they perform relative to each other.
It's probably worth mentioning that even though np.unique is quite fast in practice due to its numpyness, it has worse algorithmic complexity than the Counter solution. np.unique is sort-based, so runs asymptotically in O(n log n) time. Counter is hash-based, so has O(n) complexity. This will not matter much for anything but the largest datasets.
I think this is most clear done outside of numpy. You'll have to time it against your numpy solutions if you are concerned with speed.
>>> import numpy as np
>>> from collections import Counter
>>> a = np.array([1, 2, 1, 3, 3, 3, 0])
>>> [item for item, count in Counter(a).items() if count > 1]
[1, 3]
note: This is similar to Burhan Khalid's answer, but the use of items without subscripting in the condition should be faster.
People have already suggested Counter variants, but here's one which doesn't use a listcomp:
>>> from collections import Counter
>>> a = [1, 2, 1, 3, 3, 3, 0]
>>> (Counter(a) - Counter(set(a))).keys()
[1, 3]
[Posted not because it's efficient -- it's not -- but because I think it's cute that you can subtract Counter instances.]
For Python 2.7+
>>> import numpy
>>> from collections import Counter
>>> n = numpy.array([1,1,2,3,3,3,0])
>>> [x[1] for x in Counter(n).most_common() if x[0] > 1]
[3, 1]
Here's another approach using set operations that I think is a bit more straightforward than the ones you offer:
>>> indices = np.setdiff1d(np.arange(len(a)), np.unique(a, return_index=True)[1])
>>> a[indices]
array([1, 3, 3])
I suppose you're asking for numpy-only solutions, since if that's not the case, it's very difficult to argue with just using a Counter instead. I think you should make that requirement explicit though.
If a is made up of small integers you can use numpy.bincount directly:
import numpy as np
a = np.array([3, 2, 2, 0, 4, 3])
counts = np.bincount(a)
print np.where(counts > 1)[0]
# array([2, 3])
This is very similar your "histogram" method, which is the one I would use if a was not made up of small integers.
If the array is a sorted numpy array, then just do:
a = np.array([1, 2, 2, 3, 4, 5, 5, 6])
rep_el = a[np.diff(a) == 0]
I'm adding my solution to the pile for this 3 year old question because none of the solutions fit what I wanted or used libs besides numpy. This method finds both the indices of duplicates and values for distinct sets of duplicates.
import numpy as np
A = np.array([1,2,3,4,4,4,5,6,6,7,8])
# Record the indices where each unique element occurs.
list_of_dup_inds = [np.where(a == A)[0] for a in np.unique(A)]
# Filter out non-duplicates.
list_of_dup_inds = filter(lambda inds: len(inds) > 1, list_of_dup_inds)
for inds in list_of_dup_inds: print inds, A[inds]
# >> [3 4 5] [4 4 4]
# >> [7 8] [6 6]
>>> import numpy as np
>>> a=np.array([1,2,2,2,2,3])
>>> uniques, uniq_idx, counts = np.unique(a,return_index=True,return_counts=True)
>>> duplicates = a[ uniq_idx[counts>=2] ] # <--- Get duplicates
If you also want to get the orphans:
>>> orphans = a[ uniq_idx[counts==1] ]
Combination of Pandas and Numpy (Utilizing value_counts():
import pandas as pd
import numpy as np
arr=np.array(('a','b','b','c','a'))
pd.Series(arr).value_counts()
OUTPUT:
a 2
b 2
c 1
What is the best way to improve this code:
def my_func(x, y):
... do smth ...
return cmp(x',y')
my_list = range(0, N)
my_list.sort(cmp=my_func)
A python's list takes a lot of memory in comparison with numpy array (6800MB vs 700MB),
but nympy.array doesn't have the sort function with cmp argument.
Are there other ways to improve memory usage or sort numpy's array with my cmp function?
Update: my current solution is a C function (shared with SWIG) that sorts a huge array of integers and returns it to python after sorting.
But I hope that there is some way to implement memory efficient sorting of huge datasets with Python. Any ideas?
If you can write a ufunc that convert your array, you can do fast sort by argsort:
b = convert(a)
idx = np.argsort(b)
sort_a = a[idx]
As an alternative you can use the builtin sorted with an numpy array:
>>> a = np.arange(10, 1, -1)
>>> sorted(a, cmp=lambda a,b: cmp(a,b))
[2, 3, 4, 5, 6, 7, 8, 9, 10]
It is not in-place, so you have 1400 MB compared to 6800 MB.
Say I have an array with a couple hundred elements. I need to iterate of the array and replace one or more items in the array with some other item. Which strategy is more efficient in python in terms of speed (I'm not worried about memory)?
For example: I have an array
my_array = [1,2,3,4,5,6]
I want to replace the first 3 elements with one element with the value 123.
Option 1 (inline):
my_array = [1,2,3,4,5,6]
my_array.remove(0,3)
my_array.insert(0,123)
Option2 (new array creation):
my_array = [1,2,3,4,5,6]
my_array = my_array[3:]
my_array.insert(0,123)
Both of the above will options will give a result of:
>>> [123,4,5,6]
Any comments would be appreciated. Especially if there is options I have missed.
If you want to replace an item or a set of items in a list, you should never use your first option. Removing and adding to a list in the middle is slow (reference). Your second option is also fairly inefficient, since you're doing two operations for a single replacement.
Instead, just do slice assignment, as eiben's answer instructs. This will be significantly faster and more efficient than either of your methods:
>>> my_array = [1,2,3,4,5,6]
>>> my_array[:3] = [123]
>>> my_array
[123, 4, 5, 6]
arr[0] = x
replaces the 0th element with x. You can also replace whole slices.
>>> arr = [1, 2, 3, 4, 5, 6]
>>> arr[0:3] = [8, 9, 99]
>>> arr
[8, 9, 99, 4, 5, 6]
>>>
And generally it's unclear what you're trying to achieve. Please provide more information or an example.
OK, as for your update. The remove method doesn't work (remove needs one argument). But the slicing I presented works for your case too:
>>> arr
[8, 9, 99, 4, 5, 6]
>>> arr[0:3] = [4]
>>> arr
[4, 4, 5, 6]
I would guess it's the fastest method, but do try it with timeit. According to my tests it's twice as fast as your "new array" approach.
If you're looking speed efficience and manipulate series of integers, You should use the standard array module instead:
>>> import array
>>> my_array = array.array('i', [1,2,3,4,5,6])
>>> my_array = my_array[3:]
>>> my_array.insert(0,123)
>>> my_array
array('i', [123, 4, 5, 6])
The key thing is to avoid moving large numbers of list items more than absolutely have to. Slice assignment, as far as i'm aware, still involves moving the items around the slice, which is bad news.
How do you recognise when you have a sequence of items which need to be replaced? I'll assume you have a function like:
def replacement(objects, startIndex):
"returns a pair (numberOfObjectsToReplace, replacementObject), or None if the should be no replacement"
I'd then do:
def replaceAll(objects):
src = 0
dst = 0
while (src < len(objects)):
replacementInfo = replacement(objects, src)
if (replacementInfo != None):
numberOfObjectsToReplace, replacementObject = replacementInfo
else:
numberOfObjectsToReplace = 1
replacementObject = objects[src]
objects[dst] = replacementObject
src = src + numberOfObjectsToReplace
dst = dst + 1
del objects[dst:]
This code still does a few more loads and stores than it absolutely has to, but not many.
I'm sure there's a nice way to do this in Python, but I'm pretty new to the language, so forgive me if this is an easy one!
I have a list, and I'd like to pick out certain values from that list. The values I want to pick out are the ones whose indexes in the list are specified in another list.
For example:
indexes = [2, 4, 5]
main_list = [0, 1, 9, 3, 2, 6, 1, 9, 8]
the output would be:
[9, 2, 6]
(i.e., the elements with indexes 2, 4 and 5 from main_list).
I have a feeling this should be doable using something like list comprehensions, but I can't figure it out (in particular, I can't figure out how to access the index of an item when using a list comprehension).
[main_list[x] for x in indexes]
This will return a list of the objects, using a list comprehension.
t = []
for i in indexes:
t.append(main_list[i])
return t
map(lambda x:main_list[x],indexes)
If you're good with numpy:
import numpy as np
main_array = np.array(main_list) # converting to numpy array
out_array = main_array.take([2, 4, 5])
out_list = out_array.tolist() # if you want a list specifically
I think Yuval A's solution is a pretty clear and simple. But if you actually want a one line list comprehension:
[e for i, e in enumerate(main_list) if i in indexes]
As an alternative to a list comprehension, you can use map with list.__getitem__. For large lists you should see better performance:
import random
n = 10**7
L = list(range(n))
idx = random.sample(range(n), int(n/10))
x = [L[x] for x in idx]
y = list(map(L.__getitem__, idx))
assert all(i==j for i, j in zip(x, y))
%timeit [L[x] for x in idx] # 474 ms per loop
%timeit list(map(L.__getitem__, idx)) # 417 ms per loop
For a lazy iterator, you can just use map(L.__getitem__, idx). Note in Python 2.7, map returns a list, so there is no need to pass to list.
I have noticed that there are two optional ways to do this job, either by loop or by turning to np.array. Then I test the time needed by these two methods, the result shows that when dataset is large
【[main_list[x] for x in indexes]】is about 3~5 times faster than
【np.array.take()】
if your code is sensitive to the computation time, the highest voted answer is a good choice.