I'm tyring to make my code faster by removing some for loops and using arrays. The slowest step right now is the generation of the random lists.
context: I have a number of mutations in a chromosome, i want to perform 1000 random "chromosomes" with the same length and same number of mutation but their positions are randomized.
here is what I'm currently running to generate these randomized mutation positions:
iterations=1000
Chr_size=1000000
num_mut=500
randbps=[]
for k in range(iterations):
listed=np.random.choice(range(Chr_size),num_mut,replace=False)
randbps.append(listed)
I want to do something similar to what they cover in this question
np.random.choice(range(Chr_size),size=(num_mut,iterations),replace=False)
however without replacement applies to the array as a whole.
further context: later in the script i go through each randomized chromosome and count the number of mutations in a given window:
for l in range(len(randbps)):
arr=np.asarray(randbps[l])
for i in range(chr_last_window[f])[::step]:
counter=((i < arr) & (arr < i+window)).sum()
I don't know how np.random.choice is implemented but I am guessing it is optimized for a general case. Your numbers, on the other hand, are not likely to produce the same sequences. Sets may be more efficient for this case, building from scratch:
import random
def gen_2d(iterations, Chr_size, num_mut):
randbps = set()
while len(randbps) < iterations:
listed = set()
while len(listed) < num_mut:
listed.add(random.choice(range(Chr_size)))
randbps.add(tuple(sorted(listed)))
return np.array(list(randbps))
This function starts with an empty set, generates a single number in range(Chr_size) and adds that number to the set. Because of the properties of the sets it cannot add the same number again. It does the same thing for the randbps as well so each element of randbps is also unique.
For only one iteration of np.random.choice vs gen_2d:
iterations=1000
Chr_size=1000000
num_mut=500
%timeit np.random.choice(range(Chr_size),num_mut,replace=False)
10 loops, best of 3: 141 ms per loop
%timeit gen_2d(1, Chr_size, num_mut)
1000 loops, best of 3: 647 µs per loop
Based on the trick used in this solution, here's an approach that uses argsort/argpartition on an array of random elements to simulate numpy.random.choice without replacement to give us randbps as a 2D array -
np.random.rand(iterations,Chr_size).argpartition(num_mut)[:,:num_mut]
Runtime test -
In [2]: def original_app(iterations,Chr_size,num_mut):
...: randbps=[]
...: for k in range(iterations):
...: listed=np.random.choice(range(Chr_size),num_mut,replace=False)
...: randbps.append(listed)
...: return randbps
...:
In [3]: # Input params (scaled down version of params listed in question)
...: iterations=100
...: Chr_size=100000
...: num=50
...:
In [4]: %timeit original_app(iterations,Chr_size,num)
1 loops, best of 3: 1.53 s per loop
In [5]: %timeit np.random.rand(iterations,Chr_size).argpartition(num)[:,:num]
1 loops, best of 3: 424 ms per loop
Related
I want to determine whether or not my list (actually a numpy.ndarray) contains duplicates in the fastest possible execution time. Note that I don't care about removing the duplicates, I simply want to know if there are any.
Note: I'd be extremely surprised if this is not a duplicate, but I've tried my best and can't find one. Closest are this question and this question, both of which are requesting that the unique list be returned.
Here are the four ways I thought of doing it.
TL;DR: if you expect very few (less than 1/1000) duplicates:
def contains_duplicates(X):
return len(np.unique(X)) != len(X)
If you expect frequent (more than 1/1000) duplicates:
def contains_duplicates(X):
seen = set()
seen_add = seen.add
for x in X:
if (x in seen or seen_add(x)):
return True
return False
The first method is an early exit from this answer which wants to return the unique values, and the second of which is the same idea applied to this answer.
>>> import numpy as np
>>> X = np.random.normal(0,1,[10000])
>>> def terhorst_early_exit(X):
...: elems = set()
...: for i in X:
...: if i in elems:
...: return True
...: elems.add(i)
...: return False
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 10.6 ms per loop
>>> def peterbe_early_exit(X):
...: seen = set()
...: seen_add = seen.add
...: for x in X:
...: if (x in seen or seen_add(x)):
...: return True
...: return False
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 9.35 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 4.54 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 967 µs per loop
Do things change if you start with an ordinary Python list, and not a numpy.ndarray?
>>> X = X.tolist()
>>> %timeit terhorst_early_exit(X)
100 loops, best of 3: 9.34 ms per loop
>>> %timeit peterbe_early_exit(X)
100 loops, best of 3: 8.07 ms per loop
>>> %timeit len(set(X)) != len(X)
100 loops, best of 3: 3.09 ms per loop
>>> %timeit len(np.unique(X)) != len(X)
1000 loops, best of 3: 1.83 ms per loop
Edit: what if we have a prior expectation of the number of duplicates?
The above comparison is functioning under the assumption that a) there are likely to be no duplicates, or b) we're more worried about the worst case than the average case.
>>> X = np.random.normal(0, 1, [10000])
>>> for n_duplicates in [1, 10, 100]:
>>> print("{} duplicates".format(n_duplicates))
>>> duplicate_idx = np.random.choice(len(X), n_duplicates, replace=False)
>>> X[duplicate_idx] = 0
>>> print("terhost_early_exit")
>>> %timeit terhorst_early_exit(X)
>>> print("peterbe_early_exit")
>>> %timeit peterbe_early_exit(X)
>>> print("set length")
>>> %timeit len(set(X)) != len(X)
>>> print("numpy unique length")
>>> %timeit len(np.unique(X)) != len(X)
1 duplicates
terhost_early_exit
100 loops, best of 3: 12.3 ms per loop
peterbe_early_exit
100 loops, best of 3: 9.55 ms per loop
set length
100 loops, best of 3: 4.71 ms per loop
numpy unique length
1000 loops, best of 3: 1.31 ms per loop
10 duplicates
terhost_early_exit
1000 loops, best of 3: 1.81 ms per loop
peterbe_early_exit
1000 loops, best of 3: 1.47 ms per loop
set length
100 loops, best of 3: 5.44 ms per loop
numpy unique length
1000 loops, best of 3: 1.37 ms per loop
100 duplicates
terhost_early_exit
10000 loops, best of 3: 111 µs per loop
peterbe_early_exit
10000 loops, best of 3: 99 µs per loop
set length
100 loops, best of 3: 5.16 ms per loop
numpy unique length
1000 loops, best of 3: 1.19 ms per loop
So if you expect very few duplicates, the numpy.unique function is the way to go. As the number of expected duplicates increases, the early exit methods dominate.
Depending on how large your array is, and how likely duplicates are, the answer will be different.
For example, if you expect the average array to have around 3 duplicates, early exit will cut your average-case time (and space) by 2/3rds; if you expect only 1 in 1000 arrays to have any duplicates at all, it will just add a bit of complexity without improving anything.
Meanwhile, if the arrays are big enough that building a temporary set as large as the array is likely to be expensive, sticking a probabilistic test like a bloom filter in front of it will probably speed things up dramatically, but if not, it's again just wasted effort.
Finally, you want to stay within numpy if at all possible. Looping over an array of floats (or whatever) and boxing each one into a Python object is going to take almost as much time as hashing and checking the values, and of course storing things in a Python set instead of optimized numpy storage is wasteful as well. But you have to trade that off against the other issues—you can't do early exit with numpy, and there may be nice C-optimized bloom filter implementations a pip install away but not be any that are numpy-friendly.
So, there's no one best solution for all possible scenarios.
Just to give an idea of how easy it is to write a bloom filter, here's one I hacked together in a couple minutes:
from bitarray import bitarray # pip3 install bitarray
def dupcheck(X):
# Hardcoded values to give about 5% false positives for 10000 elements
size = 62352
hashcount = 4
bits = bitarray(size)
bits.setall(0)
def check(x, hash=hash): # TODO: default-value bits, hashcount, size?
for i in range(hashcount):
if not bits[hash((x, i)) % size]: return False
return True
def add(x):
for i in range(hashcount):
bits[hash((x, i)) % size] = True
seen = set()
seen_add = seen.add
for x in X:
if check(x) or add(x):
if x in seen or seen_add(x):
return True
return False
This only uses 12KB (a 62352-bit bitarray plus a 500-float set) instead of 80KB (a 10000-float set or np.array). Which doesn't matter when you're only dealing with 10K elements, but with, say, 10B elements that use up more than half of your physical RAM, it would be a different story.
Of course it's almost certainly going to be an order of magnitude or so slower than using np.unique, or maybe even set, because we're doing all that slow looping in Python. But if this turns out to be worth doing, it should be a breeze to rewrite in Cython (and to directly access the numpy array without boxing and unboxing).
My timing tests differ from Scott for small lists. Using Python 3.7.3, set() is much faster than np.unique for a small numpy array from randint (length 8), but faster for a larger array (length 1000).
Length 8
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
---------- --------- ----------- ------------ ---------
set_len 0 7.73486e-06 Baseline
unique_len 9.644e-06 2.55573e-05 Slower 0
Length 1000
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
---------- ---------- ----------- ------------ ---------
set_len 0.00011066 0.000270466 Baseline
unique_len 4.3684e-05 8.95608e-05 Faster 0
Then I tried my own implementation, but I think it would require optimized C code to beat set:
def check_items(key_rand, **kwargs):
for i, vali in enumerate(key_rand):
for j in range(i+1, len(key_rand)):
valj = key_rand[j]
if vali == valj:
break
Length 8
Timing test iterations: 10000
Function Min Avg Sec Conclusion p-value
----------- ---------- ----------- ------------ ---------
set_len 0 6.74221e-06 Baseline
unique_len 0 2.14604e-05 Slower 0
check_items 1.1138e-05 2.16369e-05 Slower 0
(using my randomized compare_time() function from easyinfo)
What is the fastest way to check if a set contains at least one number within a given range?
For example setA = set(1,4,7,9,10), lowerRange=6, upperRange=8, will return True because of 7.
Currently I am using:
filtered = filter(lambda x: lowerRange<=x<=upperRange,setA)
Then if filtered is not empty, returns a True.
Assuming that setA can be a very large set, is this the optimal solution? Or is this iterating through the entire setA?
Since the membership chick is approximately O(1) for sets you can use a generator expression within any() built-in function:
rng = range(6, 9)
any(i in setA for i in rng)
Note that for a short range you'll give a better performance with set.intersection():
In [2]: a = {1,4,7,9,10}
In [3]: rng = range(6, 9)
In [8]: %timeit bool(a.intersection(rng))
1000000 loops, best of 3: 344 ns per loop
In [9]: %timeit any(i in a for i in rng)
1000000 loops, best of 3: 620 ns per loop
But in this case for longer ranges you'd definitely go with any():
In [10]: rng = range(6, 9000)
In [11]: %timeit any(i in a for i in rng)
1000000 loops, best of 3: 620 ns per loop
In [12]: %timeit bool(a.intersection(rng))
1000 loops, best of 3: 233 µs per loop
And note that the reason that any() performs better is because it returns True right after it encounter an item that exist in your set (based on our membership condition) an since the number 8 is at the beginning of the range it makes the any() per forms so fast. Also as mentioned in comment as a more pythonic way for checking the validity of the intersection of an iterable within a set you can use isdisjoint() method. Here is a benchmark with this method for small rage:
In [26]: %timeit not a.isdisjoint(rng)
1000000 loops, best of 3: 153 ns per loop
In [27]: %timeit any(i in a for i in rng)
1000000 loops, best of 3: 609 ns per loop
And here is a benchmark that makes the any() checks the membership for all the numbers which shows that isdisjoint() performs so better:
In [29]: rng = range(8, 1000)
In [30]: %timeit any(i in a for i in rng)
1000000 loops, best of 3: 595 ns per loop
In [31]: %timeit not a.isdisjoint(rng)
10000000 loops, best of 3: 142 ns per loop
The fastest way is to work with a sorted list or tuple instead of a set. That way you can do the range searches using the bisect module.
Unless you plan to use those values, using the filter function is unnecessary, because it stores data that you won't end up using. It also keeps going even after it finds one that fits the criteria, slowing you down quite a bit.
My solution would have been to write and use the following function.
def check(list, lower, upper):
for i in list:
if i >= lower and i <= upper:
return True
return False
Like with #Kasramvd's answer, and your idea, this is the brute-force search (O(n) solution). That's impossible to beat unless there are some constraints on the data beforehand, like that it has to be sorted.
I know this subject is well discussed but I've come around a case I don't really understand how the recursive method is "slower" than a method using "reduce,lambda,xrange".
def factorial2(x, rest=1):
if x <= 1:
return rest
else:
return factorial2(x-1, rest*x)
def factorial3(x):
if x <= 1:
return 1
return reduce(lambda a, b: a*b, xrange(1, x+1))
I know python doesn't optimize tail recursion so the question isn't about it. To my understanding, a generator should still generate n amount of numbers using the +1 operator. So technically, fact(n) should add a number n times just like the recursive one. The lambda in the reduce will be called n times just as the recursive method... So since we don't have tail call optimization in both case, stacks will be created/destroyed and returned n times. And a if in the generator should check when to raise a StopIteration exception.
This makes me wonder why does the recursive method still slowlier than the other one since the recursive one use simple arithmetic and doesn't use generators.
In a test, I replaced rest*x by x in the recursive method and the time spent got reduced on par with the method using reduce.
Here are my timings for fact(400), 1000 times
factorial3 : 1.22370505333
factorial2 : 1.79896998405
Edit:
Making the method start from 1 to n doesn't help either instead of n to 1. So not an overhead with the -1.
Also, can we make the recursive method faster. I tried multiple things like global variables that I can change... Using a mutable context by placing variables in cells that I can modify like an array and keep the recursive method without parameters. Sending the method used for recursion as a parameter so we don't have to "dereference" it in our scope...?! But nothings makes it faster.
I'll point out that I have a version of the fact that use a forloop that is much faster than both of those 2 methods so there is clearly space for improvement but I wouldn't expect anything faster than the forloop.
The slowness of the recursive version comes from the need to resolve on each call the named (argument) variables. I have provided a different recursive implementation that has only one argument and it works slightly faster.
$ cat fact.py
def factorial_recursive1(x):
if x <= 1:
return 1
else:
return factorial_recursive1(x-1)*x
def factorial_recursive2(x, rest=1):
if x <= 1:
return rest
else:
return factorial_recursive2(x-1, rest*x)
def factorial_reduce(x):
if x <= 1:
return 1
return reduce(lambda a, b: a*b, xrange(1, x+1))
# Ignore the rest of the code for now, we'll get back to it later in the answer
def range_prod(a, b):
if a + 1 < b:
c = (a+b)//2
return range_prod(a, c) * range_prod(c, b)
else:
return a
def factorial_divide_and_conquer(n):
return 1 if n <= 1 else range_prod(1, n+1)
$ ipython -i fact.py
In [1]: %timeit factorial_recursive1(400)
10000 loops, best of 3: 79.3 µs per loop
In [2]: %timeit factorial_recursive2(400)
10000 loops, best of 3: 90.9 µs per loop
In [3]: %timeit factorial_reduce(400)
10000 loops, best of 3: 61 µs per loop
Since in your example very large numbers are involved, initially I suspected that the performance difference might be due to the order of multiplication. Multiplying on every iteration a large partial product by the next number is proportional to the number of digits/bits in the product, so the time complexity of such a method is O(n2), where n is the number of bits in the final product. Instead it is better to use a divide and conquer technique, where the final result is obtained as a product of two approximately equally long values each of which is computed recursively in the same manner. So I implemented that version too (see factorial_divide_and_conquer(n) in the above code) . As you can see below it still loses to the reduce()-based version for small arguments (due to the same problem with named parameters) but outperforms it for large arguments.
In [4]: %timeit factorial_divide_and_conquer(400)
10000 loops, best of 3: 90.5 µs per loop
In [5]: %timeit factorial_divide_and_conquer(4000)
1000 loops, best of 3: 1.46 ms per loop
In [6]: %timeit factorial_reduce(4000)
100 loops, best of 3: 3.09 ms per loop
UPDATE
Trying to run the factorial_recursive?() versions with x=4000 hits the default recursion limit, so the limit must be increased:
In [7]: sys.setrecursionlimit(4100)
In [8]: %timeit factorial_recursive1(4000)
100 loops, best of 3: 3.36 ms per loop
In [9]: %timeit factorial_recursive2(4000)
100 loops, best of 3: 7.02 ms per loop
Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop
Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop