Using a boolean Mask on large numpy array is very slow - python

I have a performance issue when coding with python.
let's say I have 2 very large arrays (Nx2) of strings say with N = 12,000,000, and two variables label_a and label_b which are also strings. Here is the following code:
import numpy as np
import time
indices = np.array([np.random.choice(np.arange(5000).astype(str),size=10000000),np.random.choice(np.arange(5000).astype(str),size=10000000)]).T
costs = np.random.uniform(size=10000000)
label_a = '2'
label_b = '9'
t0 = time.time()
costs = costs[(indices[:,0]!=label_a)*(indices[:,0]!=label_b)*(indices[:,1]!=label_a)*(indices[:,1]!=label_b)]
indices = indices[(indices[:,0]!=label_a)*(indices[:,0]!=label_b)*(indices[:,1]!=label_a)*(indices[:,1]!=label_b)]
t1 = time.time()
toseq = t1-t0
print(toseq)
the above code segment takes 3 seconds every time it's ran. I would like to achieve the same thing while reducing the computing cost:
I am using a boolean mask to only retrieve rows in the costs and indices arrays where the values are not label_a and label_b

As indicated in the comments, computing the values of the indices you're after only once, and combining them only once would save time.
(I've also changed the way of timing, just for brevity - the results are the same)
import numpy as np
from timeit import timeit
r = 5000
n = 10000000
indices = np.array([
np.random.choice(np.arange(r).astype(str), size=n),
np.random.choice(np.arange(r).astype(str), size=n)
]).T
costs = np.random.uniform(size=n)
label_a = '2'
label_b = '9'
n_indices = np.array([
np.random.choice(np.arange(r), size=n),
np.random.choice(np.arange(r), size=n)
]).T
def run():
global indices
global costs
_ = costs[(indices[:, 0] != label_a)*(indices[:, 0] != label_b) *
(indices[:, 1] != label_a)*(indices[:, 1] != label_b)]
_ = indices[(indices[:, 0] != label_a)*(indices[:, 0] != label_b) *
(indices[:, 1] != label_a)*(indices[:, 1] != label_b)]
def run_faster():
global indices
global costs
# only compute these only once
not_a0 = indices[:, 0] != label_a
not_b0 = indices[:, 0] != label_b
not_a1 = indices[:, 1] != label_a
not_b1 = indices[:, 1] != label_b
_ = costs[not_a0 * not_b0 * not_a1 * not_b1]
_ = indices[not_a0 * not_b0 * not_a1 * not_b1]
def run_even_faster():
global indices
global costs
# also combine them only once
cond = ((indices[:, 0] != label_a) * (indices[:, 0] != label_b) *
(indices[:, 1] != label_a) * (indices[:, 1] != label_b))
_ = costs[cond]
_ = indices[cond]
def run_sep_mask():
global indices
global costs
global cond
# just the masking part of run_even_faster
cond = ((indices[:, 0] != label_a) * (indices[:, 0] != label_b) *
(indices[:, 1] != label_a) * (indices[:, 1] != label_b))
def run_sep_index():
global indices
global costs
global cond
# just the indexing part of run_even_faster
_ = costs[cond]
_ = indices[cond]
def run_even_faster_numerical():
global indices
global costs
# use int values and n_indices instead of indices
a = int(label_a)
b = int(label_b)
cond = ((n_indices[:, 0] != a) * (n_indices[:, 0] != b) *
(n_indices[:, 1] != a) * (n_indices[:, 1] != b))
_ = costs[cond]
_ = indices[cond]
def run_all(funcs):
for f in funcs:
print('{:.4f} : {}()'.format(timeit(f, number=1), f.__name__))
run_all([run, run_faster, run_even_faster, run_sep_mask, run_sep_index, run_even_faster_numerical])
Note that I also added an example where the operation is not based on strings, but on numbers instead. If you can avoid the values being strings, but get numbers instead, you'd get a performance boost as well.
This boost gets substantial if you start comparing longer labels - in the end it might even be worth converting the strings to numbers before the filtering, if the strings get long enough.
These are my results:
0.9711 : run()
0.7065 : run_faster()
0.6983 : run_even_faster()
0.2657 : run_sep_mask()
0.4174 : run_sep_index()
0.4536 : run_even_faster_numerical()
The two sep entries show that the indexing is about twice the amount of time it takes to build the mask for run_even_faster, so you can only expect so much improvement from tuning it even more.
However, they also show that building the mask based on integers is less than 0.04 seconds on top of doing the actual indexing, compared to the about 0.26 seconds for building the mask based on strings. So, that's the room you have for improvement.

Related

Implementing Smith-Waterman algorithm for local alignment in python

I have created a sequence alignment tool to compare two strands of DNA (X and Y) to find the best alignment of substrings from X and Y. The algorithm is summarized here (https://en.wikipedia.org/wiki/Smith–Waterman_algorithm). I have been able to generate a lists of lists, filling them all with zeros, to represent my matrix. I created a scoring algorithm to return a numerical score for each kind of alignment between bases (eg. plus 4 for a match). Then I created an alignment algorithm that should put a score in each coordinate of my "matrix". However, when I go to print the matrix, it only returns the original with all zeros (rather than actual scores).
I know there are other methods of implementing this method (with numpy for example), so could you please tell me why this specific code (below) does not work? Is there a way to modify it, so that it does work?
code:
def zeros(X: int, Y: int):
lenX = len(X) + 1
lenY = len(Y) + 1
matrix = []
for i in range(lenX):
matrix.append([0] * lenY)
def score(X, Y):
if X[n] == Y[m]: return 4
if X[n] == '-' or Y[m] == '-': return -4
else: return -2
def SmithWaterman(X, Y, score):
for n in range(1, len(X) + 1):
for m in range(1, len(Y) + 1):
align = matrix[n-1, m-1] + (score(X[n-1], Y[m-1]))
indelX = matrix[n-1, m] + (score(X[n-1], Y[m]))
indelY = matrix[n, m-1] + (score(X[n], Y[m-1]))
matrix[n, m] = max(align, indelX, indelY, 0)
print(matrix)
zeros("ACGT", "ACGT")
output:
[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]]
The reason it's just printing out the zeroed out matrix is that the SmithWaterman function is never called, so the matrix is never updated.
You would need to do something like
# ...
SmithWaterman(X, Y, score)
print(matrix)
# ...
However, If you do this, you will find that this code is actually quite broken in many other ways. I've gone through and annotated some of the syntax errors and other issues with the code:
def zeros(X: int, Y: int):
# ^ ^ incorrect type annotations. should be str
lenX = len(X) + 1
lenY = len(Y) + 1
matrix = []
for i in range(lenX):
matrix.append([0] * lenY)
# A more "pythonic" way of expressing the above would be:
# matrix = [[0] * len(Y) + 1 for _ in range(len(x) + 1)]
def score(X, Y):
# ^ ^ shadowing variables from outer scope. this is not a bug per se but it's considered bad practice
if X[n] == Y[m]: return 4
# ^ ^ variables not defined in scope
if X[n] == '-' or Y[m] == '-': return -4
# ^ ^ variables not defined in scope
else: return -2
def SmithWaterman(X, Y, score): # this function is never called
# ^ unnecessary function passed as parameter. function is defined in scope
for n in range(1, len(X) + 1):
for m in range(1, len(Y) + 1):
align = matrix[n-1, m-1] + (score(X[n-1], Y[m-1]))
# ^ invalid list lookup. should be: matrix[n-1][m-1]
indelX = matrix[n-1, m] + (score(X[n-1], Y[m]))
# ^ out of bounds error when m == len(Y)
indelY = matrix[n, m-1] + (score(X[n], Y[m-1]))
# ^ out of bounds error when n == len(X)
matrix[n, m] = max(align, indelX, indelY, 0)
# this should be nested in the inner for-loop. m, n, indelX, and indelY are not defined in scope here
print(matrix)
zeros("ACGT", "ACGT")

How to make nested list behave like numpy array?

I'm trying to implements an algorithm to count subsets with given sum in python which is
import numpy as np
maxN = 20
maxSum = 1000
minSum = 1000
base = 1000
dp = np.zeros((maxN, maxSum + minSum))
v = np.zeros((maxN, maxSum + minSum))
# Function to return the required count
def findCnt(arr, i, required_sum, n) :
# Base case
if (i == n) :
if (required_sum == 0) :
return 1
else :
return 0
# If the state has been solved before
# return the value of the state
if (v[i][required_sum + base]) :
return dp[i][required_sum + base]
# Setting the state as solved
v[i][required_sum + base] = 1
# Recurrence relation
dp[i][required_sum + base] = findCnt(arr, i + 1, required_sum, n) + findCnt(arr, i + 1, required_sum - arr[i], n)
return dp[i][required_sum + base]
arr = [ 2, 2, 2, 4 ]
n = len(arr)
k = 4
print(findCnt(arr, 0, k, n))
And it gives the expected result, but I was asked to not use numpy, so I replaced numpy arrays with nested lists like this :
#dp = np.zeros((maxN, maxSum + minSum)) replaced by
dp = [[0]*(maxSum + minSum)]*maxN
#v = np.zeros((maxN, maxSum + minSum)) replaced by
v = [[0]*(maxSum + minSum)]*maxN
but now the program always gives me 0 in the output, I think this is because of some behavior differences between numpy arrays and nested lists, but I don't know how to fix it
EDIT :
thanks to #venky__ who provided this solution in the comments :
[[0 for i in range( maxSum + minSum)] for i in range(maxN)]
and it worked, but I still don't understand what is the difference between it and what I was doing before, I tried :
print( [[0 for i in range( maxSum + minSum)] for i in range(maxN)] == [[0]*(maxSum + minSum)]*maxN )
And the result is True, so how this was able to fix the problem ?
It turns out that I was using nested lists the wrong way to represent 2d arrays, since python was not crating separate objets, but the same sub list indexes was referring to the same integer object, for better explanation please read this.

How can this function be vectorized?

I have a NumPy array with the following properties:
shape: (9986080, 2)
dtype: np.float32
I have a method that loops over the range of the array, performs an operation and then inputs result to new array:
def foo(arr):
new_arr = np.empty(arr.size, dtype=np.uint64)
for i in range(arr.size):
x, y = arr[i]
e, n = ''
if x < 0:
e = '1'
else:
w = '2'
if y > 0:
n = '3'
else:
s = '4'
new_arr[i] = int(f'{abs(x)}{e}{abs(y){n}'.replace('.', ''))
I agree with Iguananaut's comment that this data structure seems a bit odd. My biggest problem with it is that it is really tricky to try and vectorize the putting together of integers in a string and then re-converting that to an integer. Still, this will certainly help speed up the function:
def foo(arr):
x_values = arr[:,0]
y_values = arr[:,1]
ones = np.ones(arr.shape[0], dtype=np.uint64)
e = np.char.array(np.where(x_values < 0, ones, ones * 2))
n = np.char.array(np.where(y_values < 0, ones * 3, ones * 4))
x_values = np.char.array(np.absolute(x_values))
y_values = np.char.array(np.absolute(y_values))
x_values = np.char.replace(x_values, '.', '')
y_values = np.char.replace(y_values, '.', '')
new_arr = np.char.add(np.char.add(x_values, e), np.char.add(y_values, n))
return new_arr.astype(np.uint64)
Here, the x and y values of the input array are first split up. Then we use a vectorized computation to determine where e and n should be 1 or 2, 3 or 4. The last line uses a standard list comprehension to do the string merging bit, which is still undesirably slow for super large arrays but faster than a regular for loop. Also vectorizing the previous computations should speed the function up hugely.
Edit:
I was mistaken before. Numpy does have a nice way of handling string concatenation using the np.char.add() method. This requires converting x_values and y_values to Numpy character arrays using np.char.array(). Also for some reason, the np.char.add() method only takes two arrays as inputs, so it is necessary to first concatenate x_values and e and y_values and n and then concatenate these results. Still, this vectorizes the computations and should be pretty fast. The code is still a bit clunky because of the rather odd operation you are after, but I think this will help you speed up the function greatly.
You may use np.apply_along_axis. When you feed this function with another function that takes row (or column) as an argument, it does what you want to do.
For you case, You may rewrite the function as below:
def foo(row):
x, y = row
e, n = ''
if x < 0:
e = '1'
else:
w = '2'
if y > 0:
n = '3'
else:
s = '4'
return int(f'{abs(x)}{e}{abs(y){n}'.replace('.', ''))
# Where you want to you use it.
new_arr = np.apply_along_axis(foo, 1, n)

How could i optimize this code with list comprehension?

i have this method which create a list of lists which contain zeros and one.
for example the output for (unit = 3) is: [[1,0,0],[0,1,0],[0,0,1]]
how can i do it in less lines with list comprehension? I think that one line its enough.
major_list = [] # contains lists full off zeros and ones and at the end converted to a matrix
for i in range(unit):
major_list.append([0] * unit)
major_list[i][i] = 1
You can't get any faster than using numpy.identity():
np.identity(3)
Code:
import numpy as np
unit = 3
major_array = np.identity(unit)
With a list comphrension you can join 3 sublist
major_list = [[0] * i + [1] + [0] * (unit - i - 1) for i in range(unit)]
print(major_list)
Or better use a performante way with numpy
major_list = numpy.identity(3)
print(major_list)
Testing the performance of the different methods suggested here, and assuming the required final result is a list of lists (and not numpy array), the fastest, with 2.091 seconds to unit = 10k is:
major_list = [[0] * i + [1] + [0] * (unit - i - 1) for i in range(unit)]
The numpy method becomes:
major_list = numpy.identity(unit).astype(int).tolist()
And is second fastest with 2.359 sec.
My method:
major_list = [[1 if i == index else 0 for i in range(unit)]
for index in range(unit)]
Is far behind with 6.960 sec.
And last:
major_list = [[int(c==r) for c in range(unit)] for r in range(unit)]
With 17.732 sec
If by optimize you mean reduce the number of lines (not necessarily make it faster), you can use the following:
unit = 4
major_list = [
[ 0 ] * (i) +
[ 1 ] +
[ 0 ] * (unit - i - 1)
for i in range(unit)
]
for i in major_list:
print(i)
[1, 0, 0, 0]
[0, 1, 0, 0]
[0, 0, 1, 0]
[0, 0, 0, 1]
The following makes it rather concise:
major_list = [[int(c==r) for c in range(unit)] for r in range(unit)]
This puts 1 where column index equals row index, 0 everywhere else.
You can try this:
def frame_matrix(unit):
return [[int(1) if i==j else 0 for i in range(unit)] for j in range(unit)]

Vectorizing outer and inner loop when these contain calculations and deletes

I've been checking out how to vectorize an outer and inner for loop. These have some calculations and also a delete inside them - that seems to make it much less straight forward.
How would this be vectorized best?
import numpy as np
flattenedArray = np.ndarray.tolist(someNumpyArray)
#flattenedArray is a python list of lists.
c = flattenedArray[:]
for a in range (len(flattenedArray)):
for b in range(a+1, len(flattenedArray)):
if a == b:
continue
i0 = flattenedArray[a][0]
j0 = flattenedArray[a][1]
z0 = flattenedArray[a][2]
i1 = flattenedArray[b][0]
i2 = flattenedArray[b][1]
z1 = flattenedArray[b][2]
if ((np.square(z0-z1)) <= (np.square(i0-i1) + (np.square(j0-j2)))):
if (np.square(i0-i1) + (np.square(j0-j1))) <= (np.square(z0+z1)):
c.remove(flattenedArray[b])
#MSeifert is, of course, as so often right. So the following full vectorisation is only to show "how it's done"
import numpy as np
N = 4
data = np.random.random((N, 3))
# vectorised code
j, i = np.tril_indices(N, -1) # chose tril over triu to have contiguous columns
# useful later
sqsum = np.square(data[i,0]-data[j,0]) + np.square(data[i,1]-data[j,1])
cond = np.square(data[i, 2] + data[j, 2]) >= sqsum
cond &= np.square(data[i, 2] - data[j, 2]) <= sqsum
# because equal 'b's are grouped together we can use reduceat:
cond = np.r_[False, np.logical_or.reduceat(
cond, np.add.accumulate(np.arange(N-1)))]
left = data[~cond, :]
# original code (modified to make it run)
flattenedArray = np.ndarray.tolist(data)
#flattenedArray is a python list of lists.
c = flattenedArray[:]
for a in range (len(flattenedArray)):
for b in range(a+1, len(flattenedArray)):
if a == b:
continue
i0 = flattenedArray[a][0]
j0 = flattenedArray[a][1]
z0 = flattenedArray[a][2]
i1 = flattenedArray[b][0]
j1 = flattenedArray[b][1]
z1 = flattenedArray[b][2]
if ((np.square(z0-z1)) <= (np.square(i0-i1) + (np.square(j0-j1)))):
if (np.square(i0-i1) + (np.square(j0-j1))) <= (np.square(z0+z1)):
try:
c.remove(flattenedArray[b])
except:
pass
# check they are the same
print(np.alltrue(c == left))
Vectorizing the inner loop isn't much of a problem if you work with a mask:
import numpy as np
# I'm using a random array
flattenedArray = np.random.randint(0, 100, (10, 3))
mask = np.zeros(flattenedArray.shape[0], bool)
for idx, row in enumerate(flattenedArray):
# Calculate the broadcasted elementwise addition/subtraction of this row
# with all following
added_squared = np.square(row[None, :] + flattenedArray[idx+1:])
subtracted_squared = np.square(row[None, :] - flattenedArray[idx+1:])
# Check the conditions
col1_col2_added = subtracted_squared[:, 0] + subtracted_squared[:, 1]
cond1 = subtracted_squared[:, 2] <= col1_col2_added
cond2 = col1_col2_added <= added_squared[:, 2]
# Update the mask
mask[idx+1:] |= cond1 & cond2
# Apply the mask
flattenedArray[mask]
If you also want to vectorize the outer loop one has to do it by broadcasting, that however will use a lot of memory O(n**2) instead of O(n). Given that the critical inner loop is already vectorized there won't be a lot of speedup by vectorizing the outer loop.

Categories

Resources