Is it actually possible to make this run faster? I need to get half of all possible grids (all elements can be either -1 or 1) of size 4*Lx (for counting energies in Ising model).
def get_grid(Lx):
a = list()
count = 0
t = list(product([1,-1], repeat=Lx))
for i in range(len(t)):
for j in range(len(t)):
for k in range(len(t)):
for l in range(len(t)):
count += 1
a.append([t[i], t[j], t[k], t[l]])
if count == 2**(Lx*4)/2:
return np.array(a)
Tried using numba, but that didn't work out.
First of all, Numba does not like lists. If you want an efficient code, then you need to operate on arrays (except when you really do not know the size at runtime and estimating it is hard/slow). Here the size of the output array is already known so it is better to preallocate it and then fill it. Numba does not like much high-level features like generators, you should prefer using basic loops which are faster (as long as they are executed in a JITed function). The Cartesian product can be replaced by the efficient computation of an array based on the bits of an increasing integer. The whole computation is mainly memory-bound so it is better to use small integer datatypes like uint8 which take 4 times less space in RAM (and thus about 4 times faster to fill). Here is the resulting code:
import numpy as np
import numba as nb
#nb.njit('int8[:,:,:](int64,)')
def get_grid_numba(Lx):
t = np.empty((2**Lx, Lx), dtype=np.int8)
for i in range(2**Lx):
for j in range(Lx):
t[i, Lx-1-j] = 1 - 2 * ((i >> j) & 1)
outSize = 2**(Lx*4 - 1)
out = np.empty((outSize, 4, Lx), dtype=np.int8)
cur = 0
for i in range(len(t)):
for j in range(len(t)):
for k in range(len(t)):
for l in range(len(t)):
out[cur, 0, :] = t[i, :]
out[cur, 1, :] = t[j, :]
out[cur, 2, :] = t[k, :]
out[cur, 3, :] = t[l, :]
cur += 1
if cur == outSize:
return out
return out
For Lx=4, the initial code takes 66.8 ms while this Numba code takes 0.36 ms on my i5-9600KF processor. It is thus 185 times faster.
Note that the size of the output array exponentially grows very quickly. For Lx=7, the output shape is (134217728, 4, 7) and it takes 3.5 GiB of RAM. The Numba code takes 2.47 s to generate it, that is 1.4 GiB/s. If this is not enough to you, then you can write specific implementation from Lx=1 to Lx=8, use loops for the out slice assignment and even use multiple threads for Lx>=5. For small arrays, you can pre-compute them once. This should be an order of magnitude faster.
Related
Suppose I have N items and a multi-hot vector of values {0, 1} that represents inclusion of these items in a result:
N = 4
# items 1 and 3 will be included in the result
vector = [0, 1, 0, 1]
# item 2 will be included in the result
vector = [0, 0, 1, 0]
I'm also provided a matrix of conflicts which indicates which items cannot be included in the result at the same time:
conflicts = [
[0, 1, 1, 0], # any result that contains items 1 AND 2 is invalid
[0, 1, 1, 1], # any result that contains AT LEAST 2 items from {1, 2, 3} is invalid
]
Given this matrix of conflicts, we can determine the validity of the earlier vectors:
# invalid as it triggers conflict 1: [0, 1, 1, 1]
vector = [0, 1, 0, 1]
# valid as it triggers no conflicts
vector = [0, 0, 1, 0]
A naive solution to detect whether a given vector is "valid" (i.e. does not trigger any conflicts) may be done via a dot product and summation operation in numpy:
violation = np.dot(conflicts, vector)
is_valid = np.max(violation) <= 1
Are there are more efficient ways to perform this operation, perhaps either via np.einsum or by bypassing numpy arrays entirely in favour of bit manipulation?
We assume that the number of vectors being checked can be very large (e.g. up to 2^N if we evaluate all possibilities) but that only one vector is likely being checked at a time (to avoid generating a matrix of shape up to (2^N, N) as input).
TL;DR: you can use Numba to optimize np.dot to only operate only on binary values. More specifically, you can perform SIMD-like operations on 8 bytes at once using 64-bit views.
Converting lists to arrays
First of all, the lists can be efficiently converted to relatively-compact arrays using this approach:
vector = np.fromiter(vector, np.uint8)
conflicts = np.array([np.fromiter(conflicts[i], np.uint8) for i in range(len(conflicts))])
This is faster than using the automatic Numpy conversion or np.array (there is less check to perform in the Numpy code internally and Numpy, Numpy know what type of array to build and the resulting one is smaller in memory and thus faster to fill). This step can be used to speed up your np.dot-based solution.
If the input are already a Numpy array, then check they are of type np.uint8 or np.int8. Otherwise, please cast them to such type using conflits = conflits.astype(np.uint8) for example.
First try
Then, one solution could be to use np.packbits to pack the input binary values much as possible in an array of bits in memory, and then perform logical ANDs. But it turns out that np.packbits is pretty slow. Thus, this solution is not a good idea in the end. In fact, any solution creating temporary arrays with a shape similar to conflicts will be slow since writing such an array in memory is generally slower than np.dot (which read conflicts from memory once).
Using Numba
Since np.dot is pretty well optimized, the only solution to defeat it is to use an optimized native code. Numba can be used to generate a native executable code at runtime from a Numpy-based Python code thanks to a just-in-time compiler. The idea is to perform a logical ANDs between vector and rows of conflicts per block. Conflict are check for each block so to stop the computation as early as possible. Blocks can be efficiently compared by groups of 8 octets by comparing the uint64 views of the two arrays (in a SIMD-friendly way).
import numba as nb
#nb.njit('bool_(uint8[::1], uint8[:,::1])')
def check_valid(vector, conflicts):
n, m = conflicts.shape
assert vector.size == m
for i in range(n):
block_size = 128 # In the range: 8,16,...,248
conflicts_row = conflicts[i,:]
gsum = 0 # Global sum of conflicts
m_limit = m // block_size * block_size
for j in range(0, m_limit, block_size):
vector_block = vector[j:j+block_size].view(np.uint64)
conflicts_block = conflicts_row[j:j+block_size].view(np.uint64)
# Matching
lsum = np.uint64(0) # 8 local sums of conflicts
for k in range(block_size//8):
lsum += vector_block[k] & conflicts_block[k]
# Trick to perform the reduction of all the bytes in lsum
lsum += lsum >> 32
lsum += lsum >> 16
lsum += lsum >> 8
gsum += lsum & 0xFF
# Check if there is a conflict
if gsum >= 2:
return False
# Remaining part
for j in range(m_limit, m):
gsum += vector[j] & conflicts_row[j]
if gsum >= 2:
return False
return True
Results
This is about 9 times faster than np.dot on my machine for a large conflicts array of shape (16, 65536) (without conflicts). The time to convert lists is not included in both cases. When there are conflicts, the provided solution is much faster since it can early stop the computation.
Theoretically, the computation should be even faster, but the Numba JIT do not succeed to vectorize the loop using SIMD instructions. That being said, it seems the same issue appears for np.dot. If the arrays are even bigger, you can parallelize the computation of the blocks (at the expense of a slower computation if the function return False).
Given two matrices X1 (N,3136) and X2 (M,3136) (where every element in every row is an binary number) i am trying to calculate hamming distance so that each element in X1 is compared to all of the rows from X2, such that result matrix is (N,M).
I have written two function for it (first one with help of numpy and the other one without numpy):
def hamming_distance(X, X_train):
array = np.array([np.sum(np.logical_xor(x, X_train), axis=1) for x in X])
return array
def hamming_distance2(X, X_train):
a = len(X[:,0])
b = len(X_train[:,0])
hamming_distance = np.zeros(shape=(a, b))
for i in range(0, a):
for j in range(0, b):
hamming_distance[i,j] = np.count_nonzero(X[i,:] != X_train[j,:])
return hamming_distance
My problem is that upper function is much slower than lower one where I use two for loops. Is it possible to improve on first function so that I use only one loop?
PS. Sorry for my english, it isn't my first language, although I was trying to do my best!
Numpy only makes your code much faster if you use it to vectorize your work. In your case you can make use of array broadcasting to vectorize your problem: compare your two arrays and create an auxiliary array of shape (N,M,K) which you can sum along its third dimension:
hamming_distance = (X[:,None,:] != X_train).sum(axis=-1)
We inject a singleton dimension into the first array to make it of shape (N,1,K), the second array is implicitly compatible with shape (1,M,K), so the operation can be performed.
In the comments #ayhan noted that this will create a huge auxiliary array for large M and N, which is quite true. This is the price of vectorization: you gain CPU time at the cost of memory. If you have enough memory for the above to work, it will be very fast. If you don't, you have to reduce the scope of your vectorization, and loop in either M or N (or both; this would be your current approach). But this doesn't concern numpy itself, this is about striking a balance between available resources and performance.
What you are doing is very similar to dot product. Consider these two binary arrays:
1 0 1 0 1 1 0 0
0 0 1 1 0 1 0 1
We are trying to find the number of different pairs. If you directly take the dot product, it gives you the number of (1, 1) pairs. However, if you negate one of them, it will count the different ones. For example, a1.dot(1-a2) counts (1, 0) pairs. Since we also need the number of (0, 1) pairs, we will add a2.dot(1-a1) to that. The good thing about dot product is that it is pretty fast. However, you will need to convert your arrays to floats first, as Divakar pointed out.
Here's a demo:
prng = np.random.RandomState(0)
arr1 = prng.binomial(1, 0.3, (1000, 3136))
arr2 = prng.binomial(1, 0.3, (2000, 3136))
res1 = hamming_distance2(arr1, arr2)
arr1 = arr1.astype('float32'); arr2 = arr2.astype('float32')
res2 = (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
np.allclose(res1, res2)
Out: True
And timings:
%timeit hamming_distance(arr1, arr2)
1 loop, best of 3: 13.9 s per loop
%timeit hamming_distance2(arr1, arr2)
1 loop, best of 3: 5.01 s per loop
%timeit (1-arr1).dot(arr2.T) + arr1.dot(1-arr2.T)
10 loops, best of 3: 93.1 ms per loop
I have a numpy array embed_vec of length tot_vec in which each entry is a 3d vector:
[[ 0.52483319 0.78015841 0.71117216]
[ 0.53041481 0.79462171 0.67234534]
[ 0.53645428 0.80896727 0.63119403]
...,
[ 0.72283509 0.40070804 0.15220522]
[ 0.71277758 0.38498613 0.16141834]
[ 0.70221445 0.36918032 0.17370776]]
For each of the elements in this array, I want to find out the number of other entries which are "close" to that entry. By close, I mean that the distance between two vectors is less than a specified value R. For this, I must compare all the possible pairs in this array with each other and then find out the number of close vectors for each of the vectors in the array. So I am doing this:
p = np.zeros(tot_vec) # This contains the number of close vectors
for i in range(tot_vec-1):
for j in range(i+1, tot_vec):
if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
p[i] += 1
However, this is extremely inefficient because I have two nested python loops and for larger array sizes, this takes forever. If this were in C++ or Fortran, it wouldn't have been a great issue. My question is, can one achieve the same thing using numpy efficiently using some vectorization method? As a side note, I don't mind a solution using Pandas also.
Approach #1 : Vectorized approach -
def vectorized_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
r,c = np.triu_indices(tot_vec,1)
subs = embed_vec[r] - embed_vec[c]
dists = np.einsum('ij,ij->i',subs,subs)
return np.bincount(r,dists<R**2,minlength=tot_vec)
Approach #2 : With less loop complexity (for very large arrays) -
def loopy_less_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
Rsq = R**2
out = np.zeros(tot_vec,dtype=int)
for i in range(tot_vec):
subs = embed_vec[i] - embed_vec[i+1:tot_vec]
dists = np.einsum('ij,ij->i',subs,subs)
out[i] = np.count_nonzero(dists < Rsq)
return out
Benchmarking
Original approach -
def loopy_app(embed_vec, R):
tot_vec = embed_vec.shape[0]
p = np.zeros(tot_vec) # This contains the number of close vectors
for i in range(tot_vec-1):
for j in range(i+1, tot_vec):
if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
p[i] += 1
return p
Timings -
In [76]: # Sample random array
...: embed_vec = np.random.rand(3000,3)
...: R = 0.5
...:
In [77]: %timeit loopy_app(embed_vec, R)
1 loops, best of 3: 50.5 s per loop
In [78]: %timeit loopy_less_app(embed_vec, R)
10 loops, best of 3: 143 ms per loop
350x+ speedup there!
Going with much bigger array with the proposed loopy_less_app -
In [81]: # Sample random array
...: embed_vec = np.random.rand(20000,3)
...: R = 0.5
...:
In [82]: %timeit loopy_less_app(embed_vec, R)
1 loops, best of 3: 4.47 s per loop
I am intrigued by that question and attempted to solve it efficintly using scipy's cKDTree. However, this approach may run out of memory because internally a list of all pairs with distance <= R is maintained. If your R and tot_vec are small enough it will work:
import numpy as np
from scipy.spatial import cKDTree as KDTree
tot_vec = 60000
embed_vec = np.random.randn(tot_vec, 3)
R = 0.1
tree = KDTree(embed_vec, leafsize=100)
p = np.zeros(tot_vec)
for pair in tree.query_pairs(R):
p[pair[0]] += 1
p[pair[1]] += 1
In case memory is an issue, with some effort it is possible to rewrite query_pairs as a generator function in Python at the cost of C performance.
first broadcast the difference:
disp_vecs=tot_vec[:,None,:]-tot_vec[None,:,:]
Now, depending on how big your dataset is, you may want to do a fist pass without all the math. If the distance is less than r, all the components should be less than r
first_mask=np.max(disp_vec, axis=-1)<r
Then do the actual calculation
disps=np.linlg.norm(disp_vec[first_mask],axis=-1)
second_mask=disps<r
Now reassign
disps=disps[second_mask]
first_mask[first_mask]=second_mask
disps are now the good values, and first_mask is a boolean mask of where they go. You can process from there.
I need to create a matrix starting from the values of a weight matrix. Which is the best structure to hold the matrix in term of speed both when creating and iterating over it? I was thinking about a list of lists or a numpy 2D array but they both seem slow to me.
What I need:
numpy array
A = np.zeros((dim, dim))
for r in range(A.shape[0]):
for c in range(A.shape[0]):
if(r==c):
A.itemset(node_degree[r])
else:
A.itemset(arc_weight[r,c])
or
list of lists
l = []
for r in range(dim):
l.append([])
for c in range(dim):
if(i==j):
l[i].append(node_degree[r])
else:
l[i].append(arc_weight[r,c])
where dim can be also 20000 , node_degree is a vector and arc_weight is another matrix. I wrote it in c++, it takes less less than 0.5 seconds while the others two in python more than 20 seconds. I know python is not c++ but I need to be as fast as possible.
Thank you all.
One thing is you shouldn't be appending to the list if you already know it's size.
Preallocate the memory first using list comprehension and generate the r, c values using xrange() instead of range() since you are using Python < 3.x (see here):
l = [[0 for c in xrange(dim)] for r in xrange(dim)]
Better yet, you can build what you need in one shot using:
l = [[node_degree[r] if r == c else arc_weight[r,c]
for c in xrange(dim)] for r in xrange(dim)]
Compared to your original implementation, this should use less memory (because of the xrange() generators), and less time because you remove the need to reallocating memory by specifying the dimensions up front.
Numpy matrices are generally faster as they know their dimensions and entry type.
In your particular situation, since you already have the arc_weight and node_degree matrices created so you can create your matrix directly from arc_weight and then replace the diagonal:
A = np.matrix(arc_matrix)
np.fill_diagonal(A, node_degree)
Another option is to replace the double loop with a function that puts the right element in each position and create a matrix from the function:
def fill_matrix(r, c):
return arc_weight[r,c] if r != c else node_degree[r]
A = np.fromfunction(fill_matrix, (dim, dim))
As a rule of thumb, with numpy you must avoid loops at all costs. First method should be faster but you should profile both to see what works for you. You should also take into account that you seem to be duplicating your data set in memory, so if it is really huge you might get in trouble. Best idea would be to create your matrix directly avoiding arc_weight and node_degree altogether.
Edit: Some simple time comparisons between list comprehension and numpy matrix creation. Since I don't know how your arc_weight and node_degree are defined, I just made up two random functions. It seems that numpy.fromfunction complains a bit if the function has a conditional on it, so I construct the matrix in two steps.
import numpy as np
def arc_weight(a,b):
return a+b
def node_degree(a):
return a*a
def create_as_list(N):
return [[arc_weight(c,r) if c!=r else node_degree(c) for c in xrange(N)] for r in xrange(N)]
def create_as_numpy(N):
A = np.fromfunction(arc_weight, (N,N))
np.fill_diagonal(A, node_degree(np.arange(N)))
return A
And here the timings for N=2000:
time A = create_as_list(2000)
CPU times: user 839 ms, sys: 16.5 ms, total: 856 ms
Wall time: 845 ms
time A = create_as_numpy(2000)
CPU times: user 83.1 ms, sys: 12.9 ms, total: 96 ms
Wall time: 95.3 ms
Make a copy of arc_weight and fill the diagonal with values from node_degree. For a 20000-by-20000 output, it takes about 1.6 seconds on my machine:
>>> import numpy
>>> dim = 20000
>>> arc_weight = numpy.arange(dim**2).reshape([dim, dim])
>>> node_degree = numpy.arange(dim)
>>> import timeit
>>> timeit.timeit('''
... A = arc_weight.copy()
... A.flat[::dim+1] = node_degree
... ''', '''
... from __main__ import dim, arc_weight, node_degree''',
... number=1)
1.6081738501125764
Once you have your array, try not to iterate over it. Compared to broadcasted operators and NumPy built-in functions, Python-level loops are a performance disaster.
I have a function
def getSamples():
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
goodSamples = []
rightCount = 0
totalCount = 0
while(rightCount < 100000):
z0 = np.random.normal(5, 14)
u0 = np.random.uniform(0,k*q(z0))
if(p(z0) > u0):
goodSamples.append(z0)
rightCount += 1
totalCount += 1
return np.array(goodSamples)
My implementation to generate 100000 samples is taking much long. How can I make it fast with itertools or something similar?
I would say that the secret to making this code faster does not lie in changing the loop syntax. Here are a few points:
np.random.normal has an additional parameter size that lets you get many values at once. I would suggest using an array of say 1E09 elements and then checking your condition on that for how many are good. You can then estimate how likely that is.
To create your uniform samples, why not use sympy for symbolic evaluation of the pdf? (I don't know if this is faster but it could be since you already know the mean and variance.)
Again, for p could you use a symbolic function?
In general, performance problems are caused by doing things the "wrong way". Numpy can be very fast when used as it is designed to be used, that is by exploiting its vector processing where these vectorized operations are handed off to compiled code. Two bad practices that come from other programing languages/approaches are
Loops: Whenever you think you need a loop stop and think. Most of the time you do not and in fact do not even want one. It is much faster both to write and run code without loops.
Memory allocation: Whenever you know the size of an object, preallocate space for it. Growing memory, particularly in Python lists, is very slow compared to the alternatives.
In this case it is easy to get (approximately) two orders of magnitude speedup; the tradeoff is more memory usage.
Below is some representative code, it is not meant to be blindly used. I have not even verified it produces the correct results. It is more or less a direct translation of your routine. It appears you are drawing random numbers from a probability distribution using the rejection method. There may be more efficient algorithms to do this for your probability distribution.
def getSamples2() :
p = lambda x : mlab.normpdf(x,3,2) + mlab.normpdf(x,-5,1)
q = lambda x : mlab.normpdf(x,5,14)
k=30
N = 100000 # Total number of samples we want
Ngood = 0 # Current number of good samples
goodSamples = np.zeros(N) # Storage for the good samples
while Ngood < N : # Unfortunately a loop, ....
z0 = np.random.normal(5, 14, size=N)
u0 = np.random.uniform(size=N)*k*q(z0)
ind, = np.where(p(z0) > u0)
n = min(len(ind), N-Ngood)
goodSamples[Ngood:Ngood+n] = z0[ind[:n]]
Ngood += n
return goodSamples
This generates random numbers in chunks and saves the good ones. I have not tried to optimize the chunk size (here I just use N, the total number we want, in principle this could/should be different and could even be adjusted based on the number we have left to generate). This still uses a loop, unfortunately, but now this will be run "tens" of times instead of 100,000 times. This also uses the where function and array slicing; these are good general tools to be comfortable with.
In one test with %timeit on my machine I found
In [27]: %timeit getSamples() # Original routine
1 loops, best of 3: 49.3 s per loop
In [28]: %timeit getSamples2()
1 loops, best of 3: 505 ms per loop
Here is kinda itertools "magic", but I'm not sure it can help. Probably it's much better for perfomance to prepare an numpy array (using zeros) and fill it without creating python auto-growing list. Here is both itertools and zero-preparations. (Excuse me in advance for untested code)
from itertools import count, ifilter, imap, takewhile
import operator
def getSamples():
p = lambda x : mlab.normpdf(x, 3, 2) + mlab.normpdf(x, -5, 1)
q = lambda x : mlab.normpdf(x, 5, 14)
k = 30
n = 100000
samples_iter = imap(
operator.itemgetter(1),
takewhile(
lambda i, s: i < n,
enumerate(
ifilter(lambda z: p(z) > np.random.uniform(0,k*q(z)),
(np.random.normal(5, 14) for _ in count()))
)))
goodSamples = numpy.zeros(n)
# set values from iterator, probably there is a better way for that
for i, sample in enumerate(samples_iter):
goodSamples[i] = sample
return goodSamples