Any efficient way to update very large matrix? - python

I have a very large matrix, which is a 2D array with few hundred rows but around 2 million columns. My application needs to update this matrix row by rows with one painful constraint. To update a specific row, it needs to wait all columns of previous rows updated. And the process is very slow.
For example:
matrix = [[0 for x in xrange(2000000)] for y in xrange(300)]
for i in xrange(1, 300):
for j in xrange(2000000):
k = a random column in row i-1
matrix[i][j] = matrix[i-1][k] * simple_function(i) # some calculation
To update matrix[i][j] , I need to have the value of row i-1
My initial thought is to use multi-process approach to parallelize j loop in each i round. However, the calculation in j loop is too light. The process creation cost is much higher than calculation (I also tried process pool).
Second thought is to use thread, it works well with almost no performance gain due to GIL limitation.
I would like to know if there's any other approach that can accelerate the my code. Thank you.
BTW, I know Cython can work without GIL. But the calculation function needs to access a Python object and it will takes lots of works to modify the original code.

Use Numba for such tasks
https://numba.pydata.org/
If you wan't a matrix or an array create one (and not a List of Lists). I would also recommend to take a look at some beginners tutorials regarding numpy.
Example code:
import numpy as np
import numba as nb
import time
def main():
matrix = np.random.rand(2000000,300)
t1=time.time()
k=100
Testing(matrix,100)
print(time.time()-t1)
#nb.jit(cache=True)
def Testing(matrix,k):
for i in xrange(1, matrix.shape[0]):
#inline your simple function (in my case np.power(i,2))
res_of_func=np.power(i,2)
for j in xrange(matrix.shape[1]):
matrix[i,j] = matrix[i-1,k] * res_of_func
if __name__ == "__main__":
main()
The matrix calculation takes 0.5 seconds on my machine (Haswell i7).
On Python 2 parallelizing with numba on windows isn't working today. But the jit compiled code should be aproximately 100 times faster.

Related

Multi-Processing nested loops in Python?

I have a multi-nested for loop and I'd like to parallelize this as much as possible, in Python.
Suppose I have some arbitrary function, which accepts two arguments func(a,b) and I'd like to compute this function on all combinations of M and N.
What I've done so far is 'flatten' the indices as a dictionary
idx_map = {}
count = 0
for i in range(n):
for j in range(m):
idx_map[count] = (i,j)
count += 1
Now that my nested loop is flattened, I can use it like so:
arr = []
for idx in range(n*m):
i,j = idx_map[idx]
arr.append( func(M[i], N[j]) )
Can I use this with Python's built in multi-Processing to parallelize? Race conditions should not be an issue because I do not need to aggregate func calls; rather, I just want to arrive at some final array, which evaluates all func(a,b) combinations across M and N. (So Async behavior and complexity should not be relevant here.)
What's the best way to accomplish this effect?
I see from this related question but I don't understand what the author was trying to illustrate.
if 1: # multi-threaded
pool = mp.Pool(28) # try 2X num procs and inc/dec until cpu maxed
st = time.time()
for x in pool.imap_unordered(worker, range(data_Y)):
pass
print 'Multiprocess total time is %4.3f seconds' % (time.time()-st)
print
You can accomplish this yes, however the amount of work you are doing per function call needs to be quite substantial to overcome the overhead of the processes.
Vectorizing using something like numpy is typically easier, like Jérôme stated previously.
I have altered your code so that you may observe the speed up you get by using multiprocessing.
Feel free to change the largNum variable to see how as the amount of work increases per function call the scaling for multiprocessing gets better and how at low values multiprocessing is actually slower.
from concurrent.futures import ProcessPoolExecutor
import time
# Sums n**2 of a+b
def costlyFunc(theArgs):
a=theArgs[0]
b=theArgs[1]
topOfRange=(a+b)**2
sum=0
for i in range(topOfRange):
sum+=i
return sum
#changed to list
idx_map = []
largNum=200
# Your indicey flattening
for i in range(largNum):
for j in range(largNum):
idx_map.append((i,j))
I use the map function in the single core version to call costlyFunc on every element in the list. Python's concurrent.futures module also has a similar map function, however it distributes it over multiple processes.
if __name__ == "__main__":
# No multiprocessing
oneCoreTimer=time.time()
result=[x for x in map(costlyFunc,idx_map)]
oneCoreTime=time.time()-oneCoreTimer
print(oneCoreTime," seconds to complete the function without multiprocessing")
# Multiprocessing
mpTimer=time.time()
with ProcessPoolExecutor() as ex:
mpResult=[x for x in ex.map(costlyFunc,idx_map)]
mpTime=time.time()-mpTimer
print(mpTime," seconds to complete the function with multiprocessing")
print(f"Multiprocessing is {oneCoreTime/mpTime} times faster than using one core")

python performance bottleneck with lil_matrix

I am currently working with sparse matrix in python. I choose to use lil_matrix for my problem because as explained in the documentation lil_matrix are intended to be used for constructing a sparse matrix. My sparse matrix has dimensions 2500x2500
I have two piece of code inside two loops (which iterate in the matrix elements) which are having different execution time and I want to understand why. The first one is
current = lil_matrix_A[i,j]
lil_matrix_A[i, j] = current + 1
lil_matrix_A[j, i] = current + 1
Basically just taking every element of the matrix and incrementing its value by one.
And the second one is as below
value = lil_matrix_A[i, j]
temp = (value * 10000) / (dictionary[listA[i]] * dictionary[listB[j]])
lil_matrix_A[i, j] = temp
lil_matrix_A[j, i] = temp
Basically taking the value, making the calculation of a formula and inserting this new value to the matrix.
The first code is executed for around 0.4 seconds and the second piece of code is executed for around 32 seconds.
I understand that the second one has an extra calculation in the middle, but the time difference, in my opinion, does not make sense. The dictionary and list indexing have O(1) complexity so it is not supposed to be a problem. Is there any suggestion what it is causing this difference in execution time?
Note: The number of elements in list and dictionary is also 2500.

How to make huge loops in python faster on mac?

I am a computer science student and some of the things I do require me to run huge loops on Macbook with dual core i5. Some the loops take 5-6 hours to complete but they only use 25% of my CPU. Is there a way to make this process faster? I cant change my loops but is there a way to make them run faster?
Thank you
Mac OS 10.11
Python 2.7 (I have to use 2.7) with IDLE or Spyder on Anaconda
Here is a sample code that takes 15 minutes:
def test_false_pos():
sumA = [0] * 1000
for test in range(1000):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
avg = np.mean(sumA)
return avg
Sure thing: Python 2.7 has to generate huge lists and waste a lot of memory each time you use range(<a huge number>).
Try to use the xrange function instead. It doesn't create that gigantic list at once, it produces the members of a sequence lazily.
But if your were to use Python 3 (which is the modern version and the future of Python), you'll find out that there range is even cooler and faster than xrange in Python 2.
You could split it up into 4 loops:
import multiprocessing
def test_false_pos(times, i, q):
sumA = [0] * times
for test in range(times):
counter = 0
bf = BloomFilter(4095,10)
for i in range(600):
bf.rand_inserts()
for x in range(10000):
randS = str(rnd.randint(0,10**8))
if bf.lookup(randS):
counter += 1
sumA[test] = counter/10000.0
q.put([i, list(sumA)])
def full_test(pieces):
threads = []
q = multiprocessing.Queue()
steps = 1000 / pieces
for i in range(pieces):
threads.append(multiprocessing.Process(target=test_false_pos, args=(steps, i, q)))
[thread.start() for thread in threads]
results = [None] * pieces
for i in range(pieces):
i, result = q.get()
results[i] = result
# Flatten the array (`results` looks like this: [[...], [...], [...], [...]])
# source: https://stackoverflow.com/a/952952/5244995
sums = [value for result in results for val in result]
return np.mean(np.array(sums))
if __name__ == '__main__':
full_test(multiprocessing.cpu_count())
This will run n processes that each do 1/nth of the work, where n is the number of processors on your computer.
The test_false_pos function has been modified to take three parameters:
times is the number of times to run the loop.
i is passed through to the result.
q is a queue to add the results to.
The function loops times times, then places i and sumA into the queue for further processing.
The main thread (full_test) waits for each thread to complete, then places the results in the appropriate position in the results list. Once the list is complete, it is flattened, and the mean is calculated and returned.
Consider looking into Numba and Jit (just in time compiler). It works for functions that are Numpy based. It can handle some python routines, but is mainly for speeding up numerical calculations, especially ones with loops (like doing cholesky rank-1 up/downdates). I don't think it would work with a BloomFilter, but it is generally super helpful to know about.
In cases where you must use other packages in your flow with numpy, separate out the heavy-lifting numpy routines into their own functions, and throw a #jit decorator on top of that function. Then put them into your flows with normal python stuff.

numpy array multiplication slower than for loop with vector multiplication?

I have come across the following issue when multiplying numpy arrays. In the example below (which is slightly simplified from the real version I am dealing with), I start with a nearly empty array A and a full array C. I then use a recursive algorithm to fill in A.
Below, I perform this algorithm in two different ways. The first method involves the operations
n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0)
while the second method involves the for loop
for m in range(0, c - 1):
B[c] += C[c-m] * B[m]
Note that the arrays A and B are identical, but they are filled in using the two different methods.
In the example below I time how long it takes to perform the computation using each method. I find that, for example, with n_pix=2 and max_counts = 400, the first method is much faster than the second (that is, time_np is much smaller than time_for). However, when I then switch to, for example, n_pix=1000 and max_counts = 400, instead I find method 2 is much faster (time_for is much smaller than time_np). I would have thought that method 1 would always be faster since method 2 explicitly runs over a loop while method 1 uses np.multiply.
So, I have two questions:
Why does the timing behave this way as a function of n_pix for a fixed max_counts?
What is optimal method for writing this code so that it behaves quickly for all n_pix?
That is, can anyone suggest a method 3? In my project, it is very important for this piece of code to perform quickly over a range of large and small n_pix.
import numpy as np
import time
def return_timing(n_pix,max_counts):
A=np.zeros((max_counts+1,n_pix))
A[0]=np.random.random(n_pix)*1.8
A[1]=np.random.random(n_pix)*2.3
B=np.zeros((max_counts+1,n_pix))
B[0]=A[0]
B[1]=A[1]
C=np.outer(np.random.random(max_counts+1),np.random.random(n_pix))*3.24
time_np=0
time_for=0
for c in range(2, max_counts + 1):
t0 = time.time()
n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0)
time_np += time.time()-t0
t0 = time.time()
for m in range(0, c - 1):
B[c] += C[c-m] * B[m]
time_for += time.time()-t0
return time_np, time_for
First of all, you can easily replace:
n_array = np.arange(0,c-1)
temp_vec= C[c-n_array] * A[n_array]
A[c] += temp_vec.sum(axis=0)
with:
A[c] += (C[c:1:-1] * A[:c-1]).sum(0)
This is much faster because indexing with an array is much slower than slicing. But the temp_vec is still hidden in there, created before summing is done. This leads to the idea of using einsum, which is the fastest because it doesn't make the temp array.
A[c] = np.einsum('ij,ij->j', C[c:1:-1], A[:c-1])
Timing. For small arrays:
>>> return_timing(10,10)
numpy OP 0.000525951385498
loop OP 0.000250101089478
numpy slice 0.000246047973633
einsum 0.000170946121216
For large:
>>> return_timing(1000,100)
numpy OP 0.185983896255
loop OP 0.0458009243011
numpy slice 0.038364648819
einsum 0.0167834758759
It is probably because your numpy-only version requires creation/allocation of new ndarrays (temp_vec and n_array), while your other method does not.
Creation of new ndarrays is very slow and if you can modify your code in such a way that it no longer have to continuously create them, I would expect that you could get better performance out of that method.

I don't understand why/how one of these methods is faster than the others

I wanted to test the difference in time between implementations of some simple code. I decided to count how many values out of a random sample of 10,000,000 numbers is greater than 0.5. The random sample is grabbed uniformly from the range [0.0, 1.0).
Here is my code:
from numpy.random import random_sample; import time;
n = 10000000;
t1 = time.clock();
t = 0;
z = random_sample(n);
for x in z:
if x > 0.5: t += 1;
print t;
t2 = time.clock();
t = 0;
for _ in xrange(n):
if random_sample() > 0.5: t += 1;
print t;
t3 = time.clock();
t = (random_sample(n) > 0.5).sum();
print t;
t4 = time.clock();
print t2-t1; print t3-t2; print t4-t3;
This is the output:
4999445
4999511
5001498
7.0348236652
1.75569394301
0.202538106332
I get that the first implementation sucks because creating a massive array and then counting it element-wise is a bad idea, so I thought that the second implementation would be the most efficient.
But how is the third implementation 10 times faster than the second method? Doesn't the third method also create a massive array in the form of random_sample(n) and then go through it checking values against 0.5?
How is this third method different from the first method and why is it ~35 times faster than the first method?
EDIT: #merlin2011 suggested that Method 3 probably doesn't create the full array in memory. So, to test that theory I tried the following:
z = random_sample(n);
t = (z > 0.5).sum();
print t;
which runs in a time of 0.197948451549 which is practically identical to Method 3. So, this is probably not a factor.
Method 1 generates a full list in memory before using it. This is slow because the memory has to be allocated and then accessed, probably missing the cache multiple times.
Method 2 uses an generator, which never creates the list in memory but instead generates each element on demand.
Method 3 is probably faster because sum() is implemented as a loop in C but I am not 100% sure. My guess is that this is faster for the same reason that Matlab vectorization is faster than for loops in Matlab.
Update: Separating out each of three steps, I observe that method 3 is still equally fast, so I have to agree with utdemir that each individual operator is executing instructions closer to machine code.
z = random_sample(n)
z2 = z > 0.5
t = z2.sum();
In each of the first two methods, you are invoking Python's standard functionality to do a loop, and this is much slower than a C-level loop that is baked into the implementation.
AFAIK
Function calls are heavy, on method two, you're calling random_sample() 10000000 times, but on third method, you just call it once.
Numpy's > and .sum are optimized to their last bits in C, also most probably using SIMD instructions to avoid loops.
So,
On method 2, you are comparing and looping using Python; but on method 3, you're much closer to the processor and using optimized instructions to compare and sum.

Categories

Resources