Multi-threaded Python application slower than single-threaded implementation

Multi-threaded Python application slower than single-threaded implementation - python

I wrote this program to properly learn how to use multi-threading. I want to implement something similar to this in my own program:
import numpy as np
import time
import os
import math
import random
from threading import Thread
def powExp(x, r):
for c in range(x.shape[1]):
x[r][c] = math.pow(100, x[r][c])
def main():
print()
rows = 100
cols = 100
x = np.random.random((rows, cols))
y = x.copy()
start = time.time()
threads = []
for r in range(x.shape[0]):
t = Thread(target = powExp, args = (x, r))
threads.append(t)
t.start()
for t in threads:
t.join()
end = time.time()
print("Multithreaded calculation took {n} seconds!".format(n = end - start))
start = time.time()
for r in range(y.shape[0]):
for c in range(y.shape[1]):
y[r][c] = math.pow(100, y[r][c])
end = time.time()
print("Singlethreaded calculation took {n} seconds!".format(n = end - start))
print()
randRow = random.randint(0, rows - 1)
randCol = random.randint(0, cols - 1)
print("Checking random indices in x and y:")
print("x[{rR}][{rC}]: = {n}".format(rR = randRow, rC = randCol, n = x[randRow][randCol]))
print("y[{rR}][{rC}]: = {n}".format(rR = randRow, rC = randCol, n = y[randRow][randCol]))
print()
for r in range(x.shape[0]):
for c in range(x.shape[1]):
if(x[r][c] != y[r][c]):
print("ERROR NO WORK WAS DONE")
print("x[{r}][{c}]: {n} == y[{r}][{c}]: {ny}".format(
r = r,
c = c,
n = x[r][c],
ny = y[r][c]
))
quit()
assert(np.array_equal(x, y))
if __name__ == main():
main()
As you can see from the code the goal here is to parallelize the operation math.pow(100, x[r][c]) by creating a thread for every column. However this code is extremely slow, a lot slower than single-threaded versions.
Output:
Multithreaded calculation took 0.026447772979736328 seconds!
Singlethreaded calculation took 0.006798267364501953 seconds!
Checking random indices in x and y:
x[58][58]: = 9.792315687115973
y[58][58]: = 9.792315687115973
I searched through stackoverflow and found some info about the GIL forcing python bytecode to be executed on a single core only. However I'm not sure that this is in fact what is limiting my parallelization. I tried rearranging the parallelized for-loop using pools instead of threads. Nothing seems to be working.
Python code performance decreases with threading
EDIT: This thread discusses the same issue. Is it completely impossible to increase performance using multi-threading in python because of the GIL? Is the GIL causing my slowdowns?
EDIT 2 (2017-01-18): So from what I can gather after searching for quite a bit online it seems like python is really bad for parallelism. What I'm trying to do is parellelize a python function used in a neural network implemented in tensorflow...it seems like adding a custom op is the way to go.

The number of issues here is quite... numerous. Too many (system!) threads that do too little work, the GIL, etc. This is what I consider a really good introduction to parallelism in Python:
https://www.youtube.com/watch?v=MCs5OvhV9S4
Live coding is awesome.

Related

Multiprocessing in Python: Parallelize a for loop to fill a Numpy array

I've been reading threads like this one but any of them seems to work for my case. I'm trying to parallelize the following toy example to fill a Numpy array inside a for loop using Multiprocessing in Python:
import numpy as np
from multiprocessing import Pool
import time
def func1(x, y=1):
return x**2 + y
def func2(n, parallel=False):
my_array = np.zeros((n))
# Parallelized version:
if parallel:
pool = Pool(processes=6)
for idx, val in enumerate(range(1, n+1)):
result = pool.apply_async(func1, [val])
my_array[idx] = result.get()
pool.close()
# Not parallelized version:
else:
for i in range(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(60000)
end = time.time()
print(my_array)
print("Normal time: {}\n".format(end-start))
start_parallel = time.time()
my_array_parallelized = func2(60000, parallel=True)
end_parallel = time.time()
print(my_array_parallelized)
print("Time based on multiprocessing: {}".format(end_parallel-start_parallel))
if __name__ == '__main__':
main()
The lines in the code based on Multiprocessing seem to work and give you the right results. However, it takes far longer than the non parallelized version. Here is the output of both versions:
[2.00000e+00 5.00000e+00 1.00000e+01 ... 3.59976e+09 3.59988e+09
3.60000e+09]
Normal time: 0.01605963706970215
[2.00000e+00 5.00000e+00 1.00000e+01 ... 3.59976e+09 3.59988e+09
3.60000e+09]
Time based on multiprocessing: 2.8775112628936768
My intuition tells me that it should be a better way of capturing results from pool.apply_async(). What am I doing wrong? What is the most efficient way to accomplish this? Thx.

Creating processes is expensive. On my machine it take at leas several hundred of microsecond per process created. Moreover, the multiprocessing module copy the data to be computed between process and then gather the results from the process pool. This inter-process communication is very slow too. The problem is that your computation is trivial and can be done very quickly, likely much faster than all the introduced overhead. The multiprocessing module is only useful when you are dealing with quite small datasets and perform intensive computation (compared to the amount of computed data).
Hopefully, when it comes to numericals computations using Numpy, there is a simple and fast way to parallelize your application: the Numba JIT. Numba can parallelize a code if you explicitly use parallel structures (parallel=True and prange). It uses threads and not heavy processes that are working in shared memory. Numba can overcome the GIL if your code does not deal with native types and Numpy arrays instead of pure Python dynamic object (lists, big integers, classes, etc.). Here is an example:
import numpy as np
import numba as nb
import time
#nb.njit
def func1(x, y=1):
return x**2 + y
#nb.njit('float64[:](int64)', parallel=True)
def func2(n):
my_array = np.zeros(n)
for i in nb.prange(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(60000)
end = time.time()
print(my_array)
print("Numba time: {}\n".format(end-start))
if __name__ == '__main__':
main()
Because Numba compiles the code at runtime, it is able to fully optimize the loop to a no-op resulting in a time close to 0 second in this case.

Here is the solution proposed by #thisisalsomypassword that improves my initial proposal. That is, "collecting the AsyncResult objects in a list within the loop and then calling AsyncResult.get() after all processes have started on each result object":
import numpy as np
from multiprocessing import Pool
import time
def func1(x, y=1):
time.sleep(0.1)
return x**2 + y
def func2(n, parallel=False):
my_array = np.zeros((n))
# Parallelized version:
if parallel:
pool = Pool(processes=6)
####### HERE COMES THE CHANGE #######
results = [pool.apply_async(func1, [val]) for val in range(1, n+1)]
for idx, val in enumerate(results):
my_array[idx] = val.get()
#######
pool.close()
# Not parallelized version:
else:
for i in range(1, n+1):
my_array[i-1] = func1(i)
return my_array
def main():
start = time.time()
my_array = func2(600)
end = time.time()
print(my_array)
print("Normal time: {}\n".format(end-start))
start_parallel = time.time()
my_array_parallelized = func2(600, parallel=True)
end_parallel = time.time()
print(my_array_parallelized)
print("Time based on multiprocessing: {}".format(end_parallel-start_parallel))
if __name__ == '__main__':
main()
Now it works. Time is reduced considerably with Multiprocessing:
Normal time: 60.107836008071
Time based on multiprocessing: 10.049324989318848
time.sleep(0.1) was added in func1 to cancel out the effect of being a super trivial task.

How to make 6 calculation as fast as possible based on one datastream?

I have one stream of data who is coming very fast, and when a new data arrive, I would like to make 6 different calculation based on it.
I would like to make those calculation as fast as possible so I can update as soon as I receive new data.
The data can arrive as fast as milliseconds so my calculation must be very fast.
So the best thing I was thinking of was to make those calculations on 6 different Threads at the same time.
I never used threads before so I don't know where to place it.
This is the code who describe my problem
What can I do from here?
import numpy as np
import time
np.random.seed(0)
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
return r
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
calc_1 = calculation_1(data=data_stream_main[0], multiplicator=2)
calc_2 = calculation_1(data=data_stream_main[0], multiplicator=3)
calc_3 = calculation_1(data=data_stream_main[1], multiplicator=2)
calc_4 = calculation_1(data=data_stream_main[1], multiplicator=3)
calc_5 = calculation_1(data=data_stream_main[2], multiplicator=2)
calc_6 = calculation_1(data=data_stream_main[2], multiplicator=3)
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)

You can use either class multiprocessing.pool.Pool or concurrent.futures.ProcessPoolExecutor to create a multiprocessing pool of 6 processes to which you can submit your 6 tasks in your loop to execute in parallel and await the results. The following example uses multiprocessing.pool.Pool.
But, the result will be very disappointing.
The problem is that (1) There is overhead in initially creating the 6 processes and (2) overhead in queueing up each task to execute in the different address space that the subprocesses live. This means that for multiprocessing to be advantageous, your worker function, calculation_1 in this case, needs to be a less-trivial, longer-running, more-CPU-intensive function. If you were to add to your worker function the following "do-nothing", CPU-intensive loop ...
cnt = 0
for i in range(100000):
cnt += 1
... then the following multiprocessing code would run several times more quickly. As is, stick with what you have.
import numpy as np
import multiprocessing as mp
import time
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
"""
cnt = 0
for i in range(100000):
cnt += 1
"""
return r
# required for Windows and other platforms that use spawn for creating new processes:
if __name__ == '__main__':
np.random.seed(0)
# no point in using more processes than processors:
n_processors = min(6, mp.cpu_count())
pool = mp.Pool(n_processors)
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
# submit tasks:
result_1 = pool.apply_async(calculation_1, (data_stream_main[0], 2))
result_2 = pool.apply_async(calculation_1, (data_stream_main[0], 3))
result_3 = pool.apply_async(calculation_1, (data_stream_main[1], 2))
result_4 = pool.apply_async(calculation_1, (data_stream_main[1], 3))
result_5 = pool.apply_async(calculation_1, (data_stream_main[2], 2))
result_6 = pool.apply_async(calculation_1, (data_stream_main[2], 3))
# wait for results:
calc_1 = result_1.get()
calc_2 = result_2.get()
calc_3 = result_3.get()
calc_4 = result_4.get()
calc_5 = result_5.get()
calc_6 = result_6.get()
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)

You could factorize the calculation by separating the log(data) from the log(multiplicator).
Given that np.log(data * (multiplicator+1)) is the same as np.log(data) + np.log(multiplicator+1), you can compute and store the 2 possible values of np.log(multiplicator+1) in global variables, then only compute log(data) once per index (thus saving 50%) on that part.
# global variables and calculation function:
multiplicator2 = np.log(3)
multiplicator3 = np.log(4)
def calculation_1(data):
logData = np.log(data)
return logData + multiplicator2, logData + multiplicator3
# in the loop:...
calc_1,calc_2 = calculation_1(data_stream_main[0])
calc_3,calc_4 = calculation_1(data_stream_main[1])
calc_5,calc_6 = calculation_1(data_stream_main[2])
If you can afford to buffer several rows of data into a numpy matrix before outputing the result, you may get some performance improvement by using numpy's parallelism to perform the calculation on the whole matrix (or chunk) and output the result in chunks instead of one row at a time. Separating reception of the data from computation and output is where the use of threads may provide a benefit.
For example:
start = time.time()
chunk = []
multiplicators = np.array([2,2,2,3,3,3])
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
chunk.append(data_stream_main*2)
if len(chunk)< 1000: continue
# process 1000 lines at a time and output results
calcs = np.log(np.array(chunk)*multiplicators)
calc_1,calc_4,calc_2,calc_5,calc_3,calc6 = calcs[-1,:]
chunk = [] # reset chunk
print("total time:", time.time() - start) # 2.7 (compared to 6.6)

Can Python threads work on the same process?

I am trying to come up with a way to have threads work on the same goal without interfering. In this case I am using 4 threads to add up every number between 0 and 90,000. This code runs but it ends almost immediately (Runtime: 0.00399994850159 sec) and only outputs 0. Originally I wanted to do it with a global variable but I was worried about the threads interfering with each other (ie. the small chance that two threads double count or skip a number due to strange timing of the reads/writes). So instead I distributed the workload beforehand. If there is a better way to do this please share. This is my simple way of trying to get some experience into multi threading. Thanks
import threading
import time
start_time = time.time()
tot1 = 0
tot2 = 0
tot3 = 0
tot4 = 0
def Func(x,y,tot):
tot = 0
i = y-x
while z in range(0,i):
tot = tot + i + z
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,tot1))
x = threading.Thread(target=Func, args=(22500,44999,tot2))
y = threading.Thread(target=Func, args=(45000,67499,tot3))
z = threading.Thread(target=Func, args=(67500,89999,tot4))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = tot1 + tot2 + tot3 + tot4
print total
print("--- %s seconds ---" % (time.time() - start_time))

You have a bug that makes this program end almost immediately. Look at while z in range(0,i): in Func. z isn't defined in the function and its only by luck (bad luck really) that you happen to have a global variable z = threading.Thread(target=Func, args=(67500,89999,tot4)) that masks the problem. You are testing whether the thread object is in a list of integers... and its not!
The next problem is with the global variables. First, you are absolutely right that using a single global variable is not thread safe. The threads would mess with each others calculations. But you misunderstand how globals work. When you do threading.Thread(target=Func, args=(67500,89999,tot4)), python passes the object currently referenced by tot4 to the function, but the function has no idea which global it came from. You only update the local variable tot and discard it when the function completes.
A solution is to use a global container to hold the calculations as shown in the example below. Unfortunately, this is actually slower than just doing all the work in one thread. The python global interpreter lock (GIL) only lets 1 thread run at a time and only slows down CPU-intensive tasks implemented in pure python.
You could look at the multiprocessing module to split this into multiple processes. That works well if the cost of running the calculation is large compared to the cost of starting the process and passing it data.
Here is a working copy of your example:
import threading
import time
start_time = time.time()
tot = [0] * 4
def Func(x,y,tot_index):
my_total = 0
i = y-x
for z in range(0,i):
my_total = my_total + i + z
tot[tot_index] = my_total
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,0))
x = threading.Thread(target=Func, args=(22500,44999,1))
y = threading.Thread(target=Func, args=(45000,67499,2))
z = threading.Thread(target=Func, args=(67500,89999,3))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = sum(tot)
print total
print("--- %s seconds ---" % (time.time() - start_time))

You can pass in a mutable object that you can add your results either with an identifier, e.g. dict or just a list and append() the results, e.g.:
import threading
def Func(start, stop, results):
results.append(sum(range(start, stop+1)))
rngs = [(0, 22499), (22500, 44999), (45000, 67499), (67500, 89999)]
results = []
jobs = [threading.Thread(target=Func, args=(start, stop, results)) for start, stop in rngs]
for j in jobs:
j.start()
for j in jobs:
j.join()
print(sum(results))
# 4049955000
# 100 loops, best of 3: 2.35 ms per loop

As others have noted you could look multiprocessing in order to split the work to multiple different processes that can run parallel. This would benefit especially in CPU-intensive tasks assuming that there isn't huge amount of data to pass between the processes.
Here's a simple implementation of the same functionality using multiprocessing:
from multiprocessing import Pool
POOL_SIZE = 4
NUMBERS = 90000
def func(_range):
tot = 0
for z in range(*_range):
tot += z
return tot
with Pool(POOL_SIZE) as pool:
chunk_size = int(NUMBERS / POOL_SIZE)
chunks = ((i, i + chunk_size) for i in range(0, NUMBERS, chunk_size))
print(sum(pool.imap(func, chunks)))
In above chunks is a generator that produces the same ranges that were hardcoded in original version. It's given to imap which works the same as standard map except that it executes the function in the processes within the pool.
Less known fact about multiprocessing is that you can easily convert the code to use threads instead of processes by using undocumented multiprocessing.pool.ThreadPool. In order to convert above example to use threads just change import to:
from multiprocessing.pool import ThreadPool as Pool

Python array sum vs MATLAB

I'm slowly switching to Python and I wanted to make a simple test for comparing the performance of a simple array summation. I generate a random 1000x1000 array and add one to each of the values in this array.
Here my script in Python :
import time
import numpy
from numpy.random import random
def testAddOne(data):
"""
Test addOne
"""
return data + 1
i = 1000
data = random((i,i))
start = time.clock()
for x in xrange(1000):
testAddOne(data)
stop = time.clock()
print stop - start
And my function in MATLAB:
function test
%parameter declaration
c=rand(1000);
tic
for t = 1:1000
testAddOne(c);
end
fprintf('Structure: \n')
toc
end
function testAddOne(c)
c = c + 1;
end
The Python takes 2.77 - 2.79 seconds, the same as the MATLAB function (I'm actually quite impressed by Numpy!). What would I have to change to my Python script to use multithreading? I can't in MATLAB since I don,t have the toolbox.

Multi threading in Python is only useful for situations where threads get blocked, e.g. on getting input, which is not the case here (see the answers to this question for more details). However, multi processing is easy to do in Python. Multiprocessing in general is covered here.
A program taking a similar approach to your example is below
import time
import numpy
from numpy.random import random
from multiprocessing import Process
def testAddOne(data):
return data + 1
def testAddN(data,N):
# print "testAddN", N
for x in xrange(N):
testAddOne(data)
if __name__ == '__main__':
matrix_size = 1000
num_adds = 10000
num_processes = 4
data = random((matrix_size,matrix_size))
start = time.clock()
if num_processes > 1:
processes = [Process(target=testAddN, args=(data,num_adds/num_processes))
for i in range(num_processes)]
for p in processes:
p.start()
for p in processes:
p.join()
else:
testAddN(data,num_adds)
stop = time.clock()
print "Elapsed", stop - start
A more useful example using a pool of worker processes to successively add 1 to different matrices is below.
import time
import numpy
from numpy.random import random
from multiprocessing import Pool
def testAddOne(data):
return data + 1
def testAddN(dataN):
data,N=dataN
for x in xrange(N):
data = testAddOne(data)
return data
if __name__ == '__main__':
num_matrices = 4
matrix_size = 1000
num_adds_per_matrix = 2500
num_processes = 4
inputs = [(random((matrix_size,matrix_size)), num_adds_per_matrix)
for i in range(num_matrices)]
#print inputs # test using, e.g., matrix_size = 2
start = time.clock()
if num_processes > 1:
proc_pool = Pool(processes=num_processes)
outputs = proc_pool.map(testAddN, inputs)
else:
outputs = map(testAddN, inputs)
stop = time.clock()
#print outputs # test using, e.g., matrix_size = 2
print "Elapsed", stop - start
In this case the code in testAddN actually does something with the result of calling testAddOne. And you can uncomment the print statements to check that some useful work is being done.
In both cases I've changed the total number of additions to 10000; with fewer additions the cost of starting up processes becomes more significant (but you can experiment with the parameters). And you can experiment with num_processes also. On my machine I found that compared to running in the same process with num_processes=1 I got just under a 2x speedup spawning four processes with num_processes=4.

Python threads don't run simultaneously

I'm brand new to multi-threaded processing, so please forgive me if I butcher terms or miss something obvious.
The code below doesn't offer any time advantage over different code that calls the same two functions one after the other.
import time
import threading
start_time = time.clock()
def fibonacci(nth): #can be ignored
first = 0
second = 1
for i in range(nth):
third = first + second
first = second
second = third
print "Fibonacci number", i + 1, "is", len(str(first)), "digits long"
def collatz(collatz_max): #can be ignored
for n in range(collatz_max):
n = n + 1 #avoid entering 0
solution = []
solution.append(n)
while n != 1:
if n % 2 == 0:
n = n / 2
else:
n = (n*3) + 1
solution.append(n)
print "Path for Collatz number", collatz_max, "is", solution
def scripts():
thread_fibonacci = threading.Thread(target=fibonacci, args = (800000,))
thread_collatz = threading.Thread(target=collatz, args = (400000,))
thread_fibonacci.start()
thread_collatz.start()
return thread_fibonacci, thread_collatz
all_scripts = scripts()
#wait until both threads are finished
for script in all_scripts:
script.join()
print time.clock() - start_time, "seconds"
What do I need to do to make the threads simultaneous? Does GIL mean concurrency can only be achieved through separate processes? If so, what is the point of multithreading?
Using Python 2.7.5 on Windows 8.1, quad-core processor. Any help would be appreciated.

There are good answers regarding the GIL you can look at.
In short, if your tasks are CPU-bound (like the ones you posted), threads are not going to help you. Python threads are good for IO-bound tasks, like retrieving a web page.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multi-threaded Python application slower than single-threaded implementation - python

The number of issues here is quite... numerous. Too many (system!) threads that do too little work, the GIL, etc. This is what I consider a really good introduction to parallelism in Python: https://www.youtube.com/watch?v=MCs5OvhV9S4 Live coding is awesome.

Related

Multiprocessing in Python: Parallelize a for loop to fill a Numpy array

How to make 6 calculation as fast as possible based on one datastream?

Can Python threads work on the same process?

Python array sum vs MATLAB

Python threads don't run simultaneously

Categories

Resources