Python multiprocessing taking longer time

Python multiprocessing taking longer time - python

I was trying python multiprocessing module to reduce time for my filtering code. At the beginning I have done some experiment. Results are not promising.
I've defined a function to run a loop within a certain range. Then I've run this function with and without threading and measured the time. Here is my code:
import time
from multiprocessing.pool import ThreadPool
def do_loop(i,j):
l = []
for i in range(i,j):
l.append(i)
return l
#loop veriable
x = 7
#without thredding
start_time = time.time()
c = do_loop(0,10**x)
print("--- %s seconds ---" % (time.time() - start_time))
#with thredding
def thread_work(n):
#dividing loop size
a = 0
b = int(n/2)
c = int(n/2)
#multiprocessing
pool = ThreadPool(processes=10)
async_result1 = pool.apply_async(do_loop, (a,b))
async_result2 = pool.apply_async(do_loop, (b,c))
async_result3 = pool.apply_async(do_loop, (c,n))
#get the result from all processes]
result = async_result1.get() + async_result2.get() + async_result3.get()
return result
start_time = time.time()
ll = thread_work(10**x)
print("--- %s seconds ---" % (time.time() - start_time))
For x=7 the result is:
--- 1.0931916236877441 seconds ---
--- 1.4213247299194336 seconds ---
Without threading it takes less time. And here is another problem. For X=8, most of the time I get MemoryError for threading. Once I got this result:
--- 17.04124426841736 seconds ---
--- 32.871358156204224 seconds ---
The solution is important as I need to optimize a filtering task which takes 6 hours.

Depending on your task, multiprocessing may or may not take longer.
If you want to take advantages of your CPU cores and speed up your filtering process then you should use multiprocessing.Pool
offers a convenient means of parallelizing the execution of a function
across multiple input values, distributing the input data across
processes (data parallelism).
I've been creating an example of data filtering and then I've been measuring the timing of a simple approach and the timing of a multiprocess approach. (starting from your code)
# take only the sentences that ends in "we are what we dream", the second word is "are"
import time
from multiprocessing.pool import Pool
LEN_FILTER_SENTENCE = len('we are what we dream')
num_process = 10
def do_loop(sentences):
l = []
for sentence in sentences:
if sentence[-LEN_FILTER_SENTENCE:].lower() =='we are what we doing' and sentence.split()[1] == 'are':
l.append(sentence)
return l
#with thredding
def thread_work(sentences):
#multiprocessing
pool = Pool(processes=num_process)
pool_food = (sentences[i: i + num_process] for i in range(0, len(sentences), num_process))
result = pool.map(do_loop, pool_food)
return result
def test(data_size=5, sentence_size=100):
to_be_filtered = ['we are what we doing'*sentence_size] * 10 ** data_size + ['we are what we dream'*sentence_size] * 10 ** data_size
start_time = time.time()
c = do_loop(to_be_filtered)
simple_time = (time.time() - start_time)
start_time = time.time()
ll = [e for l in thread_work(to_be_filtered) for e in l]
multiprocessing_time = (time.time() - start_time)
assert c == ll
return simple_time, multiprocessing_time
data_size represents the length of your data and sentence_size is a factor of multiplication for each data element, you can see that sentence_size is directly proportional with the number of CPU operations requested for each item from your data.
data_size = [1, 2, 3, 4, 5, 6]
results = {i: {'simple_time': [], 'multiprocessing_time': []} for i in data_size}
sentence_size = list(range(1, 500, 100))
for size in data_size:
for s_size in sentence_size:
simple_time, multiprocessing_time = test(size, s_size)
results[size]['simple_time'].append(simple_time)
results[size]['multiprocessing_time'].append(multiprocessing_time)
import pandas as pd
df_small_data = pd.DataFrame({'simple_data_size_1': results[1]['simple_time'],
'simple_data_size_2': results[2]['simple_time'],
'simple_data_size_3': results[3]['simple_time'],
'multiprocessing_data_size_1': results[1]['multiprocessing_time'],
'multiprocessing_data_size_2': results[2]['multiprocessing_time'],
'multiprocessing_data_size_3': results[3]['multiprocessing_time'],
'sentence_size': sentence_size})
df_big_data = pd.DataFrame({'simple_data_size_4': results[4]['simple_time'],
'simple_data_size_5': results[5]['simple_time'],
'simple_data_size_6': results[6]['simple_time'],
'multiprocessing_data_size_4': results[4]['multiprocessing_time'],
'multiprocessing_data_size_5': results[5]['multiprocessing_time'],
'multiprocessing_data_size_6': results[6]['multiprocessing_time'],
'sentence_size': sentence_size})
Ploting the timing for small data:
ax = df_small_data.set_index('sentence_size').plot(figsize=(20, 10), title = 'Simple vs multiprocessing approach for small data')
ax.set_ylabel('Time in seconds')
Ploting the timing for big data(relative big data):
As you can see, the multiprocessing power is revealing when you have big data that requires relatively significant CPU power for each data element.

The task here is so small that the parallelization overhead hugely dominates over the benefits. This is a common FAQ.

It's better you use multiprocessing.Process(), as python has Global Interpreter Lock(GIL). So even if you make threads to increase the speed of your tasks, it won't increase, it'll go one by one. You can refer python doc for GIL and threading.

Aroosh Rana may have the best answer, but when testing using that approach a couple things to watch out for. The way you grow your array in the loop may be very inefficient, instead consider allocating its full size up front. Also, look closely at the way you divided up the work, you have two loops that process half the array and one that goes from n/2 to n/2. Also as mentioned elsewhere the word done is rather trivial and wouldn't benefit from parallel processing.
I've tried to improve upon your previous test.
import time
from multiprocessing.pool import ThreadPool
import math
def do_loop(array, i,j):
for k in range(i,j):
array[k] = math.cos(1/(1+k))
return array
#loop veriable
x = 7
array_size = 2*10**x
#without thredding
start_time = time.time()
array = [0]*array_size
c = do_loop(array, 0,array_size)
print("--- %s seconds ---" % (time.time() - start_time))
#with thredding
def thread_work(n):
#dividing loop size
array = [0]*n
a = 0
b = int(n/3)
c = int(2*n/3)
#multiprocessing
pool = ThreadPool(processes=4)
async_result1 = pool.apply_async(do_loop, (array, a,b))
async_result2 = pool.apply_async(do_loop, (array, b,c))
async_result3 = pool.apply_async(do_loop, (array, c,n))
#get the result from all processes]
result1 = async_result1.get()
result2 = async_result2.get()
result3 = async_result3.get()
start_time = time.time()
result = result1+result2+result3
print("--- %s seconds ---" % (time.time() - start_time))
return result
start_time = time.time()
ll = thread_work(array_size)
print("--- %s seconds ---" % (time.time() - start_time))
Also keep in mind that with an approach like this you wouldn't have to combine the results at the end as each thread would be processing on the same array.

Why do you use multiprocessing for threads?
Best is to create multiple instances of threads. Give each of them their tasks. Finally, start all of them. And wait until they finish. In the meantime, collect results to some list.
From my experience (for one particular task), I have found that even creating a whole graph of threads at the beginning gives smaller overhead than directly before starting tasks in next node in graph. I mean it for 10, 100, 1000, 10000 threads. Just make sure that threads are sleeping during idle time i.e. time.sleep(0.5) to avoid wasting cycles of CPU.
With threads, you can use lists, dictionaries, and queues, which are thread-safe.

Related

Thread Pool Executor using Concurrent: no improvement for various number of workers

I'm trying to implement a task in parallel using Concurrent. Please find below a piece of code for it:
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
import concurrent.futures
# num CPUs
cpu_num = len(os.sched_getaffinity(0))
print("Number of cpu available : ",cpu_num)
# max_Worker = cpu_num
max_Worker = 1
# A fake input array
n=1000000
array = list(range(n))
results = []
# A fake function being applied to each element of array
def task(i):
return i**2
x = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=max_Worker) as executor:
features = {executor.submit(task, j) for j in array}
# the real function is heavy and we need to be sure of completeness of each run
for future in concurrent.futures.as_completed(features):
results.append(future.result())
results = [future.result() for future in features]
y = time.time()
print('=========================================')
print(f"Train data preparation time (s): {(y-x)}")
print('=========================================')
And now my questions,
Although there is no error, is it correct/optimized?
While playing with the number of workers, seems there is no
improvement in the speed (e.g., 1 vs 16, no difference). Then,
what's the problem and how can be solved?
Thanks in advance,

See my comment to your question. To the overhead I mentioned in that comment you need to also add the oberhead in just creating the process pool itself.
The following is a benchmark with several results. The first is a timing from just calling the worker function task 100000 times and creating a results list and printing out the last element of that list. It will become apparent why I have reduced the number of times I am calling task from 1000000 to 100000.
The next attempt is to use multiprocessing to accomplish the same thing using a ProcessPoolExecutor with the submit method and then processing the Future instances that are returned.
The next attempt is to instead use the map method with the default chunksize argument of 1 being used. It is important to understand this argument. With a chunksize value of 1, each element of the iterable that is passed to the map method is written individually to a queue of tasks as a chunk to be processed by the processes in the pool. When a pool process becomes idle looking for work, it pulls from the queue the next chunk of tasks to be performed, processes each task comprising the chunk and then becomes idle again. When there are a lot of submitted tasks being submitted via map, a chunksize value of 1 is inefficient. You would expect its performance to be equivalent to repeatedly issuing submit calls for each element of the iterable.
The next attempt specifies a chunksize value which approximates more or less the value that the map function used by the Pool class in the multiprocessing package would have used by default. As you can see, the improvement is dramatic, but still not an improvement over the non-multiprocessing case.
The final attempt uses the multiprocessing faciltity provided by package multiprocessing and its multiprocessing.pool.Pool class. The difference in this benchmark is that its map function uses a more intelligent default chunksize when no chunksize argument is specified.
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
from multiprocessing import Pool
# A fake function being applied to each element of array
def task(i):
return i**2
# required for Windows:
if __name__ == '__main__':
n=100000
t1 = time.time()
results = [task(i) for i in range(n)]
print('Non-multiprocessing time:', time.time() - t1, results[-1])
# num CPUs
cpu_num = os.cpu_count()
print("Number of CPUs available: ",cpu_num)
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
futures = [executor.submit(task, i) for i in range(n)]
results = [future.result() for future in futures]
print('Multiprocessing time using submit:', time.time() - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n)))
print('Multiprocessing time using map:', time.time() - t1, results[-1])
t1 = time.time()
chunksize = n // (4 * cpu_num)
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n), chunksize=chunksize))
print(f'Multiprocessing time using map: {time.time() - t1}, chunksize: {chunksize}', results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n))
print('Multiprocessing time using Pool.map:', time.time() - t1, results[-1])
Prints:
Non-multiprocessing time: 0.027019739151000977 9999800001
Number of CPUs available: 8
Multiprocessing time using submit: 77.34723353385925 9999800001
Multiprocessing time using map: 79.52981925010681 9999800001
Multiprocessing time using map: 0.30500149726867676, chunksize: 3125 9999800001
Multiprocessing time using Pool.map: 0.2799997329711914 9999800001
Update
The following bechmarks use a version of task that is very CPU-intensive and shows the benefit of multiprocessing. It would also seem for this small iterable size (100), forcing a chunksize value of 1 for the Pool.map case (it would by default compute a chunksize value of 4), is slightly more performant.
import os
import time
from concurrent.futures import ProcessPoolExecutor as PE
from multiprocessing import Pool
# A fake function being applied to each element of array
def task(i):
for _ in range(1_000_000):
result = i ** 2
return result
def compute_chunksize(iterable_size, pool_size):
chunksize, remainder = divmod(iterable_size, pool_size * 4)
if remainder:
chunksize += 1
return chunksize
# required for Windows:
if __name__ == '__main__':
n = 100
cpu_num = os.cpu_count()
chunksize = compute_chunksize(n, cpu_num)
t1 = time.time()
results = [task(i) for i in range(n)]
t2 = time.time()
print('Non-multiprocessing time:', t2 - t1, results[-1])
# num CPUs
print("Number of CPUs available: ",cpu_num)
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
futures = [executor.submit(task, i) for i in range(n)]
results = [future.result() for future in futures]
t2 = time.time()
print('Multiprocessing time using submit:', t2 - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n)))
t2 = time.time()
print('Multiprocessing time using map:', t2 - t1, results[-1])
t1 = time.time()
with PE(max_workers=cpu_num) as executor:
results = list(executor.map(task, range(n), chunksize=chunksize))
t2 = time.time()
print(f'Multiprocessing time using map: {t2 - t1}, chunksize: {chunksize}', results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n))
t2 = time.time()
print('Multiprocessing time using Pool.map:', t2 - t1, results[-1])
t1 = time.time()
with Pool(cpu_num) as executor:
results = executor.map(task, range(n), chunksize=1)
t2 = time.time()
print('Multiprocessing time using Pool.map (chunksize=1):', t2 - t1, results[-1])
Prints:
Non-multiprocessing time: 23.12758779525757 9801
Number of CPUs available: 8
Multiprocessing time using submit: 5.336004018783569 9801
Multiprocessing time using map: 5.364996671676636 9801
Multiprocessing time using map: 5.444890975952148, chunksize: 4 9801
Multiprocessing time using Pool.map: 5.400001287460327 9801
Multiprocessing time using Pool.map (chunksize=1): 4.698001146316528 9801

How to make 6 calculation as fast as possible based on one datastream?

I have one stream of data who is coming very fast, and when a new data arrive, I would like to make 6 different calculation based on it.
I would like to make those calculation as fast as possible so I can update as soon as I receive new data.
The data can arrive as fast as milliseconds so my calculation must be very fast.
So the best thing I was thinking of was to make those calculations on 6 different Threads at the same time.
I never used threads before so I don't know where to place it.
This is the code who describe my problem
What can I do from here?
import numpy as np
import time
np.random.seed(0)
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
return r
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
calc_1 = calculation_1(data=data_stream_main[0], multiplicator=2)
calc_2 = calculation_1(data=data_stream_main[0], multiplicator=3)
calc_3 = calculation_1(data=data_stream_main[1], multiplicator=2)
calc_4 = calculation_1(data=data_stream_main[1], multiplicator=3)
calc_5 = calculation_1(data=data_stream_main[2], multiplicator=2)
calc_6 = calculation_1(data=data_stream_main[2], multiplicator=3)
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)

You can use either class multiprocessing.pool.Pool or concurrent.futures.ProcessPoolExecutor to create a multiprocessing pool of 6 processes to which you can submit your 6 tasks in your loop to execute in parallel and await the results. The following example uses multiprocessing.pool.Pool.
But, the result will be very disappointing.
The problem is that (1) There is overhead in initially creating the 6 processes and (2) overhead in queueing up each task to execute in the different address space that the subprocesses live. This means that for multiprocessing to be advantageous, your worker function, calculation_1 in this case, needs to be a less-trivial, longer-running, more-CPU-intensive function. If you were to add to your worker function the following "do-nothing", CPU-intensive loop ...
cnt = 0
for i in range(100000):
cnt += 1
... then the following multiprocessing code would run several times more quickly. As is, stick with what you have.
import numpy as np
import multiprocessing as mp
import time
def calculation_1(data, multiplicator):
r = np.log(data * (multiplicator+1))
"""
cnt = 0
for i in range(100000):
cnt += 1
"""
return r
# required for Windows and other platforms that use spawn for creating new processes:
if __name__ == '__main__':
np.random.seed(0)
# no point in using more processes than processors:
n_processors = min(6, mp.cpu_count())
pool = mp.Pool(n_processors)
start = time.time()
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
# calculation that has to be done together
# submit tasks:
result_1 = pool.apply_async(calculation_1, (data_stream_main[0], 2))
result_2 = pool.apply_async(calculation_1, (data_stream_main[0], 3))
result_3 = pool.apply_async(calculation_1, (data_stream_main[1], 2))
result_4 = pool.apply_async(calculation_1, (data_stream_main[1], 3))
result_5 = pool.apply_async(calculation_1, (data_stream_main[2], 2))
result_6 = pool.apply_async(calculation_1, (data_stream_main[2], 3))
# wait for results:
calc_1 = result_1.get()
calc_2 = result_2.get()
calc_3 = result_3.get()
calc_4 = result_4.get()
calc_5 = result_5.get()
calc_6 = result_6.get()
print(calc_1)
print(calc_2)
print(calc_3)
print(calc_4)
print(calc_5)
print(calc_6)
print("total time:", time.time() - start)

You could factorize the calculation by separating the log(data) from the log(multiplicator).
Given that np.log(data * (multiplicator+1)) is the same as np.log(data) + np.log(multiplicator+1), you can compute and store the 2 possible values of np.log(multiplicator+1) in global variables, then only compute log(data) once per index (thus saving 50%) on that part.
# global variables and calculation function:
multiplicator2 = np.log(3)
multiplicator3 = np.log(4)
def calculation_1(data):
logData = np.log(data)
return logData + multiplicator2, logData + multiplicator3
# in the loop:...
calc_1,calc_2 = calculation_1(data_stream_main[0])
calc_3,calc_4 = calculation_1(data_stream_main[1])
calc_5,calc_6 = calculation_1(data_stream_main[2])
If you can afford to buffer several rows of data into a numpy matrix before outputing the result, you may get some performance improvement by using numpy's parallelism to perform the calculation on the whole matrix (or chunk) and output the result in chunks instead of one row at a time. Separating reception of the data from computation and output is where the use of threads may provide a benefit.
For example:
start = time.time()
chunk = []
multiplicators = np.array([2,2,2,3,3,3])
for ii in range(1000000):
data_stream_main = [np.random.uniform(0, 2.0), np.random.uniform(10, 1000.0), np.random.uniform(0, 0.01)]
chunk.append(data_stream_main*2)
if len(chunk)< 1000: continue
# process 1000 lines at a time and output results
calcs = np.log(np.array(chunk)*multiplicators)
calc_1,calc_4,calc_2,calc_5,calc_3,calc6 = calcs[-1,:]
chunk = [] # reset chunk
print("total time:", time.time() - start) # 2.7 (compared to 6.6)

Append to Empty list vs Initialising First and then giving values to list indexes

I'm trying to make a code in python. I know beforehand the size of list.
Suppose list is l and its size is n.
So should I first initialise a list of length n and then using a for loop put l[i]=i for i from 0 to n.
OR
Is it better to first initialise empty list and then usinig a for loop append items into the list?
I know that both the two approaches will yield the same result.
I just mean to ask if one is faster or more efficient than the other.

Appending is faster because lists are designed to be inherently mutable. You can initialise a fixed-size list with code like
x = [None]*n
But this in itself takes time. I wrote a quick program to compare the speeds as follows.
import time
def main1():
x = [None]*10000
for i in range(10000):
x[i] = i
print(x)
def main2():
y = []
for i in range(10000):
y.append(i)
print(y)
start_time = time.time()
main1()
print("1 --- %s seconds ---" % (time.time() - start_time))
#1 --- 0.03682112693786621 seconds ---
start_time = time.time()
main2()
print("2 --- %s seconds ---" % (time.time() - start_time))
#2 --- 0.01464700698852539 seconds ---
As you can see, the first way is actually significantly slower on my computer!

If you now the size of your list your first approach is correct.
you can use numpy library to make your list with any size you want.
import numpy as np
X = np.zeros(n, y) #size of your list
#use your for loop here
for i in range(10):
a[i] = i+1
X.tolist()

I got the opposite result compared to another answer.
import time
count = 10000000
def main1():
x = [None]*count
for i in range(count):
x[i] = i
def main2():
y = []
for i in range(count):
y.append(i)
start_time = time.time()
main1()
end_time = time.time()
print("1:", end_time-start_time) # 0.488
start_time = time.time()
main2()
end_time = time.time()
print("2:", end_time-start_time) # 0.728

Multi-threaded Python application slower than single-threaded implementation

I wrote this program to properly learn how to use multi-threading. I want to implement something similar to this in my own program:
import numpy as np
import time
import os
import math
import random
from threading import Thread
def powExp(x, r):
for c in range(x.shape[1]):
x[r][c] = math.pow(100, x[r][c])
def main():
print()
rows = 100
cols = 100
x = np.random.random((rows, cols))
y = x.copy()
start = time.time()
threads = []
for r in range(x.shape[0]):
t = Thread(target = powExp, args = (x, r))
threads.append(t)
t.start()
for t in threads:
t.join()
end = time.time()
print("Multithreaded calculation took {n} seconds!".format(n = end - start))
start = time.time()
for r in range(y.shape[0]):
for c in range(y.shape[1]):
y[r][c] = math.pow(100, y[r][c])
end = time.time()
print("Singlethreaded calculation took {n} seconds!".format(n = end - start))
print()
randRow = random.randint(0, rows - 1)
randCol = random.randint(0, cols - 1)
print("Checking random indices in x and y:")
print("x[{rR}][{rC}]: = {n}".format(rR = randRow, rC = randCol, n = x[randRow][randCol]))
print("y[{rR}][{rC}]: = {n}".format(rR = randRow, rC = randCol, n = y[randRow][randCol]))
print()
for r in range(x.shape[0]):
for c in range(x.shape[1]):
if(x[r][c] != y[r][c]):
print("ERROR NO WORK WAS DONE")
print("x[{r}][{c}]: {n} == y[{r}][{c}]: {ny}".format(
r = r,
c = c,
n = x[r][c],
ny = y[r][c]
))
quit()
assert(np.array_equal(x, y))
if __name__ == main():
main()
As you can see from the code the goal here is to parallelize the operation math.pow(100, x[r][c]) by creating a thread for every column. However this code is extremely slow, a lot slower than single-threaded versions.
Output:
Multithreaded calculation took 0.026447772979736328 seconds!
Singlethreaded calculation took 0.006798267364501953 seconds!
Checking random indices in x and y:
x[58][58]: = 9.792315687115973
y[58][58]: = 9.792315687115973
I searched through stackoverflow and found some info about the GIL forcing python bytecode to be executed on a single core only. However I'm not sure that this is in fact what is limiting my parallelization. I tried rearranging the parallelized for-loop using pools instead of threads. Nothing seems to be working.
Python code performance decreases with threading
EDIT: This thread discusses the same issue. Is it completely impossible to increase performance using multi-threading in python because of the GIL? Is the GIL causing my slowdowns?
EDIT 2 (2017-01-18): So from what I can gather after searching for quite a bit online it seems like python is really bad for parallelism. What I'm trying to do is parellelize a python function used in a neural network implemented in tensorflow...it seems like adding a custom op is the way to go.

The number of issues here is quite... numerous. Too many (system!) threads that do too little work, the GIL, etc. This is what I consider a really good introduction to parallelism in Python:
https://www.youtube.com/watch?v=MCs5OvhV9S4
Live coding is awesome.

Can Python threads work on the same process?

I am trying to come up with a way to have threads work on the same goal without interfering. In this case I am using 4 threads to add up every number between 0 and 90,000. This code runs but it ends almost immediately (Runtime: 0.00399994850159 sec) and only outputs 0. Originally I wanted to do it with a global variable but I was worried about the threads interfering with each other (ie. the small chance that two threads double count or skip a number due to strange timing of the reads/writes). So instead I distributed the workload beforehand. If there is a better way to do this please share. This is my simple way of trying to get some experience into multi threading. Thanks
import threading
import time
start_time = time.time()
tot1 = 0
tot2 = 0
tot3 = 0
tot4 = 0
def Func(x,y,tot):
tot = 0
i = y-x
while z in range(0,i):
tot = tot + i + z
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,tot1))
x = threading.Thread(target=Func, args=(22500,44999,tot2))
y = threading.Thread(target=Func, args=(45000,67499,tot3))
z = threading.Thread(target=Func, args=(67500,89999,tot4))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = tot1 + tot2 + tot3 + tot4
print total
print("--- %s seconds ---" % (time.time() - start_time))

You have a bug that makes this program end almost immediately. Look at while z in range(0,i): in Func. z isn't defined in the function and its only by luck (bad luck really) that you happen to have a global variable z = threading.Thread(target=Func, args=(67500,89999,tot4)) that masks the problem. You are testing whether the thread object is in a list of integers... and its not!
The next problem is with the global variables. First, you are absolutely right that using a single global variable is not thread safe. The threads would mess with each others calculations. But you misunderstand how globals work. When you do threading.Thread(target=Func, args=(67500,89999,tot4)), python passes the object currently referenced by tot4 to the function, but the function has no idea which global it came from. You only update the local variable tot and discard it when the function completes.
A solution is to use a global container to hold the calculations as shown in the example below. Unfortunately, this is actually slower than just doing all the work in one thread. The python global interpreter lock (GIL) only lets 1 thread run at a time and only slows down CPU-intensive tasks implemented in pure python.
You could look at the multiprocessing module to split this into multiple processes. That works well if the cost of running the calculation is large compared to the cost of starting the process and passing it data.
Here is a working copy of your example:
import threading
import time
start_time = time.time()
tot = [0] * 4
def Func(x,y,tot_index):
my_total = 0
i = y-x
for z in range(0,i):
my_total = my_total + i + z
tot[tot_index] = my_total
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,0))
x = threading.Thread(target=Func, args=(22500,44999,1))
y = threading.Thread(target=Func, args=(45000,67499,2))
z = threading.Thread(target=Func, args=(67500,89999,3))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = sum(tot)
print total
print("--- %s seconds ---" % (time.time() - start_time))

You can pass in a mutable object that you can add your results either with an identifier, e.g. dict or just a list and append() the results, e.g.:
import threading
def Func(start, stop, results):
results.append(sum(range(start, stop+1)))
rngs = [(0, 22499), (22500, 44999), (45000, 67499), (67500, 89999)]
results = []
jobs = [threading.Thread(target=Func, args=(start, stop, results)) for start, stop in rngs]
for j in jobs:
j.start()
for j in jobs:
j.join()
print(sum(results))
# 4049955000
# 100 loops, best of 3: 2.35 ms per loop

As others have noted you could look multiprocessing in order to split the work to multiple different processes that can run parallel. This would benefit especially in CPU-intensive tasks assuming that there isn't huge amount of data to pass between the processes.
Here's a simple implementation of the same functionality using multiprocessing:
from multiprocessing import Pool
POOL_SIZE = 4
NUMBERS = 90000
def func(_range):
tot = 0
for z in range(*_range):
tot += z
return tot
with Pool(POOL_SIZE) as pool:
chunk_size = int(NUMBERS / POOL_SIZE)
chunks = ((i, i + chunk_size) for i in range(0, NUMBERS, chunk_size))
print(sum(pool.imap(func, chunks)))
In above chunks is a generator that produces the same ranges that were hardcoded in original version. It's given to imap which works the same as standard map except that it executes the function in the processes within the pool.
Less known fact about multiprocessing is that you can easily convert the code to use threads instead of processes by using undocumented multiprocessing.pool.ThreadPool. In order to convert above example to use threads just change import to:
from multiprocessing.pool import ThreadPool as Pool

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing taking longer time - python

The task here is so small that the parallelization overhead hugely dominates over the benefits. This is a common FAQ.

It's better you use multiprocessing.Process(), as python has Global Interpreter Lock(GIL). So even if you make threads to increase the speed of your tasks, it won't increase, it'll go one by one. You can refer python doc for GIL and threading.

Related

Thread Pool Executor using Concurrent: no improvement for various number of workers

How to make 6 calculation as fast as possible based on one datastream?

Append to Empty list vs Initialising First and then giving values to list indexes

Multi-threaded Python application slower than single-threaded implementation

Can Python threads work on the same process?

Categories

Resources