Why do python Processes behave differently with barrier.wait()?

Why do python Processes behave differently with barrier.wait()? - python

TLDR
Selecting the process that gets 0 back from barrier.wait() biases the order of the processes on the next iteration. Why?
Full Story
The below problem is a simple version of my actual problem.
Lets say I have the following multiprocessing problem:
I have a string (named output), n processes, n characters (not spaces), and N iterations.
Example: output, n, characters, N = '', 3, ['a', 'b', 'c'], 4
For each iteration, every process should append its character to the string (in any order). After each iteration, a space character should be appended to the string.
For the example, the output could be:
output = 'bca abc bac cab'
To get this functionality I use the multiprocessing library.
from multiprocessing import Process, Lock, Value, Manager, Barrier
import ctypes
def print_characters(lock, barrier, character, N, output):
for i in range(N):
lock.acquire()
output.value += character
lock.release()
_id = barrier.wait() # get an id for each thread (although not promised to be assigned randomly)
if _id == 0: # select 1 and get this thread to append the ' '
lock.acquire()
output.value += ' '
lock.release()
barrier.wait()
if __name__ == '__main__':
manager = Manager()
output = manager.Value(ctypes.c_char_p, "")
num_processes = 3
characters = 'abc'
N = 4
lock, barrier = Lock(), Barrier(num_processes)
processes = []
for i in range(num_processes):
p = Process(target=print_characters, args=(lock, barrier, characters[i], N, output))
p.start()
processes.append(p)
for p in processes:
p.join()
print(output.value)
Cool....
According to the documentation for Barrier.wait(),
When all the [processes] to the barrier have called this function, they are all released simultaneously.
So, if all n processes are released simultaneously, I wondered if this would cause a kind of "uniform random race condition" on the processes such that each process had an equal chance of getting to the lock.acquire() on the next iteration.
I don't really care about the order but it might be nice to know if I have the processes doing something different in the future.
To see this if this was the case I counted the number of times that each character came in which position:
output = output.value.split(' ')[:-1]
c = [[0]*num_processes for i in range(num_processes)]
for line in output:
for i, character in enumerate(line):
ind = characters.find(character)
c[ind][i] += 1
print(c)
For N = 1000, this is one output I got:
[[1000 0 0]
[ 0 948 52]
[ 0 52 948]]
:| ok?
So out of 1000 iterations, the process with the first character was always first. That didn't seem very random.
I then made one change in the print_characters function:
if _id == 1:
And I got:
[[254 254 492]
[489 489 22]
[257 257 486]]
Not uniform random but also not as biased.
Changing the number of processes (assuming the cpu can support it) and thus the number of characters shows this same effect:
Selecting the process that gets 0 back from barrier.wait() biases the order of the processes on the next iteration.
Why?

Related

separate the abnormal reads of DNA (A,T,C,G) templates

I have millions of DNA clone reads and few of them are misreads or error. I want to separate the clean reads only.
For non biological background:
DNA clone consist of only four characters (A,T,C,G) in various permutation/combination. Any character, symbol or sign other that "A","T","C", and "G" in DNA is an error.
Is there any way (fast/high throughput) in python to separate the clean reads only.
Basically I want to find a way through which I can separate a string which has nothing but "A","T","C","G" alphabet characters.
Edit
correct_read_clone: "ATCGGTTCATCGAATCCGGGACTACGTAGCA"
misread_clone: "ATCGGNATCGACGTACGTACGTTTAAAGCAGG" or "ATCGGTT#CATCGAATCCGGGACTACGTAGCA" or "ATCGGTTCATCGAA*TCCGGGACTACGTAGCA" or "AT?CGGTTCATCGAATCCGGGACTACGTAGCA" etc
I have tried the below for loop
check_list=['A','T','C','G']
for i in clone:
if i not in check_list:
continue
but the problem with this for loop is, it iterates over the string and match one by one which makes this process slow. To clean millions of clone this delay is very significant.

If these are the nucleotide sequences with an error in 2 of them,
a = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
b = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
c = 'ATUACTGAGTCAGTACGT'
d = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
e = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
f = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
g = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
test = a, b, c, d, e, f, g
try:
misread_counter = 0
correct_read_clone = []
for clone in test:
if len(set(list(clone))) <= 4:
correct_read_clone.append(clone)
else:
misread_counter +=1
print(f'Unclean sequences: {misread_counter}')
print(correct_read_clone)
Output:
Unclean sequences: 2
['ATACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT', 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT']
This way the for loop only has to attend each full sequence in a list of clones, rather than looping over each character of every sequence.
or if you want to know which ones have the errors you can make two lists:
misread_clone = []
correct_read_clone = []
for clone in test:
bases = len(set(list(clone)))
misread_clone.append(clone) if bases > 4 else correct_read_clone.append(clone)
print(f'misread sequences count: {len(misread_clone)}')
print(correct_read_clone)
Output:
misread sequences count: 2
['ATACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT', 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT', 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT']

I don't think you're going to get too many significant improvements for this. Most operations on a string are going to be O(N), and there isn't much you can do to get it to O(log(N)) or O(1). The checking for the values in ACTG is also O(N), leading to a worse case of O(n*m), where n and m are the lengths of the string and ACTG.
One thing you could do is cast the string into a set, which would be O(N), check if the length of set is more than 4 (which should be impossible if the only characters are ACTG) and if not, loop through the set and do the check against ACTG. I am assuming that it is possible that a clone could possibly be a string such as "AACCAA!!" which results in a set of ['A', 'C', '!'] in which case the length would be less than or equal to 4, but still be unclean/incorrect.
clones = [ "ACGTATCG", "AGCTGACGAT", "AGTACGATCAGT", "ACTGAGTCAGTACGT", "AGTACGTACGATCAGTACGT", "AAACCS", "AAACCCCCGGGGTTTS"]
for clone in clones:
if len(set(clone)) > 4:
print(f"unclean: {clone}")
else:
for char in clone:
if char not in "ACTG":
print(f"unclean: {clone}")
break
else:
print(f"clean: {clone}")
Since len(set) is O(1), that could potentially skip the need to check against ACTG. If it is less than or equal to 4, then the check would be O(n*m) again, but in this case the n is guaranteed to be less than 4 while your m stays the same at 4. The final process becomes O(n) rather than O(n*m), where n and m are the lengths of the set and ACTG. Since you are now checking against a set and anything other than ACTG will be unclean, n has a cap of 5. This means that no matter how large the original string is, doing the ACTG check on the set will be worst case O(5*4) and is thus essentially O(1) (Big O notation is about scale rather than exact values).
However, whether or not this is actually faster would depend on the length of the original string. It may end up taking more time if the original string is short. This would be unlikely, since the string would have to be very short, but can be the case.
You may get more time saved by tackling the amount of entries which you have noted is very large, if possible you may want to consider if you can split this into smaller groups to run them asynchronously. However, at the end of the day none of these are going to actively scale down your time. They would reduce the time taken since you'd be cutting out a constant scale from the time complexity or running a few at the same time, but at the end of the day it's still an O(N*M), with N and M being the number and length of strings, and there isn't anything that can really change that.

try this:
def is_clean_read(read):
for char in read:
if char not in ['A', 'T', 'C', 'G']:
return False
return True
reads = [ "ACGTATCG", "AGCTGACGAT", "AGTACGATCAGT", "ACTGAGTCAGTACGT", "AGTACGTACGATCAGTACGT"]
clean_reads = [read for read in reads if is_clean_read(read)]
print(clean_reads)

ok stealing from answer https://stackoverflow.com/a/75393987/9877065 by Shorn, tried to add multiprocessing, you can play with the lenght of my orfs list in the first part of the code and then try to change the number_of_processes = XXX to different values from 1 to your system max : multiprocessing.cpu_count(), code :
import time
from multiprocessing import Pool
from datetime import datetime
a = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
b = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
c = 'ATUACTGAGTCAGTACGT'
d = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
e = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
f = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
g = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
aa = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
bb = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
cc = 'ATUACTGAGTCAGTACGT'
dd = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
ee = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
ff = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
gg = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
aaa = 'ATACTGAGTCAGTACGTACTGAGTCAGTACGT'
bbb = 'AACTGAGTCAGTACGTACTGAGTCAAGTCAGTACGTSACTGAGTCAGTACGT'
ccc = 'ATUACTGAGTCAGTACGT'
ddd = 'AAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
eee = 'AACTGAGTCAGTAAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
fff = 'AAGTACGTACTGAGTCAGTACGTACTCAGTACGT'
ggg = 'ATCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGTACTGAGTCAGTACGT'
kkk = 'AAAAAAAAAAAAAAAAAAAAAAkkkkkkkkkkkkk'
clones = [a, b, c, d, e, f, g, aa, bb, cc, dd, ee, ff,gg, aaa, bbb, ccc, ddd, eee, fff, ggg, kkk]
clones_2 = clones
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
clones_2.extend(clones)
# clones_2.extend(clones)
# clones_2.extend(clones)
#print(clones_2, len(clones_2))
def check(clone):
# ATTENZIONE ALLUNGA TEMPO CPU vs I/O ##############################################################################################################
# time.sleep(1) ####################################################### !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
if len(set(clone)) > 4:
print(f"unclean: {clone}")
else:
for char in clone:
if char not in "ACTG":
print(f"unclean: {clone}")
break
else:
print(f"clean: {clone}")
begin = datetime.now()
number_of_processes = 4
p = Pool(number_of_processes)
list_a = []
cnt_atgc = 0
while True:
for i in clones_2 :
try:
list_a.append(i)
cnt_atgc += 1
if cnt_atgc == number_of_processes:
result = p.map(check, list_a)
p.close()
p.join()
p = Pool(number_of_processes)
cnt_atgc = 0
list_a = []
else:
continue
except:
print('SKIPPED !!!')
if len(list_a) > 0:
p = Pool(number_of_processes)
result = p.map(check, list_a)
p.close()
p.join()
break
else:
print('FINITO !!!!!!!!!!')
break
print('done')
print(datetime.now() - begin)
I have to pre load a list containing the orfs to be multiprocessed at each iteration, despite that can at least cut the execuition time by half on my machine, not sure how stdout influence the speed of the multiprocessing (and how to cope with result order see python multiprocess.Pool show results in order in stdout).

Why is my parallel code slower than my serial code?

Recently started learning parallel on my own and I have next to no idea what I'm doing. Tried applying what I have learnt but I think I'm doing something wrong because my parallel code is taking a longer time to execute than my serial code. My PC is running a i7-9700. This is the original serial code in question
def getMatrix(name):
matrixCreated = []
i = 0
while True:
i += 1
row = input('\nEnter elements in row %s of Matrix %s (separated by commas)\nOr -1 to exit: ' %(i, name))
if row == '-1':
break
else:
strList = row.split(',')
matrixCreated.append(list(map(int, strList)))
return matrixCreated
def getColAsList(matrixToManipulate, col):
myList = []
numOfRows = len(matrixToManipulate)
for i in range(numOfRows):
myList.append(matrixToManipulate[i][col])
return myList
def getCell(matrixA, matrixB, r, c):
matrixBCol = getColAsList(matrixB, c)
lenOfList = len(matrixBCol)
productList = [matrixA[r][i]*matrixBCol[i] for i in range(lenOfList)]
return sum(productList)
matrixA = getMatrix('A')
matrixB = getMatrix('B')
rowA = len(matrixA)
colA = len(matrixA[0])
rowB = len(matrixB)
colB = len(matrixB[0])
result = [[0 for p in range(colB)] for q in range(rowA)]
if (colA != rowB):
print('The two matrices cannot be multiplied')
else:
print('\nThe result is')
for i in range(rowA):
for j in range(colB):
result[i][j] = getCell(matrixA, matrixB, i, j)
print(result[i])
EDIT: This is the parallel code with time library. Initially didn't include it as I thought it was wrong so just wanted to see if anyone had ideas to parallize it instead
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
def getMatrix(name):
matrixCreated = []
i = 0
while True:
i += 1
row = input('\nEnter elements in row %s of Matrix %s (separated by commas)\nOr -1 to exit: ' %(i, name))
if row == '-1':
break
else:
strList = row.split(',')
matrixCreated.append(list(map(int, strList)))
return matrixCreated
def getColAsList(matrixToManipulate, col):
myList = []
numOfRows = len(matrixToManipulate)
for i in range(numOfRows):
myList.append(matrixToManipulate[i][col])
return myList
def getCell(matrixA, matrixB, r, c):
matrixBCol = getColAsList(matrixB, c)
lenOfList = len(matrixBCol)
productList = [matrixA[r][i]*matrixBCol[i] for i in range(lenOfList)]
return sum(productList)
matrixA = getMatrix('A')
matrixB = getMatrix('B')
rowA = len(matrixA)
colA = len(matrixA[0])
rowB = len(matrixB)
colB = len(matrixB[0])
import time
start_time = time.time()
result = [[0 for p in range(colB)] for q in range(rowA)]
if (colA != rowB):
print('The two matrices cannot be multiplied')
else:
print('\nThe result is')
for i in range(rowA):
for j in range(colB):
result[i][j] = getCell(matrixA, matrixB, i, j)
print(result[i])
print (" %s seconds " % (time.time() - start_time))
results = [pool.apply(getMatrix, getColAsList, getCell)]
pool.close()

So I would agree that you are doing something wrong. I would say that your code is not parallelable.
For the code to be parallelable it has to be dividable into smaller pieces and it either has to be:
1, Independent, meaning when it runs it doesn't rely on other processes to do its job.
For example if I have a list with 1,000,000 objects that need to be processed. And I have 4 workers to process them with. Then give each worker 1/4 of the objects to process and then when they finish all objects have been processed. But worker 3 doesn't care if worker 1, 2 or 4 completed before or after it did. Nor does worker3 care about what worker 1, 2 or 4 returned or did. It actually shouldn't even know that there are any other workers out there.
2, Managed, meaning there is dependencies between workers but thats ok cause you have a main thread that coordinates the workers. Still though, workers shouldn't know or care about each other. Think of them as mindless muscle, they only do what you tell them to do. Not to think for themselves.
For example I have a list with 1,000,000 objects that need to be processed. First all objects need to go through func1 which returns something. Once ALL objects are done with func1 those results should then go into func2. So I create 4 workers, give each worker 1/4 of the objects and have them process them with func1 and return the results. I wait for all workers to finish processing the objects. Then I give each worker 1/4 of the results returned by func1 and have them process it with func2. And I can keep doing this as many times as I want. All I have to do is have the main thread coordinate the workers so they dont start when they aren't suppose too and tell them what and when to process.
Take this with a grain of salt as this is a simplified version of parallel processing.
Tip for parallel and concurrency
You shouldn't get user input in parallel. Only the main thread should handle that.
If your work load is light then you shouldn't use parallel processing.
If your task can't be divided up into smaller pieces then its not parallelable. But it can still be run on a background thread as a way of running something concurrently.
Concurrency Example:
If your task is long running and not parallelable, lets say it takes 10 minutes to complete. And it requires a user to give input. Then when the user gives input start the task on a worker. If the user gives input again 1 minute later then take that input and start the 2nd task on worker2. Input at 5 minutes start task3 on worker3. At the 10 minute mark task1 is complete. Because everything is running concurrently by the 15 minute mark all task are complete. That's 2x faster then running the tasks in serial which would take 30 minutes. However this is concurrency not parallel.

How to maintain global processes in a pool working recursively?

I want to implement a recursive parallel algorithm and I want a pool to be created only once and each time step do a job wait for all the jobs to finish and then call the processes again with inputs the previous outputs and then again the same at the next time step, etc.
My problem is that I have implemented a version where every time step I create and kill the pool, but this is extremely slow, even slower than the sequential version. When I try to implement a version where the pool is created only once at the beginning I got assertion error when I try to call join().
This is my code
def log_result(result):
tempx , tempb, u = result
X[:,u,np.newaxis], b[:,u,np.newaxis] = tempx , tempb
workers = mp.Pool(processes = 4)
for t in range(p,T):
count = 0 #==========This is only master's job=============
for l in range(p):
for k in range(4):
gn[count]=train[t-l-1,k]
count+=1
G = G*v + gn # gn.T#==================================
if __name__ == '__main__':
for i in range(4):
workers.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis], i, gn), callback = log_result)
workers.join()
X and b are the matrices that I want to update directly at the master's memory.
What is wrong here and I get the assertion error?
Can I implement with the pool what I want or not?

You cannot join a pool that is not closed first, as join() will wait worker processes to terminate, not jobs to complete (https://docs.python.org/3.6/library/multiprocessing.html section 17.2.2.9).
But as this will close the pool, which is not what you want, you cannot use this. So join is out, and you need to implement a "wait until all jobs completed" by yourself.
One way of doing this without busy loops would be using a queue. You could also work with bounded semaphores, but they do not work on all operating systems.
counter = 0
lock_queue = multiprocessing.Queue()
counter_lock = multiprocessing.Lock()
def log_result(result):
tempx , tempb, u = result
X[:,u,np.newaxis], b[:,u,np.newaxis] = tempx , tempb
with counter_lock:
counter += 1
if counter == 4:
counter = 0
lock_queue.put(42)
workers = mp.Pool(processes = 4)
for t in range(p,T):
count = 0 #==========This is only master's job=============
for l in range(p):
for k in range(4):
gn[count]=train[t-l-1,k]
count+=1
G = G*v + gn # gn.T#==================================
if __name__ == '__main__':
counter = 0
for i in range(4):
workers.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis], i, gn), callback = log_result)
lock_queue.get(block=True)
This resets a global counter before submitting jobs. As soon as a job is completed, you callback increments a global counter. When the counter hits 4 (your number of jobs), the callback knows it has processed the last result. Then a dummy message is sent in a queue. Your main program is waiting at Queue.get() for something to appear there.
This allows your main program to block until all jobs have completed, without closing down the pool.
If you replace multiprocessing.Pool with ProcessPoolExecutor from concurrent.futures, you can skip this part and use
concurrent.futures.wait(fs, timeout=None, return_when=ALL_COMPLETED)
to block until all submitted tasks have finished. From functional standpoint there is no difference between these. The concurrent.futures method is a couple of lines shorter but the result is exactly the same.

Can Python threads work on the same process?

I am trying to come up with a way to have threads work on the same goal without interfering. In this case I am using 4 threads to add up every number between 0 and 90,000. This code runs but it ends almost immediately (Runtime: 0.00399994850159 sec) and only outputs 0. Originally I wanted to do it with a global variable but I was worried about the threads interfering with each other (ie. the small chance that two threads double count or skip a number due to strange timing of the reads/writes). So instead I distributed the workload beforehand. If there is a better way to do this please share. This is my simple way of trying to get some experience into multi threading. Thanks
import threading
import time
start_time = time.time()
tot1 = 0
tot2 = 0
tot3 = 0
tot4 = 0
def Func(x,y,tot):
tot = 0
i = y-x
while z in range(0,i):
tot = tot + i + z
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,tot1))
x = threading.Thread(target=Func, args=(22500,44999,tot2))
y = threading.Thread(target=Func, args=(45000,67499,tot3))
z = threading.Thread(target=Func, args=(67500,89999,tot4))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = tot1 + tot2 + tot3 + tot4
print total
print("--- %s seconds ---" % (time.time() - start_time))

You have a bug that makes this program end almost immediately. Look at while z in range(0,i): in Func. z isn't defined in the function and its only by luck (bad luck really) that you happen to have a global variable z = threading.Thread(target=Func, args=(67500,89999,tot4)) that masks the problem. You are testing whether the thread object is in a list of integers... and its not!
The next problem is with the global variables. First, you are absolutely right that using a single global variable is not thread safe. The threads would mess with each others calculations. But you misunderstand how globals work. When you do threading.Thread(target=Func, args=(67500,89999,tot4)), python passes the object currently referenced by tot4 to the function, but the function has no idea which global it came from. You only update the local variable tot and discard it when the function completes.
A solution is to use a global container to hold the calculations as shown in the example below. Unfortunately, this is actually slower than just doing all the work in one thread. The python global interpreter lock (GIL) only lets 1 thread run at a time and only slows down CPU-intensive tasks implemented in pure python.
You could look at the multiprocessing module to split this into multiple processes. That works well if the cost of running the calculation is large compared to the cost of starting the process and passing it data.
Here is a working copy of your example:
import threading
import time
start_time = time.time()
tot = [0] * 4
def Func(x,y,tot_index):
my_total = 0
i = y-x
for z in range(0,i):
my_total = my_total + i + z
tot[tot_index] = my_total
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,0))
x = threading.Thread(target=Func, args=(22500,44999,1))
y = threading.Thread(target=Func, args=(45000,67499,2))
z = threading.Thread(target=Func, args=(67500,89999,3))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = sum(tot)
print total
print("--- %s seconds ---" % (time.time() - start_time))

You can pass in a mutable object that you can add your results either with an identifier, e.g. dict or just a list and append() the results, e.g.:
import threading
def Func(start, stop, results):
results.append(sum(range(start, stop+1)))
rngs = [(0, 22499), (22500, 44999), (45000, 67499), (67500, 89999)]
results = []
jobs = [threading.Thread(target=Func, args=(start, stop, results)) for start, stop in rngs]
for j in jobs:
j.start()
for j in jobs:
j.join()
print(sum(results))
# 4049955000
# 100 loops, best of 3: 2.35 ms per loop

As others have noted you could look multiprocessing in order to split the work to multiple different processes that can run parallel. This would benefit especially in CPU-intensive tasks assuming that there isn't huge amount of data to pass between the processes.
Here's a simple implementation of the same functionality using multiprocessing:
from multiprocessing import Pool
POOL_SIZE = 4
NUMBERS = 90000
def func(_range):
tot = 0
for z in range(*_range):
tot += z
return tot
with Pool(POOL_SIZE) as pool:
chunk_size = int(NUMBERS / POOL_SIZE)
chunks = ((i, i + chunk_size) for i in range(0, NUMBERS, chunk_size))
print(sum(pool.imap(func, chunks)))
In above chunks is a generator that produces the same ranges that were hardcoded in original version. It's given to imap which works the same as standard map except that it executes the function in the processes within the pool.
Less known fact about multiprocessing is that you can easily convert the code to use threads instead of processes by using undocumented multiprocessing.pool.ThreadPool. In order to convert above example to use threads just change import to:
from multiprocessing.pool import ThreadPool as Pool

Multiprocessing in python to speed up functions

I am confused with Python multiprocessing.
I am trying to speed up a function which process strings from a database but I must have misunderstood how multiprocessing works because the function takes longer when given to a pool of workers than with “normal processing”.
Here an example of what I am trying to achieve.
from time import clock, time
from multiprocessing import Pool, freeze_support
from random import choice
def foo(x):
TupWerteMany = []
for i in range(0,len(x)):
TupWerte = []
s = list(x[i][3])
NewValue = choice(s)+choice(s)+choice(s)+choice(s)
TupWerte.append(NewValue)
TupWerte = tuple(TupWerte)
TupWerteMany.append(TupWerte)
return TupWerteMany
if __name__ == '__main__':
start_time = time()
List = [(u'1', u'aa', u'Jacob', u'Emily'),
(u'2', u'bb', u'Ethan', u'Kayla')]
List1 = List*1000000
# METHOD 1 : NORMAL (takes 20 seconds)
x2 = foo(List1)
print x2[1:3]
# METHOD 2 : APPLY_ASYNC (takes 28 seconds)
# pool = Pool(4)
# Werte = pool.apply_async(foo, args=(List1,))
# x2 = Werte.get()
# print '--------'
# print x2[1:3]
# print '--------'
# METHOD 3: MAP (!! DOES NOT WORK !!)
# pool = Pool(4)
# Werte = pool.map(foo, args=(List1,))
# x2 = Werte.get()
# print '--------'
# print x2[1:3]
# print '--------'
print 'Time Elaspse: ', time() - start_time
My questions:
Why does apply_async takes longer than the “normal way” ?
What I am doing wrong with map?
Does it makes sense to speed up such tasks with multiprocessing at all?
Finally: after all I have read here, I am wondering if multiprocessing in python works on windows at all ?

So your first problem is that there is no actual parallelism happening in foo(x), you are passing the entire list to the function once.
1)
The idea of a process pool is to have many processes doing computations on separate bits of some data.
# METHOD 2 : APPLY_ASYNC
jobs = 4
size = len(List1)
pool = Pool(4)
results = []
# split the list into 4 equally sized chunks and submit those to the pool
heads = range(size/jobs, size, size/jobs) + [size]
tails = range(0,size,size/jobs)
for tail,head in zip(tails, heads):
werte = pool.apply_async(foo, args=(List1[tail:head],))
results.append(werte)
pool.close()
pool.join() # wait for the pool to be done
for result in results:
werte = result.get() # get the return value from the sub jobs
This will only give you an actual speedup if the time it takes to process each chunk is greater than the time it takes to launch the process, in the case of four processes and four jobs to be done, of course these dynamics change if you've got 4 processes and 100 jobs to be done. Remember that you are creating a completely new python interpreter four times, this isn't free.
2) The problem you have with map is that it applies foo to EVERY element in List1 in a separate process, this will take quite a while. So if you're pool has 4 processes map will pop an item of the list four times and send it to a process to be dealt with - wait for process to finish - pop some more stuff of the list - wait for the process to finish. This makes sense only if processing a single item takes a long time, like for instance if every item is a file name pointing to a one gigabyte text file. But as it stands map will just take a single string of the list and pass it to foo where as apply_async takes a slice of the list. Try the following code
def foo(thing):
print thing
map(foo, ['a','b','c','d'])
That's the built-in python map and will run a single process, but the idea is exactly the same for the multiprocess version.
Added as per J.F.Sebastian's comment: You can however use the chunksize argument to map to specify an approximate size of for each chunk.
pool.map(foo, List1, chunksize=size/jobs)
I don't know though if there is a problem with map on Windows as I don't have one available for testing.
3) yes, given that your problem is big enough to justify forking out new python interpreters
4) can't give you a definitive answer on that as it depends on the number of cores/processors etc. but in general it should be fine on Windows.

On question (2)
With the guidance of Dougal and Matti, I figured out what's went wrong.
The original foo function processes a list of lists, while map requires a function to process single elements.
The new function should be
def foo2 (x):
TupWerte = []
s = list(x[3])
NewValue = choice(s)+choice(s)+choice(s)+choice(s)
TupWerte.append(NewValue)
TupWerte = tuple(TupWerte)
return TupWerte
and the block to call it :
jobs = 4
size = len(List1)
pool = Pool()
#Werte = pool.map(foo2, List1, chunksize=size/jobs)
Werte = pool.map(foo2, List1)
pool.close()
print Werte[1:3]
Thanks to all of you who helped me understand this.
Results of all methods:
for List * 2 Mio records: normal 13.3 seconds , parallel with async: 7.5 seconds, parallel with with map with chuncksize : 7.3, without chunksize 5.2 seconds

Here's a generic multiprocessing template if you are interested.
import multiprocessing as mp
import time
def worker(x):
time.sleep(0.2)
print "x= %s, x squared = %s" % (x, x*x)
return x*x
def apply_async():
pool = mp.Pool()
for i in range(100):
pool.apply_async(worker, args = (i, ))
pool.close()
pool.join()
if __name__ == '__main__':
apply_async()
And the output looks like this:
x= 0, x squared = 0
x= 1, x squared = 1
x= 2, x squared = 4
x= 3, x squared = 9
x= 4, x squared = 16
x= 6, x squared = 36
x= 5, x squared = 25
x= 7, x squared = 49
x= 8, x squared = 64
x= 10, x squared = 100
x= 11, x squared = 121
x= 9, x squared = 81
x= 12, x squared = 144
As you can see, the numbers are not in order, as they are being executed asynchronously.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.