Complete a multithreading parallelize process with k threads

Complete a multithreading parallelize process with k threads - python

3sum Problem is defined as
Given: A positive integer k≤20, a postive integer n≤104, and k arrays of size n containing integers from −105 to 105.
Return: For each array A[1..n], output three different indices 1≤p<q<r≤n such that A[p]+A[q]+A[r]=0 if exist, and "-1" otherwise.
Sample Dataset
4 5
2 -3 4 10 5
8 -6 4 -2 -8
-5 2 3 2 -4
2 4 -5 6 8
Sample Output
-1
1 2 4
1 2 3
-1
However I want to speed up the code using threads, To do so I am applying python code
def TS(arr):
original = arr[:]
arr.sort()
n = len(arr)
for i in xrange(n-2):
a = arr[i]
j = i+1
k = n-1
while j < k:
b = arr[j]
c = arr[k]
if a + b + c == 0:
return sorted([original.index(a)+1,original.index(b)+1,original.index(c)+1])
elif a + b + c > 0:
k = k - 1
else:
j = j +1
return [-1]
with open("dataset.txt") as dataset:
k = int(dataset.readline().split()[0])
for i in xrange(k):
aux = map(int, dataset.readline().split())
results = TS(aux)
print ' ' . join(map(str, results))
I was thinking on creating k threads, and a global array output, however do not know how to continue developing the idea
from threading import Thread
class thread_it(Thread):
def __init__ (self,param):
Thread.__init__(self)
self.param = param
def run(self):
mutex.acquire()
output.append(TS(aux))
mutex.release()
threads = [] #k threads
output = [] #global answer
mutex = thread.allocate_lock()
with open("dataset.txt") as dataset:
k = int(dataset.readline().split()[0])
for i in xrange(k):
aux = map(int, dataset.readline().split())
current = thread_it(aux)
threads.append(current)
current.start()
for t in threads:
t.join()
What would be the correct way to get the results = TS(aux) inside a thread and then wait until all threads have finish and then print ' ' . join(map(str,results)) for all of them?
Update
Got this issue when running script from console

First, like #Cyphase said, because of GIL, you cannot speed things up with threading. Every thread will run on the same core. Consider using multiprocessing to utilize multiple cores, multiprocessing has a very similar API as threading.
Second, even if we pretend GIL doesn't exist. Putting everything in a critical section protected by mutex, you are actually serializing all the threads. What you need to protect is access to output, so put the processing code out of critical section, to make them run concurrently:
def run(self):
result = TS(aux)
mutex.acquire()
output.append(result)
mutex.release()
But don't re-invent the wheel, python standard library provides a thread-safe Queue, use that:
try:
import Queue as queue # python2
except:
import queue
output = queue.Queue()
def run(self):
result = TS(self.param)
output.append(result)
With multiprocessing, the final code looks something like this:
from multiprocessing import Process, Queue
output = Queue()
class TSProcess(Process):
def __init__ (self, param):
Process.__init__(self)
self.param = param
def run(self):
result = TS(self.param)
output.put(result)
processes = []
with open("dataset.txt") as dataset:
k = int(dataset.readline().split()[0])
for i in xrange(k):
aux = map(int, dataset.readline().split())
current = TSProcess(aux)
processes.append(current)
current.start()
for p in processes:
p.join()
# process result with output.get()

Related

Why is my parallel code slower than my serial code?

Recently started learning parallel on my own and I have next to no idea what I'm doing. Tried applying what I have learnt but I think I'm doing something wrong because my parallel code is taking a longer time to execute than my serial code. My PC is running a i7-9700. This is the original serial code in question
def getMatrix(name):
matrixCreated = []
i = 0
while True:
i += 1
row = input('\nEnter elements in row %s of Matrix %s (separated by commas)\nOr -1 to exit: ' %(i, name))
if row == '-1':
break
else:
strList = row.split(',')
matrixCreated.append(list(map(int, strList)))
return matrixCreated
def getColAsList(matrixToManipulate, col):
myList = []
numOfRows = len(matrixToManipulate)
for i in range(numOfRows):
myList.append(matrixToManipulate[i][col])
return myList
def getCell(matrixA, matrixB, r, c):
matrixBCol = getColAsList(matrixB, c)
lenOfList = len(matrixBCol)
productList = [matrixA[r][i]*matrixBCol[i] for i in range(lenOfList)]
return sum(productList)
matrixA = getMatrix('A')
matrixB = getMatrix('B')
rowA = len(matrixA)
colA = len(matrixA[0])
rowB = len(matrixB)
colB = len(matrixB[0])
result = [[0 for p in range(colB)] for q in range(rowA)]
if (colA != rowB):
print('The two matrices cannot be multiplied')
else:
print('\nThe result is')
for i in range(rowA):
for j in range(colB):
result[i][j] = getCell(matrixA, matrixB, i, j)
print(result[i])
EDIT: This is the parallel code with time library. Initially didn't include it as I thought it was wrong so just wanted to see if anyone had ideas to parallize it instead
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
def getMatrix(name):
matrixCreated = []
i = 0
while True:
i += 1
row = input('\nEnter elements in row %s of Matrix %s (separated by commas)\nOr -1 to exit: ' %(i, name))
if row == '-1':
break
else:
strList = row.split(',')
matrixCreated.append(list(map(int, strList)))
return matrixCreated
def getColAsList(matrixToManipulate, col):
myList = []
numOfRows = len(matrixToManipulate)
for i in range(numOfRows):
myList.append(matrixToManipulate[i][col])
return myList
def getCell(matrixA, matrixB, r, c):
matrixBCol = getColAsList(matrixB, c)
lenOfList = len(matrixBCol)
productList = [matrixA[r][i]*matrixBCol[i] for i in range(lenOfList)]
return sum(productList)
matrixA = getMatrix('A')
matrixB = getMatrix('B')
rowA = len(matrixA)
colA = len(matrixA[0])
rowB = len(matrixB)
colB = len(matrixB[0])
import time
start_time = time.time()
result = [[0 for p in range(colB)] for q in range(rowA)]
if (colA != rowB):
print('The two matrices cannot be multiplied')
else:
print('\nThe result is')
for i in range(rowA):
for j in range(colB):
result[i][j] = getCell(matrixA, matrixB, i, j)
print(result[i])
print (" %s seconds " % (time.time() - start_time))
results = [pool.apply(getMatrix, getColAsList, getCell)]
pool.close()

So I would agree that you are doing something wrong. I would say that your code is not parallelable.
For the code to be parallelable it has to be dividable into smaller pieces and it either has to be:
1, Independent, meaning when it runs it doesn't rely on other processes to do its job.
For example if I have a list with 1,000,000 objects that need to be processed. And I have 4 workers to process them with. Then give each worker 1/4 of the objects to process and then when they finish all objects have been processed. But worker 3 doesn't care if worker 1, 2 or 4 completed before or after it did. Nor does worker3 care about what worker 1, 2 or 4 returned or did. It actually shouldn't even know that there are any other workers out there.
2, Managed, meaning there is dependencies between workers but thats ok cause you have a main thread that coordinates the workers. Still though, workers shouldn't know or care about each other. Think of them as mindless muscle, they only do what you tell them to do. Not to think for themselves.
For example I have a list with 1,000,000 objects that need to be processed. First all objects need to go through func1 which returns something. Once ALL objects are done with func1 those results should then go into func2. So I create 4 workers, give each worker 1/4 of the objects and have them process them with func1 and return the results. I wait for all workers to finish processing the objects. Then I give each worker 1/4 of the results returned by func1 and have them process it with func2. And I can keep doing this as many times as I want. All I have to do is have the main thread coordinate the workers so they dont start when they aren't suppose too and tell them what and when to process.
Take this with a grain of salt as this is a simplified version of parallel processing.
Tip for parallel and concurrency
You shouldn't get user input in parallel. Only the main thread should handle that.
If your work load is light then you shouldn't use parallel processing.
If your task can't be divided up into smaller pieces then its not parallelable. But it can still be run on a background thread as a way of running something concurrently.
Concurrency Example:
If your task is long running and not parallelable, lets say it takes 10 minutes to complete. And it requires a user to give input. Then when the user gives input start the task on a worker. If the user gives input again 1 minute later then take that input and start the 2nd task on worker2. Input at 5 minutes start task3 on worker3. At the 10 minute mark task1 is complete. Because everything is running concurrently by the 15 minute mark all task are complete. That's 2x faster then running the tasks in serial which would take 30 minutes. However this is concurrency not parallel.

How to maintain global processes in a pool working recursively?

I want to implement a recursive parallel algorithm and I want a pool to be created only once and each time step do a job wait for all the jobs to finish and then call the processes again with inputs the previous outputs and then again the same at the next time step, etc.
My problem is that I have implemented a version where every time step I create and kill the pool, but this is extremely slow, even slower than the sequential version. When I try to implement a version where the pool is created only once at the beginning I got assertion error when I try to call join().
This is my code
def log_result(result):
tempx , tempb, u = result
X[:,u,np.newaxis], b[:,u,np.newaxis] = tempx , tempb
workers = mp.Pool(processes = 4)
for t in range(p,T):
count = 0 #==========This is only master's job=============
for l in range(p):
for k in range(4):
gn[count]=train[t-l-1,k]
count+=1
G = G*v + gn # gn.T#==================================
if __name__ == '__main__':
for i in range(4):
workers.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis], i, gn), callback = log_result)
workers.join()
X and b are the matrices that I want to update directly at the master's memory.
What is wrong here and I get the assertion error?
Can I implement with the pool what I want or not?

You cannot join a pool that is not closed first, as join() will wait worker processes to terminate, not jobs to complete (https://docs.python.org/3.6/library/multiprocessing.html section 17.2.2.9).
But as this will close the pool, which is not what you want, you cannot use this. So join is out, and you need to implement a "wait until all jobs completed" by yourself.
One way of doing this without busy loops would be using a queue. You could also work with bounded semaphores, but they do not work on all operating systems.
counter = 0
lock_queue = multiprocessing.Queue()
counter_lock = multiprocessing.Lock()
def log_result(result):
tempx , tempb, u = result
X[:,u,np.newaxis], b[:,u,np.newaxis] = tempx , tempb
with counter_lock:
counter += 1
if counter == 4:
counter = 0
lock_queue.put(42)
workers = mp.Pool(processes = 4)
for t in range(p,T):
count = 0 #==========This is only master's job=============
for l in range(p):
for k in range(4):
gn[count]=train[t-l-1,k]
count+=1
G = G*v + gn # gn.T#==================================
if __name__ == '__main__':
counter = 0
for i in range(4):
workers.apply_async(OULtraining, args=(train[t,i], X[:,i,np.newaxis], b[:,i,np.newaxis], i, gn), callback = log_result)
lock_queue.get(block=True)
This resets a global counter before submitting jobs. As soon as a job is completed, you callback increments a global counter. When the counter hits 4 (your number of jobs), the callback knows it has processed the last result. Then a dummy message is sent in a queue. Your main program is waiting at Queue.get() for something to appear there.
This allows your main program to block until all jobs have completed, without closing down the pool.
If you replace multiprocessing.Pool with ProcessPoolExecutor from concurrent.futures, you can skip this part and use
concurrent.futures.wait(fs, timeout=None, return_when=ALL_COMPLETED)
to block until all submitted tasks have finished. From functional standpoint there is no difference between these. The concurrent.futures method is a couple of lines shorter but the result is exactly the same.

Can Python threads work on the same process?

I am trying to come up with a way to have threads work on the same goal without interfering. In this case I am using 4 threads to add up every number between 0 and 90,000. This code runs but it ends almost immediately (Runtime: 0.00399994850159 sec) and only outputs 0. Originally I wanted to do it with a global variable but I was worried about the threads interfering with each other (ie. the small chance that two threads double count or skip a number due to strange timing of the reads/writes). So instead I distributed the workload beforehand. If there is a better way to do this please share. This is my simple way of trying to get some experience into multi threading. Thanks
import threading
import time
start_time = time.time()
tot1 = 0
tot2 = 0
tot3 = 0
tot4 = 0
def Func(x,y,tot):
tot = 0
i = y-x
while z in range(0,i):
tot = tot + i + z
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,tot1))
x = threading.Thread(target=Func, args=(22500,44999,tot2))
y = threading.Thread(target=Func, args=(45000,67499,tot3))
z = threading.Thread(target=Func, args=(67500,89999,tot4))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = tot1 + tot2 + tot3 + tot4
print total
print("--- %s seconds ---" % (time.time() - start_time))

You have a bug that makes this program end almost immediately. Look at while z in range(0,i): in Func. z isn't defined in the function and its only by luck (bad luck really) that you happen to have a global variable z = threading.Thread(target=Func, args=(67500,89999,tot4)) that masks the problem. You are testing whether the thread object is in a list of integers... and its not!
The next problem is with the global variables. First, you are absolutely right that using a single global variable is not thread safe. The threads would mess with each others calculations. But you misunderstand how globals work. When you do threading.Thread(target=Func, args=(67500,89999,tot4)), python passes the object currently referenced by tot4 to the function, but the function has no idea which global it came from. You only update the local variable tot and discard it when the function completes.
A solution is to use a global container to hold the calculations as shown in the example below. Unfortunately, this is actually slower than just doing all the work in one thread. The python global interpreter lock (GIL) only lets 1 thread run at a time and only slows down CPU-intensive tasks implemented in pure python.
You could look at the multiprocessing module to split this into multiple processes. That works well if the cost of running the calculation is large compared to the cost of starting the process and passing it data.
Here is a working copy of your example:
import threading
import time
start_time = time.time()
tot = [0] * 4
def Func(x,y,tot_index):
my_total = 0
i = y-x
for z in range(0,i):
my_total = my_total + i + z
tot[tot_index] = my_total
# class Tester(threading.Thread):
# def run(self):
# print(n)
w = threading.Thread(target=Func, args=(0,22499,0))
x = threading.Thread(target=Func, args=(22500,44999,1))
y = threading.Thread(target=Func, args=(45000,67499,2))
z = threading.Thread(target=Func, args=(67500,89999,3))
w.start()
x.start()
y.start()
z.start()
w.join()
x.join()
y.join()
z.join()
# while (w.isAlive() == False | x.isAlive() == False | y.isAlive() == False | z.isAlive() == False): {}
total = sum(tot)
print total
print("--- %s seconds ---" % (time.time() - start_time))

You can pass in a mutable object that you can add your results either with an identifier, e.g. dict or just a list and append() the results, e.g.:
import threading
def Func(start, stop, results):
results.append(sum(range(start, stop+1)))
rngs = [(0, 22499), (22500, 44999), (45000, 67499), (67500, 89999)]
results = []
jobs = [threading.Thread(target=Func, args=(start, stop, results)) for start, stop in rngs]
for j in jobs:
j.start()
for j in jobs:
j.join()
print(sum(results))
# 4049955000
# 100 loops, best of 3: 2.35 ms per loop

As others have noted you could look multiprocessing in order to split the work to multiple different processes that can run parallel. This would benefit especially in CPU-intensive tasks assuming that there isn't huge amount of data to pass between the processes.
Here's a simple implementation of the same functionality using multiprocessing:
from multiprocessing import Pool
POOL_SIZE = 4
NUMBERS = 90000
def func(_range):
tot = 0
for z in range(*_range):
tot += z
return tot
with Pool(POOL_SIZE) as pool:
chunk_size = int(NUMBERS / POOL_SIZE)
chunks = ((i, i + chunk_size) for i in range(0, NUMBERS, chunk_size))
print(sum(pool.imap(func, chunks)))
In above chunks is a generator that produces the same ranges that were hardcoded in original version. It's given to imap which works the same as standard map except that it executes the function in the processes within the pool.
Less known fact about multiprocessing is that you can easily convert the code to use threads instead of processes by using undocumented multiprocessing.pool.ThreadPool. In order to convert above example to use threads just change import to:
from multiprocessing.pool import ThreadPool as Pool

Python Multiprocessing Duplicates

When I run this script on Linux, it prints 8 duplicates. How to force python use all cores on different results, rather than on duplicates?
from multiprocessing import Pool
def f():
f = open("/path/to/10.txt", 'r')
l = [s.strip('\n') for s in f]
f.close()
for a in range(0, len(l)):
for b in range(0, len(l)):
result = 0
if (a == b):
result = 1
else:
counter = 0
for i in range(len(l[a])):
if (int(l[a][i]) == int(l[b][i]) == 1):
counter += 1
result = counter / 10000
print((a + 1), (b + 1), result)
if __name__ == '__main__':
p = Process(target=f)
p.start()
p.join()

If you simply want to run more than one core you will have to use multiple processes, here you are just using one.
also you need to break your routine f in independent units/routine such a way that it can work in parallel and the whole task can be shared among the multiple worker processes.
Here is a sample 2-process code, which can use multiple cores on your machine:
from multiprocessing import Process
def task(arg):
pass
if __name__ == '__main__'
value = 'something'
prc1 = Process(target=task, args=(value,))
prc2 = Process(target=task, args=(value,))
prc1.start()
prc2.start()
prc1.join()
prc2.join()

Multiprocessing and Queue with Dataframe

I have some troubles with exchange of the object (dataframe) between 2 processes through the Queue.
First process get the data from a queue, second put data into a queue.
The put-process is faster, so the get-process should clear the queue with reading all object.
I've got strange behaviour, because my code works perfectly and as expected but only for 100 rows in dataframe, for 1000row the get-process takes always only 1 object.
import multiprocessing, time, sys
import pandas as pd
NR_ROWS = 1000
i = 0
def getDf():
global i, NR_ROWS
myheader = ["name", "test2", "test3"]
myrow1 = [ i, i+400, i+250]
df = pd.DataFrame([myrow1]*NR_ROWS, columns = myheader)
i = i+1
return df
def f_put(q):
print "f_put start"
while(1):
data = getDf()
q.put(data)
print "P:", data["name"].iloc[0]
sys.stdout.flush()
time.sleep(1.55)
def f_get(q):
print "f_get start"
while(1):
data = pd.DataFrame()
while not q.empty():
data = q.get()
print "get"
if not data.empty:
print "G:", data["name"].iloc[0]
else:
print "nothing new"
time.sleep(5.9)
if __name__ == "__main__":
q = multiprocessing.Queue()
p = multiprocessing.Process(target=f_put, args=(q,))
p.start()
while(1):
f_get(q)
p.join()
Output for 100rows dataframe, get-process takes all objects
f_get start
nothing new
f_put start
P: 0 # put 1.object into the queue
P: 1 # put 2.object into the queue
P: 2 # put 3.object into the queue
P: 3 # put 4.object into the queue
get # get-process takes all 4 objects from the queue
get
get
get
G: 3
P: 4
P: 5
P: 6
get
get
get
G: 6
P: 7
P: 8
Output for 1000rows dataframe, get-process takes only one object.
f_get start
nothing new
f_put start
P: 0 # put 1.object into the queue
P: 1 # put 2.object into the queue
P: 2 # put 3.object into the queue
P: 3 # put 4.object into the queue
get <-- #!!! get-process takes ONLY 1 object from the queue!!!
G: 1
P: 4
P: 5
P: 6
get
G: 2
P: 7
P: 8
P: 9
P: 10
get
G: 3
P: 11
Any idea what I am doing wrong and how to pass also the bigger dataframe through?

At the risk of not being completely able to provide a fully functional example, here is what goes wrong.
First of all, its a timing issue.
I tried your code again with larger DataFrames (10000 or even 100000) and I start to see the same things as you do. This means you see this behaviour as soon as the size of the arrays crosses a certain threshold that will be system(CPU?) dependent.
I modified your code a bit to make it easier to see what happens. First, 5 DataFrames are put into the queue without any custom time.sleep. In the f_get function I added a counter (and a time.sleep(0), see below) to the loop (while not q.empty()).
The new code:
import multiprocessing, time, sys
import pandas as pd
NR_ROWS = 10000
i = 0
def getDf():
global i, NR_ROWS
myheader = ["name", "test2", "test3"]
myrow1 = [ i, i+400, i+250]
df = pd.DataFrame([myrow1]*NR_ROWS, columns = myheader)
i = i+1
return df
def f_put(q):
print "f_put start"
j = 0
while(j < 5):
data = getDf()
q.put(data)
print "P:", data["name"].iloc[0]
sys.stdout.flush()
j += 1
def f_get(q):
print "f_get start"
while(1):
data = pd.DataFrame()
loop = 0
while not q.empty():
data = q.get()
print "get (loop: %s)" %loop
time.sleep(0)
loop += 1
time.sleep(1.)
if __name__ == "__main__":
q = multiprocessing.Queue()
p = multiprocessing.Process(target=f_put, args=(q,))
p.start()
while(1):
f_get(q)
p.join()
Now, if you run this for different number of rows, you will see something like this:
N=100:
f_get start
f_put start
P: 0
P: 1
P: 2
P: 3
P: 4
get (loop: 0)
get (loop: 1)
get (loop: 2)
get (loop: 3)
get (loop: 4)
N=10000:
f_get start
f_put start
P: 0
P: 1
P: 2
P: 3
P: 4
get (loop: 0)
get (loop: 1)
get (loop: 0)
get (loop: 0)
get (loop: 0)
What does this tell us?
As long as the DataFrame is small, your assumption that the put process is faster than the get seems true, we can fetch all 5 items within one loop of while not q.empty().
But, as the number of rows increases, something changes. The while-condition q.empty() evaluates to True (the queue is empty) and the outer while(1) cycles.
This could mean that put is now slower than get and we have to wait. But if we set the sleep time for the whole f_get to something like 15, we still get the same behaviour.
On the other hand, if we change the time.sleep(0) in the inner q.get() loop to 1,
while not q.empty():
data = q.get()
time.sleep(1)
print "get (loop: %s)" %loop
loop += 1
we get this:
f_get start
f_put start
P: 0
P: 1
P: 2
P: 3
P: 4
get (loop: 0)
get (loop: 1)
get (loop: 2)
get (loop: 3)
get (loop: 4)
This looks right! And it means that actually get does something strange. It seems that while it is still processing a get, the queue state is empty, and after the get is done the next item is available.
I'm sure there is a reason for that, but I'm not familiar enough with multiprocessing to see that.
Depending on your application, you could just add the appropriate time.sleep to your inner loop and see if thats enough.
Or, if you want to solve it (instead of using a workaround as the time.sleep method), you could look into multiprocessing and look for information on blocking, non-blocking or asynchronous communication - I think the solution will be found there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.