Multiprocessing with Python and Windows - python

I have a code that works with Thread in python, but I wanna switch to Process as if I have understood well that will give me a speed-up.
Here there is the code with Thread:
threads.append(Thread(target=getId, args=(my_queue, read)))
threads.append(Thread(target=getLatitude, args=(my_queue, read)))
The code works putting the return in the Queue and after a join on the threads list, I can retrieve the results.
Changing the code and the import statement my code now is like that:
threads.append(Process(target=getId, args=(my_queue, read)))
threads.append(Process(target=getLatitude, args=(my_queue, read)))
However it does not execute anything and the Queue is empty, with the Thread the Queue is not empty so I think it is related to Process.
I have read answers in which the Process class does not work on Windows is it true, or there is a way to make it work (adding freeze_support() does not help)?
In the negative case, multithreading on windows is actually executed in parallel on different cores?
ref:
Python multiprocessing example not working
Python code with multiprocessing does not work on Windows
Multiprocessing process does not join when putting complex dictionary in return queue
(in which is described that fork does not exist on Windows)
EDIT:
To add some details:
the code with Process is actually working on centOS.
EDIT2:
add a simplified version of my code with processes, code tested on centOS
import pandas as pd
from multiprocessing import Process, freeze_support
from multiprocessing import Queue
#%% Global variables
datasets = []
latitude = []
def fun(key, job):
global latitude
if(key == 'LAT'):
latitude.append(job)
def getLatitude(out_queue, skip = None):
latDict = {'LAT' : latitude}
out_queue.put(latDict)
n = pd.read_csv("my.csv", sep =',', header = None).shape[0]
print("Number of baboon:" + str(n))
read = []
for i in range(0,n):
threads = []
my_queue = Queue()
threads.append(Process(target=getLatitude, args=(my_queue, read)))
for t in threads:
freeze_support() # try both with and without this line
t.start()
for t in threads:
t.join()
while not my_queue.empty():
try:
job = my_queue.get()
key = list(job.keys())
fun(key[0],job[key[0]])
except:
print("END")
read.append(i)

Per the documentation, you need the following after the function definitions. When Python creates the subprocesses, they import your script so the code that runs at the global level will be run multiple times. For the code you only want to run in the main thread:
if __name__ == '__main__':
n = pd.read_csv("my.csv", sep =',', header = None).shape[0]
# etc.
Indent the rest of code under this if.

Related

issues with multithread and subprocess in python 3.x

I am tying to run a loop with different treads to speed the process. And I don't find other way to do but to create another script and call it with subprocess and give it the big array as argument..
I put the code commented to explain the probleme..
the script I am trying to call with subprocess:
import multiprocessing
from joblib import Parallel, delayed
import sys
num_cores = multiprocessing.cpu_count()
inputs = sys.argv[1:]
print(inputs)
def evaluate_individual(ind):
ind.evaluate()
ind.already_evaluate = True
if __name__ == "__main__":
# i am trying to execute a loop with multi thread.
# but it need to be in the "__main__" but I can in my main script, so I create an external script...
Parallel(n_jobs=num_cores)(delayed(evaluate_individual)(i) for i in inputs)
the script who call the other script:
import subprocess, sys
from clean import IndividualExamples
# the big array
inputs = []
for i in range(200):
ind = IndividualExamples.simple_individual((None, None), None)
inputs.append(ind)
# and now I need to call this code from another script...
# but as arguments I pass a big array and I don't now if there are a better method
# AND... this don't work to call the subprocess and return no error so I don't now who to do..
subprocess.Popen(['py','C:/Users/alexa/OneDrive/Documents/Programmes/Neronal Network/multi_thread.py', str(inputs)])
thanks for your help, if you now another way to run a loop with multi thread in a function (not in the main) tell me as well.
And sorry for my approximative english.
edit: I tried with a pool but same probleme I need to put It in the "main", so I can put it in a function in my script (as i need to use it)
modified code with pool:
if __name__ == "__main__":
tps1 = time.time()
with multiprocessing.Pool(12) as p:
outputs = p.map(evaluate_individual, inputs)
tps2 = time.time()
print(tps2 - tps1)
New_array = outputs
Other little question, I try with a simple loop and with the Pool multiprocess and a compare the time of the 2 process:
simple loop: 0.022951126098632812
multi thread: 0.9151067733764648
and I go this... why the multi Process On 12 cores can be longer than a simple loop ?
You can do it using a multiprocessing.Pool
import subprocess, sys
from multiprocessing import Pool
from clean import IndividualExamples
def evaluate_individual(ind):
ind.evaluate()
# the big array
pool = Pool()
for i in range(200):
ind = IndividualExamples.simple_individual((None, None), None)
pool.apply_async(func=evaluate_individual, args=(ind,))
pool.close()
pool.join()
thanks, I just try but this raise a run time error:
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.

Python multiprocessing.Pool does not start right away

I want to input text to python and process it in parallel. For that purpose I use multiprocessing.Pool. The problem is that sometime, not always, I have to input text multiple times before anything is processed.
This is a minimal version of my code to reproduce the problem:
import multiprocessing as mp
import time
def do_something(text):
print('Out: ' + text, flush=True)
# do some awesome stuff here
if __name__ == '__main__':
p = None
while True:
message = input('In: ')
if not p:
p = mp.Pool()
p.apply_async(do_something, (message,))
What happens is that I have to input text multiple times before I get a result, no matter how long I wait after I have inputted something the first time. (As stated above, that does not happen every time.)
python3 test.py
In: a
In: a
In: a
In: Out: a
Out: a
Out: a
If I create the pool before the while loop or if I add time.sleep(1) after creating the pool, it seems to work every time. Note: I do not want to create the pool before I get an input.
Has someone an explanation for this behavior?
I'm running Windows 10 with Python 3.4.2
EDIT: Same behavior with Python 3.5.1
EDIT:
An even simpler example with Pool and also ProcessPoolExecutor. I think the problem is the call to input() right after appyling/submitting, which only seems to be a problem the first time appyling/submitting something.
import concurrent.futures
import multiprocessing as mp
import time
def do_something(text):
print('Out: ' + text, flush=True)
# do some awesome stuff here
# ProcessPoolExecutor
# if __name__ == '__main__':
# with concurrent.futures.ProcessPoolExecutor() as executor:
# executor.submit(do_something, 'a')
# input('In:')
# print('done')
# Pool
if __name__ == '__main__':
p = mp.Pool()
p.apply_async(do_something, ('a',))
input('In:')
p.close()
p.join()
print('done')
Your code works when I tried it on my Mac.
In Python 3, it might help to explicitly declare how many processors will be in your pool (ie the number of simultaneous processes).
try using p = mp.Pool(1)
import multiprocessing as mp
import time
def do_something(text):
print('Out: ' + text, flush=True)
# do some awesome stuff here
if __name__ == '__main__':
p = None
while True:
message = input('In: ')
if not p:
p = mp.Pool(1)
p.apply_async(do_something, (message,))
I could not reproduce it on Windows 7 but there are few long shots worth to mention for your issue.
your AV might be interfering with the newly spawned processes, try temporarily disabling it and see if the issue is still present.
Win 10 might have different IO caching algorithm, try inputting larger strings. If it works, it means that the OS tries to be smart and sends data when a certain amount has piled up.
As Windows has no fork() primitive, you might see the delay caused by the spawn starting method.
Python 3 added a new pool of workers called ProcessPoolExecutor, I'd recommend you to use this no matter the issue you suffer from.

Python multiprocessing - Is it possible to introduce a fixed time delay between individual processes?

I have searched and cannot find an answer to this question elsewhere. Hopefully I haven't missed something.
I am trying to use Python multiprocessing to essentially batch run some proprietary models in parallel. I have, say, 200 simulations, and I want to batch run them ~10-20 at a time. My problem is that the proprietary software crashes if two models happen to start at the same / similar time. I need to introduce a delay between processes spawned by multiprocessing so that each new model run waits a little bit before starting.
So far, my solution has been to introduced a random time delay at the start of the child process before it fires off the model run. However, this only reduces the probability of any two runs starting at the same time, and therefore I still run into problems when trying to process a large number of models. I therefore think that the time delay needs to be built into the multiprocessing part of the code but I haven't been able to find any documentation or examples of this.
Edit: I am using Python 2.7
This is my code so far:
from time import sleep
import numpy as np
import subprocess
import multiprocessing
def runmodels(arg):
sleep(np.random.rand(1,1)*120) # this is my interim solution to reduce the probability that any two runs start at the same time, but it isn't a guaranteed solution
subprocess.call(arg) # this line actually fires off the model run
if __name__ == '__main__':
arguments = [big list of runs in here
]
count = 12
pool = multiprocessing.Pool(processes = count)
r = pool.imap_unordered(runmodels, arguments)
pool.close()
pool.join()
multiprocessing.Pool() already limits number of processes running concurrently.
You could use a lock, to separate the starting time of the processes (not tested):
import threading
import multiprocessing
def init(lock):
global starting
starting = lock
def run_model(arg):
starting.acquire() # no other process can get it until it is released
threading.Timer(1, starting.release).start() # release in a second
# ... start your simulation here
if __name__=="__main__":
arguments = ...
pool = Pool(processes=12,
initializer=init, initargs=[multiprocessing.Lock()])
for _ in pool.imap_unordered(run_model, arguments):
pass
One way to do this with thread and semaphore :
from time import sleep
import subprocess
import threading
def runmodels(arg):
subprocess.call(arg)
sGlobal.release() # release for next launch
if __name__ == '__main__':
threads = []
global sGlobal
sGlobal = threading.Semaphore(12) #Semaphore for max 12 Thread
arguments = [big list of runs in here
]
for arg in arguments :
sGlobal.acquire() # Block if more than 12 thread
t = threading.Thread(target=runmodels, args=(arg,))
threads.append(t)
t.start()
sleep(1)
for t in threads :
t.join()
The answer suggested by jfs caused problems for me as a result of starting a new thread with threading.Timer. If the worker just so happens to finish before the timer does, the timer is killed and the lock is never released.
I propose an alternative route, in which each successive worker will wait until enough time has passed since the start of the previous one. This seems to have the same desired effect, but without having to rely on another child process.
import multiprocessing as mp
import time
def init(shared_val):
global start_time
start_time = shared_val
def run_model(arg):
with start_time.get_lock():
wait_time = max(0, start_time.value - time.time())
time.sleep(wait_time)
start_time.value = time.time() + 1.0 # Specify interval here
# ... start your simulation here
if __name__=="__main__":
arguments = ...
pool = mp.Pool(processes=12,
initializer=init, initargs=[mp.Value('d')])
for _ in pool.imap_unordered(run_model, arguments):
pass

Python threading return values

I am new to threading an I have existing application that I would like to make a little quicker using threading.
I have several functions that return to a main Dict and would like to send these to separate threads so that run at the same time rather than one at a time.
I have done a little googling but I cant seem to find something that fits my existing code and could use a little help.
I have around six functions that return to the main Dict like this:
parsed['cryptomaps'] = pipes.ConfigParse.crypto(parsed['split-config'], parsed['asax'], parsed['names'])
The issue here is with the return value. I understand that I would need to use a queue for this but would I need a queue for each of these six functions or one queue for all of these. If it is the later how would I separate the returns from the threads and assign the to the correct Dict entries.
Any help on this would be great.
John
You can push tuples of (worker, data) to queue to identify the source.
Also please note that due to Global Interpreter Lock Python threading is not very useful. I suggest to take a look at multiprocessing module which offers interface very similiar to multithreading but will actually scale with number of workers.
Edit:
Code sample.
import multiprocessing as mp
# py 3 compatibility
try:
from future_builtins import range, map
except ImportError:
pass
data = [
# input data
# {split_config: ... }
]
def crypto(split_config, asax, names):
# your code here
pass
if __name__ == "__main__":
terminate = mp.Event()
input = mp.Queue()
output = mp.Queue()
def worker(id, terminate, input, output):
# use event here to graciously exit
# using Process.terminate would leave queues
# in undefined state
while not terminate.is_set():
try:
x = input.get(True, timeout=1000)
output.put((id, crypto(**x)))
except Queue.Empty:
pass
workers = [mp.Process(target=worker, args=(i, )) for i in range(0, mp.cpu_count())]
for worker in workers:
worker.start()
for x in data:
input.put(x)
# terminate workers
terminate.set()
# process results
# make sure that queues are emptied otherwise Process.join can deadlock
for worker in workers:
worker.join()

Running two lines of code in python at the same time?

How would I run two lines of code at the exact same time in Python 2.7? I think it's called parallel processing or something like that but I can't be too sure.
You can use either multi threading or multiprocessing.
You will need to use queues for the task.
The sample code below will help you get started with multi threading.
import threading
import Queue
import datetime
import time
class myThread(threading.Thread):
def __init__(self, in_queue, out_queue):
threading.Thread.__init__(self)
self.in_queue = in_queue
self.out_queue = out_queue
def run(self):
while True:
item = self.in_queue.get() #blocking till something is available in the queue
#run your lines of code here
processed_data = item + str(datetime.now()) + 'Processed'
self.out_queue.put(processed_data)
IN_QUEUE = Queue.Queue()
OUT_QUEUE = Queue.Queue()
#starting 10 threads to do your work in parallel
for i in range(10):
t = myThread(IN_QUEUE, OUT_QUEUE)
t.setDaemon(True)
t.start()
#now populate your input queue
for i in range(3000):
IN_QUEUE.put("string to process")
while not IN_QUEUE.empty():
print "Data left to process - ", IN_QUEUE.qsize()
time.sleep(10)
#finally printing output
while not OUT_QUEUE.empty():
print OUT_QUEUE.get()
This script starts 10 threads to process a string. Waits till the input queue has been processed, then prints the output with the time of processing.
You can define multiple classes of threads for different kinds of processing. Or you put function objects in the queue and have different functions running in parallel.
It depends on what you mean by at the exact same time. If you want something that doesn't stop while something else that takes a while runs, threads are a decent option. If you want to truly run two things in parallel, multiprocessing is the way to go: http://docs.python.org/2/library/multiprocessing.html
if you mean for example start a timer and start a loop and exactly after that, I think you can you ; like this: start_timer; start_loop in one line
there is a powerful package to run parallel jobs using python: use JobLib.

Categories

Resources