issues with multithread and subprocess in python 3.x - python

I am tying to run a loop with different treads to speed the process. And I don't find other way to do but to create another script and call it with subprocess and give it the big array as argument..
I put the code commented to explain the probleme..
the script I am trying to call with subprocess:
import multiprocessing
from joblib import Parallel, delayed
import sys
num_cores = multiprocessing.cpu_count()
inputs = sys.argv[1:]
print(inputs)
def evaluate_individual(ind):
ind.evaluate()
ind.already_evaluate = True
if __name__ == "__main__":
# i am trying to execute a loop with multi thread.
# but it need to be in the "__main__" but I can in my main script, so I create an external script...
Parallel(n_jobs=num_cores)(delayed(evaluate_individual)(i) for i in inputs)
the script who call the other script:
import subprocess, sys
from clean import IndividualExamples
# the big array
inputs = []
for i in range(200):
ind = IndividualExamples.simple_individual((None, None), None)
inputs.append(ind)
# and now I need to call this code from another script...
# but as arguments I pass a big array and I don't now if there are a better method
# AND... this don't work to call the subprocess and return no error so I don't now who to do..
subprocess.Popen(['py','C:/Users/alexa/OneDrive/Documents/Programmes/Neronal Network/multi_thread.py', str(inputs)])
thanks for your help, if you now another way to run a loop with multi thread in a function (not in the main) tell me as well.
And sorry for my approximative english.
edit: I tried with a pool but same probleme I need to put It in the "main", so I can put it in a function in my script (as i need to use it)
modified code with pool:
if __name__ == "__main__":
tps1 = time.time()
with multiprocessing.Pool(12) as p:
outputs = p.map(evaluate_individual, inputs)
tps2 = time.time()
print(tps2 - tps1)
New_array = outputs
Other little question, I try with a simple loop and with the Pool multiprocess and a compare the time of the 2 process:
simple loop: 0.022951126098632812
multi thread: 0.9151067733764648
and I go this... why the multi Process On 12 cores can be longer than a simple loop ?

You can do it using a multiprocessing.Pool
import subprocess, sys
from multiprocessing import Pool
from clean import IndividualExamples
def evaluate_individual(ind):
ind.evaluate()
# the big array
pool = Pool()
for i in range(200):
ind = IndividualExamples.simple_individual((None, None), None)
pool.apply_async(func=evaluate_individual, args=(ind,))
pool.close()
pool.join()

thanks, I just try but this raise a run time error:
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
This probably means that you are not using fork to start your
child processes and you have forgotten to use the proper idiom
in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce an executable.

Related

Multiprocessing uses all my cores but does not compute anything

I'm trying to learn about multiprocessing, however the next simple example keeps running forever using all my cpu and doesn't give me the answer.
I have a file called preample.py:
def function(list):
return list[0]+list[1]
And a file called main.py:
from preample.py import*
from multiprocessing import Pool
if __name__ == '__main__':
parameters=[[1,2],[3,4]]
# Multiprocessing
with Pool() as p:
results = p.map(function, parameters)
print(results)
Any sort of help would be appreciated.
The first thing that I tried was to write the definition of the function in a different file (as suggested in other questions) but this didn't work.

Multiprocessing & Pool in __main__ - how to get the output outside the __main__?

Based on this answer (https://stackoverflow.com/a/20192251/9024698), I have to do this:
from multiprocessing import Pool
def process_image(name):
sci=fits.open('{}.fits'.format(name))
<process>
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
pool.map(process_image, data_inputs) # process data_inputs iterable with pool
to multi-process a for loop.
However, I am wondering, how can I get the output of this and further process if I want?
It must be like that:
if __name__ == '__main__':
pool = Pool() # Create a multiprocessing Pool
output = pool.map(process_image, data_inputs) # process data_inputs iterable with pool
# further processing
But then this means that I have to put all the rest of my code in __main__ unless I write everything in functions which are called by __main__?
The notion of __main__ has been always pretty confusing to me.
if __name__ == '__main__': is literally just "if this file is being run as a script, as opposed to being imported as a module, then do this". __name__ is a hidden variable that gets set to '__main__' if it's being run as a script. why it works this way is beyond the scope of this discussion but suffice it to say it has to do with how python evaluates sourcefiles top-to-bottom.
In other words, you can put the other two lines anywhere you want - in a function, probably, that you call elsewhere in the program. You could return output from that function, or do other processing on it, or etc., whatever you happen to need.

Multiprocessing with Python and Windows

I have a code that works with Thread in python, but I wanna switch to Process as if I have understood well that will give me a speed-up.
Here there is the code with Thread:
threads.append(Thread(target=getId, args=(my_queue, read)))
threads.append(Thread(target=getLatitude, args=(my_queue, read)))
The code works putting the return in the Queue and after a join on the threads list, I can retrieve the results.
Changing the code and the import statement my code now is like that:
threads.append(Process(target=getId, args=(my_queue, read)))
threads.append(Process(target=getLatitude, args=(my_queue, read)))
However it does not execute anything and the Queue is empty, with the Thread the Queue is not empty so I think it is related to Process.
I have read answers in which the Process class does not work on Windows is it true, or there is a way to make it work (adding freeze_support() does not help)?
In the negative case, multithreading on windows is actually executed in parallel on different cores?
ref:
Python multiprocessing example not working
Python code with multiprocessing does not work on Windows
Multiprocessing process does not join when putting complex dictionary in return queue
(in which is described that fork does not exist on Windows)
EDIT:
To add some details:
the code with Process is actually working on centOS.
EDIT2:
add a simplified version of my code with processes, code tested on centOS
import pandas as pd
from multiprocessing import Process, freeze_support
from multiprocessing import Queue
#%% Global variables
datasets = []
latitude = []
def fun(key, job):
global latitude
if(key == 'LAT'):
latitude.append(job)
def getLatitude(out_queue, skip = None):
latDict = {'LAT' : latitude}
out_queue.put(latDict)
n = pd.read_csv("my.csv", sep =',', header = None).shape[0]
print("Number of baboon:" + str(n))
read = []
for i in range(0,n):
threads = []
my_queue = Queue()
threads.append(Process(target=getLatitude, args=(my_queue, read)))
for t in threads:
freeze_support() # try both with and without this line
t.start()
for t in threads:
t.join()
while not my_queue.empty():
try:
job = my_queue.get()
key = list(job.keys())
fun(key[0],job[key[0]])
except:
print("END")
read.append(i)
Per the documentation, you need the following after the function definitions. When Python creates the subprocesses, they import your script so the code that runs at the global level will be run multiple times. For the code you only want to run in the main thread:
if __name__ == '__main__':
n = pd.read_csv("my.csv", sep =',', header = None).shape[0]
# etc.
Indent the rest of code under this if.

Is it possible to use multiprocessing in a module with windows?

I'm currently going through some pre-existing code with the goal of speeding it up. There's a few places that are extremely good candidates for parallelization. Since Python has the GIL, I thought I'd use the multiprocess module.
However from my understanding the only way this will work on windows is if I call the function that needs multiple processes from the highest-level script with the if __name__=='__main__' safeguard. However, this particular program was meant to be distributed and imported as a module, so it'd be kind of clunky to have the user copy and paste that safeguard and is something I'd really like to avoid doing.
Am I out of luck or misunderstanding something as far as multiprocessing goes? Or is there any other way to do it with Windows?
For everyone still searching:
inside module
from multiprocessing import Process
def printing(a):
print(a)
def foo(name):
var={"process":{}}
if name == "__main__":
for i in range(10):
var["process"][i] = Process(target=printing , args=(str(i)))
var["process"][i].start()
for i in range(10):
var["process"][i].join
inside main.py
import data
name = __name__
data.foo(name)
output:
>>2
>>6
>>0
>>4
>>8
>>3
>>1
>>9
>>5
>>7
I am a complete noob so please don't judge the coding OR presentation but at least it works.
As explained in comments, perhaps you could do something like
#client_main.py
from mylib.mpSentinel import MPSentinel
#client logic
if __name__ == "__main__":
MPSentinel.As_master()
#mpsentinel.py
class MPSentinel(object):
_is_master = False
#classmethod
def As_master(cls):
cls._is_master = True
#classmethod
def Is_master(cls):
return cls._is_master
It's not ideal in that it's effectively a singleton/global but it would work around window's lack of fork. Still you could use MPSentinel.Is_master() to use multiprocessing optionally and it should prevent Windows from process bombing.
On ms-windows, you should be able to import the main module of a program without side effects like starting a process.
When Python imports a module, it actually runs it.
So one way of doing that is in the if __name__ is '__main__' block.
Another way is to do it from within a function.
The following won't work on ms-windows:
from multiprocessing import Process
def foo():
print('hello')
p = Process(target=foo)
p.start()
This is because it tries to start a process when importing the module.
The following example from the programming guidelines is OK:
from multiprocessing import Process, freeze_support, set_start_method
def foo():
print('hello')
if __name__ == '__main__':
freeze_support()
set_start_method('spawn')
p = Process(target=foo)
p.start()
Because the code in the if block doesn't run when the module is imported.
But putting it in a function should also work:
from multiprocessing import Process
def foo():
print('hello')
def bar()
p = Process(target=foo)
p.start()
When this module is run, it will define two new functions, not run then.
i've been developing an instagram images scraper so in order to get the download & save operations run faster i've implemented multiprocesing in one auxiliary module, note that this code it's inside an auxiliary module and not inside the main module.
The solution I found is adding this line:
if __name__ != '__main__':
pretty simple but it's actually working!
def multi_proces(urls, profile):
img_saved = 0
if __name__ != '__main__': # line needed for the sake of getting this NOT to crash
processes = []
for url in urls:
try:
process = multiprocessing.Process(target=download_save, args=[url, profile, img_saved])
processes.append(process)
img_saved += 1
except:
continue
for proce in processes:
proce.start()
for proce in processes:
proce.join()
return img_saved
def download_save(url, profile,img_saved):
file = requests.get(url, allow_redirects=True) # Download
open(f"scraped_data\{profile}\{profile}-{img_saved}.jpg", 'wb').write(file.content) # Save

Python multiprocessing.Pool does not start right away

I want to input text to python and process it in parallel. For that purpose I use multiprocessing.Pool. The problem is that sometime, not always, I have to input text multiple times before anything is processed.
This is a minimal version of my code to reproduce the problem:
import multiprocessing as mp
import time
def do_something(text):
print('Out: ' + text, flush=True)
# do some awesome stuff here
if __name__ == '__main__':
p = None
while True:
message = input('In: ')
if not p:
p = mp.Pool()
p.apply_async(do_something, (message,))
What happens is that I have to input text multiple times before I get a result, no matter how long I wait after I have inputted something the first time. (As stated above, that does not happen every time.)
python3 test.py
In: a
In: a
In: a
In: Out: a
Out: a
Out: a
If I create the pool before the while loop or if I add time.sleep(1) after creating the pool, it seems to work every time. Note: I do not want to create the pool before I get an input.
Has someone an explanation for this behavior?
I'm running Windows 10 with Python 3.4.2
EDIT: Same behavior with Python 3.5.1
EDIT:
An even simpler example with Pool and also ProcessPoolExecutor. I think the problem is the call to input() right after appyling/submitting, which only seems to be a problem the first time appyling/submitting something.
import concurrent.futures
import multiprocessing as mp
import time
def do_something(text):
print('Out: ' + text, flush=True)
# do some awesome stuff here
# ProcessPoolExecutor
# if __name__ == '__main__':
# with concurrent.futures.ProcessPoolExecutor() as executor:
# executor.submit(do_something, 'a')
# input('In:')
# print('done')
# Pool
if __name__ == '__main__':
p = mp.Pool()
p.apply_async(do_something, ('a',))
input('In:')
p.close()
p.join()
print('done')
Your code works when I tried it on my Mac.
In Python 3, it might help to explicitly declare how many processors will be in your pool (ie the number of simultaneous processes).
try using p = mp.Pool(1)
import multiprocessing as mp
import time
def do_something(text):
print('Out: ' + text, flush=True)
# do some awesome stuff here
if __name__ == '__main__':
p = None
while True:
message = input('In: ')
if not p:
p = mp.Pool(1)
p.apply_async(do_something, (message,))
I could not reproduce it on Windows 7 but there are few long shots worth to mention for your issue.
your AV might be interfering with the newly spawned processes, try temporarily disabling it and see if the issue is still present.
Win 10 might have different IO caching algorithm, try inputting larger strings. If it works, it means that the OS tries to be smart and sends data when a certain amount has piled up.
As Windows has no fork() primitive, you might see the delay caused by the spawn starting method.
Python 3 added a new pool of workers called ProcessPoolExecutor, I'd recommend you to use this no matter the issue you suffer from.

Categories

Resources