Why import instruction is not executed into python Process - python

I'm looking for a solution to do multiprocessing for running script.
I have a function which launches 4 process, and each process executes a script through runpy.run_path() and I get return back.
Example :
def valorise(product, dico_valo):
res = runpy.run_path(product +"/PyScript.py", run_name="__main__")
dico_valo[product] = res["ret"]
def f(mutex,l,dico):
while len(l)!= 0:
mutex.acquire()
product = l.pop(0)
mutex.release()
p = Process(target=valorise, args=(product,dico))
p.start()
p.join()
def run_parallel_computations(valuationDate, list_scripts):
if len(product_list)>0:
print '\n\nPARALLEL COMPUTATIONS BEGIN..........\n\n'
manager = Manager()
l = manager.list(list_scripts)
dico = manager.dict()
mutex = Lock()
p1 = Process(target=f, args=(mutex,l,dico), name="script1")
p2 = Process(target=f, args=(mutex,l,dico), name="script2")
p3 = Process(target=f, args=(mutex,l,dico), name="script3")
p4 = Process(target=f, args=(mutex,l,dico), name="script4")
p1.start()
p2.start()
p3.start()
p4.start()
p1.join()
p2.join()
p3.join()
p4.join()
dico_isin = {}
for i in iter(dico.keys()):
dico_isin[i] = dico[i]
return dico
print '\n\nPARALLEL COMPUTATIONS END..........'
else:
print '\n\nNOTHING TO PRICE !'
In every PyScript.py, I import a library and each script has to import again it. However, in this case, it doesn't work as I want and I don't understand why. My library is imported once during the first process and the same "import" is used in the other processes.
Could you help me ?
Thank you !

It might not be the case in multiprocessing (but looks like it is).
When you will try to import something more than once (ie. import re in most of your modules), Python will not 'reimport' it. As it will see it in modules already imported and will skip it.
To force reloading you can try reload(module_name) (it can not reload import of single class/method from module, you can reload whole module or nothing)

Related

Multiprocessing a for loop - got errors

I have some code in Python and I wanna do it with multiprocessing
import multiprocessing as mp
from multiprocessing.sharedctypes import Value
import time
import math
resault_a = []
resault_b = []
resault_c = []
def make_calculation_one(numbers):
for number in numbers:
resault_a.append(math.sqrt(number**3))
def make_calculation_two(numbers):
for number in numbers:
resault_a.append(math.sqrt(number**4))
def make_calculation_three(numbers):
for number in numbers:
resault_c.append(math.sqrt(number**5))
number_list = list(range(1000000))
if __name__ == "__main__":
mp.set_start_method("fork")
p1 = mp.Process(target=make_calculation_one, args=(number_list))
p2 = mp.Process(target=make_calculation_two, args=(number_list))
p3 = mp.Process(target=make_calculation_three, args=(number_list))
start = time.time()
p1.start()
p2.start()
p3.start()
end = time.time()
print(end - start)
I got an empty array, where is the problem?
I got some errors:
"Process Process-1:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()"
How can I fix it?
TNX
There are several issues with your code:
The major problem is that the args argument to the Process initializer requires a tuple or list. You are specifying args=(number_list). The parentheses around number_list does not make this a tuple. Without the comma you just have a parenthesized expression, i.e. a list. So instead of passing a single argument that is a list, you are passing 10,000 arguments, while your "worker" functions only take 1 argument. You need: args=(number_list,).
Your worker functions are doing calculations but neither printing nor returning the results of these calculations. Assuming you want to return the results, you need a mechanism for doing so. If you are using multiprocessing.Process then the usual solution is to pass to the worker function a multiprocessing.Queue instance to which the worker function can put the results (see below). You can also use a multiprocessing pool (also see below).
Your timing is not quite right. You have started the child processes and immediately set end without waiting for the tasks to complete. To get the actual time, end should only be set when the child processes have finished creating their results.
Using Process with queues
import multiprocessing as mp
import time
import math
def make_calculation_one(numbers, out_q):
out_q.put([math.sqrt(number**3) for number in numbers])
def make_calculation_two(numbers, out_q):
out_q.put([math.sqrt(number**4) for number in numbers])
def make_calculation_three(numbers, out_q):
out_q.put([math.sqrt(number**5) for number in numbers])
if __name__ == "__main__":
# We only want one copy of `number_list`, i.e. in our main process.
# But there is actually no need to convert to a list:
number_list = range(1000000)
mp.set_start_method("fork")
out_q_1 = mp.Queue()
out_q_2 = mp.Queue()
out_q_3 = mp.Queue()
# Create pool of size 3:
p1 = mp.Process(target=make_calculation_one, args=(number_list, out_q_1))
p2 = mp.Process(target=make_calculation_two, args=(number_list, out_q_2))
p3 = mp.Process(target=make_calculation_three, args=(number_list, out_q_3))
start = time.time()
p1.start()
p2.start()
p3.start()
results = []
# Get return values:
results.append(out_q_1.get())
results.append(out_q_2.get())
results.append(out_q_3.get())
end = time.time()
p1.join()
p2.join()
p3.join()
print(end - start)
Using a shared memory array to pass the number list and to return the results
import multiprocessing as mp
import time
import math
def make_calculation_one(numbers, results):
for idx, number in enumerate(numbers):
results[idx] = math.sqrt(number**3)
def make_calculation_two(numbers, results):
for idx, number in enumerate(numbers):
results[idx] = math.sqrt(number**4)
def make_calculation_three(numbers, results):
for idx, number in enumerate(numbers):
results[idx] = math.sqrt(number**5)
if __name__ == "__main__":
# We only want one copy of `number_list`, i.e. in our main process
number_list = mp.RawArray('d', range(1000000))
mp.set_start_method("fork")
results_1 = mp.RawArray('d', len(number_list))
results_2 = mp.RawArray('d', len(number_list))
results_3 = mp.RawArray('d', len(number_list))
# Create pool of size 3:
p1 = mp.Process(target=make_calculation_one, args=(number_list, results_1))
p2 = mp.Process(target=make_calculation_two, args=(number_list, results_2))
p3 = mp.Process(target=make_calculation_three, args=(number_list, results_3))
start = time.time()
p1.start()
p2.start()
p3.start()
p1.join()
p2.join()
p3.join()
end = time.time()
print(end - start)
Using a multiprocessing pool
import multiprocessing as mp
import time
import math
def make_calculation_one(numbers):
return [math.sqrt(number**3) for number in numbers]
def make_calculation_two(numbers):
return [math.sqrt(number**4) for number in numbers]
def make_calculation_three(numbers):
return [math.sqrt(number**5) for number in numbers]
if __name__ == "__main__":
# We only want one copy of `number_list`, i.e. in our main process
number_list = range(1000000)
mp.set_start_method("fork")
# Create pool of size 3:
pool = mp.Pool(3)
start = time.time()
async_results = []
async_results.append(pool.apply_async(make_calculation_one, args=(number_list,)))
async_results.append(pool.apply_async(make_calculation_two, args=(number_list,)))
async_results.append(pool.apply_async(make_calculation_three, args=(number_list,)))
# Now wait for results:
results = [async_result.get() for async_result in async_results]
end = time.time()
pool.close()
pool.join()
print(end - start)
Conclusion
Since your calculations yield a type readily supported by shared memory, the second code example above should result in the best performance. You could also adapt the multiprocessing pool example to use shared memory.
I'm getting some other error:
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
TypeError: make_calculation_one() takes 1 positional argument but 1000000 were given
but if I change these line accordingly then it works:
p1 = mp.Process(target=make_calculation_one, args=([number_list]))
p2 = mp.Process(target=make_calculation_two, args=([number_list]))
p3 = mp.Process(target=make_calculation_three, args=([number_list]))
The function that is run in a worker Process cannot access data in the parent process.
If the "fork" start method is used, it would have access to the copy of that data in the forked process.
But modifying that would not alter the value in the parent process.
In this case, the easiest thing to do it to create a multiprocessing.Array, and pass that to the process to use.
import math
import multiprocessing as mp
def make_calculation_one(numbers, res):
for idx, number in enumerate(numbers):
res[idx] = math.sqrt(number**3)
number_list = list(range(10000))
if __name__ == "__main__":
result_a = mp.Array("d", len(number_list))
p1 = mp.Process(target=make_calculation_one, args=(number_list, result_a))
p1.start()
p1.join()
print(sum(result_a))
This code prints the value 3999500012.4745193.

Why is multiprocessing module not producing the desired result?

import multiprocessing as mp
import os
def cube(num):
print(os.getpid())
print("Cube is {}".format(num*num*num))
def square(num):
print(os.getpid())
print("Square is {}".format(num*num))
if __name__ == "__main__":
p1 = mp.Process(target = cube, args = (3,))
p2 = mp.Process(target = square, args = (4,))
p1.start()
p2.start()
p1.join()
p2.join()
print("Done")
I was using the multiprocessing module, but I am not able to print any output from a function using that.
I even tried flushing the stdout using the sys module.
Q : "Why is multiprocessing module not producing the desired result?"
Why?
Because it crashes.
The MWE/MCVE-representation of the problem has a wrong code. It crashes & it has nothing to do with the sys.stdout.flush() :
>>> cube( 4 )
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 2, in cube
NameError: global name 'os' is not defined
Solution :
>>> import os # be it in the __main__ or in the def()-ed functions...
>>> cube( 4 )
14165
Cube is 64
and your mp.Process()-based replicas of the python-process instances will stop crashing too.
MCVE that works :
(base) Fri May 29 14:29:33 $ conda activate py3
(py3) Fri May 29 14:34:55 $ python StackOverflow_mp.py
This is ____6745::__main__
This is ____6746::PID
This is ____6747::PID
Cube(__3) is _______27.
Square(__4) is _______16.
Done.
Works.
Q.E.D.
import multiprocessing as mp
import os
import sys
import time
def cube( num ):
print( "This is {0:_>8d}::PID".format( os.getpid() ) )
print( "Cube({0:_>3d}) is {1:_>9d}.".format( num, num*num*num ) )
sys.stdout.flush()
def square( num ):
print( "This is {0:_>8d}::PID".format( os.getpid() ) )
print( "Square({0:_>3d}) is {1:_>9d}.".format( num, num*num ) )
sys.stdout.flush()
if __name__ == "__main__":
print( "This is {0:_>8d}::__main__".format( os.getpid() ) )
p1 = mp.Process( target = cube, args = (3, ) )
p2 = mp.Process( target = square, args = (4, ) )
p1.start()
p2.start()
p1.join()
p2.join()
time.sleep( 1 )
print( "Done.\nWorks.\nQ.E.D." )
I copied and pasted your exact code. But I still didn't get the output from the called functions using the multiprocessing libraries– Kartikeya Agarwal 47 mins ago
So,
- I opened a new Terminal process,
- I copied the conda activate py3 command and
- I hit Enter to let it run, so as to make python3 ecosystem go live.
- I re-launched the proof-of-solution again python StackOverflow_mp.py and
- I hit Enter to let it run
- I saw it working the very same way as it worked last time.
- I doubt the problem is on the provided twice (re)-validated proof-of-solution side, is it?
Q.E.D.
(py3) Fri May 29 19:53:58 $ python StackOverflow_mp.py
This is ___27202::__main__
This is ___27203::PID
Cube(__3) is _______27.
This is ___27204::PID
Square(__4) is _______16.
Done

python multiprocessing to create an excel file with multiple sheets [duplicate]

I am new to Python and I am trying to save the results of five different processes to one excel file (each process write to a different sheet). I have read different posts here, but still can't get it done as I'm very confused about pool.map, queues, and locks, and I'm not sure what is required here to fulfill this task.
This is my code so far:
list_of_days = ["2017.03.20", "2017.03.21", "2017.03.22", "2017.03.23", "2017.03.24"]
results = pd.DataFrame()
if __name__ == '__main__':
global list_of_days
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
nr_of_cores = multiprocessing.cpu_count()
l = multiprocessing.Lock()
pool = multiprocessing.Pool(processes=nr_of_cores, initializer=init, initargs=(l,))
pool.map(f, range(len(list_of_days)))
pool.close()
pool.join()
def init(l):
global lock
lock = l
def f(k):
global results
*** DO SOME STUFF HERE***
results = results[ *** finished pandas dataframe *** ]
lock.acquire()
results.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
lock.release()
The result is that only one sheet gets created in excel (I assume it is the process finishing last). Some questions about this code:
How to avoid defining global variables?
Is it even possible to pass around dataframes?
Should I move the locking to main instead?
Really appreciate some input here, as I consider mastering multiprocessing as instrumental. Thanks
1) Why did you implement time.sleep in several places in your 2nd method?
In __main__, time.sleep(0.1), to give the started process a timeslice to startup.
In f2(fq, q), to give the queue a timeslice to flushed all buffered data to the pipe and
as q.get_nowait() are used.
In w(q), are only for testing simulating long run of writer.to_excel(...),
i removed this one.
2) What is the difference between pool.map and pool = [mp.Process( . )]?
Using pool.map needs no Queue, no parameter passed, shorter code.
The worker_process have to return immediately the result and terminates.
pool.map starts a new process as long as all iteration are done.
The results have to be processed after that.
Using pool = [mp.Process( . )], starts n processes.
A process terminates on queue.Empty
Can you think of a situation where you would prefer one method over the other?
Methode 1: Quick setup, serialized, only interested in the result to continue.
Methode 2: If you want to do all workload parallel.
You could't use global writer in processes.
The writer instance has to belong to one process.
Usage of mp.Pool, for instance:
def f1(k):
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
return results
if __name__ == '__main__':
pool = mp.Pool()
results = pool.map(f1, range(len(list_of_days)))
writer = pd.ExcelWriter('../test/myfile.xlsx', engine='xlsxwriter')
for k, result in enumerate(results):
result.to_excel(writer, sheet_name=list_of_days[k])
writer.save()
pool.close()
This leads to .to_excel(...) are called in sequence in the __main__ process.
If you want parallel .to_excel(...) you have to use mp.Queue().
For instance:
The worker process:
# mp.Queue exeptions have to load from
try:
# Python3
import queue
except:
# Python 2
import Queue as queue
def f2(fq, q):
while True:
try:
k = fq.get_nowait()
except queue.Empty:
exit(0)
# *** DO SOME STUFF HERE***
results = pd.DataFrame(df_)
q.put( (list_of_days[k], results) )
time.sleep(0.1)
The writer process:
def w(q):
writer = pd.ExcelWriter('myfile.xlsx', engine='xlsxwriter')
while True:
try:
titel, result = q.get()
except ValueError:
writer.save()
exit(0)
result.to_excel(writer, sheet_name=titel)
The __main__ process:
if __name__ == '__main__':
w_q = mp.Queue()
w_p = mp.Process(target=w, args=(w_q,))
w_p.start()
time.sleep(0.1)
f_q = mp.Queue()
for i in range(len(list_of_days)):
f_q.put(i)
pool = [mp.Process(target=f2, args=(f_q, w_q,)) for p in range(os.cpu_count())]
for p in pool:
p.start()
time.sleep(0.1)
for p in pool:
p.join()
w_q.put('STOP')
w_p.join()
Tested with Python:3.4.2 - pandas:0.19.2 - xlsxwriter:0.9.6

multiprocessing does not save data

My program essentially scrapes images off of websites that I made up. I have 3 functions and each of them scrape images off of a specific website by using a parameter. My program contains the following code.
import requests
from bs4 import BeautifulSoup
from multiprocessing import Process
img1 = []
img2 = []
img3 = []
def my_func1(img_search):
del img1[:]
url1 = "http://www.somewebsite.com/" + str(img_search)
r1 = requests.get(url1)
soup1 = BeautifulSoup(r1.content)
data1 = soup1.find_all("div",{"class":"img"})
for item in data1:
try:
img1.append(item.contents[0].find('img')['src'])
except:
img1.append("img Unknown")
return
def my_func2(img_search):
del img2[:]
url2 = "http://www.somewebsite2.com/" + str(img_search)
r2 = requests.get(url2)
soup2 = BeautifulSoup(r2.content)
data2 = soup2.find_all("div",{"class":"img"})
for item in data2:
try:
img2.append(item.contents[0].find('img')['src'])
except:
img2.append("img Unknown")
return
def my_func3(img_search):
del img3[:]
url3 = "http://www.somewebsite3.com/" + str(img_search)
r3 = requests.get(url3)
soup3 = BeautifulSoup(r3.content)
data3 = soup3.find_all("div",{"class":"img"})
for item in data3:
try:
img3.append(item.contents[0].find('img')['src'])
except:
img3.append("img Unknown")
return
my_func1("orange cat")
my_func2("blue cat")
my_func3("green cat")
print(*img1, sep='\n')
print(*img2, sep='\n')
print(*img3, sep='\n')
The scraping works just fine, but it is quite slow so I decided to use multiprocessing to speed it up, and multiprocessing did in fact speed it up. I essentially replaced the function calls with this
p = Process(target=my_func1, args=("orange cat",))
p.start()
p2 = Process(target=my_func2, args=("blue cat",))
p2.start()
p3 = Process(target=my_func3, args=("green cat",))
p3.start()
p.join()
p2.join()
p3.join()
However, when I print the img1 , img2, and img3 lists they are empty. How would I fix this?
When you use multiprocessing to distribute your work between several processes, each process will run in a separate namespace (a copy of the namespace of the main process). The changes you make in the child-process's namespace will not be reflected in the parent process's namespace. You'll need to use a multiprocessing.Queue or some other synchronization method to pass the data back from the worker processes.
In your example code, your three functions are almost exactly the same, only the web site's domain and the variable names differ. If that's how your real functions look, I suggest using multiprocessing.Pool.map and passing the whole URL to a single function, rather than just passing search terms:
def my_func(search_url):
r = requests.get(search_url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div",{"class":"img"})
images = []
for item in data:
try:
images.append(item.contents[0].find('img')['src'])
except:
images.append("img Unknown")
return images
if __name__ == "__main__":
searches = ['http://www.somewebsite1.com/?orange+cat', # or whatever
'http://www.somewebsite2.com/?blue+cat',
'http://www.somewebsite3.com/?green+cat']
pool = multiprocessing.Pool() # will create as many processes as you have CPU cores
results = pool.map(my_func, searches)
pool.close()
# do something with results, which will be a list with of the function return values

Python - Fork Modules

My requirement is to do something like below -
def task_a():
...
...
ret a1
def task_b():
...
...
ret b1
.
.
def task_z():
...
...
ret z1
Now in my main code I want to Execute Tasks a..z in parallel and then wait for the return values of all of the above..
a = task_a()
b = task_b()
z = task_z()
Is there a way to call the above modules in parallel in Python?
Thanks,
Manish
Reference:
Python: How can I run python functions in parallel?
Import:
from multiprocessing import Process
Add new function:
def runInParallel(*fns):
proc = []
for fn in fns:
p = Process(target=fn)
p.start()
proc.append(p)
for p in proc:
p.join()
Input existing functions into the new function:
runInParallel(task_a, task_b, task_c...task_z)

Categories

Resources