For simplification purposes, let's suppose I'm downloading multiple large files from S3 to my local machine.
def get_file(name):
# pull from S3 and returns DataFrame
return df
if __name__ == "__main__":
df1 = get_file("large_file_1.csv")
df2 = get_file("large_file_2.csv")
df3 = get_file("large_file_3.csv")
and I want to refactor this code to make these calls non-blocking (i.e. start pulling all of them from S3 at once and wait for them to finish). My first instinct is to use the threading module with something like
from threading import Thread
if __name__ == "__main__":
t1 = Thread(target=get_file, args=("large_file_1.csv",))
t2 = Thread(target=get_file, args=("large_file_2.csv",))
t3 = Thread(target=get_file, args=("large_file_3.csv",))
t1.start()
t2.start()
t3.start()
t1.join()
t2.join()
t3.join()
However, Thread doesn't expose a way to assign the return value of the target function to a variable. What's the preferred way of going about this is in Python?
A simple way to do the work concurrently, and get a response back from each thread, is to use a ThreadPoolExecutor:
from concurrent.futures import ThreadPoolExecutor
def get_file(f):
# Do real work here
return f + "1" # Return a real result here
l = ["large_file_1.csv", "large_file_2.csv", "large_file3.csv"]
pool = ThreadPoolExecutor(3)
out = pool.map(get_file, l)
print(list(out))
Output:
['large_file_1.csv1', 'large_file_2.csv1', 'large_file3.csv1']
You could also keep using Thread directly, and use a Queue to get the results back, but ThreadPoolExecutor is abstracting that away for you, so there's really no need.
Related
I want to run two or more python scripts simultaneously from a master script. Each of these scripts already have threads within them which are running in parallel. For example I run
script1.py
if __name__ == '__main__':
pid_vav = PID_VAV('B2')
t1 = threading.Thread(target=pid_vav.Controls)
t1.daemon = False
t1.start()
t2 = threading.Thread(target=pid_vav.mqttConnection)
t2.daemon = False
t2.start()
script2.py
if __name__ == '__main__':
pid_vav = PID_VAV('B4')
t1 = threading.Thread(target=pid_vav.Controls)
t1.daemon = False
t1.start()
t2 = threading.Thread(target=pid_vav.mqttConnection)
t2.daemon = False
t2.start()
I am running this script1.py and script2.py separately. Only difference is the parameter which I am passing to the class. Is it possible to have a master script such that if I just run that, both these scripts will run ?
Thanks
Assuming you want the output of both scripts to be shown when you run the master script.
You can make use of the subprocess module to call the python file, and you can use the threading module to start separate threads
from threading import Thread
import subprocess
t1 = Thread(target=subprocess.run, args=(["python", "script1.py"],))
t2 = Thread(target=subprocess.run, args=(["python", "script2.py"],))
t1.start()
t2.start()
t1.join()
t2.join()
If u want to trigger 2 scripts from a master script u can use the below method.
It will help you trigger both scripts as thread and the thread can also produce different threads based on the callable scripts. You can even make Scripts run independently.
import subprocess
pid1 = subprocess.Popen([sys.executable, "script1.py"])
pid2 = subprocess.Popen([sys.executable, "script2.py"])
Yes, ofc.
script_master.py:
from os import system
system('start script1.py && start script2.py')
But I think you could to use this code:
script_together.py:
if __name__ == '__main__':
todo=[]
todo.append(threading.Thread(target=lambda: PID_VAV('B2').Controls, daemon=False))
todo.append(threading.Thread(target=lambda: PID_VAV('B4').mqttConnection, daemon=False))
for th in todo:
th.start()
for th in todo:
th.join()
If you're happy to have the code for both live in one file, you can use multiprocessing to run them concurrently on different CPU cores.
import multiprocessing as mp
from threading import Thread
def start_process(pid_vav_label):
pid_vav, threads = PID_VAV(pid_vav_label), []
threads.append(Thread(target=pid_vav.Controls))
threads.append(Thread(target=pid_vav.mqttConnection))
for thread in threads:
thread.start()
# Join if necessary
for thread in threads:
thread.join()
if __name__ == '__main__':
processes = []
for label in ['B2', 'B4']:
processes.append(mp.Process(target=start_process, args=(label,)))
processes[-1].start()
# Again, can join if necessary
for process in processes:
process.join()
I am following a preceding question here: how to add more items to a multiprocessing queue while script in motion
the code I am working with now:
import multiprocessing
class MyFancyClass:
def __init__(self, name):
self.name = name
def do_something(self):
proc_name = multiprocessing.current_process().name
print('Doing something fancy in {} for {}!'.format(proc_name, self.name))
def worker(q):
while True:
obj = q.get()
if obj is None:
break
obj.do_something()
if __name__ == '__main__':
queue = multiprocessing.Queue()
p = multiprocessing.Process(target=worker, args=(queue,))
p.start()
queue.put(MyFancyClass('Fancy Dan'))
queue.put(MyFancyClass('Frankie'))
# print(queue.qsize())
queue.put(None)
# Wait for the worker to finish
queue.close()
queue.join_thread()
p.join()
Right now, there's two items in the queue. if I replace the two lines with a list of, say 50 items....How do I initiate a POOL to allow a number of processes available. for example:
p = multiprocessing.Pool(processes=4)
where does that go? I'd like to be able run multiple items at once, especially if the items run for a bit.
Thanks!
As a rule, you either use Pool or Process(es) plus Queues. Mixing both is a misuse; the Pool already uses Queues (or a similar mechanism) behind the scenes.
If you want to do this with a Pool, change your code to (moving code to main function for performance and better resource cleanup than running in global scope):
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# Submit all the work
futures = [p.apply_async(fancy.do_something) for fancy in myfancyclasses]
# Done submitting, let workers exit as they run out of work
p.close()
# Wait until all the work is finished
for f in futures:
f.wait()
if __name__ == '__main__':
main()
This could be simplified further at the expense of purity, with the .*map* methods of Pool, e.g. to minimize memory usage redefine main as:
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# No return value, so we ignore it, but we need to run out the result
# or the work won't be done
for _ in p.imap_unordered(MyFancyClass.do_something, myfancyclasses):
pass
Yes, technically either approach has a slightly higher overhead in terms of needing to serialize the return value you're not using so give it back to the parent process. But in practice, this cost is pretty low (since your function has no return, it's returning None, which serializes to almost nothing). An advantage to this approach is that for printing to the screen, you generally don't want to do it from the child processes (since they'll end up interleaving output), and you can replace the printing with returns to let the parent do the work, e.g.:
import multiprocessing
class MyFancyClass:
def __init__(self, name):
self.name = name
def do_something(self):
proc_name = multiprocessing.current_process().name
# Changed from print to return
return 'Doing something fancy in {} for {}!'.format(proc_name, self.name)
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# Using the return value now to avoid interleaved output
for res in p.imap_unordered(MyFancyClass.do_something, myfancyclasses):
print(res)
if __name__ == '__main__':
main()
Note how all of these solutions remove the need to write your own worker function, or manually manage Queues, because Pools do that grunt work for you.
Alternate approach using concurrent.futures to efficiently process results as they become available, while allowing you to choose to submit new work (either based on the results, or based on external information) as you go:
import concurrent.futures
from concurrent.futures import FIRST_COMPLETED
def main():
allow_new_work = True # Set to False to indicate we'll no longer allow new work
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your initial MyFancyClass instances here
with concurrent.futures.ProcessPoolExecutor() as executor:
remaining_futures = {executor.submit(fancy.do_something)
for fancy in myfancyclasses}
while remaining_futures:
done, remaining_futures = concurrent.futures.wait(remaining_futures,
return_when=FIRST_COMPLETED)
for fut in done:
result = fut.result()
# Do stuff with result, maybe submit new work in response
if allow_new_work:
if should_stop_checking_for_new_work():
allow_new_work = False
# Let the workers exit when all remaining tasks done,
# and reject submitting more work from now on
executor.shutdown(wait=False)
elif has_more_work():
# Assumed to return collection of new MyFancyClass instances
new_fanciness = get_more_fanciness()
remaining_futures |= {executor.submit(fancy.do_something)
for fancy in new_fanciness}
myfancyclasses.extend(new_fanciness)
I am new to multiprocessing of Python, and I wrote the tiny script below:
import multiprocessing
import os
def task(queue):
print(100)
def run(pool):
queue = multiprocessing.Queue()
for i in range(os.cpu_count()):
pool.apply_async(task, args=(queue, ))
if __name__ == '__main__':
multiprocessing.freeze_support()
pool = multiprocessing.Pool()
run(pool)
pool.close()
pool.join()
I am wondering why the task() method is not executed and there is no output after running this script. Could anyone help me?
It is running, but it's dying with an error outside the main thread, and so you don't see the error. For that reason, it's always good to .get() the result of an async call, even if you don't care about the result: the .get() will raise the error that's otherwise invisible.
For example, change your loop like so:
tasks = []
for i in range(os.cpu_count()):
tasks.append(pool.apply_async(task, args=(queue,)))
for t in tasks:
t.get()
Then the new t.get() will blow up, ending with:
RuntimeError: Queue objects should only be shared between processes through inheritance
In short, passing Queue objects to Pool methods isn't supported.
But you can pass them to multiprocessing.Process(), or to a Pool initialization function. For example, here's a way to do the latter:
import multiprocessing
import os
def pool_init(q):
global queue # make queue global in workers
queue = q
def task():
# can use `queue` here if you like
print(100)
def run(pool):
tasks = []
for i in range(os.cpu_count()):
tasks.append(pool.apply_async(task))
for t in tasks:
t.get()
if __name__ == '__main__':
queue = multiprocessing.Queue()
pool = multiprocessing.Pool(initializer=pool_init, initargs=(queue,))
run(pool)
pool.close()
pool.join()
On Linux-y systems, you can - as the original error message suggested - use process inheritance instead (but that's not possible on Windows).
I am trying to process files in over 1000 directories. However, as the size of the directory is very big, i have to run the process concurrently to save time. The dir [:100] is so that that function has only 100 directories to process.However, the 2nd Thread does not start until the first Thread has finished.
def sort(start,end):
dir = os.list(filepath)
dir = dir [start:end]
for file in dir:
~process~
print result
if __name__ == '__main__':
Thread(target = Sort(0,100)).start()
Thread(target = Sort(100,200)).start()
I have tried duplicating the Sort function as Sort2(): but this yields the same result
Thread(target = Sort(0,100)).start()
Thread(target = Sort2(100,200)).start()
Any help or direction is greatly apprenticed
The problem is that Sort and Sort2 are being evaluated before you start your threads. When you say Sort(...), the () is an operator that executes the Sort function.
This answer could say more about your problem.
You'll have to use the args parameter for the Thread constructor. For example:
Thread(target=Sort, args=(0,100)).start()
import multiprocessing
def worker_sort():
# includes the code of your sorting function
pass
if __name__ == '__main__':
jobs = []
for i in range(2):
p = multiprocessing.Process(target=worker_sort, args=(0,100))
jobs.append(p)
p.start()
I'd recommend some reading about the multiprocessing module in the official documentation
Here the code which download 3 files, and do something with it.
But before starting Thread2 it waits until Thread1 will be finished. How make them run together?
Specify some examples with commentary. Thanks
import threading
import urllib.request
def testForThread1():
print('[Thread1]::Started')
resp = urllib.request.urlopen('http://192.168.85.16/SOME_FILE')
data = resp.read()
# Do something with it
return 'ok'
def testForThread2():
print('[Thread2]::Started')
resp = urllib.request.urlopen('http://192.168.85.10/SOME_FILE')
data = resp.read()
# Do something with it
return 'ok'
if __name__ == "__main__":
t1 = threading.Thread(name="Hello1", target=testForThread1())
t1.start()
t2 = threading.Thread(name="Hello2", target=testForThread2())
t2.start()
print(threading.enumerate())
t1.join()
t2.join()
exit(0)
You are executing the target function for the thread in the thread instance creation.
if __name__ == "__main__":
t1 = threading.Thread(name="Hello1", target=testForThread1()) # <<-- here
t1.start()
This is equivalent to:
if __name__ == "__main__":
result = testForThread1() # == 'ok', this is the blocking execution
t1 = threading.Thread(name="Hello1", target=result)
t1.start()
It's Thread.start()'s job to execute that function and store its result somewhere for you to reclaim. As you can see, the previous format was executing the blocking function in the main thread, preventing you from being able to parallelize (e.g. it would have to finish that function execution before getting to the line where it calls the second function).
The proper way to set the thread in a non-blocking fashion would be:
if __name__ == "__main__":
t1 = threading.Thread(name="Hello1", target=testForThread1) # tell thread what the target function is
# notice no function call braces for the function "testForThread1"
t1.start() # tell the thread to execute the target function
For this, we can use threading but it's not efficient since you want to download files. so the total time will be equal to the sum of download time of all files.
If you have good internet speed, then multiprocessing is the best way.
import multiprocessing
def test_function():
for i in range(199999998):
pass
t1 = multiprocessing.Process(target=test_function)
t2 = multiprocessing.Process(target=test_function)
t1.start()
t2.start()
This is the fastest solution. You can check this using following command:
time python3 filename.py
you will get the following output like this:
real 0m6.183s
user 0m12.277s
sys 0m0.009s
here, real = user + sys
user time is the time taken by python file to execute.
but you can see that above formula doesn't satisfy because each function takes approx 6.14. But due to multiprocessing, both take 6.18 seconds and reduced total time by multiprocessing in parallel.
You can get more about it from here.