terminate all processes in a Pool - python

I have a python script that looks like follows:
import os
import tempfile
from multiprocessing import Pool
def runReport(a, b, c):
# do task.
temp_dir = tempfile.gettempdir()
if (os.path.isfile(temp_dir + "/stop_check")):
# How to terminate all processes in the pool here?
def runReports(args):
return runReport(*args)
def main(argv):
pool = Pool(4)
args = []
# Code to generate args. args is an array of tuples of form (a, b, c)
pool.map(runReports, args)
if (__name__ == '__main__'):
main(sys.argv[1:])
There is another python script that creates this file /tmp/stop_check.
When this file gets created, I need to terminate the Pool. How can I achieve this?

Only the parent process can terminate the pool. You're better off having the parent run a loop that checks for the existence of that file, rather than trying to have each child do it and then signal the parent somehow:
import os
import sys
import time
import tempfile
from multiprocessing import Pool
def runReport(*args):
# do task
def runReports(args):
return runReport(*args)
def main(argv):
pool = Pool(4)
args = []
# Code to generate args. args is an array of tuples of form (a, b, c)
result = pool.map_async(runReports, args)
temp_dir = tempfile.gettempdir()
while not result.ready():
if os.path.isfile(temp_dir + "/stop_check"):
pool.terminate()
break
result.wait(.5) # Wait a bit to avoid pegging the CPU. You can tune this value as you see fit.
if (__name__ == '__main__'):
main(sys.argv[1:])
By using map_async instead of map, you're free to have the parent use a loop to check for the existence of the file, and then terminate the pool when necessary. Do not that using terminate to kill the children means that they won't get to do any clean up at all, so you need to make sure none of them access resources that could get left in an inconsistent state if the process dies while using them.

Related

How to run 10 python programs simultaneously?

I have a_1.py~a_10.py
I want to run 10 python programs in parallel.
I tried:
from multiprocessing import Process
import os
def info(title):
I want to execute python program
def f(name):
for i in range(1, 11):
subprocess.Popen(['python3', f'a_{i}.py'])
if __name__ == '__main__':
info('main line')
p = Process(target=f)
p.start()
p.join()
but it doesn't work
How do I solve this?
I would suggest using the subprocess module instead of multiprocessing:
import os
import subprocess
import sys
MAX_SUB_PROCESSES = 10
def info(title):
print(title, flush=True)
if __name__ == '__main__':
info('main line')
# Create a list of subprocesses.
processes = []
for i in range(1, MAX_SUB_PROCESSES+1):
pgm_path = f'a_{i}.py' # Path to Python program.
command = f'"{sys.executable}" "{pgm_path}" "{os.path.basename(pgm_path)}"'
process = subprocess.Popen(command, bufsize=0)
processes.append(process)
# Wait for all of them to finish.
for process in processes:
process.wait()
print('Done')
If you just need to call 10 external py scripts (a_1.py ~ a_10.py) as a separate processes - use subprocess.Popen class:
import subprocess, sys
for i in range(1, 11):
subprocess.Popen(['python3', f'a_{i}.py'])
# sys.exit() # optional
It's worth to look at a rich subprocess.Popen signature (you may find some useful params/options)
You can use a multiprocessing pool to run them concurrently.
import multiprocessing as mp
def worker(module_name):
""" Executes a module externally with python """
__import__(module_name)
return
if __name__ == "__main__":
max_processes = 5
module_names = [f"a_{i}" for i in range(1, 11)]
print(module_names)
with mp.Pool(max_processes) as pool:
pool.map(worker, module_names)
The max_processes variable is the maximum number of workers to have working at any given time. In other words, its the number of processes spawned by your program. The pool.map(worker, module_names) uses the available processes and calls worker on each item in your module_names list. We don't include the .py because we're running the module by importing it.
Note: This might not work if the code you want to run in your modules is contained inside if __name__ == "__main__" blocks. If that is the case, then my recommendation would be to move all the code in the if __name__ == "__main__" blocks of the a_{} modules into a main function. Additionally, you would have to change the worker to something like:
def worker(module_name):
module = __import__(module_name) # Kind of like 'import module_name as module'
module.main()
return

How to add a pool of processes available for a multiprocessing queue

I am following a preceding question here: how to add more items to a multiprocessing queue while script in motion
the code I am working with now:
import multiprocessing
class MyFancyClass:
def __init__(self, name):
self.name = name
def do_something(self):
proc_name = multiprocessing.current_process().name
print('Doing something fancy in {} for {}!'.format(proc_name, self.name))
def worker(q):
while True:
obj = q.get()
if obj is None:
break
obj.do_something()
if __name__ == '__main__':
queue = multiprocessing.Queue()
p = multiprocessing.Process(target=worker, args=(queue,))
p.start()
queue.put(MyFancyClass('Fancy Dan'))
queue.put(MyFancyClass('Frankie'))
# print(queue.qsize())
queue.put(None)
# Wait for the worker to finish
queue.close()
queue.join_thread()
p.join()
Right now, there's two items in the queue. if I replace the two lines with a list of, say 50 items....How do I initiate a POOL to allow a number of processes available. for example:
p = multiprocessing.Pool(processes=4)
where does that go? I'd like to be able run multiple items at once, especially if the items run for a bit.
Thanks!
As a rule, you either use Pool or Process(es) plus Queues. Mixing both is a misuse; the Pool already uses Queues (or a similar mechanism) behind the scenes.
If you want to do this with a Pool, change your code to (moving code to main function for performance and better resource cleanup than running in global scope):
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# Submit all the work
futures = [p.apply_async(fancy.do_something) for fancy in myfancyclasses]
# Done submitting, let workers exit as they run out of work
p.close()
# Wait until all the work is finished
for f in futures:
f.wait()
if __name__ == '__main__':
main()
This could be simplified further at the expense of purity, with the .*map* methods of Pool, e.g. to minimize memory usage redefine main as:
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# No return value, so we ignore it, but we need to run out the result
# or the work won't be done
for _ in p.imap_unordered(MyFancyClass.do_something, myfancyclasses):
pass
Yes, technically either approach has a slightly higher overhead in terms of needing to serialize the return value you're not using so give it back to the parent process. But in practice, this cost is pretty low (since your function has no return, it's returning None, which serializes to almost nothing). An advantage to this approach is that for printing to the screen, you generally don't want to do it from the child processes (since they'll end up interleaving output), and you can replace the printing with returns to let the parent do the work, e.g.:
import multiprocessing
class MyFancyClass:
def __init__(self, name):
self.name = name
def do_something(self):
proc_name = multiprocessing.current_process().name
# Changed from print to return
return 'Doing something fancy in {} for {}!'.format(proc_name, self.name)
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# Using the return value now to avoid interleaved output
for res in p.imap_unordered(MyFancyClass.do_something, myfancyclasses):
print(res)
if __name__ == '__main__':
main()
Note how all of these solutions remove the need to write your own worker function, or manually manage Queues, because Pools do that grunt work for you.
Alternate approach using concurrent.futures to efficiently process results as they become available, while allowing you to choose to submit new work (either based on the results, or based on external information) as you go:
import concurrent.futures
from concurrent.futures import FIRST_COMPLETED
def main():
allow_new_work = True # Set to False to indicate we'll no longer allow new work
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your initial MyFancyClass instances here
with concurrent.futures.ProcessPoolExecutor() as executor:
remaining_futures = {executor.submit(fancy.do_something)
for fancy in myfancyclasses}
while remaining_futures:
done, remaining_futures = concurrent.futures.wait(remaining_futures,
return_when=FIRST_COMPLETED)
for fut in done:
result = fut.result()
# Do stuff with result, maybe submit new work in response
if allow_new_work:
if should_stop_checking_for_new_work():
allow_new_work = False
# Let the workers exit when all remaining tasks done,
# and reject submitting more work from now on
executor.shutdown(wait=False)
elif has_more_work():
# Assumed to return collection of new MyFancyClass instances
new_fanciness = get_more_fanciness()
remaining_futures |= {executor.submit(fancy.do_something)
for fancy in new_fanciness}
myfancyclasses.extend(new_fanciness)

Issue with MultiProcessing in Python with BeautifulSoup 4

I'm having issuing using most or all of the cores to process the files faster , it can be reading multiple files a time or using multiple cores to read a single file.
I would prefer using multiple cores to read a single file before moving it to the next.
I tried the code below but can't seem to get all the core used up.
The following code would basically retrieve *.txt file in the directory which contains htmls , in json format.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
import json
import urlparse
import os
from bs4 import BeautifulSoup
from multiprocessing.dummy import Pool # This is a thread-based Pool
from multiprocessing import cpu_count
def crawlTheHtml(htmlsource):
htmlArray = json.loads(htmlsource)
for eachHtml in htmlArray:
soup = BeautifulSoup(eachHtml['result'], 'html.parser')
if all(['another text to search' not in str(soup),
'text to search' not in str(soup)]):
try:
gd_no = ''
try:
gd_no = soup.find('input', {'id': 'GD_NO'})['value']
except:
pass
r = requests.post('domain api address', data={
'gd_no': gd_no,
})
except:
pass
if __name__ == '__main__':
pool = Pool(cpu_count() * 2)
print(cpu_count())
fileArray = []
for filename in os.listdir(os.getcwd()):
if filename.endswith('.txt'):
fileArray.append(filename)
for file in fileArray:
with open(file, 'r') as myfile:
htmlsource = myfile.read()
results = pool.map(crawlTheHtml(htmlsource), f)
On top of that , i'm not sure what the ,f represent.
Question 1 :
What did i not do properly to fully utilize all the cores/threads ?
Question 2 :
Is there a better way to use try : except : because sometimes the value is not in the page and that would cause the script to stop. When dealing with multiple variables, i will end up with a lot of try & except statement.
Answer to question 1, your problem is this line:
from multiprocessing.dummy import Pool # This is a thread-based Pool
Answer taken from: multiprocessing.dummy in Python is not utilising 100% cpu
When you use multiprocessing.dummy, you're using threads, not processes:
multiprocessing.dummy replicates the API of multiprocessing but is no
more than a wrapper around the threading module.
That means you're restricted by the Global Interpreter Lock (GIL), and only one thread can actually execute CPU-bound operations at a time. That's going to keep you from fully utilizing your CPUs. If you want get full parallelism across all available cores, you're going to need to address the pickling issue you're hitting with multiprocessing.Pool.
i had this probleme
you need to do
from multiprocessing import Pool
from multiprocessing import freeze_support
and you need to do in the end
if __name__ = '__main__':
freeze_support()
and you can continue your script
from multiprocessing import Pool, Queue
from os import getpid
from time import sleep
from random import random
MAX_WORKERS=10
class Testing_mp(object):
def __init__(self):
"""
Initiates a queue, a pool and a temporary buffer, used only
when the queue is full.
"""
self.q = Queue()
self.pool = Pool(processes=MAX_WORKERS, initializer=self.worker_main,)
self.temp_buffer = []
def add_to_queue(self, msg):
"""
If queue is full, put the message in a temporary buffer.
If the queue is not full, adding the message to the queue.
If the buffer is not empty and that the message queue is not full,
putting back messages from the buffer to the queue.
"""
if self.q.full():
self.temp_buffer.append(msg)
else:
self.q.put(msg)
if len(self.temp_buffer) > 0:
add_to_queue(self.temp_buffer.pop())
def write_to_queue(self):
"""
This function writes some messages to the queue.
"""
for i in range(50):
self.add_to_queue("First item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
self.add_to_queue("Second item for loop %d" % i)
# Not really needed, just to show that some elements can be added
# to the queue whenever you want!
sleep(random()*2)
def worker_main(self):
"""
Waits indefinitely for an item to be written in the queue.
Finishes when the parent process terminates.
"""
print "Process {0} started".format(getpid())
while True:
# If queue is not empty, pop the next element and do the work.
# If queue is empty, wait indefinitly until an element get in the queue.
item = self.q.get(block=True, timeout=None)
print "{0} retrieved: {1}".format(getpid(), item)
# simulate some random length operations
sleep(random())
# Warning from Python documentation:
# Functionality within this package requires that the __main__ module be
# importable by the children. This means that some examples, such as the
# multiprocessing.Pool examples will not work in the interactive interpreter.
if __name__ == '__main__':
mp_class = Testing_mp()
mp_class.write_to_queue()
# Waits a bit for the child processes to do some work
# because when the parent exits, childs are terminated.
sleep(5)

Issue with Pool and Queue of multiprocessing module in Python

I am new to multiprocessing of Python, and I wrote the tiny script below:
import multiprocessing
import os
def task(queue):
print(100)
def run(pool):
queue = multiprocessing.Queue()
for i in range(os.cpu_count()):
pool.apply_async(task, args=(queue, ))
if __name__ == '__main__':
multiprocessing.freeze_support()
pool = multiprocessing.Pool()
run(pool)
pool.close()
pool.join()
I am wondering why the task() method is not executed and there is no output after running this script. Could anyone help me?
It is running, but it's dying with an error outside the main thread, and so you don't see the error. For that reason, it's always good to .get() the result of an async call, even if you don't care about the result: the .get() will raise the error that's otherwise invisible.
For example, change your loop like so:
tasks = []
for i in range(os.cpu_count()):
tasks.append(pool.apply_async(task, args=(queue,)))
for t in tasks:
t.get()
Then the new t.get() will blow up, ending with:
RuntimeError: Queue objects should only be shared between processes through inheritance
In short, passing Queue objects to Pool methods isn't supported.
But you can pass them to multiprocessing.Process(), or to a Pool initialization function. For example, here's a way to do the latter:
import multiprocessing
import os
def pool_init(q):
global queue # make queue global in workers
queue = q
def task():
# can use `queue` here if you like
print(100)
def run(pool):
tasks = []
for i in range(os.cpu_count()):
tasks.append(pool.apply_async(task))
for t in tasks:
t.get()
if __name__ == '__main__':
queue = multiprocessing.Queue()
pool = multiprocessing.Pool(initializer=pool_init, initargs=(queue,))
run(pool)
pool.close()
pool.join()
On Linux-y systems, you can - as the original error message suggested - use process inheritance instead (but that's not possible on Windows).

Using Python's Multiprocessing module to execute simultaneous and separate SEAWAT/MODFLOW model runs

I'm trying to complete 100 model runs on my 8-processor 64-bit Windows 7 machine. I'd like to run 7 instances of the model concurrently to decrease my total run time (approx. 9.5 min per model run). I've looked at several threads pertaining to the Multiprocessing module of Python, but am still missing something.
Using the multiprocessing module
How to spawn parallel child processes on a multi-processor system?
Python Multiprocessing queue
My Process:
I have 100 different parameter sets I'd like to run through SEAWAT/MODFLOW to compare the results. I have pre-built the model input files for each model run and stored them in their own directories. What I'd like to be able to do is have 7 models running at a time until all realizations have been completed. There needn't be communication between processes or display of results. So far I have only been able to spawn the models sequentially:
import os,subprocess
import multiprocessing as mp
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
files = []
for f in os.listdir(ws + r'\fieldgen\reals'):
if f.endswith('.npy'):
files.append(f)
## def work(cmd):
## return subprocess.call(cmd, shell=False)
def run(f,def_param=ws):
real = f.split('_')[2].split('.')[0]
print 'Realization %s' % real
mf2k = r'c:\modflow\mf2k.1_19\bin\mf2k.exe '
mf2k5 = r'c:\modflow\MF2005_1_8\bin\mf2005.exe '
seawatV4 = r'c:\modflow\swt_v4_00_04\exe\swt_v4.exe '
seawatV4x64 = r'c:\modflow\swt_v4_00_04\exe\swt_v4x64.exe '
exe = seawatV4x64
swt_nam = ws + r'\reals\real%s\ss\ss.nam_swt' % real
os.system( exe + swt_nam )
if __name__ == '__main__':
p = mp.Pool(processes=mp.cpu_count()-1) #-leave 1 processor available for system and other processes
tasks = range(len(files))
results = []
for f in files:
r = p.map_async(run(f), tasks, callback=results.append)
I changed the if __name__ == 'main': to the following in hopes it would fix the lack of parallelism I feel is being imparted on the above script by the for loop. However, the model fails to even run (no Python error):
if __name__ == '__main__':
p = mp.Pool(processes=mp.cpu_count()-1) #-leave 1 processor available for system and other processes
p.map_async(run,((files[f],) for f in range(len(files))))
Any and all help is greatly appreciated!
EDIT 3/26/2012 13:31 EST
Using the "Manual Pool" method in #J.F. Sebastian's answer below I get parallel execution of my external .exe. Model realizations are called up in batches of 8 at a time, but it doesn't wait for those 8 runs to complete before calling up the next batch and so on:
from __future__ import print_function
import os,subprocess,sys
import multiprocessing as mp
from Queue import Queue
from threading import Thread
def run(f,ws):
real = f.split('_')[-1].split('.')[0]
print('Realization %s' % real)
seawatV4x64 = r'c:\modflow\swt_v4_00_04\exe\swt_v4x64.exe '
swt_nam = ws + r'\reals\real%s\ss\ss.nam_swt' % real
subprocess.check_call([seawatV4x64, swt_nam])
def worker(queue):
"""Process files from the queue."""
for args in iter(queue.get, None):
try:
run(*args)
except Exception as e: # catch exceptions to avoid exiting the
# thread prematurely
print('%r failed: %s' % (args, e,), file=sys.stderr)
def main():
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
wdir = os.path.join(ws, r'fieldgen\reals')
q = Queue()
for f in os.listdir(wdir):
if f.endswith('.npy'):
q.put_nowait((os.path.join(wdir, f), ws))
# start threads
threads = [Thread(target=worker, args=(q,)) for _ in range(8)]
for t in threads:
t.daemon = True # threads die if the program dies
t.start()
for _ in threads: q.put_nowait(None) # signal no more files
for t in threads: t.join() # wait for completion
if __name__ == '__main__':
mp.freeze_support() # optional if the program is not frozen
main()
No error traceback is available. The run() function performs its duty when called upon a single model realization file as with mutiple files. The only difference is that with multiple files, it is called len(files) times though each of the instances immediately closes and only one model run is allowed to finish at which time the script exits gracefully (exit code 0).
Adding some print statements to main() reveals some information about active thread-counts as well as thread status (note that this is a test on only 8 of the realization files to make the screenshot more manageable, theoretically all 8 files should be run concurrently, however the behavior continues where they are spawn and immediately die except one):
def main():
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
wdir = os.path.join(ws, r'fieldgen\test')
q = Queue()
for f in os.listdir(wdir):
if f.endswith('.npy'):
q.put_nowait((os.path.join(wdir, f), ws))
# start threads
threads = [Thread(target=worker, args=(q,)) for _ in range(mp.cpu_count())]
for t in threads:
t.daemon = True # threads die if the program dies
t.start()
print('Active Count a',threading.activeCount())
for _ in threads:
print(_)
q.put_nowait(None) # signal no more files
for t in threads:
print(t)
t.join() # wait for completion
print('Active Count b',threading.activeCount())
**The line which reads "D:\\Data\\Users..." is the error information thrown when I manually stop the model from running to completion. Once I stop the model running, the remaining thread status lines get reported and the script exits.
EDIT 3/26/2012 16:24 EST
SEAWAT does allow concurrent execution as I've done this in the past, spawning instances manually using iPython and launching from each model file folder. This time around, I'm launching all model runs from a single location, namely the directory where my script resides. It looks like the culprit may be in the way SEAWAT is saving some of the output. When SEAWAT is run, it immediately creates files pertaining to the model run. One of these files is not being saved to the directory in which the model realization is located, but in the top directory where the script is located. This is preventing any subsequent threads from saving the same file name in the same location (which they all want to do since these filenames are generic and non-specific to each realization). The SEAWAT windows were not staying open long enough for me to read or even see that there was an error message, I only realized this when I went back and tried to run the code using iPython which directly displays the printout from SEAWAT instead of opening a new window to run the program.
I am accepting #J.F. Sebastian's answer as it is likely that once I resolve this model-executable issue, the threading code he has provided will get me where I need to be.
FINAL CODE
Added cwd argument in subprocess.check_call to start each instance of SEAWAT in its own directory. Very key.
from __future__ import print_function
import os,subprocess,sys
import multiprocessing as mp
from Queue import Queue
from threading import Thread
import threading
def run(f,ws):
real = f.split('_')[-1].split('.')[0]
print('Realization %s' % real)
seawatV4x64 = r'c:\modflow\swt_v4_00_04\exe\swt_v4x64.exe '
cwd = ws + r'\reals\real%s\ss' % real
swt_nam = ws + r'\reals\real%s\ss\ss.nam_swt' % real
subprocess.check_call([seawatV4x64, swt_nam],cwd=cwd)
def worker(queue):
"""Process files from the queue."""
for args in iter(queue.get, None):
try:
run(*args)
except Exception as e: # catch exceptions to avoid exiting the
# thread prematurely
print('%r failed: %s' % (args, e,), file=sys.stderr)
def main():
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
wdir = os.path.join(ws, r'fieldgen\reals')
q = Queue()
for f in os.listdir(wdir):
if f.endswith('.npy'):
q.put_nowait((os.path.join(wdir, f), ws))
# start threads
threads = [Thread(target=worker, args=(q,)) for _ in range(mp.cpu_count()-1)]
for t in threads:
t.daemon = True # threads die if the program dies
t.start()
for _ in threads: q.put_nowait(None) # signal no more files
for t in threads: t.join() # wait for completion
if __name__ == '__main__':
mp.freeze_support() # optional if the program is not frozen
main()
I don't see any computations in the Python code. If you just need to execute several external programs in parallel it is sufficient to use subprocess to run the programs and threading module to maintain constant number of processes running, but the simplest code is using multiprocessing.Pool:
#!/usr/bin/env python
import os
import multiprocessing as mp
def run(filename_def_param):
filename, def_param = filename_def_param # unpack arguments
... # call external program on `filename`
def safe_run(*args, **kwargs):
"""Call run(), catch exceptions."""
try: run(*args, **kwargs)
except Exception as e:
print("error: %s run(*%r, **%r)" % (e, args, kwargs))
def main():
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
workdir = os.path.join(ws, r'fieldgen\reals')
files = ((os.path.join(workdir, f), ws)
for f in os.listdir(workdir) if f.endswith('.npy'))
# start processes
pool = mp.Pool() # use all available CPUs
pool.map(safe_run, files)
if __name__=="__main__":
mp.freeze_support() # optional if the program is not frozen
main()
If there are many files then pool.map() could be replaced by for _ in pool.imap_unordered(safe_run, files): pass.
There is also mutiprocessing.dummy.Pool that provides the same interface as multiprocessing.Pool but uses threads instead of processes that might be more appropriate in this case.
You don't need to keep some CPUs free. Just use a command that starts your executables with a low priority (on Linux it is a nice program).
ThreadPoolExecutor example
concurrent.futures.ThreadPoolExecutor would be both simple and sufficient but it requires 3rd-party dependency on Python 2.x (it is in the stdlib since Python 3.2).
#!/usr/bin/env python
import os
import concurrent.futures
def run(filename, def_param):
... # call external program on `filename`
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
wdir = os.path.join(ws, r'fieldgen\reals')
files = (os.path.join(wdir, f) for f in os.listdir(wdir) if f.endswith('.npy'))
# start threads
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
future_to_file = dict((executor.submit(run, f, ws), f) for f in files)
for future in concurrent.futures.as_completed(future_to_file):
f = future_to_file[future]
if future.exception() is not None:
print('%r generated an exception: %s' % (f, future.exception()))
# run() doesn't return anything so `future.result()` is always `None`
Or if we ignore exceptions raised by run():
from itertools import repeat
... # the same
# start threads
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
executor.map(run, files, repeat(ws))
# run() doesn't return anything so `map()` results can be ignored
subprocess + threading (manual pool) solution
#!/usr/bin/env python
from __future__ import print_function
import os
import subprocess
import sys
from Queue import Queue
from threading import Thread
def run(filename, def_param):
... # define exe, swt_nam
subprocess.check_call([exe, swt_nam]) # run external program
def worker(queue):
"""Process files from the queue."""
for args in iter(queue.get, None):
try:
run(*args)
except Exception as e: # catch exceptions to avoid exiting the
# thread prematurely
print('%r failed: %s' % (args, e,), file=sys.stderr)
# start threads
q = Queue()
threads = [Thread(target=worker, args=(q,)) for _ in range(8)]
for t in threads:
t.daemon = True # threads die if the program dies
t.start()
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
wdir = os.path.join(ws, r'fieldgen\reals')
for f in os.listdir(wdir):
if f.endswith('.npy'):
q.put_nowait((os.path.join(wdir, f), ws))
for _ in threads: q.put_nowait(None) # signal no more files
for t in threads: t.join() # wait for completion
Here is my way to maintain the minimum x number of threads in the memory. Its an combination of threading and multiprocessing modules. It may be unusual to other techniques like respected fellow members have explained above BUT may be worth considerable. For the sake of explanation, I am taking a scenario of crawling a minimum of 5 websites at a time.
so here it is:-
#importing dependencies.
from multiprocessing import Process
from threading import Thread
import threading
# Crawler function
def crawler(domain):
# define crawler technique here.
output.write(scrapeddata + "\n")
pass
Next is threadController function. This function will control the flow of threads to the main memory. It will keep activating the threads to maintain the threadNum "minimum" limit ie. 5. Also it won't exit until, all Active threads(acitveCount) are finished up.
It will maintain a minimum of threadNum(5) startProcess function threads (these threads will eventually start the Processes from the processList while joining them with a time out of 60 seconds). After staring threadController, there would be 2 threads which are not included in the above limit of 5 ie. the Main thread and the threadController thread itself. thats why threading.activeCount() != 2 has been used.
def threadController():
print "Thread count before child thread starts is:-", threading.activeCount(), len(processList)
# staring first thread. This will make the activeCount=3
Thread(target = startProcess).start()
# loop while thread List is not empty OR active threads have not finished up.
while len(processList) != 0 or threading.activeCount() != 2:
if (threading.activeCount() < (threadNum + 2) and # if count of active threads are less than the Minimum AND
len(processList) != 0): # processList is not empty
Thread(target = startProcess).start() # This line would start startThreads function as a seperate thread **
startProcess function, as a separate thread, would start Processes from the processlist. The purpose of this function (**started as a different thread) is that It would become a parent thread for Processes. So when It will join them with a timeout of 60 seconds, this would stop the startProcess thread to move ahead but this won't stop threadController to perform. So this way, threadController will work as required.
def startProcess():
pr = processList.pop(0)
pr.start()
pr.join(60.00) # joining the thread with time out of 60 seconds as a float.
if __name__ == '__main__':
# a file holding a list of domains
domains = open("Domains.txt", "r").read().split("\n")
output = open("test.txt", "a")
processList = [] # thread list
threadNum = 5 # number of thread initiated processes to be run at one time
# making process List
for r in range(0, len(domains), 1):
domain = domains[r].strip()
p = Process(target = crawler, args = (domain,))
processList.append(p) # making a list of performer threads.
# starting the threadController as a seperate thread.
mt = Thread(target = threadController)
mt.start()
mt.join() # won't let go next until threadController thread finishes.
output.close()
print "Done"
Besides maintaining a minimum number of threads in the memory, my aim was to also have something which could avoid stuck threads or processes in the memory. I did this using the time out function.
My apologies for any typing mistake.
I hope this construction would help anyone in this world.
Regards,
Vikas Gautam

Categories

Resources