Terminante threads when main program exits - python

I am trying to get started with multithreading in Python. I have multiple threads that acquire a lock, perform an operation, release the lock, and output the result to a csv file. The threads should be terminated in case the main thread finishes (e.g. through pressing Ctrl+C). How is this done?
I thought making the threads daemons should do the job, because "the entire Python program exits when only daemon threads are left" (official Python documentation). This is not sufficient. According to python: how to terminate a thread when main program ends, I then tried catching a KeyboardInterrupt and then terminating the threads manually. Also this is not working, it seems that the KeyboardInterrupt is not catched correctly.
It makes me think, there is something about multithreading in Python that I misunderstood... Find attached my code:
import random
import csv
import sys
from datetime import datetime
import threading
lock = threading.Lock()
def observer(obs):
print("In " + str(obs))
sys.stdout.flush()
with open("data/test_" + str(obs) + ".csv", 'a', newline='') as csvfile:
while True:
datenow = datetime.today()
lock.acquire()
print(str(obs) + "acquired")
sum = 0
for i in range(10000):
sum = sum + random.random()
print(str(obs) + "released")
sys.stdout.flush()
lock.release()
writer = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_ALL)
writer.writerow([datenow.isoformat(), sum/10000])
#print(str(obs) + ": " + datenow.isoformat() + " " + str(sum/1000))
sys.stdout.flush()
if __name__ == "__main__":
observe = [1, 2, 3, 4]
processes = []
for obs in observe:
process = threading.Thread(target=observer, args=(obs,), daemon=True)
processes.append(process)
print("Start processes")
for p in processes:
p.start()
print("Started")
try:
for p in processes:
p.join()
except KeyboardInterrupt:
for p in processes:
p.terminate()
print("Keyboard interrupt")
print("Finished")
Thanks!

This is wrong
try:
for p in processes:
p.join()
except KeyboardInterrupt:
for p in processes:
p.terminate()
print("Keyboard interrupt")
Your threads never exit the while True: loop, so you will be joining them forever. But that's not the point.
If Ctrl-C (aka KeyboardInterrupt) is not being caught my best bet:
You are running Python under Windows
The shell under which you run your script manipulates the processes and you actually never see the Ctrl-C because your program is being terminated abruptly.
You problem would then be:
Your program doesn't actually know it has terminated and that's why the threads hang.
If the assumptions made above are right, try this code (from Python - Windows - Exiting Child Process when "unrelated" parent dies/crashes)
import sys
def win_wait_for_parent(raise_exceptions=False):
if not sys.platform == 'win32':
return True
# When started under cygwin, the parent process will die leaving a child
# hanging around. The process has to be waited upon
import ctypes
from ctypes.wintypes import DWORD, BOOL, HANDLE
import os
import threading
INFINITE = -1
SYNCHRONIZE = 0x00100000
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
kernel32.OpenProcess.argtypes = (DWORD, BOOL, DWORD)
kernel32.OpenProcess.restype = HANDLE
kernel32.WaitForSingleObject.argtypes = (HANDLE, DWORD)
kernel32.WaitForSingleObject.restype = DWORD
phandle = kernel32.OpenProcess(SYNCHRONIZE, 0, os.getppid())
def check_parent():
# Get a token with right access to parent and wait for it to be
# signaled (die). Exit ourselves then
kernel32.WaitForSingleObject(phandle, INFINITE)
os._exit(0)
if not phandle:
if raise_exceptions:
raise ctypes.WinError(ctypes.get_last_error())
return False

Related

python how to cleanup a subprocess

I'm running a python script that will restart a subprocess every 2 hours. code as follow.
import datetime
import os
import signal
import subprocess
import sys
import time
if __name__ == '__main__':
while True:
with subprocess.Popen([sys.executable, '/script.py']) as p:
start_time = datetime.datetime.now()
try:
print(f'starting the process at {p.pid}')
p.communicate(timeout=4800)
except subprocess.TimeoutExpired:
print('timeout terminate process')
os.kill(p.pid, signal.SIGINT)
p.terminate()
p.wait(60)
p.kill()
dif = datetime.datetime.now()-start_time
t = 4810-dif.total_seconds()
print(f'restarting after {t/60} mins')
time.sleep(t)
in my script.py, I'm having a thread executor that each thread runs a chrome webdriver instance. The issue I'm having is when the process timeout, the python process is terminated but all the webdriver instances are still lingering. Is there a way for the parent process to kill all the spawned child processes aka the chrome driver instances in my case? Running on my mac
Edit:
script.py
def run_with_config(user):
i = WebDriverInstance()
try:
return i.run_loop() # while true infinite loop running some ui actions
except Exception:
return False
finally:
i.quit()
def run(users):
with concurrent.futures.ThreadPoolExecutor(max_workers=len(user_configs)) as executor:
f_to_user = {}
for c in user_configs:
f = executor.submit(run_with_config, c)
f_to_user[f] = c
for f in concurrent.futures.as_completed(f_to_user):
res = f.result()
print(res)
Run the subprocess in a new process group, then kill the entire group with os.killpg().
if __name__ == '__main__':
while True:
with subprocess.Popen([sys.executable, '/script.py'], creationflags=subprocess.CREATE_NEW_PROCESS_GROUP) as p:
start_time = datetime.datetime.now()
try:
print(f'starting the process at {p.pid}')
p.communicate(timeout=4800)
except subprocess.TimeoutExpired:
print('timeout terminate process')
os.killpg(p.pid, signal.SIGINT)
p.terminate()
p.wait(60)
p.kill()
dif = datetime.datetime.now()-start_time
t = 4810-dif.total_seconds()
print(f'restarting after {t/60} mins')
time.sleep(t)
passing preexec_fn=os.setpgrp to subprocess popen works for me.

python multiprocessing.Process.terminate - How to kill child processes

This code:
import multiprocessing as mp
from threading import Thread
import subprocess
import time
class WorkerProcess(mp.Process):
def run(self):
# Simulate long running task
self.subprocess = subprocess.Popen(['python', '-c', 'import time; time.sleep(1000)'])
self.code = self.subprocess.wait()
class ControlThread(Thread):
def run():
jobs = []
for _ in range(2):
job = WorkerProcess()
jobs.append(job)
job.start()
# wait for a while and then kill jobs
time.sleep(2)
for job in jobs:
job.terminate()
if __name__ == "__main__":
controller = ControlThread()
controller.start()
When I terminate the spawned WorkerProcess instances. They die just fine, however the subprocesses python -c 'import time; time.sleep(1000) runs until completition. This is well documented in the official docs, but how do I kill the child processes of a killed process?
A possbile soultion might be:
Wrap WorkerProcess.run() method inside try/except block catching SIGTERM, and terminating the subprocess.call call. But I am not sure how to catch the SIGTERM in the WorkerProcess
I also tried setting signal.signal(signal.SIGINT, handler) in the WorkerProcess, but I am getting ValuError, because it is allowed to be set only in the main thread.
What do I do now?
EDIT: As #svalorzen pointed out in comments this doesn't really work since the reference to self.subprocess is lost.
Finally came to a clean, acceptable solution. Since mp.Process.terminate is a method, we can override it.
class WorkerProcess(mp.Process):
def run(self):
# Simulate long running task
self.subprocess = subprocess.Popen(['python', '-c', 'import time; time.sleep(1000)'])
self.code = self.subprocess.wait()
# HERE
def terminate(self):
self.subprocess.terminate()
super(WorkerProcess, self).terminate()
You can use queues to message to your subprocesses and ask them nicely to terminate their children before exiting themselves. You can't use signals in anywhere else but your main thread, so signals are not suitable for this.
Curiously, when I modify the code like this, even if I interrupt it with control+C, subprocesses will die as well. This may be OS related thing, though.
import multiprocessing as mp
from threading import Thread
import subprocess
import time
from Queue import Empty
class WorkerProcess(mp.Process):
def __init__(self,que):
super(WorkerProcess,self).__init__()
self.queue = que
def run(self):
# Simulate long running task
self.subprocess = subprocess.Popen(['python', '-c', 'import time; time.sleep(1000)'])
while True:
a = self.subprocess.poll()
if a is None:
time.sleep(1)
try:
if self.queue.get(0) == "exit":
print "kill"
self.subprocess.kill()
self.subprocess.wait()
break
else:
pass
except Empty:
pass
print "run"
else:
print "exiting"
class ControlThread(Thread):
def run(self):
jobs = []
queues = []
for _ in range(2):
q = mp.Queue()
job = WorkerProcess(q)
queues.append(q)
jobs.append(job)
job.start()
# wait for a while and then kill jobs
time.sleep(5)
for q in queues:
q.put("exit")
time.sleep(30)
if __name__ == "__main__":
controller = ControlThread()
controller.start()
Hope this helps.
Hannu

python threading running process in backend

I was trying to make some text report file from some data source which takes enormous time and to simulate this I wrote the following code
I planned to do it using thread and thought t.daemon = True would
solve the purpose, but the program doesn't exit till the operation is
complete
import random
import threading
import time
import logging
logging.basicConfig(level=logging.DEBUG,
format='(%(threadName)-10s) %(message)s',
)
def worker():
"""thread worker function"""
t = threading.currentThread()
tag = random.randint(1, 64)
file_name = "/tmp/t-%d.txt" % (tag)
logging.debug('started writing file - %s', file_name)
f = open(file_name, 'w')
for x in xrange(2 ** tag): # total no of lines is 2**tag
f.write("%d\n" % x)
logging.debug('ending')
f.close()
return
# to simulate 5 files
for i in range(5):
t = threading.Thread(target=worker)
t.daemon = True
t.start()
main_thread = threading.currentThread()
for t in threading.enumerate():
if t is main_thread:
continue
logging.debug('joining %s', t.getName())
t.join()
When I removed t.join() then some of the data written till program exits and the program
exits quickly, but adding t.join() keeps program running till end. Is there any way to exit from program but the
process should still be running to complete the task in backend.
You aren't looking for a daemon. In fact you want to make sure your process isn't a daemon because it will get killed once that's all that's left and your program exists. You are looking to detach your thread.
Note: lowered max to 28 in case I forgot to kill processes (and so it won't take my entire disk). You will need to kill each process individually if you want them to stop! ie "kill 13345" if you had the message "exiting main 13345" (where that thread is over 2**25)
Also note: thread joining will keep going until the end because your program is not done running and is waiting to join the threads.
Here's what you want:
import logging
import random
import multiprocessing
import time
import sys
#Make sure you don't write to stdout after this program stopped running and sub-processes are logging!
logging.basicConfig(level=logging.DEBUG,
format='(%(threadName)-10s) %(message)s',
)
def detach():
p = multiprocessing.current_process()
name = "worker" + str(p.pid)
cc = multiprocessing.Process(name=name, target=worker)
cc.daemon = False
cc.start()
logging.debug('Detached process: %s %s', p.name, p.pid)
sys.stdout.flush()
def worker():
"""thread worker function"""
#Should probably make sure there isn't already a thread processing this file already...
tag = random.randint(5, 33) #Stop at 33 to make sure we don't take over the harddrive (8GB)
file_name = "/tmp/t-%d.txt" % (tag)
if tag > 26:
logging.warning('\n\nThe detached process resulting from this may need to be killed by hand.\n')
logging.debug('started writing file - %s', file_name)
#Changed your code to use "with", available in any recent python version
with open(file_name, 'w') as f:
for x in xrange(2 ** tag): # total no of lines is 2**tag
f.write("%d\n" % x)
return
#Stackoverflow: Keep scrolling to see more code!
# to simulate 5 files
for i in range(5):
t = multiprocessing.Process(target=detach)
t.daemon = False
t.start()
time.sleep(0.5)
t.terminate()
logging.debug("Terminating main program")

Python - How to pass global variable to multiprocessing.Process?

I need to terminate some processes after a while, so I've used sleeping another process for the waiting. But the new process doesn't have access to global variables from the main process I guess. How could I solve it please?
Code:
import os
from subprocess import Popen, PIPE
import time
import multiprocessing
log_file = open('stdout.log', 'a')
log_file.flush()
err_file = open('stderr.log', 'a')
err_file.flush()
processes = []
def processing():
print "processing"
global processes
global log_file
global err_file
for i in range(0, 5):
p = Popen(['java', '-jar', 'C:\\Users\\two\\Documents\\test.jar'], stdout=log_file, stderr=err_file) # something long running
processes.append(p)
print len(processes) # returns 5
def waiting_service():
name = multiprocessing.current_process().name
print name, 'Starting'
global processes
print len(processes) # returns 0
time.sleep(2)
for i in range(0, 5):
processes[i].terminate()
print name, 'Exiting'
if __name__ == '__main__':
processing()
service = multiprocessing.Process(name='waiting_service', target=waiting_service)
service.start()
You should be using synchronization primitives.
Possibly you want to set an Event that's triggered after a while by the main (parent) process.
You may also want to wait for the processes to actually complete and join them (like you would a thread).
If you have many similar tasks, you can use a processing pool like multiprocessing.Pool.
Here is a small example of how it's done:
import multiprocessing
import time
kill_event = multiprocessing.Event()
def work(_id):
while not kill_event.is_set():
print "%d is doing stuff" % _id
time.sleep(1)
print "%d quit" % _id
def spawn_processes():
processes = []
# spawn 10 processes
for i in xrange(10):
# spawn process
process = multiprocessing.Process(target=work, args=(i,))
processes.append(process)
process.start()
time.sleep(1)
# kill all processes by setting the kill event
kill_event.set()
# wait for all processes to complete
for process in processes:
process.join()
print "done!"
spawn_processes()
The whole problem was in Windows' Python. Python for Windows is blocking global variables to be seen in functions. I've switched to linux and my script works OK.
Special thanks to #rchang for his comment:
When I tested it, in both cases the print statement came up with 5. Perhaps we have a version mismatch in some way? I tested it with Python 2.7.6 on Linux kernel 3.13.0 (Mint distribution).

Using Python's Multiprocessing module to execute simultaneous and separate SEAWAT/MODFLOW model runs

I'm trying to complete 100 model runs on my 8-processor 64-bit Windows 7 machine. I'd like to run 7 instances of the model concurrently to decrease my total run time (approx. 9.5 min per model run). I've looked at several threads pertaining to the Multiprocessing module of Python, but am still missing something.
Using the multiprocessing module
How to spawn parallel child processes on a multi-processor system?
Python Multiprocessing queue
My Process:
I have 100 different parameter sets I'd like to run through SEAWAT/MODFLOW to compare the results. I have pre-built the model input files for each model run and stored them in their own directories. What I'd like to be able to do is have 7 models running at a time until all realizations have been completed. There needn't be communication between processes or display of results. So far I have only been able to spawn the models sequentially:
import os,subprocess
import multiprocessing as mp
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
files = []
for f in os.listdir(ws + r'\fieldgen\reals'):
if f.endswith('.npy'):
files.append(f)
## def work(cmd):
## return subprocess.call(cmd, shell=False)
def run(f,def_param=ws):
real = f.split('_')[2].split('.')[0]
print 'Realization %s' % real
mf2k = r'c:\modflow\mf2k.1_19\bin\mf2k.exe '
mf2k5 = r'c:\modflow\MF2005_1_8\bin\mf2005.exe '
seawatV4 = r'c:\modflow\swt_v4_00_04\exe\swt_v4.exe '
seawatV4x64 = r'c:\modflow\swt_v4_00_04\exe\swt_v4x64.exe '
exe = seawatV4x64
swt_nam = ws + r'\reals\real%s\ss\ss.nam_swt' % real
os.system( exe + swt_nam )
if __name__ == '__main__':
p = mp.Pool(processes=mp.cpu_count()-1) #-leave 1 processor available for system and other processes
tasks = range(len(files))
results = []
for f in files:
r = p.map_async(run(f), tasks, callback=results.append)
I changed the if __name__ == 'main': to the following in hopes it would fix the lack of parallelism I feel is being imparted on the above script by the for loop. However, the model fails to even run (no Python error):
if __name__ == '__main__':
p = mp.Pool(processes=mp.cpu_count()-1) #-leave 1 processor available for system and other processes
p.map_async(run,((files[f],) for f in range(len(files))))
Any and all help is greatly appreciated!
EDIT 3/26/2012 13:31 EST
Using the "Manual Pool" method in #J.F. Sebastian's answer below I get parallel execution of my external .exe. Model realizations are called up in batches of 8 at a time, but it doesn't wait for those 8 runs to complete before calling up the next batch and so on:
from __future__ import print_function
import os,subprocess,sys
import multiprocessing as mp
from Queue import Queue
from threading import Thread
def run(f,ws):
real = f.split('_')[-1].split('.')[0]
print('Realization %s' % real)
seawatV4x64 = r'c:\modflow\swt_v4_00_04\exe\swt_v4x64.exe '
swt_nam = ws + r'\reals\real%s\ss\ss.nam_swt' % real
subprocess.check_call([seawatV4x64, swt_nam])
def worker(queue):
"""Process files from the queue."""
for args in iter(queue.get, None):
try:
run(*args)
except Exception as e: # catch exceptions to avoid exiting the
# thread prematurely
print('%r failed: %s' % (args, e,), file=sys.stderr)
def main():
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
wdir = os.path.join(ws, r'fieldgen\reals')
q = Queue()
for f in os.listdir(wdir):
if f.endswith('.npy'):
q.put_nowait((os.path.join(wdir, f), ws))
# start threads
threads = [Thread(target=worker, args=(q,)) for _ in range(8)]
for t in threads:
t.daemon = True # threads die if the program dies
t.start()
for _ in threads: q.put_nowait(None) # signal no more files
for t in threads: t.join() # wait for completion
if __name__ == '__main__':
mp.freeze_support() # optional if the program is not frozen
main()
No error traceback is available. The run() function performs its duty when called upon a single model realization file as with mutiple files. The only difference is that with multiple files, it is called len(files) times though each of the instances immediately closes and only one model run is allowed to finish at which time the script exits gracefully (exit code 0).
Adding some print statements to main() reveals some information about active thread-counts as well as thread status (note that this is a test on only 8 of the realization files to make the screenshot more manageable, theoretically all 8 files should be run concurrently, however the behavior continues where they are spawn and immediately die except one):
def main():
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
wdir = os.path.join(ws, r'fieldgen\test')
q = Queue()
for f in os.listdir(wdir):
if f.endswith('.npy'):
q.put_nowait((os.path.join(wdir, f), ws))
# start threads
threads = [Thread(target=worker, args=(q,)) for _ in range(mp.cpu_count())]
for t in threads:
t.daemon = True # threads die if the program dies
t.start()
print('Active Count a',threading.activeCount())
for _ in threads:
print(_)
q.put_nowait(None) # signal no more files
for t in threads:
print(t)
t.join() # wait for completion
print('Active Count b',threading.activeCount())
**The line which reads "D:\\Data\\Users..." is the error information thrown when I manually stop the model from running to completion. Once I stop the model running, the remaining thread status lines get reported and the script exits.
EDIT 3/26/2012 16:24 EST
SEAWAT does allow concurrent execution as I've done this in the past, spawning instances manually using iPython and launching from each model file folder. This time around, I'm launching all model runs from a single location, namely the directory where my script resides. It looks like the culprit may be in the way SEAWAT is saving some of the output. When SEAWAT is run, it immediately creates files pertaining to the model run. One of these files is not being saved to the directory in which the model realization is located, but in the top directory where the script is located. This is preventing any subsequent threads from saving the same file name in the same location (which they all want to do since these filenames are generic and non-specific to each realization). The SEAWAT windows were not staying open long enough for me to read or even see that there was an error message, I only realized this when I went back and tried to run the code using iPython which directly displays the printout from SEAWAT instead of opening a new window to run the program.
I am accepting #J.F. Sebastian's answer as it is likely that once I resolve this model-executable issue, the threading code he has provided will get me where I need to be.
FINAL CODE
Added cwd argument in subprocess.check_call to start each instance of SEAWAT in its own directory. Very key.
from __future__ import print_function
import os,subprocess,sys
import multiprocessing as mp
from Queue import Queue
from threading import Thread
import threading
def run(f,ws):
real = f.split('_')[-1].split('.')[0]
print('Realization %s' % real)
seawatV4x64 = r'c:\modflow\swt_v4_00_04\exe\swt_v4x64.exe '
cwd = ws + r'\reals\real%s\ss' % real
swt_nam = ws + r'\reals\real%s\ss\ss.nam_swt' % real
subprocess.check_call([seawatV4x64, swt_nam],cwd=cwd)
def worker(queue):
"""Process files from the queue."""
for args in iter(queue.get, None):
try:
run(*args)
except Exception as e: # catch exceptions to avoid exiting the
# thread prematurely
print('%r failed: %s' % (args, e,), file=sys.stderr)
def main():
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
wdir = os.path.join(ws, r'fieldgen\reals')
q = Queue()
for f in os.listdir(wdir):
if f.endswith('.npy'):
q.put_nowait((os.path.join(wdir, f), ws))
# start threads
threads = [Thread(target=worker, args=(q,)) for _ in range(mp.cpu_count()-1)]
for t in threads:
t.daemon = True # threads die if the program dies
t.start()
for _ in threads: q.put_nowait(None) # signal no more files
for t in threads: t.join() # wait for completion
if __name__ == '__main__':
mp.freeze_support() # optional if the program is not frozen
main()
I don't see any computations in the Python code. If you just need to execute several external programs in parallel it is sufficient to use subprocess to run the programs and threading module to maintain constant number of processes running, but the simplest code is using multiprocessing.Pool:
#!/usr/bin/env python
import os
import multiprocessing as mp
def run(filename_def_param):
filename, def_param = filename_def_param # unpack arguments
... # call external program on `filename`
def safe_run(*args, **kwargs):
"""Call run(), catch exceptions."""
try: run(*args, **kwargs)
except Exception as e:
print("error: %s run(*%r, **%r)" % (e, args, kwargs))
def main():
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
workdir = os.path.join(ws, r'fieldgen\reals')
files = ((os.path.join(workdir, f), ws)
for f in os.listdir(workdir) if f.endswith('.npy'))
# start processes
pool = mp.Pool() # use all available CPUs
pool.map(safe_run, files)
if __name__=="__main__":
mp.freeze_support() # optional if the program is not frozen
main()
If there are many files then pool.map() could be replaced by for _ in pool.imap_unordered(safe_run, files): pass.
There is also mutiprocessing.dummy.Pool that provides the same interface as multiprocessing.Pool but uses threads instead of processes that might be more appropriate in this case.
You don't need to keep some CPUs free. Just use a command that starts your executables with a low priority (on Linux it is a nice program).
ThreadPoolExecutor example
concurrent.futures.ThreadPoolExecutor would be both simple and sufficient but it requires 3rd-party dependency on Python 2.x (it is in the stdlib since Python 3.2).
#!/usr/bin/env python
import os
import concurrent.futures
def run(filename, def_param):
... # call external program on `filename`
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
wdir = os.path.join(ws, r'fieldgen\reals')
files = (os.path.join(wdir, f) for f in os.listdir(wdir) if f.endswith('.npy'))
# start threads
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
future_to_file = dict((executor.submit(run, f, ws), f) for f in files)
for future in concurrent.futures.as_completed(future_to_file):
f = future_to_file[future]
if future.exception() is not None:
print('%r generated an exception: %s' % (f, future.exception()))
# run() doesn't return anything so `future.result()` is always `None`
Or if we ignore exceptions raised by run():
from itertools import repeat
... # the same
# start threads
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
executor.map(run, files, repeat(ws))
# run() doesn't return anything so `map()` results can be ignored
subprocess + threading (manual pool) solution
#!/usr/bin/env python
from __future__ import print_function
import os
import subprocess
import sys
from Queue import Queue
from threading import Thread
def run(filename, def_param):
... # define exe, swt_nam
subprocess.check_call([exe, swt_nam]) # run external program
def worker(queue):
"""Process files from the queue."""
for args in iter(queue.get, None):
try:
run(*args)
except Exception as e: # catch exceptions to avoid exiting the
# thread prematurely
print('%r failed: %s' % (args, e,), file=sys.stderr)
# start threads
q = Queue()
threads = [Thread(target=worker, args=(q,)) for _ in range(8)]
for t in threads:
t.daemon = True # threads die if the program dies
t.start()
# populate files
ws = r'D:\Data\Users\jbellino\Project\stJohnsDeepening\model\xsec_a'
wdir = os.path.join(ws, r'fieldgen\reals')
for f in os.listdir(wdir):
if f.endswith('.npy'):
q.put_nowait((os.path.join(wdir, f), ws))
for _ in threads: q.put_nowait(None) # signal no more files
for t in threads: t.join() # wait for completion
Here is my way to maintain the minimum x number of threads in the memory. Its an combination of threading and multiprocessing modules. It may be unusual to other techniques like respected fellow members have explained above BUT may be worth considerable. For the sake of explanation, I am taking a scenario of crawling a minimum of 5 websites at a time.
so here it is:-
#importing dependencies.
from multiprocessing import Process
from threading import Thread
import threading
# Crawler function
def crawler(domain):
# define crawler technique here.
output.write(scrapeddata + "\n")
pass
Next is threadController function. This function will control the flow of threads to the main memory. It will keep activating the threads to maintain the threadNum "minimum" limit ie. 5. Also it won't exit until, all Active threads(acitveCount) are finished up.
It will maintain a minimum of threadNum(5) startProcess function threads (these threads will eventually start the Processes from the processList while joining them with a time out of 60 seconds). After staring threadController, there would be 2 threads which are not included in the above limit of 5 ie. the Main thread and the threadController thread itself. thats why threading.activeCount() != 2 has been used.
def threadController():
print "Thread count before child thread starts is:-", threading.activeCount(), len(processList)
# staring first thread. This will make the activeCount=3
Thread(target = startProcess).start()
# loop while thread List is not empty OR active threads have not finished up.
while len(processList) != 0 or threading.activeCount() != 2:
if (threading.activeCount() < (threadNum + 2) and # if count of active threads are less than the Minimum AND
len(processList) != 0): # processList is not empty
Thread(target = startProcess).start() # This line would start startThreads function as a seperate thread **
startProcess function, as a separate thread, would start Processes from the processlist. The purpose of this function (**started as a different thread) is that It would become a parent thread for Processes. So when It will join them with a timeout of 60 seconds, this would stop the startProcess thread to move ahead but this won't stop threadController to perform. So this way, threadController will work as required.
def startProcess():
pr = processList.pop(0)
pr.start()
pr.join(60.00) # joining the thread with time out of 60 seconds as a float.
if __name__ == '__main__':
# a file holding a list of domains
domains = open("Domains.txt", "r").read().split("\n")
output = open("test.txt", "a")
processList = [] # thread list
threadNum = 5 # number of thread initiated processes to be run at one time
# making process List
for r in range(0, len(domains), 1):
domain = domains[r].strip()
p = Process(target = crawler, args = (domain,))
processList.append(p) # making a list of performer threads.
# starting the threadController as a seperate thread.
mt = Thread(target = threadController)
mt.start()
mt.join() # won't let go next until threadController thread finishes.
output.close()
print "Done"
Besides maintaining a minimum number of threads in the memory, my aim was to also have something which could avoid stuck threads or processes in the memory. I did this using the time out function.
My apologies for any typing mistake.
I hope this construction would help anyone in this world.
Regards,
Vikas Gautam

Categories

Resources