I am working on a multithreaded Python script that takes a list of file names and puts them on a queue. Most of the time it works however I'll occasionally find it stuck and 'ps -efL' will show two threads open for python script. I followed it with strace and 5 out of 6 threads returned but one just hangs out in futex wait forever.
Here's the block of code in question.
threads = 6
for fileName in fileNames:
queue.put(fileName)
for i in range(threads):
t = threading.Thread(target=get_backup_list, args=(queue,dbCreds,arguments.verbose,arguments.vault))
activeThreads.append(t)
t.start()
for activeThread in activeThreads:
activeThread.join()
def get_backup_list(queue,dbCreds,verbosity,vault):
backupFiles = []
while True:
if queue.empty() == True:
return
fileName = queue.get()
try:
fileInfo = lookup_file_by_path(fileName,dbCreds,vault)
if not fileInfo:
start = time.time()
attributes = get_attributes(fileName,verbosity)
end = time.time() - start
if verbosity: print("finished in ") + str(end) + (" seconds")
insert_file(attributes,dbCreds,vault)
fileInfo = lookup_file_by_path(fileName,dbCreds,vault)
except Exception, e:
print("error on " + fileName + " " + str(e))
return
def lookup_file_by_path(path,dbCreds,vault):
attributes = {}
conn = mdb.connect(dbCreds['server'] , dbCreds['user'], dbCreds['password'], dbCreds['database'], cursorclass=MySQLdb.cursors.DictCursor);
c = conn.cursor()
c.execute('''SELECT * FROM {} where path = "%s" '''.format(vault) % ( path ) )
data = c.fetchone()
if data:
for key in data.keys():
attributes[key] = data[key]
conn.close
return attributes
Am I doing something fundamentally wrong here that's causing a race condition? Or is there something else I'm missing.
Thanks,
Thomas C
There is a race condition in your code:
while True:
if queue.empty() == True:
return
fileName = queue.get()
First the threads checks if the queue is empty. If it's not, it tries a blocking get. However, in the time between the call to queue.empty() and queue.get, another thread could have consumed the final item from the queue, meaning the get call will block forever. You should do this instead:
try:
fileName = queue.get_nowait()
except Queue.Empty:
return
If that doesn't solve it, you can just throw some print statements into the threaded method to identify exactly where it gets stuck, and go from there. However, there's no other concurrency issues jumping out at me.
Edit:
As an aside, what you're doing here could be more cleanly implemented as a ThreadPool or multiprocessing.Pool:
from multiprocessing.pool import ThreadPool
from functools import partial
def get_backup_list(dbCreds, verbosity, vault, fileName):
backupFiles = []
fileInfo = lookup_file_by_path(fileName,dbCreds,vault)
...
if __name__ == "__main__":
pool = ThreadPool(6) # You could use a multiprocessing.Pool, too
func = partial(get_backup_list, dbCreds, arguments.verbose, arguments.vault)
pool.map(func, fileNames)
pool.close()
pool.join()
Depending on how much work each call to get_backup_list is doing, you may find it performs better as a multiprocessing.Pool, because it is able to get around the Global Interpreter Lock (GIL), which prevents Python threads from executing across CPU cores concurrently. It looks like your code is probably I/O-bound, though, so ThreadPool might do just fine.
Related
TL;DR I want to collect the accumulated data in the globals of each worker when the pool is finished processing
Description of what I think I'm missing
As I'm new to multiprocessing, I don't know of all the features that exist. I am looking for a way to make a worker return the value it was initialized with (after manipulating that value a bunch of millions of times). Then, I hope I can collect and merge all these values at the end of the program when all the 'jobs' are done.
import multiprocessing as mp
from collections import defaultdict, Counter
from customtools import load_regexes #, . . .
import gzip
import nltk
result_dict = None
regexes = None
def create_worker():
global result_dict
global regexes
result_dict = defaultdict(Counter) # I want to return this at the end
# these are a bunch of huge regexes
regexes = load_regexes()
These functions represents the way I load and process data. The data is a big gzipfile with articles.
def load_data(semaphore):
with gzip.open('some10Gbfile') as f:
for line in file:
semaphore.acquire()
yield str(line, 'utf-8')
def worker_job(line):
global regexes
global result_dict
hits = defaultdict(Counter)
for sent in nltk.sent_tokenize(line[3:]):
for rename, regex in regex.items():
for hit in regex.finditer(sent):
hits[rename][hit.group(0)]+=1
# and more and more... results = _filter(_extract(hits))
# store some data in results_dict here . . .
return filtered_hits
Class ResultEater():
def __init__(self):
self.wordscounts=defaultdict(Counter)
self.filtered=Counter()
def eat_results(self, filte red_hits):
for k, v in filte.items():
for i, c in v.items():
self.wordscount[k][i]+=c
This is the main program
if __name__ == '__main__':
pool = mp.Pool(mp.cpu_count(), initializer=create_worker)
semaphore = mp.Semaphore(50)
loader = load_data(semaphore)
results = ResultEater()
for intermediate_result in pool.imap_unordered(worker_job, loader, chunksize=10):
results.eat_results(intermediate_result)
semaphore.release()
# results.eat_workers(the_leftover_workers_or_something)
results.print()
I don't really think I understand how exactly returning the data incrementally isn't sufficient, but it kinda seems like you need some sort of finalization function to send the data similar to how you have an initialization function. Unfortunately, I don't think this sort of thing exists for mp.Pool, so it'll require you to use a couple mp.Process's, and send input args, and return results with a couple mp.Queue's
On a side note your use of Semaphore is unncessary, as the call to the "load_data" iterator always happens on the main process. I have moved that to another "producer" process, which puts inputs to a queue, which is also already synchronized automatically by default. This allows you to have one process for gathering inputs, several processes for processing the inputs to outputs, and leaves the main (parent) process to gather outputs. If the "producer" generating the inputs is IO limited by file read speed (very likely), it could also be in a thread rather than a process, but in this case the difference is probably minimal.
I have created an example of a custom "Pool" which allows you to return some data at the end of each worker's "life" using aforementioned "producer-consumer" scheme. there are print statements to track what is going on in each process, but please also read the comments to track what's going on and why:
import multiprocessing as mp
from time import sleep
from queue import Empty
class ExitFlag:
def __init__(self, exit_value=None):
self.exit_value = exit_value #optionally pass value along with exit flag
def producer_func(input_q, n_workers):
for i in range(100): #100 lines of some long file
print(f"put {i}")
input_q.put(i) #put each line of the file to the work queue
print('stopping consumers')
for i in range(n_workers):
input_q.put(ExitFlag()) #send shut down signal to each of the workers
print('producer exiting')
def consumer_func(input_q, output_q, work_func):
counter = 0
while True:
try:
item = input_q.get(.1) #never wait forever on a "get". It's a recipe for deadlock.
except Empty:
continue
print(f"get {item}")
if isinstance(item, ExitFlag):
break
else:
counter += 1
output_q.put(work_func(item))
output_q.put(ExitFlag(exit_value=counter))
print('consumer exiting')
def work_func(number):
sleep(.1) #some heavy nltk work...
return number*2
if __name__ == '__main__':
input_q = mp.Queue(maxsize=10) #only bother limiting size if you have memory usage constraints
output_q = mp.Queue(maxsize=10)
n_workers = mp.cpu_count()
producer = mp.Process(target=producer_func, args=(input_q, n_workers)) #generate the input from another process. (this could just as easily be a thread as it seems it will be IO limited anyway)
producer.start()
consumers = [mp.Process(target=consumer_func, args=(input_q, output_q, work_func)) for _ in range(n_workers)]
for c in consumers: c.start()
total = 0
stop_signals = 0
exit_values = []
while True:
try:
item = output_q.get(.1)
except Empty:
continue
if isinstance(item, ExitFlag):
stop_signals += 1
if item.exit_value is not None:
exit_values.append(item.exit_value) #do something with the return at the end
if stop_signals >= n_workers: #stop waiting for more results once all consumers finish
break
else:
total += item #do something with the incremental return values
print(total)
print(exit_values)
#cleanup
producer.join()
print("producer joined")
for c in consumers: c.join()
print("consumers joined")
I want to build a tool that scan a website for sub domains, I know how to do his, but my function is slower, I looked up in the gobuster usage, and I saw that the gobuster can use many concurrent threads, how can I implement this too ?
I have asked Google many times, but I can't see anything about this, can someone give me an example ?
gobuster usage: -t Number of concurrent threads (default 10)
My current program:
def subdomaines(url, wordlist):
checks(url, wordlist) # just checking for valid args
num_lines = get_line_count(wordlist) # number of lines in a file
count = 0
for line in open(wordlist).readlines():
resp = requests.get(url + line) # resp
if resp.status_code in (301, 200):
print(f'Valid - {line}')
print(f'{count} / {num_lines}')
count += 1
Note* : gobuster is a very fast tool for searching subdomains in websites
If you're trying to use threading in python you should start from the basics and learn what's available. But here's a simple example taken from https://pymotw.com/2/threading/
import threading
def worker():
"""thread worker function"""
print 'Worker'
return
threads = []
for i in range(5):
t = threading.Thread(target=worker)
threads.append(t)
t.start()
To apply this to your task, a simple approach would be to spawn a thread for each request. Something like the code below. Note: if your wordlist is long this might be very expensive. Look into some of the thread pool libraries in python for better thread management that you won't need to explicitly control yourself.
import threading
def subdomains(url, wordlist):
checks(url, wordlist) # just checking for valid args
num_lines = get_line_count(wordlist) # number of lines in a file
count = 0
threads = []
for line in open(wordlist).readlines():
t = threading.Thread(target=checkUrl,args=(url,line))
threads.append(t)
t.start()
for thread in threads: #wait for all threads to complete
thread.join()
def checkUrl(url,line):
resp = requests.get(url + line)
if resp.status_code in (301, 200):
print(f'Valid - {line}')
To implement the counter you'll need to control shared access between threads to prevent race conditions (two threads accessing the variable at the same time resulting in... problems). A counter object with protected access is provided in the link above:
class Counter(object):
def __init__(self, start=0):
self.lock = threading.Lock()
self.value = start
def increment(self):
#Waiting for lock
self.lock.acquire()
try:
#Acquired lock
self.value = self.value + 1
finally:
#Release lock, so other threads can count
self.lock.release()
#usage:
#in subdomains()...
counter = Counter()
for ...
t = threading.Thread(target=checkUrl,args=(url,line,counter))
#in checkUrl()...
c.increment()
Final note: I have not compiled or tested any of this code.
Python have threading module.
The simplest way to use a Thread is to instantiate it with a target function and call start() to let it begin working.
import threading
def subdomains(url, wordlist):
checks(url, wordlist) # just checking for valid args
num_lines = get_line_count(wordlist) # number of lines in a file
count = 0
for line in open(wordlist).readlines():
resp = requests.get(url + line) # resp
if resp.status_code in (301, 200):
print(f'Valid - {line}')
print(f'{count} / {num_lines}')
count += 1
threads = []
for i in range(10):
t = threading.Thread(target=subdomains)
threads.append(t)
t.start()
I have a bunch of long running processes that I would like to split up into multiple processes. That part I can do no problem. The issue I run into is sometimes these processes go into a hung state. To address this issue I would like to be able to set a time threshold for each task that a process is working on. When that time threshold is exceeded I would like to restart or terminate the task.
Originally my code was very simple using a process pool, however with the pool I could not figure out how to retrieve the processes inside the pool, nevermind how to restart / terminate a process in the pool.
I have resorted to using a queue and process objects as is illustrated in this example (https://pymotw.com/2/multiprocessing/communication.html#passing-messages-to-processes with some changes.
My attempts to figure this out are in the code below. In its current state the process does not actually get terminated. Further to that I cannot figure out how to get the process to move onto the next task after the current task is terminated. Any suggestions / help appreciated, perhaps I’m going about this the wrong way.
Thanks
import multiprocess
import time
class Consumer(multiprocess.Process):
def __init__(self, task_queue, result_queue, startTimes, name=None):
multiprocess.Process.__init__(self)
if name:
self.name = name
print 'created process: {0}'.format(self.name)
self.task_queue = task_queue
self.result_queue = result_queue
self.startTimes = startTimes
def stopProcess(self):
elapseTime = time.time() - self.startTimes[self.name]
print 'killing process {0} {1}'.format(self.name, elapseTime)
self.task_queue.cancel_join_thread()
self.terminate()
# now want to get the process to start procesing another job
def run(self):
'''
The process subclass calls this on a separate process.
'''
proc_name = self.name
print proc_name
while True:
# pulling the next task off the queue and starting it
# on the current process.
task = self.task_queue.get()
self.task_queue.cancel_join_thread()
if task is None:
# Poison pill means shutdown
#print '%s: Exiting' % proc_name
self.task_queue.task_done()
break
self.startTimes[proc_name] = time.time()
answer = task()
self.task_queue.task_done()
self.result_queue.put(answer)
return
class Task(object):
def __init__(self, a, b, startTimes):
self.a = a
self.b = b
self.startTimes = startTimes
self.taskName = 'taskName_{0}_{1}'.format(self.a, self.b)
def __call__(self):
import time
import os
print 'new job in process pid:', os.getpid(), self.taskName
if self.a == 2:
time.sleep(20000) # simulate a hung process
else:
time.sleep(3) # pretend to take some time to do the work
return '%s * %s = %s' % (self.a, self.b, self.a * self.b)
def __str__(self):
return '%s * %s' % (self.a, self.b)
if __name__ == '__main__':
# Establish communication queues
# tasks = this is the work queue and results is for results or completed work
tasks = multiprocess.JoinableQueue()
results = multiprocess.Queue()
#parentPipe, childPipe = multiprocess.Pipe(duplex=True)
mgr = multiprocess.Manager()
startTimes = mgr.dict()
# Start consumers
numberOfProcesses = 4
processObjs = []
for processNumber in range(numberOfProcesses):
processObj = Consumer(tasks, results, startTimes)
processObjs.append(processObj)
for process in processObjs:
process.start()
# Enqueue jobs
num_jobs = 30
for i in range(num_jobs):
tasks.put(Task(i, i + 1, startTimes))
# Add a poison pill for each process object
for i in range(numberOfProcesses):
tasks.put(None)
# process monitor loop,
killProcesses = {}
executing = True
while executing:
allDead = True
for process in processObjs:
name = process.name
#status = consumer.status.getStatusString()
status = process.is_alive()
pid = process.ident
elapsedTime = 0
if name in startTimes:
elapsedTime = time.time() - startTimes[name]
if elapsedTime > 10:
process.stopProcess()
print "{0} - {1} - {2} - {3}".format(name, status, pid, elapsedTime)
if allDead and status:
allDead = False
if allDead:
executing = False
time.sleep(3)
# Wait for all of the tasks to finish
#tasks.join()
# Start printing results
while num_jobs:
result = results.get()
print 'Result:', result
num_jobs -= 1
I generally recommend against subclassing multiprocessing.Process as it leads to code hard to read.
I'd rather encapsulate your logic in a function and run it in a separate process. This keeps the code much cleaner and intuitive.
Nevertheless, rather than reinventing the wheel, I'd recommend you to use some library which already solves the issue for you such as Pebble or billiard.
For example, the Pebble library allows to easily set timeouts to processes running independently or within a Pool.
Running your function within a separate process with a timeout:
from pebble import concurrent
from concurrent.futures import TimeoutError
#concurrent.process(timeout=10)
def function(foo, bar=0):
return foo + bar
future = function(1, bar=2)
try:
result = future.result() # blocks until results are ready
except TimeoutError as error:
print("Function took longer than %d seconds" % error.args[1])
Same example but with a process Pool.
with ProcessPool(max_workers=5, max_tasks=10) as pool:
future = pool.schedule(function, args=[1], timeout=10)
try:
result = future.result() # blocks until results are ready
except TimeoutError as error:
print("Function took longer than %d seconds" % error.args[1])
In both cases, the timing out process will be automatically terminated for you.
A way simpler solution would be to continue using a than reimplementing the Pool is to design a mechanism which timeout the function you are running.
For instance:
from time import sleep
import signal
class TimeoutError(Exception):
pass
def handler(signum, frame):
raise TimeoutError()
def run_with_timeout(func, *args, timeout=10, **kwargs):
signal.signal(signal.SIGALRM, handler)
signal.alarm(timeout)
try:
res = func(*args, **kwargs)
except TimeoutError as exc:
print("Timeout")
res = exc
finally:
signal.alarm(0)
return res
def test():
sleep(4)
print("ok")
if __name__ == "__main__":
import multiprocessing as mp
p = mp.Pool()
print(p.apply_async(run_with_timeout, args=(test,),
kwds={"timeout":1}).get())
The signal.alarm set a timeout and when this timeout, it run the handler, which stop the execution of your function.
EDIT: If you are using a windows system, it seems to be a bit more complicated as signal does not implement SIGALRM. Another solution is to use the C-level python API. This code have been adapted from this SO answer with a bit of adaptation to work on 64bit system. I have only tested it on linux but it should work the same on windows.
import threading
import ctypes
from time import sleep
class TimeoutError(Exception):
pass
def run_with_timeout(func, *args, timeout=10, **kwargs):
interupt_tid = int(threading.get_ident())
def interupt_thread():
# Call the low level C python api using ctypes. tid must be converted
# to c_long to be valid.
res = ctypes.pythonapi.PyThreadState_SetAsyncExc(
ctypes.c_long(interupt_tid), ctypes.py_object(TimeoutError))
if res == 0:
print(threading.enumerate())
print(interupt_tid)
raise ValueError("invalid thread id")
elif res != 1:
# "if it returns a number greater than one, you're in trouble,
# and you should call it again with exc=NULL to revert the effect"
ctypes.pythonapi.PyThreadState_SetAsyncExc(
ctypes.c_long(interupt_tid), 0)
raise SystemError("PyThreadState_SetAsyncExc failed")
timer = threading.Timer(timeout, interupt_thread)
try:
timer.start()
res = func(*args, **kwargs)
except TimeoutError as exc:
print("Timeout")
res = exc
else:
timer.cancel()
return res
def test():
sleep(4)
print("ok")
if __name__ == "__main__":
import multiprocessing as mp
p = mp.Pool()
print(p.apply_async(run_with_timeout, args=(test,),
kwds={"timeout": 1}).get())
print(p.apply_async(run_with_timeout, args=(test,),
kwds={"timeout": 5}).get())
For long running processes and/or long iterators, spawned workers might hang after some time. To prevent this, there are two built-in techniques:
Restart workers after they have delivered maxtasksperchild tasks from the queue.
Pass timeout to pool.imap.next(), catch the TimeoutError, and finish the rest of the work in another pool.
The following wrapper implements both, as a generator. This also works when replacing stdlib multiprocessing with multiprocess.
import multiprocessing as mp
def imap(
func,
iterable,
*,
processes=None,
maxtasksperchild=42,
timeout=42,
initializer=None,
initargs=(),
context=mp.get_context("spawn")
):
"""Multiprocessing imap, restarting workers after maxtasksperchild tasks to avoid zombies.
Example:
>>> list(imap(str, range(5)))
['0', '1', '2', '3', '4']
Raises:
mp.TimeoutError: if the next result cannot be returned within timeout seconds.
Yields:
Ordered results as they come in.
"""
with context.Pool(
processes=processes,
maxtasksperchild=maxtasksperchild,
initializer=initializer,
initargs=initargs,
) as pool:
it = pool.imap(func, iterable)
while True:
try:
yield it.next(timeout)
except StopIteration:
return
To catch the TimeoutError:
>>> import time
>>> iterable = list(range(10))
>>> results = []
>>> try:
... for i, result in enumerate(imap(time.sleep, iterable, processes=2, timeout=2)):
... results.append(result)
... except mp.TimeoutError:
... print("Failed to process the following subset of iterable:", iterable[i:])
Failed to process the following subset of iterable: [2, 3, 4, 5, 6, 7, 8, 9]
Is there a way that I can have a single variable across active threads like below
count = 0
threadA(count)
threadB(count)
threadA(count):
#do stuff
count += 1
threadB(count):
#do stuff
print count
so that count will print out 1? I changed the variable in thread A and it reflected across to the other thread?
Your variable count is already available to all your threads. But you need to synchronize access to it, or you will lose updates. Look into using a lock to protect access to the count.
If you want to use processes instead of threads, use multiprocessing. It has more features, including having a Manager objects which handles shared objects for you. As a perk, you can share objects across machines!
source
import multiprocessing, signal, time
def producer(objlist):
'''
add an item to list every sec
'''
while True:
try:
time.sleep(1)
except KeyboardInterrupt:
return
msg = 'ding: {:04d}'.format(int(time.time()) % 10000)
objlist.append( msg )
print msg
def scanner(objlist):
'''
every now and then, consume objlist & run calculation
'''
while True:
try:
time.sleep(3)
except KeyboardInterrupt:
return
print 'items: {}'.format( list(objlist) )
objlist[:] = []
def main():
# create obj sharable between all processes
manager = multiprocessing.Manager()
my_objlist = manager.list() # pylint: disable=E1101
multiprocessing.Process(
target=producer, args=(my_objlist,),
).start()
multiprocessing.Process(
target=scanner, args=(my_objlist,),
).start()
# kill everything after a few seconds
signal.signal(
signal.SIGALRM,
lambda _sig,_frame: manager.shutdown(),
)
signal.alarm(12)
try:
manager.join() # wait until both workers die
except KeyboardInterrupt:
pass
if __name__=='__main__':
main()
I notice this behavior in python for the pool allocation. Even though I have 20 processes in the pool, when I do a map_async for say 8 processes, instead of throwing all the processes to execute, I get only 4 executing. when those 4 finish, it sends two more, and then when those two finish is sends one.
When I throw more than 20 at it, it runs all 20, until it starts to get less than 20 in the queue, when the above behavior repeats.
I assume this is done on purpose, but it looks weird. My goal is to have the requests processed as soon as they come in and obviously this behavior does not fit.
Using python 2.6 with billiard for maxtasksperchild support
Any ideas how can I improve it?
Code:
mypool = pool.Pool(processes=settings['num-processes'], initializer=StartChild, maxtasksperchild=10)
while True:
lines = DbData.GetAll()
if len(lines) > 0:
print 'Starting to process: ', len(lines), ' urls'
Res = mypool.map_async(RunChild, lines)
Returns = Res.get(None)
print 'Pool returns: ', idx, Returns
else:
time.sleep(0.5)
One way I deal with multiprocessing in Python is the following:
I have data on which I want to use a function function().
First I create a multiprocessing subclass:
import multiprocessing
class ProcessThread(multiprocessing.Process):
def __init__(self, id_t, inputqueue, idqueue, function, resultqueue):
self.id_t = id_t
self.inputlist = inputqueue
self.idqueue = idqueue
self.function = function
self.resultqueue = resultqueue
multiprocessing.Process.__init__(self)
def run(self):
s = "process number: " + str(self.id_t) + " starting"
print s
result = []
while self.inputqueue.qsize() > 0
try:
inp = self.inputqueue.get()
except Exception:
pass
result = self.function(inp)
while 1:
try:
self.resultqueue.put([self.id,])
except Exception:
pass
else:
break
self.idqueue.put(id)
return
and the main function:
inputqueue = multiprocessing.Queue()
resultqueue = multiprocessing.Queue()
idqueue = multiprocessing.Queue()
def function(data):
print data # or what you want
for datum in data:
inputqueue.put(datum)
for i in xrange(nbprocess):
ProcessThread(i, inputqueue, idqueue, function, resultqueue).start()
and finally get results:
results = []
while idqueue.qsize() < nbprocess:
pass
while resultqueue.qsize() > 0:
results.append(resultqueue.get())
In this way you can control perfectly what is appended with process and other stuff.
Using a multiprocessing inputqueue is an efficient technique only if the computation for each datum is quite slow (< 1,2 seconds) because of the concurrent access of the different process to the queues (that why I use exception). If your function computes very quickly, consider splitting up your data only once at the begining and put chunks of the dataset for every process at the beginning.