Writing an object oriented multithreaded job\result queue in python - python

long time lurker here.
I have a thread controller object. This object takes in other objects called "Checks". These Checks pull in DB rows that match their criteria. The thread manager polls each check (asking it for it's DB rows aka work units) and then enqueues each row along with a reference to that check object. The thought is that N many threads will come in and pull off an item from the queue and execute the corresponding Check's do_work method. The do_work method will return Pass\Fail and all passes will be enqueued for further processing.
The main script (not shown) instantiates the checks and adds them to the thread manager using add_check and then calls kick_off_work.
So far I am testing and it simply locks up:
import Queue
from threading import Thread
class ThreadMan:
def __init__(self, reporter):
print "Initializing thread manager..."
self.workQueue = Queue.Queue()
self.resultQueue = Queue.Queue()
self.checks = []
def add_check(self, check):
self.checks.append(check)
def kick_off_work(self):
for check in self.checks:
for work_unit in check.populate_work():
#work unit is a DB row
self.workQueue.put({"object" : check, "work" : work_unit})
threads = Thread(target=self.execute_work_unit)
threads = Thread(target=self.execute_work_unit)
threads.start()
self.workQueue.join();
def execute_work_unit(self):
unit = self.workQueue.get()
check_object = unit['object'] #Check object
work_row = unit['work'] # DB ROW
check_object.do_work(work_row)
self.workQueue.task_done();
print "Done with work!!"
The output is simply:
Initializing thread manager...
In check1's do_work method... Doing work
Done with work!!
(locked up)
I would like to run through the entire queue

you should only add a "while" in your execute_work_unit otherwise it stops at first iteration:
def execute_work_unit(self):
while True:
unit = self.workQueue.get()
check_object = unit['object'] #Check object
work_row = unit['work'] # DB ROW
check_object.do_work(work_row)
self.workQueue.task_done();
print "Done with work!!"
have a look there:
http://docs.python.org/2/library/queue.html#module-Queue
EDIT: to get it finish just add threads.join() after your self.workQueue.join() in
def kick_off_work(self):

Related

Python thread with queue duplicates items inside queue

I am trying to implement a worker thread to go through a queue and add the items inside to a sql db.
But I am experiencing this weird issue where even though I am definetly putting in different statements to the queue, they all become copies of each other inside the queue if I am putting them in the queue within 2 seconds.
This is the worker thread with the queue:
class DBWriterThread(threading.Thread):
def __init__(self):
super().__init__()
self.q = queue.Queue()
self.put = self.q.put
self.start()
def run(self):
db_conn = None
while True:
statements = [self.q.get()]
try:
while self.q.empty() is False:
statements.append(self.q.get(), block=False)
except queue.Empty:
pass
try:
if statements[0] is None:
return
if not db_conn:
db_conn = connect_to_db()
try:
cursor = db_conn.cursor()
for statement in statements:
print(statement)
if statement is None:
return
work_order = statement[0][0]
data = statement[0][1]
if work_order == 'insertTick':
print(f"GOT ORDER TO INSERT DATA OF SYMBOL {data['symbol']}")
insertDataRow(data, cursor)
elif work_order == 'insertTrade':
insertTradeSignal(data, cursor)
else:
print("Unknown work order")
print(work_order)
finally:
db_conn.commit()
finally:
for _ in statements:
self.q.task_done()
I instantiate this thread class inside my main program (which is also a class) where I in the init method is defining it as
self.db_writer = DBWriterThread()
and then throughout the program I am doing
self.dbWriter.put((['insertTick', self.peopleTick],))
to insert data in the queue.
I believe what is happening is that I am getting multiple peopleTicks within a very short span (practically simultaneously), where after I am calling the self.dbWriter.put((['insertTick', self.peopleTick],)) with the two different peopleTicks directly after each other.
This is where I am experiencing that even though I input two different peopleTicks into the queue, when the worker thread retrieves the queue, the items inside are duplicates of the same tick.
This can be stopped if I do a time.sleep(2) (not less) between I call the self.dbWriter.put but that would not work for my program and defeats the whole point of a queue. How do I solve this?
I have tried figuring out locks if that could help, but I don't know how to implement it and if that would be the solution.

How to add a pool of processes available for a multiprocessing queue

I am following a preceding question here: how to add more items to a multiprocessing queue while script in motion
the code I am working with now:
import multiprocessing
class MyFancyClass:
def __init__(self, name):
self.name = name
def do_something(self):
proc_name = multiprocessing.current_process().name
print('Doing something fancy in {} for {}!'.format(proc_name, self.name))
def worker(q):
while True:
obj = q.get()
if obj is None:
break
obj.do_something()
if __name__ == '__main__':
queue = multiprocessing.Queue()
p = multiprocessing.Process(target=worker, args=(queue,))
p.start()
queue.put(MyFancyClass('Fancy Dan'))
queue.put(MyFancyClass('Frankie'))
# print(queue.qsize())
queue.put(None)
# Wait for the worker to finish
queue.close()
queue.join_thread()
p.join()
Right now, there's two items in the queue. if I replace the two lines with a list of, say 50 items....How do I initiate a POOL to allow a number of processes available. for example:
p = multiprocessing.Pool(processes=4)
where does that go? I'd like to be able run multiple items at once, especially if the items run for a bit.
Thanks!
As a rule, you either use Pool or Process(es) plus Queues. Mixing both is a misuse; the Pool already uses Queues (or a similar mechanism) behind the scenes.
If you want to do this with a Pool, change your code to (moving code to main function for performance and better resource cleanup than running in global scope):
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# Submit all the work
futures = [p.apply_async(fancy.do_something) for fancy in myfancyclasses]
# Done submitting, let workers exit as they run out of work
p.close()
# Wait until all the work is finished
for f in futures:
f.wait()
if __name__ == '__main__':
main()
This could be simplified further at the expense of purity, with the .*map* methods of Pool, e.g. to minimize memory usage redefine main as:
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# No return value, so we ignore it, but we need to run out the result
# or the work won't be done
for _ in p.imap_unordered(MyFancyClass.do_something, myfancyclasses):
pass
Yes, technically either approach has a slightly higher overhead in terms of needing to serialize the return value you're not using so give it back to the parent process. But in practice, this cost is pretty low (since your function has no return, it's returning None, which serializes to almost nothing). An advantage to this approach is that for printing to the screen, you generally don't want to do it from the child processes (since they'll end up interleaving output), and you can replace the printing with returns to let the parent do the work, e.g.:
import multiprocessing
class MyFancyClass:
def __init__(self, name):
self.name = name
def do_something(self):
proc_name = multiprocessing.current_process().name
# Changed from print to return
return 'Doing something fancy in {} for {}!'.format(proc_name, self.name)
def main():
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your MyFancyClass instances here
with multiprocessing.Pool(processes=4) as p:
# Using the return value now to avoid interleaved output
for res in p.imap_unordered(MyFancyClass.do_something, myfancyclasses):
print(res)
if __name__ == '__main__':
main()
Note how all of these solutions remove the need to write your own worker function, or manually manage Queues, because Pools do that grunt work for you.
Alternate approach using concurrent.futures to efficiently process results as they become available, while allowing you to choose to submit new work (either based on the results, or based on external information) as you go:
import concurrent.futures
from concurrent.futures import FIRST_COMPLETED
def main():
allow_new_work = True # Set to False to indicate we'll no longer allow new work
myfancyclasses = [MyFancyClass('Fancy Dan'), ...] # define your initial MyFancyClass instances here
with concurrent.futures.ProcessPoolExecutor() as executor:
remaining_futures = {executor.submit(fancy.do_something)
for fancy in myfancyclasses}
while remaining_futures:
done, remaining_futures = concurrent.futures.wait(remaining_futures,
return_when=FIRST_COMPLETED)
for fut in done:
result = fut.result()
# Do stuff with result, maybe submit new work in response
if allow_new_work:
if should_stop_checking_for_new_work():
allow_new_work = False
# Let the workers exit when all remaining tasks done,
# and reject submitting more work from now on
executor.shutdown(wait=False)
elif has_more_work():
# Assumed to return collection of new MyFancyClass instances
new_fanciness = get_more_fanciness()
remaining_futures |= {executor.submit(fancy.do_something)
for fancy in new_fanciness}
myfancyclasses.extend(new_fanciness)

How to run Python custom objects in separate processes, all working on a shared events queue?

I have 4 different Python custom objects and an events queue. Each obect has a method that allows it to retrieve an event from the shared events queue, process it if the type is the desired one and then puts a new event on the same events queue, allowing other processes to process it.
Here's an example.
import multiprocessing as mp
class CustomObject:
def __init__(events_queue: mp.Queue) -> None:
self.events_queue = event_queue
def process_events_queue() -> None:
event = self.events_queue.get()
if type(event) == SpecificEventDataTypeForThisClass:
# do something and create a new_event
self.events_queue.put(new_event)
else:
self.events_queue.put(event)
# there are other methods specific to each object
These 4 objects have specific tasks to do, but they all share this same structure. Since I need to "simulate" the production condition, I want them to run all at the same time, indipendently from eachother.
Here's just an example of what I want to do, if possible.
import multiprocessing as mp
import CustomObject
if __name__ == '__main__':
events_queue = mp.Queue()
data_provider = mp.Process(target=CustomObject, args=(events_queue,))
portfolio = mp.Process(target=CustomObject, args=(events_queue,))
engine = mp.Process(target=CustomObject, args=(events_queue,))
broker = mp.Process(target=CustomObject, args=(events_queue,))
while True:
data_provider.process_events_queue()
portfolio.process_events_queue()
engine.process_events_queue()
broker.process_events_queue()
My idea is to run each object in a separate process, allowing them to communicate with events shared through the events_queue. So my question is, how can I do that?
The problem is that obj = mp.Process(target=CustomObject, args=(events_queue,)) returns a Process instance and I can't access the CustomObject methods from it. Also, is there a smarter way to achieve what I want?
Processes require a function to run, which defines what the process is actually doing. Once this function exits (and there are no non-daemon threads) the process is done. This is similar to how Python itself always executes a __main__ script.
If you do mp.Process(target=CustomObject, args=(events_queue,)) that just tells the process to call CustomObject - which instantiates it once and then is done. This is not what you want, unless the class actually performs work when instantiated - which is a bad idea for other reasons.
Instead, you must define a main function or method that handles what you need: "communicate with events shared through the events_queue". This function should listen to the queue and take action depending on the events received.
A simple implementation looks like this:
import os, time
from multiprocessing import Queue, Process
class Worker:
# separate input and output for simplicity
def __init__(self, commands: Queue, results: Queue):
self.commands = commands
self.results = results
# our main function to be run by a process
def main(self):
# each process should handle more than one command
while True:
value = self.commands.get()
# pick a well-defined signal to detect "no more work"
if value is None:
self.results.put(None)
break
# do whatever needs doing
result = self.do_stuff(value)
print(os.getpid(), ':', self, 'got', value, 'put', result)
time.sleep(0.2) # pretend we do something
# pass on more work if required
self.results.put(result)
# placeholder for what needs doing
def do_stuff(self, value):
raise NotImplementedError
This is a template for a class that just keeps on processing events. The do_stuff method must be overloaded to define what actually happens.
class AddTwo(Worker):
def do_stuff(self, value):
return value + 2
class TimesThree(Worker):
def do_stuff(self, value):
return value * 3
class Printer(Worker):
def do_stuff(self, value):
print(value)
This already defines fully working process payloads: Process(target=TimesThree(in_queue, out_queue).main) schedules the main method in a process, listening for and responding to commands.
Running this mainly requires connecting the individual components:
if __name__ == '__main__':
# bookkeeping of resources we create
processes = []
start_queue = Queue()
# connect our workers via queues
queue = start_queue
for element in (AddTwo, TimesThree, Printer):
instance = element(queue, Queue())
# we run the main method in processes
processes.append(Process(target=instance.main))
queue = instance.results
# start all processes
for process in processes:
process.start()
# send input, but do not wait for output
start_queue.put(1)
start_queue.put(248124)
start_queue.put(-256)
# send shutdown signal
start_queue.put(None)
# wait for processes to shutdown
for process in processes:
process.join()
Note that you do not need classes for this. You can also compose functions for a similar effect, as long as everything is pickle-able:
import os, time
from multiprocessing import Queue, Process
def main(commands, results, do_stuff):
while True:
value = commands.get()
if value is None:
results.put(None)
break
result = do_stuff(value)
print(os.getpid(), ':', do_stuff, 'got', value, 'put', result)
time.sleep(0.2)
results.put(result)
def times_two(value):
return value * 2
if __name__ == '__main__':
in_queue, out_queue = Queue(), Queue()
worker = Process(target=main, args=(in_queue, out_queue, times_two))
worker.start()
for message in (1, 3, 5, None):
in_queue.put(message)
while True:
reply = out_queue.get()
if reply is None:
break
print('result:', reply)

python multithreaded application hanging

I have written a program that I am using to benchmark a mongodb database performing under multithreaded bulk write conditions.
The problem is that the program hangs and does not finish executing.
I am quite sure that the problem is due to writing 530838 records to the database and using 10 threads to bulk write 50 records at a time. This leaves a modulo value of 38 records, however the run method fetches 50 records from the queue so the process hangs when 530800 records have been written and never writes the final 38 records as the following code never finishes executing
for object in range(50):
objects.append(self.queue.get())
I would like the program to write 50 records at a time until fewer than 50 remain at which point it should write the remaining records in the queue and then exit the thread when no records remain in the queue.
Thanks in advance :)
import threading
import Queue
import json
from pymongo import MongoClient, InsertOne
import datetime
#Set the number of threads
n_thread = 10
#Create the queue
queue = Queue.Queue()
#Connect to the database
client = MongoClient("mongodb://mydatabase.com")
db = client.threads
class ThreadClass(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
#Assign thread working with queue
self.queue = queue
def run(self):
while True:
objects = []
#Get next 50 objects from queue
for object in range(50):
objects.append(self.queue.get())
#Insert the queued objects into the database
db.threads.insert_many(objects)
#signals to queue job is done
self.queue.task_done()
#Create number of processes
threads = []
for i in range(n_thread):
t = ThreadClass(queue)
t.setDaemon(True)
#Start thread
t.start()
#Start timer
starttime = datetime.datetime.now()
#Read json object by object
content = json.load(open("data.txt","r"))
for jsonobj in content:
#Put object into queue
queue.put(jsonobj)
#wait on the queue until everything has been processed
queue.join()
for t in threads:
t.join()
#Print the total execution time
endtime = datetime.datetime.now()
duration = endtime-starttime
print(divmod(duration.days * 86400 + duration.seconds, 60))
From the docs on Queue.get you can see that the default settings are block=True and timeout=None, which results in blocked waiting on an empty queue to have a next item that can be taken.
You could use get_nowait or get(False) to ensure you're not blocking. If you want the blocking to be conditional on whether the queue has 50 items, whether it is empty, or other conditions, you can use Queue.empty and Queue.qsize, but note that they do not provide race-condition-proof guarantees of non-blocking behavior... they would merely be heuristics for whether to use block=False with get.
Something like this:
def run(self):
while True:
objects = []
#Get next 50 objects from queue
block = self.queue.qsize >= 50
for i in range(50):
try:
item = self.queue.get(block=block)
except Queue.Empty:
break
objects.append(item)
#Insert the queued objects into the database
db.threads.insert_many(objects)
#signals to queue job is done
self.queue.task_done()
Another approach would be to set timeout and use a try ... except block to catch any Empty exceptions that are raised. This has the advantage that you can decide how long to wait, rather than heuristically guessing when to immediately return, but they are similar.
Also note that I changed your loop variable from object to i ... you should most likely avoid having your loop variable ghost the global object class.

Python Multiprocessing Queue: Reading queue from another module

I have an issue reading a multiprocessing queue the function for reading the queue is being called from another module.
below is the class containing the function to start a thread which runs function_to_get_data. The class resides in its own file, which I will call one.py. function_to_get_data is in another file, two.py and is an infinite loop which puts data into the queue (code snippet for this further down). It also contains the function to read the queue. The Queue q is defined globally at the beginning.
import multiprocessing
from two import function_to_get_data
q = multiprocessing.Queue()
class Poller:
def startPoller(self):
pollerThread = multiprocessing.Process(target=module_to_get_data,args=(q,))
pollerThread.start()
def getPoller(self):
if q.empty():
print "queue is empty"
else:
pollResQueue = q.get()
q.put(pollResQueue)
return pollResQueue
if __name__ == "__main__":
startpoll = Poller()
startpoll.startPoller()
Below is snippet from function_to_get_data:
def module_to_get_data(q):
while 1:
# performs actions #
q.put(data_from_actions)
I have a another module, three.py, which requires the data from the queue and requests it by calling the function from the initial class:
from one import Poller
externalPoller = Poller()
data_this_module_needs = externalPoller.getPoller()
The issue is that the Queue is always empty.
I should add that the function in three.py is also called as a thread in one.py by a post from a web page:
def POST(data):
data = web.input()
if data == 'Start':
thread_two = multiprocessing.Process(target= function_in_three_py, args=(q,))
thread_two.start()
If I use the python command line and enter the two Poller functions and call them, I get data from the queue no problem.

Categories

Resources