I have written a program that I am using to benchmark a mongodb database performing under multithreaded bulk write conditions.
The problem is that the program hangs and does not finish executing.
I am quite sure that the problem is due to writing 530838 records to the database and using 10 threads to bulk write 50 records at a time. This leaves a modulo value of 38 records, however the run method fetches 50 records from the queue so the process hangs when 530800 records have been written and never writes the final 38 records as the following code never finishes executing
for object in range(50):
objects.append(self.queue.get())
I would like the program to write 50 records at a time until fewer than 50 remain at which point it should write the remaining records in the queue and then exit the thread when no records remain in the queue.
Thanks in advance :)
import threading
import Queue
import json
from pymongo import MongoClient, InsertOne
import datetime
#Set the number of threads
n_thread = 10
#Create the queue
queue = Queue.Queue()
#Connect to the database
client = MongoClient("mongodb://mydatabase.com")
db = client.threads
class ThreadClass(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
#Assign thread working with queue
self.queue = queue
def run(self):
while True:
objects = []
#Get next 50 objects from queue
for object in range(50):
objects.append(self.queue.get())
#Insert the queued objects into the database
db.threads.insert_many(objects)
#signals to queue job is done
self.queue.task_done()
#Create number of processes
threads = []
for i in range(n_thread):
t = ThreadClass(queue)
t.setDaemon(True)
#Start thread
t.start()
#Start timer
starttime = datetime.datetime.now()
#Read json object by object
content = json.load(open("data.txt","r"))
for jsonobj in content:
#Put object into queue
queue.put(jsonobj)
#wait on the queue until everything has been processed
queue.join()
for t in threads:
t.join()
#Print the total execution time
endtime = datetime.datetime.now()
duration = endtime-starttime
print(divmod(duration.days * 86400 + duration.seconds, 60))
From the docs on Queue.get you can see that the default settings are block=True and timeout=None, which results in blocked waiting on an empty queue to have a next item that can be taken.
You could use get_nowait or get(False) to ensure you're not blocking. If you want the blocking to be conditional on whether the queue has 50 items, whether it is empty, or other conditions, you can use Queue.empty and Queue.qsize, but note that they do not provide race-condition-proof guarantees of non-blocking behavior... they would merely be heuristics for whether to use block=False with get.
Something like this:
def run(self):
while True:
objects = []
#Get next 50 objects from queue
block = self.queue.qsize >= 50
for i in range(50):
try:
item = self.queue.get(block=block)
except Queue.Empty:
break
objects.append(item)
#Insert the queued objects into the database
db.threads.insert_many(objects)
#signals to queue job is done
self.queue.task_done()
Another approach would be to set timeout and use a try ... except block to catch any Empty exceptions that are raised. This has the advantage that you can decide how long to wait, rather than heuristically guessing when to immediately return, but they are similar.
Also note that I changed your loop variable from object to i ... you should most likely avoid having your loop variable ghost the global object class.
Related
I'm trying to write a code in which there is a single queue and many workers (producer_consumer in the example) that process objects in the queue. I need to use multiprocessing since the code the workers are going to execute will be CPU bounded. The setup is the following:
The queue is initialized by the parent process with some initial values (names in the example), then it starts the workers.
Workers start getting elements from the queue and after processing that element each worker may produce a new object to be inserted (...and then processed by someone else) into the queue.
All this goes on until the queue is empty. When this happens I would like that all workers stops and the control is given back to the Parent to conclude the execution.
I wrote this example in which workers correctly process elements and produce new objects into the queue but the problem is that the execution hang when the queue is empty. Any suggestions?
Thanks in advance
import time
import os
import random
import string
from multiprocessing import Process, Queue, Lock
# Produces and Consumes names in and from the Queue,
def producer_consumer(queue, lock):
# Synchronize access to the console
with lock:
print('Starting consumer => {}'.format(os.getpid()))
while not queue.empty():
time.sleep(random.randint(0, 10))
# If the queue is empty, queue.get() will block until the queue has data
name = queue.get()
if random.random() < 0.7:
product = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10))
queue.put(product)
else:
product = 'nothing'
# Synchronize access to the console
with lock:
print('{} got {}, produced {}'.format(os.getpid(), name, product))
if __name__ == '__main__':
# Create the Queue object
queue = Queue()
# Create a lock object to synchronize resource access
lock = Lock()
producer_consumers = []
names = ['Mario', 'Peppino', 'Francesco', 'Carlo', 'Ermenegildo']
for name in names:
queue.put(name)
for _ in range(5):
producer_consumers.append(Process(target=producer_consumer, args=(queue, lock)))
for process in producer_consumers:
process.start()
for p in producer_consumers:
p.join()
print('Parent process exiting...')
I would like to understand how a queue knows that it wont receive any new items. In the following example the queue will indefintely wait when the tputter thread is not started (I assume because nothing was put to it so far). If the tputter is started it waits between 'puts' until something new is there and as soon as everything is finished it stops. But how does the tgetter know whether something new will end up in the queue or not?
import threading
import queue
import time
q = queue.Queue()
def getter():
for i in range(5):
print('worker:', q.get())
time.sleep(2)
def putter():
for i in range(5):
print('putter: ', i)
q.put(i)
time.sleep(3)
tgetter = threading.Thread(target=getter)
tgetter.start()
tputter = threading.Thread(target=putter)
#tputter.start()
A common way to do this is to use the "poison pill" pattern. Basically, the producer and consumer agree on a special "poison pill" object that the producer can load into the queue, which will indicate that no more items are going to be sent, and the consumer can shut down.
So, in your example, it'd look like this:
import threading
import queue
import time
q = queue.Queue()
END = object()
def getter():
while True:
item = q.get()
if item == END:
break
print('worker:', item)
time.sleep(2)
def putter():
for i in range(5):
print('putter: ', i)
q.put(i)
time.sleep(3)
q.put(END)
tgetter = threading.Thread(target=getter)
tgetter.start()
tputter = threading.Thread(target=putter)
#tputter.start()
This is a little contrived, since the producer is hard-coded to always send five items, so you have to imagine that the consumer doesn't know ahead of time how many items the producer will send.
My program gets deadlock after working about a half hour. I use global thread safe queue usernamesQueue = Queue(), main thread produces items to queue and wait for handling them:
while(True):
print('processing file...')
with open('usernames') as f:
for line in f:
usernamesQueue.put(line.strip())
usernamesQueue.join()
I start another threads like this:
for i in range(NUMBER_OF_WORKERS):
threading.Thread(target=worker).start()
And handle values in queue like this:
def worker():
while True:
time.sleep(1)
item = None
item = usernamesQueue.get()
if item is not None:
processUser(item)
usernamesQueue.task_done()
time.sleep(random.randint(1, 5))
processUser catch any exception that can be thrown and sure there weren't exceptions before deadlock. What's wrong?
There weren't problems with deadlock, it was in HTTPSConnection which by default doesn't have timeout or have very big, I added timeout 15 seconds and program works fine even when server doesn't want to respond.
I have an iterator which contains a lot of data (larger then memory) I want to be able to perform some actions on this data. To do this quickly I am using the multiprocessing module.
def __init__(self, poolSize, spaceTimeTweetCollection=None):
super().__init__()
self.tagFreq = {}
if spaceTimeTweetCollection is not None:
q = Queue()
processes = [Process(target=self.worker, args=((q),)) for i in range(poolSize)]
for p in processes:
p.start()
for tweet in spaceTimeTweetCollection:
q.put(tweet)
for p in processes:
p.join()
the aim is that I create some proceses which listen in on the queue
def worker(self, queue):
tweet = queue.get()
self.append(tweet) #performs some actions on data
I then loop over the iterator and add the data to the queue as the queue.get() in the worker method is blocking the workers should start performing actions on the data as it recieves it from the queue.
However instead each worker on each processor is run once and thats it! so if poolSize is 8 it will read the first 8 items in the queue perform the actions on 8 different processes and then it will finish! does anyone know why this is happerning? I am running this on windows.
edit
I wanted to mention even thought this is all being done in a class the class is called in _main_like so
if __name__ == '__main__':
tweetDatabase = Database()
dataSet = tweetDatabase.read2dBoundingBox(boundaryBox)
freq = TweetCounter(8, dataSet) # this is where the multiprocessing is done
Your worker is to blame I believe. It just does one thing and then dies. Try:
def worker(self, queue):
while True:
tweet = queue.get()
self.append(tweet)
(I'd take a look at Pool though)
long time lurker here.
I have a thread controller object. This object takes in other objects called "Checks". These Checks pull in DB rows that match their criteria. The thread manager polls each check (asking it for it's DB rows aka work units) and then enqueues each row along with a reference to that check object. The thought is that N many threads will come in and pull off an item from the queue and execute the corresponding Check's do_work method. The do_work method will return Pass\Fail and all passes will be enqueued for further processing.
The main script (not shown) instantiates the checks and adds them to the thread manager using add_check and then calls kick_off_work.
So far I am testing and it simply locks up:
import Queue
from threading import Thread
class ThreadMan:
def __init__(self, reporter):
print "Initializing thread manager..."
self.workQueue = Queue.Queue()
self.resultQueue = Queue.Queue()
self.checks = []
def add_check(self, check):
self.checks.append(check)
def kick_off_work(self):
for check in self.checks:
for work_unit in check.populate_work():
#work unit is a DB row
self.workQueue.put({"object" : check, "work" : work_unit})
threads = Thread(target=self.execute_work_unit)
threads = Thread(target=self.execute_work_unit)
threads.start()
self.workQueue.join();
def execute_work_unit(self):
unit = self.workQueue.get()
check_object = unit['object'] #Check object
work_row = unit['work'] # DB ROW
check_object.do_work(work_row)
self.workQueue.task_done();
print "Done with work!!"
The output is simply:
Initializing thread manager...
In check1's do_work method... Doing work
Done with work!!
(locked up)
I would like to run through the entire queue
you should only add a "while" in your execute_work_unit otherwise it stops at first iteration:
def execute_work_unit(self):
while True:
unit = self.workQueue.get()
check_object = unit['object'] #Check object
work_row = unit['work'] # DB ROW
check_object.do_work(work_row)
self.workQueue.task_done();
print "Done with work!!"
have a look there:
http://docs.python.org/2/library/queue.html#module-Queue
EDIT: to get it finish just add threads.join() after your self.workQueue.join() in
def kick_off_work(self):