I've created a method, which downloads a product from one web page and then, this product is stored into the database SQLite3.
This function works good when it's called normally but I want to make a pool and do it parallel (because of sending parallel requests) (the web page allows bots to send 2000 requests/ minute).
The problem is that when I try to put it into the pool it does not store the data into the database nor raises some error or exception.
Here is the code of main function:
if __name__ == '__main__':
pu = product_updater() # class which handles almost all, this class also has database manager class as an attribute
pool = Pool(10)
for line in lines[0:100]: # list lines is a list of urls
# pu.update_product(line[:-1]) # this works correctly
pool.apply_async(pu.update_product, args=(line[:-1],)) # this works correctly but does not store products into the database
pool.close()
pool.join()
def update_product(self,url): # This method belongs to product_updater class
prod = self.parse_product(url)
self.man.insert_product(prod) # man is a class to handling database
I use this pool: from multiprocessing.pool import ThreadPool as Pool
Do you know what could be wrong?
EDIT: I think that it could be caused because there is only one cursor which is shared between workers but I think that if this would be a problem it would raises some Exception.
EDIT2: The weird is that I tried to make Pool which has only 1 worker so there shouldn't be a problem with concurrency but the same result - no new rows in a database.
multiprocessing.Pool is not notifying exceptions happening within the workers as long as you don't ask for confirmation from the tasks.
This examples will be silent.
from multiprocessing import Pool
def function():
raise Exception("BOOM!")
p = Pool()
p.apply_async(function)
p.close()
p.join()
This example instead will show the exception.
from multiprocessing import Pool
def function():
raise Exception("BOOM!")
p = Pool()
task = p.apply_async(function)
task.get() # <---- you will get the exception here
p.close()
p.join()
The root cause of your issue is the sharing of a single cursor object which is not thread/process safe. As multiple workers are reading/writing on the same cursor things get broken and the Pool silently eats the exception (om nom).
First solution is acknowledging the tasks as I showed in order to make problems visible. Then what you can do, is getting a dedicated cursor per each worker.
Related
I have noticed that if I use ThreadPoolExecutor(max_workers=5) to run a function 10 times, the Thread-Local Data is maintained between iterations as if Threads are re-used. I have the following very basic code to test:
import time
import uuid
import threading
import concurrent.futures
thread_data = threading.local()
def conc_func(i):
if not hasattr(thread_data, 'x'):
thread_data.x = uuid.uuid4()
print('Setting Thread Data: ', end='')
else:
print('Reading Thread Data: ', end='')
print(thread_data.x)
time.sleep(1)
def conc_pool():
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
_ = executor.map(conc_func, range(10))
if __name__ == '__main__':
conc_pool()
If I run this code, I get the following output
Setting Thread Data: e0101c90-1de7-4c5c-8358-05c6e5b5c89f
Setting Thread Data: 34d1c796-d9e0-4000-a560-2b0d0da47fb4
Setting Thread Data: 2a62e83e-0945-40d4-8af5-e5e77ee9531f
Setting Thread Data: b9bf8871-ea9d-4dca-88d2-c8916ff47e5d
Setting Thread Data: 961a7725-9ebb-4711-81c9-253ddc1c9c80
Reading Thread Data: e0101c90-1de7-4c5c-8358-05c6e5b5c89f
Reading Thread Data: 34d1c796-d9e0-4000-a560-2b0d0da47fb4
Reading Thread Data: 2a62e83e-0945-40d4-8af5-e5e77ee9531f
Reading Thread Data: b9bf8871-ea9d-4dca-88d2-c8916ff47e5d
Reading Thread Data: 961a7725-9ebb-4711-81c9-253ddc1c9c80
As you can see there are only 5 unique UUIDs (same as number of unique workers). Because I assign max 5 workers, but pass an iterable with 10 elements, each worker processes two iterations. However, what surprised me is that Thread-Local Data instance is shared between iterations, while I expected it to be CLEAN on every run.
Is there a way to automatically clean-up an instance of Local-Thread Data or should we take care of this cleaning ourselves at the end of the iteration?
A thread pool reuses threads, that's the whole point. Any thread-local data is attached to a thread. It will stay with the thread as long as it is alive, and it won't clean itself. You have to do it manually.
That being said my advice is to completely avoid thread-local storage. I am yet to see a case where it is necessary and you cannot simply pass data as argument(s). Maybe except when you already deal with a badly designed code. The drawback is that it is hard to debug and is a hard dependency, e.g. if you ever want to switch to say single threaded, async code then you'll again have a problem, just like now when you reuse threads instead of spawning a new thread each time.
I am using multiprocessing python module to run parallel and unrelated jobs with a function similar to the following example:
import numpy as np
from multiprocessing import Pool
def myFunction(arg1):
name = "file_%s.npy"%arg1
A = np.load(arg1)
A[A<0] = np.nan
np.save(arg1,A)
if(__name__ == "__main__"):
N = list(range(50))
with Pool(4) as p:
p.map_async(myFunction, N)
p.close() # I tried with and without that statement
p.join() # I tried with and without that statement
DoOtherStuff()
My problem is that the function DoOtherStuff is never executed, the processes switches into sleep mode on top and I need to kill it with ctrl+C to stop it.
Any suggestions?
You have at least a couple problems. First, you are using map_async() which does not block until the results of the task are completed. So what you're doing is starting the task with map_async(), but then immediately closes and terminates the pool (the with statement calls Pool.terminate() upon exiting).
When you add tasks to a Process pool with methods like map_async it adds tasks to a task queue which is handled by a worker thread which takes tasks off that queue and farms them out to worker processes, possibly spawning new processes as needed (actually there is a separate thread which handles that).
Point being, you have a race condition where you're terminating the Pool likely before any tasks are even started. If you want your script to block until all the tasks are done just use map() instead of map_async(). For example, I rewrote your script like this:
import numpy as np
from multiprocessing import Pool
def myFunction(N):
A = np.load(f'file_{N:02}.npy')
A[A<0] = np.nan
np.save(f'file2_{N:02}.npy', A)
def DoOtherStuff():
print('done')
if __name__ == "__main__":
N = range(50)
with Pool(4) as p:
p.map(myFunction, N)
DoOtherStuff()
I don't know what your use case is exactly, but if you do want to use map_async(), so that this task can run in the background while you do other stuff, you have to leave the Pool open, and manage the AsyncResult object returned by map_async():
result = pool.map_async(myFunction, N)
DoOtherStuff()
# Is my map done yet? If not, we should still block until
# it finishes before ending the process
result.wait()
pool.close()
pool.join()
You can see more examples in the linked documentation.
I don't know why in your attempt you got a deadlock--I was not able to reproduce that. It's possible there was a bug at some point that was then fixed, though you were also possibly invoking undefined behavior with your race condition, as well as calling terminate() on a pool after it's already been join()ed. As for your why your answer did anything at all, it's possible that with the multiple calls to apply_async() you managed to skirt around the race condition somewhat, but this is not at all guaranteed to work.
I am trying to use The Queue in python which will be multithreaded. I just wanted to know the approach I am using is correct or not. And if I am doing something redundant or If there is a better approach that I should use.
I am trying to get new requests from a table and schedule them using some logic to perform some operation like running a query.
So here from the main thread I spawn a separate thread for the queue.
if __name__=='__main__':
request_queue = SetQueue(maxsize=-1)
worker = Thread(target=request_queue.process_queue)
worker.setDaemon(True)
worker.start()
while True:
try:
#Connect to the database get all the new requests to be verified
db = Database(username_testschema, password_testschema, mother_host_testschema, mother_port_testschema, mother_sid_testschema, 0)
#Get new requests for verification
verify_these = db.query("SELECT JOB_ID FROM %s.table WHERE JOB_STATUS='%s' ORDER BY JOB_ID" %
(username_testschema, 'INITIATED'))
#If there are some requests to be verified, put them in the queue.
if len(verify_these) > 0:
for row in verify_these:
print "verifying : %s" % row[0]
verify_id = row[0]
request_queue.put(verify_id)
except Exception as e:
logger.exception(e)
finally:
time.sleep(10)
Now in the Setqueue class I have a process_queue function which is used for processing the top 2 requests in every run that were added to the queue.
'''
Overridding the Queue class to use set as all_items instead of list to ensure unique items added and processed all the time,
'''
class SetQueue(Queue.Queue):
def _init(self, maxsize):
Queue.Queue._init(self, maxsize)
self.all_items = set()
def _put(self, item):
if item not in self.all_items:
Queue.Queue._put(self, item)
self.all_items.add(item)
'''
The Multi threaded queue for verification process. Take the top two items, verifies them in a separate thread and sleeps for 10 sec.
This way max two requests per run will be processed.
'''
def process_queue(self):
while True:
scheduler_obj = Scheduler()
try:
if self.qsize() > 0:
for i in range(2):
job_id = self.get()
t = Thread(target=scheduler_obj.verify_func, args=(job_id,))
t.start()
for i in range(2):
t.join(timeout=1)
self.task_done()
except Exception as e:
logger.exception(
"QUEUE EXCEPTION : Exception occured while processing requests in the VERIFICATION QUEUE")
finally:
time.sleep(10)
I want to see if my understanding is correct and if there can be any issues with it.
So the main thread running in while True in the main func connects to database gets new requests and puts it in the queue. The worker thread(daemon) for the queue keeps on getting new requests from the queue and fork non-daemon threads which do the processing and since timeout for the join is 1 the worker thread will keep on taking new requests without getting blocked, and its child thread will keep on processing in the background. Correct?
So in case if the main process exit these won`t be killed until they finish their work but the worker daemon thread would exit.
Doubt : If the parent is daemon and child is non daemon and if parent exits does child exit?).
I also read here :- David beazley multiprocessing
By david beazley in using a Pool as a Thread Coprocessor section where he is trying to solve a similar problem. So should I follow his steps :-
1. Create a pool of processes.
2. Open a thread like I am doing for request_queue
3. In that thread
def process_verification_queue(self):
while True:
try:
if self.qsize() > 0:
job_id = self.get()
pool.apply_async(Scheduler.verify_func, args=(job_id,))
except Exception as e:
logger.exception("QUEUE EXCEPTION : Exception occured while processing requests in the VERIFICATION QUEUE")
Use a process from the pool and run the verify_func in parallel. Will this give me more performance?
While its possible to create a new independent thread for the queue, and process that data separately the way you are doing it, I believe it is more common for each independent worker thread to post messages to a queue that they already "know" about. Then that queue is processed from some other thread by pulling messages out of that queue.
Design Idea
The way I invision your application would be three threads. The main thread, and two worker threads. 1 worker thread would get requests from the database and put them in the queue. The other worker thread would process that data from the queue
The main thread would just waiting for the other threads to finish by using the thread functions .join()
You would protect queue that the threads have access to and make it thread safe by using a mutex. I have seen this pattern in many other designs in other languages as well.
Suggested Reading
"Effective Python" by Brett Slatkin has a great example of this very question.
Instead of inheriting from Queue, he just creates a wrapper to it in his class
called MyQueue and adds a get() and put(message) function.
He even provides the source code at his Github repo
https://github.com/bslatkin/effectivepython/blob/master/example_code/item_39.py
I'm not affiliated with the book or its author, but I highly recommend it as I learned quite a few things from it :)
I like this explanation of the advantages & differences between using threads and processes -
".....But there's a silver lining: processes can make progress on multiple threads of execution simultaneously. Since a parent process doesn't share the GIL with its child processes, all processes can execute simultaneously (subject to the constraints of the hardware and OS)...."
He has some great explanations for getting around GIL and how to improve performance
Read more here:
http://jeffknupp.com/blog/2013/06/30/pythons-hardest-problem-revisited/
I have written some code for testing the performance of a database when users are simultaneously running queries against it. The objective is to understand how the elapsed time increases with number of users. The code contains a class User (shown below) whose objects are created by parsing XML files.
class User(object):
def __init__(self, id, constr):
self.id = id
self.constr = constr
self.queryid = list()
self.queries = list()
def openConn(self):
self.cnxn = pyodbc.connect(self.constr)
logDet.info("%s %s"%(self.id,"Open connection."))
def closeConn(self):
self.cnxn.close()
logDet.info("%s %s"%(self.id,"Close connection."))
def executeAll(self):
self.openConn()
for n,qry in enumerate(self.queries):
try:
cursor = self.cnxn.cursor()
logTim.info("%s|%s|beg"%(self.id, self.queryid[n]))
cursor.execute(qry)
logTim.info("%s|%s|end"%(self.id, self.queryid[n]))
except Exception:
cursor.rollback()
logDet.exception("Error while running query.")
self.closeConn()
pyODBC is used for the connection to the database. Two logs are created -- one detailed (logDet) and one which has only the timings (logTim). The User objects are stored in a list. The queries for each user are also in a list (not in a thread-safe Queue).
To simulate parallel users, I have tried a couple of different approaches:
def worker(usr):
usr.executeAll()
Option 1: multiprocessing.Pool
pool = Pool(processes=len(users))
pool.map(worker, users)
Option 2: threading.Thread
for usr in users:
t = Thread(target=worker, args=(usr,))
t.start()
Both approaches work. In my test, I have tried for #users = 2,6,..,60, and each user has 4 queries. Given how the query times are captured, there should be less than a second of delay between the end of a query and beginning of next query i.e. queries should be fired one after the other. That's exactly what happens with multiprocessing but with threading, a random delay is introduced before the next query. The delay can be over a minute (see below).
Using: python3.4.1, pyodbc3.0.7; clients running code Windows 7/RHEL 6.5
I would really prefer to get this to work with threading. Is this expected in the threading approach or is there a command that I am missing? Or how can that be re-written? Thx.
When you use the threading-based approach, you're starting one thread per user, all the way up to 60 threads. All of those threads must fight for access to the GIL in between their I/O operations. This introduces a ton of overhead. You would probably see better results if you used a ThreadPool limited to a smaller number of threads (maybe 2 * multiprocessing.cpu_count()?), even with a higher number of users:
from multiprocessing.pool import ThreadPool
from multiprocessing import cpu_count
pool = ThreadPool(processes=cpu_count()*2)
pool.map(worker, users)
You may want to limit the number of concurrent processes you run, too, for memory-usage reasons. Starting up sixty concurrent Python processes is pretty expensive.
I have a problem running multiple processes in python3 .
My program does the following:
1. Takes entries from an sqllite database and passes them to an input_queue
2. Create multiple processes that take items off the input_queue, run it through a function and output the result to the output queue.
3. Create a thread that takes items off the output_queue and prints them (This thread is obviously started before the first 2 steps)
My problem is that currently the 'function' in step 2 is only run as many times as the number of processes set, so for example if you set the number of processes to 8, it only runs 8 times then stops. I assumed it would keep running until it took all items off the input_queue.
Do I need to rewrite the function that takes the entries out of the database (step 1) into another process and then pass its output queue as an input queue for step 2?
Edit:
Here is an example of the code, I used a list of numbers as a substitute for the database entries as it still performs the same way. I have 300 items on the list and I would like it to process all 300 items, but at the moment it just processes 10 (the number of processes I have assigned)
#!/usr/bin/python3
from multiprocessing import Process,Queue
import multiprocessing
from threading import Thread
## This is the class that would be passed to the multi_processing function
class Processor:
def __init__(self,out_queue):
self.out_queue = out_queue
def __call__(self,in_queue):
data_entry = in_queue.get()
result = data_entry*2
self.out_queue.put(result)
#Performs the multiprocessing
def perform_distributed_processing(dbList,threads,processor_factory,output_queue):
input_queue = Queue()
# Create the Data processors.
for i in range(threads):
processor = processor_factory(output_queue)
data_proc = Process(target = processor,
args = (input_queue,))
data_proc.start()
# Push entries to the queue.
for entry in dbList:
input_queue.put(entry)
# Push stop markers to the queue, one for each thread.
for i in range(threads):
input_queue.put(None)
data_proc.join()
output_queue.put(None)
if __name__ == '__main__':
output_results = Queue()
def output_results_reader(queue):
while True:
item = queue.get()
if item is None:
break
print(item)
# Establish results collecting thread.
results_process = Thread(target = output_results_reader,args = (output_results,))
results_process.start()
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
# Perform multi processing
perform_distributed_processing(dbList,10,Processor,output_results)
# Wait for it all to finish.
results_process.join()
A collection of processes that service an input queue and write to an output queue is pretty much the definition of a process pool.
If you want to know how to build one from scratch, the best way to learn is to look at the source code for multiprocessing.Pool, which is pretty simply Python, and very nicely written. But, as you might expect, you can just use multiprocessing.Pool instead of re-implementing it. The examples in the docs are very nice.
But really, you could make this even simpler by using an executor instead of a pool. It's hard to explain the difference (again, read the docs for both modules), but basically, a future is a "smart" result object, which means instead of a pool with a variety of different ways to run jobs and get results, you just need a dumb thing that doesn't know how to do anything but return futures. (Of course in the most trivial cases, the code looks almost identical either way…)
from concurrent.futures import ProcessPoolExecutor
def Processor(data_entry):
return data_entry*2
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
yield from executor.map(processor_factory, dbList)
if __name__ == '__main__':
# Use this as a substitute for the database in the example
dbList = [i for i in range(300)]
for result in perform_distributed_processing(dbList, 8, Processor):
print(result)
Or, if you want to handle them as they come instead of in order:
def perform_distributed_processing(dbList, threads, processor_factory):
with ProcessPoolExecutor(processes=threads) as executor:
fs = (executor.submit(processor_factory, db) for db in dbList)
yield from map(Future.result, as_completed(fs))
Notice that I also replaced your in-process queue and thread, because it wasn't doing anything but providing a way to interleave "wait for the next result" and "process the most recent result", and yield (or yield from, in this case) does that without all the complexity, overhead, and potential for getting things wrong.
Don't try to rewrite the whole multiprocessing library again. I think you can use any of multiprocessing.Pool methods depending on your needs - if this is a batch job you can even use the synchronous multiprocessing.Pool.map() - only instead of pushing to input queue, you need to write a generator that yields input to the threads.