Thread/Queue hanging issue - python

Novice to threading here. I'm borrowing a lot of the code from this thread while trying to build my first script using threading/queue:
import threading, urllib2
import Queue
import sys
from PIL import Image
import io, sys
def avhash(url,queue):
if not isinstance(url, Image.Image):
try:
im = Image.open(url)
except IOError:
fd=urllib2.urlopen(url)
image_file=io.BytesIO(fd.read())
im=Image.open(image_file)
im = im.resize((8, 8), Image.ANTIALIAS).convert('L')
avg = reduce(lambda x, y: x + y, im.getdata()) / 64.
hash = reduce(lambda x, (y, z): x | (z << y),
enumerate(map(lambda i: 0 if i < avg else 1, im.getdata())),
0)
queue.put({url:hash})
queue.task_done()
def fetch_parallel(job_list):
q = Queue.Queue()
threads = [threading.Thread(target=avhash, args = (job,q)) for job in job_list[0:50]]
for t in threads:
t.daemon = True
t.start()
for t in threads:
t.join()
return [q.get() for _ in xrange(len(job_list))]
In this case the job_list is a list of URLs. I've found that this code works fine when this list is equal to or less than 50, but it hangs when > 50. There must be something I'm fundamentally not understanding about how threading works?

Your problem is this line:
return [q.get() for _ in xrange(len(job_list))]
If job_list has more than 50 elements, then you try to read more results from your queue than you have put in. Therefore:
return [q.get() for _ in xrange(len(job_list[:50]))]
or, even better:
MAX_LEN = 50
...
threads = [... for job in job_list[:MAXLEN]]
...
return [q.get() for _ in job_list[:MAXLEN]]
[EDIT]
It seems you want your program to do something different than what it does. Your program takes the first 50 entries in job_list, handles each of these in a thread and disregards all other jobs. From your comment below I assume you want to handle all jobs, but only 50 at a time. For this, you should use a thread pool. In Python >= 3.2 you could use concurrent.futures.ThreadPoolExecutor [link].
In Python < 3.2 you have to roll your own:
CHUNK_SIZE = 50
def fetch_parallel(job_list):
results = []
queue = Queue.Queue()
while job_list:
threads = [threading.Thread(target=avhash, args=(job, queue))
for job in job_list[:CHUNK_SIZE]]
job_list = job_list[CHUNK_SIZE:]
for thread in threads:
thread.daemon = True
thread.start()
for thread in threads:
thread.join()
results.extend(queue.get() for _ in threads)
return results
(untested)
[/EDIT]

Related

processing very large text files in parallel using multiprocessing and threading

I have found several other questions that touch on this topic but none that are quite like my situation.
I have several very large text files (3+ gigabytes in size).
I would like to process them (say 2 documents) in parallel using multiprocessing. As part of my processing (within a single process) I need to make an API call and because of this would like to have each process have it's own threads to run asynchronously.
I have came up with a simplified example ( I have commented the code to try to explain what I think it should be doing):
import multiprocessing
from threading import Thread
import threading
from queue import Queue
import time
def process_huge_file(*, file_, batch_size=250, num_threads=4):
# create APICaller instance for each process that has it's own Queue
api_call = APICaller()
batch = []
# create threads that will run asynchronously to make API calls
# I expect these to immediately block since there is nothing in the Queue (which is was
# the api_call.run depends on to make a call
threads = []
for i in range(num_threads):
thread = Thread(target=api_call.run)
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
####
# start processing the file line by line
for line in file_:
# if we are at our batch size, add the batch to the api_call to to let the threads do
# their api calling
if i % batch_size == 0:
api_call.queue.put(batch)
else:
# add fake line to batch
batch.append(fake_line)
class APICaller:
def __init__(self):
# thread safe queue to feed the threads which point at instances
of these APICaller objects
self.queue = Queue()
def run(self):
print("waiting for something to do")
self.queue.get()
print("processing item in queue")
time.sleep(0.1)
print("finished processing item in queue")
if __name__ == "__main__":
# fake docs
fake_line = "this is a fake line of some text"
# two fake docs with line length == 1000
fake_docs = [[fake_line] * 1000 for i in range(2)]
####
num_processes = 2
procs = []
for idx, doc in enumerate(fake_docs):
proc = multiprocessing.Process(target=process_huge_file, kwargs=dict(file_=doc))
proc.start()
procs.append(proc)
for proc in procs:
proc.join()
As the code is now, "waiting for something to do" prints 8 times (makes sense 4 threads per process) and then it stops or "deadlocks" which is not what I expect - I expect it to start sharing time with the threads as soon as I start putting items in the Queue but the code does not appear to make it this far. I ordinarily would step through to find a hang up but I still don't have a solid understanding of how to best debug using Threads (another topic for another day).
In the meantime, can someone help me figure out why my code is not doing what it should be doing?
I have made a few adjustments and additions and the code appears to do what it is supposed to now. The main adjustments are: adding a CloseableQueue class (from Brett Slatkins Effective Python Item 55), and ensuring that I call close and join on the queue so that the threads properly exit. Full code with these changes below:
import multiprocessing
from threading import Thread
import threading
from queue import Queue
import time
from concurrency_utils import CloseableQueue
def sync_process_huge_file(*, file_, batch_size=250):
batch = []
for idx, line in enumerate(file_):
# do processing on the text
if idx % batch_size == 0:
time.sleep(0.1)
batch = []
# api_call.queue.put(batch)
else:
computation = 0
for i in range(100000):
computation += i
batch.append(line)
def process_huge_file(*, file_, batch_size=250, num_threads=4):
api_call = APICaller()
batch = []
# api call threads
threads = []
for i in range(num_threads):
thread = Thread(target=api_call.run)
threads.append(thread)
thread.start()
for idx, line in enumerate(file_):
# do processing on the text
if idx % batch_size == 0:
api_call.queue.put(batch)
else:
computation = 0
for i in range(100000):
computation += i
batch.append(line)
for _ in threads:
api_call.queue.close()
api_call.queue.join()
for thread in threads:
thread.join()
class APICaller:
def __init__(self):
self.queue = CloseableQueue()
def run(self):
for item in self.queue:
print("waiting for something to do")
pass
print("processing item in queue")
time.sleep(0.1)
print("finished processing item in queue")
print("exiting run")
if __name__ == "__main__":
# fake docs
fake_line = "this is a fake line of some text"
# two fake docs with line length == 1000
fake_docs = [[fake_line] * 10000 for i in range(2)]
####
time_s = time.time()
num_processes = 2
procs = []
for idx, doc in enumerate(fake_docs):
proc = multiprocessing.Process(target=process_huge_file, kwargs=dict(file_=doc))
proc.start()
procs.append(proc)
for proc in procs:
proc.join()
time_e = time.time()
print(f"took {time_e-time_s} ")
class CloseableQueue(Queue):
SENTINEL = object()
def __init__(self, **kwargs):
super().__init__(**kwargs)
def close(self):
self.put(self.SENTINEL)
def __iter__(self):
while True:
item = self.get()
try:
if item is self.SENTINEL:
return # exit thread
yield item
finally:
self.task_done()
As expected this is a great speedup from running synchronously - 120 seconds vs 50 seconds.

Multiple threads in while loop

I have a simple problem / question about the below code.
ip = '192.168.0.'
count = 0
while count <= 255:
print(count)
count += 1
for i in range(10):
ipg=ip+str(count)
t = Thread(target=conn, args=(ipg,80))
t.start()
I want to execute 10 threads each time and wait for it to finish and then continue with the next 10 threads until count <= 255
I understand my problem and why it does execute 10 threads for every count increase, but not how to solve it, any help would be appreciated.
it can easily achieved using concurrents.futures library
here's the example code:
from concurrent.futures import ThreadPoolExecutor
ip = '192.168.0.'
count = 0
THREAD_COUNT = 10
def work_done(future):
result = future.result()
# work with your result here
def main():
with ThreadPoolExecutor(THREAD_COUNT) as executor:
while count <= 255:
count += 1
ipg=ip+str(count)
executor.submit(conn, ipg, 80).add_done_callback(work_done)
if __name__ == '__main__':
main()
here executor returns future for every task it submits.
keep in mind that if you use add_done_callback() finished task from thread returns to the main thread (which would block your main thread) if you really want true parallelism then you should wait for future objects separately. here's the code snippet for that.
from concurrent.futures import ThreadPoolExecutor
from concurrent.futures._base import wait
futures = []
with ThreadPoolExecutor(THREAD_COUNT) as executor:
while count <= 255:
count += 1
ipg=ip+str(count)
futures.append(executor.submit(conn, ipg, 80))
wait(futures)
for succeded, failed in futures:
# work with your result here
hope this helps!
There are two viable options: multiprocessing with ThreadPool as #martineau suggested and using queue. Here's an example with queue that executes requests concurrently in 10 different threads. Note that it doesn't do any kind of batching, as soon as a thread completes it picks up next task without caring the status of other workers:
import queue
import threading
def conn():
try:
while True:
ip, port = que.get_nowait()
print('Connecting to {}:{}'.format(ip, port))
que.task_done()
except queue.Empty:
pass
que = queue.Queue()
for i in range(256):
que.put(('192.168.0.' + str(i), 80))
# Start workers
threads = [threading.Thread(target=conn) for _ in range(10)]
for t in threads:
t.start()
# Wait que to empty
que.join()
# Wait workers to die
for t in threads:
t.join()
Output:
Connecting to 192.168.0.0:80
Connecting to 192.168.0.1:80
Connecting to 192.168.0.2:80
Connecting to 192.168.0.3:80
Connecting to 192.168.0.4:80
Connecting to 192.168.0.5:80
Connecting to 192.168.0.6:80
Connecting to 192.168.0.7:80
Connecting to 192.168.0.8:80
Connecting to 192.168.0.9:80
Connecting to 192.168.0.10:80
Connecting to 192.168.0.11:80
...
I modified your code so that it has correct logic to do what you want. Please note that I don't run it but hope you'll get the general idea:
import time
from threading import Thread
ip = '192.168.0.'
count = 0
while count <= 255:
print(count)
# a list to keep your threads while they're running
alist = []
for i in range(10):
# count must be increased here to count threads to 255
count += 1
ipg=ip+str(count)
t = Thread(target=conn, args=(ipg,80))
t.start()
alist.append(t)
# check if threads are still running
while len(alist) > 0:
time.sleep(0.01)
for t in alist:
if not t.isAlive():
# remove completed threads
alist.remove(t)

subprocess does not return control to main process after finishing

I have a Python application where I use processes for computing classification. For communication processes use Queues. Everything works fine except that after all sub-processes are done the main process does not get control back. So, as I understand, the sub-processes did not terminated. But, why?
#!/usr/bin/python
from wraper import *
from multiprocessing import Process, Lock,Queue
def start_threads(data,counter,threads_num,reporter):
threads = []
d_lock = Lock()
c_lock = Lock()
r_lock = Lock()
dq = Queue()
rq = Queue()
cq = Queue()
dq.put(data)
rq.put(reporter)
cq.put(counter)
for i in range(threads_num):
t = Process(target=mule, args=(dq,cq,rq,d_lock,c_lock,r_lock))
threads.append(t)
for t in threads:
t.start()
for t in threads:
t.join()
return rq.get()
def mule(dq,cq,rq,d_lock,c_lock,r_lock):
c_lock.acquire()
counter = cq.get()
can_continue = counter.next_ok()
idx = counter.get_features_indeces()
cq.put(counter)
c_lock.release()
while can_continue:
d_lock.acquire()
data = dq.get()
labels, features = data.get_features(idx)
dq.put(data)
d_lock.release()
accuracy = test_classifier(labels, features)
r_lock.acquire()
reporter = rq.get()
reporter.add_result(accuracy[0],idx)
rq.put(reporter)
r_lock.release()
c_lock.acquire()
counter = cq.get()
can_continue = counter.next_ok()
idx = counter.get_features_indeces()
cq.put(counter)
c_lock.release()
print('done' )
It writes for each process that it did it's job and that's it...

python threading issue

I want to have the result of my threads in a list.
I have the following sample code:
def parallelizer_task(processor,input,callback):
output = processor(input)
if callback:
callback(output)
return output
class ThreadsParallelizer(Parallelizer):
def parallelize(self,processors,input=None,callback=None):
threads = []
for processor in processors:
t = threading.Thread(target=parallelizer_task,args=(processor,input,callback))
threads.append(t)
t.start()
return threads
parallelizer = ThreadsParallelizer
But I have the output of threads list as ;
* <Thread(Thread-1, started 4418719744)>
* <Thread(Thread-2, started 4425617408)>
* <Thread(Thread-3, started 4429950976)>
Is there a way to have the threads result in the list?
Yes, for that you can use for example join. It will force the main thread to wait until child threads finish the work. You can then store the data in threading.Thread objects, something like this:
def parallelizer_task(processor,input,callback):
output = processor(input)
if callback:
callback(output)
# Attach result to current thread
thread = threading.currentThread()
thread.result = output
class ThreadsParallelizer(Parallelizer):
def parallelize(self,processors,input=None,callback=None):
threads = []
for processor in processors:
t = threading.Thread(...)
threads.append(t)
t.start()
# wait for threads to finish
for th in threads:
th.join()
# do something with results
results = [th.result for th in threads]
return results

Threads not stop in python

The purpose of my program is to download files with threads. I define the unit, and using len/unit threads, the len is the length of the file which is going to be downloaded.
Using my program, the file can be downloaded, but the threads are not stopping. I can't find the reason why.
This is my code...
#! /usr/bin/python
import urllib2
import threading
import os
from time import ctime
class MyThread(threading.Thread):
def __init__(self,func,args,name=''):
threading.Thread.__init__(self);
self.func = func;
self.args = args;
self.name = name;
def run(self):
apply(self.func,self.args);
url = 'http://ubuntuone.com/1SHQeCAQWgIjUP2945hkZF';
request = urllib2.Request(url);
response = urllib2.urlopen(request);
meta = response.info();
response.close();
unit = 1000000;
flen = int(meta.getheaders('Content-Length')[0]);
print flen;
if flen%unit == 0:
bs = flen/unit;
else :
bs = flen/unit+1;
blocks = range(bs);
cnt = {};
for i in blocks:
cnt[i]=i;
def getStr(i):
try:
print 'Thread %d start.'%(i,);
fout = open('a.zip','wb');
fout.seek(i*unit,0);
if (i+1)*unit > flen:
request.add_header('Range','bytes=%d-%d'%(i*unit,flen-1));
else :
request.add_header('Range','bytes=%d-%d'%(i*unit,(i+1)*unit-1));
#opener = urllib2.build_opener();
#buf = opener.open(request).read();
resp = urllib2.urlopen(request);
buf = resp.read();
fout.write(buf);
except BaseException:
print 'Error';
finally :
#opener.close();
fout.flush();
fout.close();
del cnt[i];
# filelen = os.path.getsize('a.zip');
print 'Thread %d ended.'%(i),
print cnt;
# print 'progress : %4.2f'%(filelen*100.0/flen,),'%';
def main():
print 'download at:',ctime();
threads = [];
for i in blocks:
t = MyThread(getStr,(blocks[i],),getStr.__name__);
threads.append(t);
for i in blocks:
threads[i].start();
for i in blocks:
# print 'this is the %d thread;'%(i,);
threads[i].join();
#print 'size:',os.path.getsize('a.zip');
print 'download done at:',ctime();
if __name__=='__main__':
main();
Could someone please help me understand why the threads aren't stopping.
I can't really address your code example because it is quite messy and hard to follow, but a potential reason you are seeing the threads not end is that a request will stall out and never finish. urllib2 allows you to specify timeouts for how long you will allow the request to take.
What I would recommend for your own code is that you split your work up into a queue, start a fixed number of thread (instead of a variable number), and let the worker threads pick up work until it is done. Make the http requests have a timeout. If the timeout expires, try again or put the work back into the queue.
Here is a generic example of how to use a queue, a fixed number of workers and a sync primitive between them:
import threading
import time
from Queue import Queue
def worker(queue, results, lock):
local_results = []
while True:
val = queue.get()
if val is None:
break
# pretend to do work
time.sleep(.1)
local_results.append(val)
with lock:
results.extend(local_results)
print threading.current_thread().name, "Done!"
num_workers = 4
threads = []
queue = Queue()
lock = threading.Lock()
results = []
for i in xrange(100):
queue.put(i)
for _ in xrange(num_workers):
# Use None as a sentinel to signal the threads to end
queue.put(None)
t = threading.Thread(target=worker, args=(queue,results,lock))
t.start()
threads.append(t)
for t in threads:
t.join()
print sorted(results)
print "All done"

Categories

Resources