python threading in a loop - python

I have a project that requires a bunch of large matrices, which are stored in ~200 MB files, to be cross-correlated (i.e. FFT * conj(FFT)) with each other. The number of files is such that I can't just load them all up and then do my processing. On the other hand, reading in each file as I need it is slower than I'd like.
what I have so far is something like:
result=0
for i in xrange(N_files):
f1 = file_reader(file_list[i])
############################################################################
# here I want to have file_reader go start reading the next file I'll need #
############################################################################
in_place_processing(f1)
for j in xrange(i+1,N_files):
f2 = file_reader(file_list[j])
##################################################################
# here I want to have file_reader go start reading the next file #
##################################################################
in_place_processing(f2)
result += processing_function(f1,f2)
So basically, I just want to have two threads that will each read a file, give it to me when I ask for it (or as soon as it's done after I ask for it), and then go start reading the next file for when I ask for it. The object the file_reader returns is rather large and complicated, so I'm not sure if multiprocessing is the way to go here...
I've read about threading and queues but can't seem to figure out the part where I ask the thread to go read the file and can proceed with the program while it does. I don't want the threads to simply go about their business in the background -- am I missing a detail here, or is threading not the way to go?

Below is an example of using the multiprocessing module that will spawn off child processes to call your file_reader method and queue up their results. The queue should block when full, so you can control the number of read ahead's you'd like to perform with the QUEUE_SIZE constant.
This utilizes a standard Producer/Consumer model of multiprocess communication, with the child processes act as Producers, with the main thread being the Consumer. The join method call in the class destructor ensures the child process resources are cleaned up properly. There are some print statements interspersed for demonstration purposes.
Additionally, I added the ability for the QueuedFileReader class to offload work to a worker thread or run in the main thread, rather than using a child process, for comparison. This is done by specifying the mode parameter at class initialization to MODE_THREADS or MODE_SYNCHRONOUS, respectively.
import multiprocessing as mp
import Queue
import threading
import time
QUEUE_SIZE = 2 #buffer size of queue
## Placeholder for your functions and variables
N_files = 10
file_list = ['file %d' % i for i in range(N_files)]
def file_reader(filename):
time.sleep(.1)
result = (filename,'processed')
return result
def in_place_processing(f):
time.sleep(.2)
def processing_function(f1,f2):
print f1, f2
return id(f1) & id(f2)
MODE_SYNCHRONOUS = 0 #file_reader called in main thread synchronously
MODE_THREADS = 1 #file_reader executed in worker thread
MODE_PROCESS = 2 #file_reader executed in child_process
##################################################
## Class to encapsulate multiprocessing objects.
class QueuedFileReader():
def __init__(self, idlist, mode=MODE_PROCESS):
self.mode = mode
self.idlist = idlist
if mode == MODE_PROCESS:
self.queue = mp.Queue(QUEUE_SIZE)
self.process = mp.Process(target=QueuedFileReader.worker,
args=(self.queue,idlist))
self.process.start()
elif mode == MODE_THREADS:
self.queue = Queue.Queue(QUEUE_SIZE)
self.thread = threading.Thread(target=QueuedFileReader.worker,
args=(self.queue,idlist))
self.thread.start()
#staticmethod
def worker(queue, idlist):
for i in idlist:
queue.put((i, file_reader(file_list[i])))
print id(queue), 'queued', file_list[i]
queue.put('done')
def __iter__(self):
if self.mode == MODE_SYNCHRONOUS:
self.index = 0
return self
def next(self):
if self.mode == MODE_SYNCHRONOUS:
if self.index == len(self.idlist): raise StopIteration
q = (self.idlist[self.index],
file_reader(file_list[self.idlist[self.index]]))
self.index += 1
else:
q = self.queue.get()
if q == 'done': raise StopIteration
return q
def __del__(self):
if self.mode == MODE_PROCESS:
self.process.join()
elif self.mode == MODE_THREADS:
self.thread.join()
#mode = MODE_PROCESS
mode = MODE_THREADS
#mode = MODE_SYNCHRONOUS
result = 0
for i, f1 in QueuedFileReader(range(N_files),mode):
in_place_processing(f1)
for j, f2 in QueuedFileReader(range(i+1,N_files),mode):
in_place_processing(f2)
result += processing_function(f1,f2)
If your intermediate values are too large to pass through the Queue, you can execute each iteration of the outer loop in its own process. A handy way to do that would be using the Pool class in multiprocessing as in the example below.
import multiprocessing as mp
import time
## Placeholder for your functions and variables
N_files = 10
file_list = ['file %d' % i for i in range(N_files)]
def file_reader(filename):
time.sleep(.1)
result = (filename,'processed')
return result
def in_place_processing(f):
time.sleep(.2)
def processing_function(f1,f2):
print f1, f2
return id(f1) & id(f2)
def file_task(file_index):
print file_index
f1 = file_reader(file_list[file_index])
in_place_processing(f1)
task_result = 0
for j in range(file_index+1, N_files):
f2 = file_reader(file_list[j])
in_place_processing(f2)
task_result += processing_function(f1,f2)
return task_result
pool = mp.Pool(processes=None) #processes default to mp.cpu_count()
result = 0
for file_result in pool.map(file_task, range(N_files)):
result += file_result
print 'result', result
#or simply
#result = sum(pool.map(file_task, range(N_files)))

Related

Python multithreading- memory leak when use an shared object(so)

I have a python programs that gets memory leaks when use an third-party SO.
I simplify my code like this:
import time
import sys
import threading
import codecs
import ctypes
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
class TestThirdPartySo(object):
def __init__(self):
# this so uses thread-specific data
self.call_stat_so = ctypes.CDLL("./third_party_fun.so")
self.handle = self.call_stat_so._handle
def test_fun(self):
self.call_stat_so.fun_xxx()
def thread_fun():
TestThirdPartySo().test_fun()
def test_main(num):
count = 0
while True:
# create 3 * num threads
thread_num = 3
thread_list = []
for _ in range(thread_num):
thread_list.append(threading.Thread(target=thread_fun))
for thread in thread_list:
thread.start()
for thread in thread_list:
thread.join()
count += thread_num
time.sleep(0.01)
if count % 100 == 0:
print("finied %s" % count)
if count > num:
break
print("end !!!!")
if __name__ == '__main__':
num = sys.argv[1]
test_main(int(num))
Now, I know this shared object uses thread-specific data.And I have tried to close the SO after called it like this:
class TestThirdPartySo(object):
def __init__(self):
# this so uses thread-specific data
self.call_stat_so = ctypes.CDLL("./third_party_fun.so")
self.handle = self.call_stat_so._handle
def test_fun(self):
self.call_stat_so.fun_xxx()
def __del__(self):
dlclose_func(self.handle)
def dlclose_func(_handle):
dlclose_func_tmp = ctypes.cdll.LoadLibrary('libdl.so').dlclose
dlclose_func_tmp.argtypes = [ctypes.c_void_p]
dlclose_func_tmp(_handle)
But I failed to close the so. And I'm also not sure if the leaked memory will be freed after closing the so.
If the program not uses multi-threads or creates a fixed number of threads(threadpool), it works ok.
For some reason,I need create threads constantly in my program. What can I do to prevent this memory leaks?

Can a python pool worker return values from initialization?

TL;DR I want to collect the accumulated data in the globals of each worker when the pool is finished processing
Description of what I think I'm missing
As I'm new to multiprocessing, I don't know of all the features that exist. I am looking for a way to make a worker return the value it was initialized with (after manipulating that value a bunch of millions of times). Then, I hope I can collect and merge all these values at the end of the program when all the 'jobs' are done.
import multiprocessing as mp
from collections import defaultdict, Counter
from customtools import load_regexes #, . . .
import gzip
import nltk
result_dict = None
regexes = None
def create_worker():
global result_dict
global regexes
result_dict = defaultdict(Counter) # I want to return this at the end
# these are a bunch of huge regexes
regexes = load_regexes()
These functions represents the way I load and process data. The data is a big gzipfile with articles.
def load_data(semaphore):
with gzip.open('some10Gbfile') as f:
for line in file:
semaphore.acquire()
yield str(line, 'utf-8')
def worker_job(line):
global regexes
global result_dict
hits = defaultdict(Counter)
for sent in nltk.sent_tokenize(line[3:]):
for rename, regex in regex.items():
for hit in regex.finditer(sent):
hits[rename][hit.group(0)]+=1
# and more and more... results = _filter(_extract(hits))
# store some data in results_dict here . . .
return filtered_hits
Class ResultEater():
def __init__(self):
self.wordscounts=defaultdict(Counter)
self.filtered=Counter()
def eat_results(self, filte red_hits):
for k, v in filte.items():
for i, c in v.items():
self.wordscount[k][i]+=c
This is the main program
if __name__ == '__main__':
pool = mp.Pool(mp.cpu_count(), initializer=create_worker)
semaphore = mp.Semaphore(50)
loader = load_data(semaphore)
results = ResultEater()
for intermediate_result in pool.imap_unordered(worker_job, loader, chunksize=10):
results.eat_results(intermediate_result)
semaphore.release()
# results.eat_workers(the_leftover_workers_or_something)
results.print()
I don't really think I understand how exactly returning the data incrementally isn't sufficient, but it kinda seems like you need some sort of finalization function to send the data similar to how you have an initialization function. Unfortunately, I don't think this sort of thing exists for mp.Pool, so it'll require you to use a couple mp.Process's, and send input args, and return results with a couple mp.Queue's
On a side note your use of Semaphore is unncessary, as the call to the "load_data" iterator always happens on the main process. I have moved that to another "producer" process, which puts inputs to a queue, which is also already synchronized automatically by default. This allows you to have one process for gathering inputs, several processes for processing the inputs to outputs, and leaves the main (parent) process to gather outputs. If the "producer" generating the inputs is IO limited by file read speed (very likely), it could also be in a thread rather than a process, but in this case the difference is probably minimal.
I have created an example of a custom "Pool" which allows you to return some data at the end of each worker's "life" using aforementioned "producer-consumer" scheme. there are print statements to track what is going on in each process, but please also read the comments to track what's going on and why:
import multiprocessing as mp
from time import sleep
from queue import Empty
class ExitFlag:
def __init__(self, exit_value=None):
self.exit_value = exit_value #optionally pass value along with exit flag
def producer_func(input_q, n_workers):
for i in range(100): #100 lines of some long file
print(f"put {i}")
input_q.put(i) #put each line of the file to the work queue
print('stopping consumers')
for i in range(n_workers):
input_q.put(ExitFlag()) #send shut down signal to each of the workers
print('producer exiting')
def consumer_func(input_q, output_q, work_func):
counter = 0
while True:
try:
item = input_q.get(.1) #never wait forever on a "get". It's a recipe for deadlock.
except Empty:
continue
print(f"get {item}")
if isinstance(item, ExitFlag):
break
else:
counter += 1
output_q.put(work_func(item))
output_q.put(ExitFlag(exit_value=counter))
print('consumer exiting')
def work_func(number):
sleep(.1) #some heavy nltk work...
return number*2
if __name__ == '__main__':
input_q = mp.Queue(maxsize=10) #only bother limiting size if you have memory usage constraints
output_q = mp.Queue(maxsize=10)
n_workers = mp.cpu_count()
producer = mp.Process(target=producer_func, args=(input_q, n_workers)) #generate the input from another process. (this could just as easily be a thread as it seems it will be IO limited anyway)
producer.start()
consumers = [mp.Process(target=consumer_func, args=(input_q, output_q, work_func)) for _ in range(n_workers)]
for c in consumers: c.start()
total = 0
stop_signals = 0
exit_values = []
while True:
try:
item = output_q.get(.1)
except Empty:
continue
if isinstance(item, ExitFlag):
stop_signals += 1
if item.exit_value is not None:
exit_values.append(item.exit_value) #do something with the return at the end
if stop_signals >= n_workers: #stop waiting for more results once all consumers finish
break
else:
total += item #do something with the incremental return values
print(total)
print(exit_values)
#cleanup
producer.join()
print("producer joined")
for c in consumers: c.join()
print("consumers joined")

Multiprocessing: Passing a class instance to pool.map

I swear I saw the following in an example somewhere, but now I can't find that example and this isn't working. The __call__ class function never gets called.
EDIT: Code updated
pool.map appears to start the QueueWriter instance and the __call__ function is reached. However, the workers never seem to start or at least no results are pulled from the queue. Is my queue set up the right way? Why do the workers not fire off?
import multiprocessing as mp
import os
import random
class QueueWriter(object):
def __init__(self, **kwargs):
self.grid = kwargs.get("grid")
self.path = kwargs.get("path")
def __call__(self, q):
print self.path
log = open(self.path, "a", 1)
log.write("QueueWriter called.\n")
while 1:
res = q.get()
if res == 'kill':
self.log.write("QueueWriter received 'kill' message. Closing Writer.\n")
break
else:
self.log.write("This is where I'd write: {0} to grid file.\n".format(res))
log.close()
log = None
class Worker(object):
def __init__(self, **kwargs):
self.queue = kwargs.get("queue")
self.grid = kwargs.get("grid")
def __call__(self, idx):
res = self.workhorse(self, idx)
self.queue.put((idx,res))
return res
def workhorse(self,idx):
#in reality a fairly complex operation
return self.grid[idx] ** self.grid[idx]
if __name__ == '__main__':
# log = open(os.path.expanduser('~/minimal.log'), 'w',1)
path = os.path.expanduser('~/minimal.log')
pool = mp.Pool(mp.cpu_count())
manager = mp.Manager()
q = manager.Queue()
grid = [random.random() for _ in xrange(10000)]
# in actuality grid is a shared resource, read by Workers and written
# to by QueueWriter
qWriter = QueueWriter(grid=grid, path=path)
watcher = pool.map(qWriter, (q,),1)
wrkr = Worker(queue=q,grid=grid)
result = pool.map(wrkr, range(10000), 1)
result.get()
q.put('kill')
pool.close()
pool.join()
So the log does indeed print the initialization message, but then __call__ function is never called. Is this one of those pickling issues I've seen discussed so often? I've found answers about class member functions, but what about class instances?
At the gentle and patient prodding of martineau (thanks!) I think I've ironed out the problems. I have yet to apply it to my original code, but it is working in the example above and I'll start new questions for future implementation problems.
So in addition to changing where in the code the target file (the log, in this example) gets opened, I also started the QueueWriter instance as a single multiprocessing process rather than using pool.map. As martineau pointed out the map call blocks until the qWriter.__call__() returns and this prevented the workers from being called.
There were some other bugs in the code above, but those were incidental and fixed below:
import multiprocessing as mp
import os
import random
class QueueWriter(object):
def __init__(self, **kwargs):
self.grid = kwargs.get("grid")
self.path = kwargs.get("path")
def __call__(self, q):
print self.path
log = open(self.path, "a", 1)
log.write("QueueWriter called.\n")
while 1:
res = q.get()
if res == 'kill':
log.write("QueueWriter received 'kill' message. Closing Writer.\n")
break
else:
log.write("This is where I'd write: {0} to grid file.\n".format(res))
log.close()
log = None
class Worker(object):
def __init__(self, **kwargs):
self.queue = kwargs.get("queue")
self.grid = kwargs.get("grid")
def __call__(self, idx):
res = self.workhorse(idx)
self.queue.put((idx,res))
return res
def workhorse(self,idx):
#in reality a fairly complex operation
return self.grid[idx] ** self.grid[idx]
if __name__ == '__main__':
# log = open(os.path.expanduser('~/minimal.log'), 'w',1)
path = os.path.expanduser('~/minimal.log')
pool = mp.Pool(mp.cpu_count())
manager = mp.Manager()
q = manager.Queue()
grid = [random.random() for _ in xrange(10000)]
# in actuality grid is a shared resource, read by Workers and written
# to by QueueWriter
qWriter = QueueWriter(grid=grid, path=path)
# watcher = pool.map(qWriter, (q,),1)
# Start the writer as a single process rather than a pool
p = mp.Process(target=qWriter, args=(q,))
p.start()
wrkr = Worker(queue=q,grid=grid)
result = pool.map(wrkr, range(10000), 1)
# result.get()
# not required for pool
q.put('kill')
pool.close()
p.join()
pool.join()

Python - start two processes to run indefinitely

I have a simple example script constructed that defines three separate processes using multiprocessing in python. My objective is to have one parent thread that spawns two smaller threads that will collect and process data.
Currently, my implementation looks like this:
from Queue import Queue,Empty
from multiprocessing import Process
import time
import hashlib
class FillQueue(Process):
def __init__(self,q):
Process.__init__(self)
self.q = q
def run(self):
i = 0
while i is not 5:
print 'putting'
self.q.put('foo')
i+=1
self.q.put('|STOP|')
class ConsumeQueue(Process):
def __init__(self,q):
Process.__init__(self)
self.q = q
def run(self):
print 'Consume'
while True:
try:
value = self.q.get(False)
print value
if value == '|STOP|':
print 'done'
break;
except Empty:
print 'Nothing to process atm'
class Ripper(Process):
q = Queue()
def __init__(self):
self.fq = FillQueue(self.q)
self.cq = ConsumeQueue(self.q)
self.fq.daemon = True
self.cq.daemon = True
def run(self):
try:
self.fq.start()
self.cq.start()
except KeyboardInterrupt:
print 'exit'
if __name__ == '__main__':
r = Ripper()
r.start()
As it runs presently, the output from the script on CLI looks like this:
putting
putting
putting
putting
putting
Consume
foo
foo
foo
foo
foo
|STOP|
done
Obviously, the way I am starting my two threads is blocking, since the consumer doesn't even begin to process the items in the queue until the filler finishes adding items.
How should I rewrite this to make both threads begin immediately and not block, so the consumer will simply pass to the Empty except block while there is no work to process, but will exit completely when it receives the stop message?
EDIT: typo, had the start and run methods mixed up
You seem to be starting multiple processes using multiprocessing.Process.
However, you are using Queue.Queue which is only threadsafe, and not designed to be used by multiple processes.
shevek's answer is valid as well, but as a start, you should replace Queue.Queue with multiprocessing.Queue.
try this:
from Queue import Empty
from multiprocessing import Process, Queue
import time
import hashlib
class FillQueue(object):
def __init__(self, q):
self.q = q
def run(self):
i = 0
while i < 5:
print 'putting'
self.q.put('foo %d' % i )
i+=1
time.sleep(.5)
self.q.put('|STOP|')
class ConsumeQueue(object):
def __init__(self, q):
self.q = q
def run(self):
while True:
try:
value = self.q.get(False)
print value
if value == '|STOP|':
print 'done'
break;
except Empty:
print 'Nothing to process atm'
time.sleep(.2)
if __name__ == '__main__':
q = Queue()
f = FillQueue(q)
c = ConsumeQueue(q)
p1 = Process(target=f.run)
p1.start()
p2 = Process(target=c.run)
p2.start()
p1.join()
p2.join()
I think your program works fine. The CPU processes only one thing at a time, for a short time. However, the time required to put all your stuff in the queue is very short. So there is no reason that the filler cannot do this in one time slice.
If you add some delays in the filler, I think you should see that it actually works as you expect.

Dynamic processes in Python

I have a question concerning Python multiprocessing. I am trying to take a dataset, break into chunks, and pass those chunks to concurrently running processes. I need to transform large tables of data using simple calculations (eg. electrical resistance -> temperature for a thermistor).
The code listed below almost works as desired, but it doesn't seem to be spawning any new processes (or if so only one at a time).
from multiprocessing import Process
class Worker(Process):
# example data transform
def process(self, x): return (x * 2) / 3
def __init__(self, list):
self.data = list
self.result = map(self.process, self.data)
super(Worker, self).__init__()
if __name__ == '__main__':
start = datetime.datetime.now()
dataset = range(10000) # null dataset
processes = 3
for i in range(processes):
chunk = int(math.floor(len(dataset) / float(processes)))
if i + 1 == processes:
remainder = len(dataset) % processes
else: remainder = 0
tmp = dataset[i * chunk : (i + 1) * chunk + remainder]
exec('worker'+str(i)+' = Worker(tmp)')
exec('worker'+str(i)+'.start()')
for i in range(processes):
exec('worker'+str(i)+'.join()')
# just a placeholder to make sure the initial values of the set are as expected
exec('print worker'+str(i)+'.result[0]')
No need to send the number of chunks to each process, just use get_nowait() and handle the eventual Queue.Empty exception. Every process will get different amounts of CPU time and this should keep them all busy.
import multiprocessing, Queue
class Worker(multiprocessing.Process):
def process(self, x):
for i in range(15):
x += (float(i) / 2.6)
return x
def __init__(self, input, output):
self.input = input
self.output = output
super(Worker, self).__init__()
def run(self):
try:
while True:
self.output.put(self.process(self.input.get_nowait()))
except Queue.Empty:
pass
if name == 'main':
dataset = range(10)
processes = multiprocessing.cpu_count()
input = multiprocessing.Queue()
output = multiprocessing.Queue()
for obj in dataset:
input.put(obj)
for i in range(processes):
Worker(input, output).start()
for i in range(len(dataset)):
print output.get()
You haven't overridden the run method. There are two ways with processes (or threads) to have it execute code:
Create a process specifying target
Subclass the process, overriding the run method.
Overriding __init__ just means your process is all dressed up with nowhere to go. It should be used to give it attributes that it needs to perform what it needs to perform, but it shouldn't specify the task to be performed.
In your code, all the heavy lifting is done in this line:
exec('worker'+str(i)+' = Worker(tmp)')
and nothing is done here:
exec('worker'+str(i)+'.start()')
So checking the results with exec('print worker'+str(i)+'.result[0]') should give you something meaningful, but only because the code you want to be executed has been executed, but on process construction, not on process start.
Try this:
class Worker(Process):
# example data transform
def process(self, x): return (x * 2) / 3
def __init__(self, list):
self.data = list
self.result = []
super(Worker, self).__init__()
def run(self):
self.result = map(self.process, self.data)
EDIT:
Okay... so I was just flying based on my threading instincts here, and they were all wrong. What we both didn't understand about processes is that you can't directly share variables. Whatever you pass to a new process to start is read, copied, and gone forever. Unless you use one of the two standard ways to share data: queues and pipes. I've played around a little bit trying to get your code to work, but so far no luck. I think that will put you on the right track.
Ok, so it looks like the list was not thread safe, and I have moved to using a Queue (although it appears to be much slower). This code essentially accomplishes what I was trying to do:
import math, multiprocessing
class Worker(multiprocessing.Process):
def process(self, x):
for i in range(15):
x += (float(i) / 2.6)
return x
def __init__(self, input, output, chunksize):
self.input = input
self.output = output
self.chunksize = chunksize
super(Worker, self).__init__()
def run(self):
for x in range(self.chunksize):
self.output.put(self.process(self.input.get()))
if __name__ == '__main__':
dataset = range(10)
processes = multiprocessing.cpu_count()
input = multiprocessing.Queue()
output = multiprocessing.Queue()
for obj in dataset:
input.put(obj)
for i in range(processes):
chunk = int(math.floor(len(dataset) / float(processes)))
if i + 1 == processes:
remainder = len(dataset) % processes
else: remainder = 0
Worker(input, output, chunk + remainder).start()
for i in range(len(dataset)):
print output.get()

Categories

Resources