Python multi-threading file processing

Python multi-threading file processing - python

I have few files that resides on a server, im trying to implement a multi threading process to improve the performance, I read a tutorial but have few questions implementing it,
Here are the files,
filelistread = ['h:\\file1.txt', \
'h:\\file2.txt', \
'h:\\file3.txt', \
'h:\\file4.txt']
filelistwrte = ['h:\\file1-out.txt','h:\\file2-out.txt','h:\\file3-out.txt','h:\\file4-out.txt']
def workermethod(inpfile, outfile):
f1 = open(inpfile,'r')
f2 = open(outfile,'w')
x = f1.readlines()
for each in x:
f2.write(each)
f1.close()
f2.close()
How do I implement using the thread class and queue?
I started with the below class but not sure how to pass the inpfile and outputfile to the run method..Any inputs are appreciated
class ThreadUrl(threading.Thread):
def __init__(self,queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
item = self.queue.get()

You're mixing up two different solutions.
If you want to create a dedicated worker thread for each file, you don't need a queue for anything. If you want to create a threadpool and a queue of files, you don't want to pass inpfile and outfile to the run method; you want to put them in each job on the queue.
How do you choose between the two? Well, the first is obviously simpler, but if you have, say, 1000 files to copy, you'll end up creating 1000 threads, which is more threads than you ever want to create, and far more threads than the number of parallel copies the OS will be able to handle. A thread pool lets you create, say, 8 threads, and put 1000 jobs on a queue, and they'll be distributed to the threads as appropriate, so 8 jobs are running at a time.
Let's start with solution 1, a dedicated worker thread for each file.
First, if you aren't married to subclassing Thread, there's really no reason to do so here. You can pass a target function and an args tuple to the default constructor, and then the run method will just do target(*args), exactly as you want. So:
t = threading.Thread(target=workermethod, args=(inpfile, outfile))
That's all you need. When each thread runs, it will call workermethod(inpfile, outfile) and then exit.
However, if you do want to subclass Thread for some reason, you can. You can pass the inpfile and outfile in at construction time, and your run method would just be that workermethod modified to use self.inpfile and self.outfile instead of taking parameters. Like this:
class ThreadUrl(threading.Thread):
def __init__(self, inpfile, outfile):
threading.Thread.__init__(self)
self.inpfile, self.outfile = inpfile, outfile
def run(self):
f1 = open(self.inpfile,'r')
f2 = open(self.outfile,'w')
x = f1.readlines()
for each in x:
f2.write(each)
f1.close()
f2.close()
Either way, I'd suggest using with statements instead of explicit open and close, and getting rid of the readlines (which unnecessarily reads the entire file into memory), unless you need to deal with really old versions of Python:
def run(self):
with open(self.inpfile,'r') as f1, open(self.outfile,'w') as f2:
for line in f1:
f2.write(line)
Now, on to solution 2: a threadpool and a queue.
Again, you don't need a subclass here; the differences between the two ways of doing things are the same as in solution 1. But sticking with the subclass design you've started, you want something like this:
class ThreadUrl(threading.Thread):
def __init__(self,queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
inpfile, outfile = self.queue.get()
workermethod(inpfile, outfile)
Then you start your threads by passing a single queue to all of them:
q = queue.Queue
threadpool = [ThreadUrl(q) for i in range(poolsize)]
And submit jobs like this:
q.put((inpfile, outfile))
If you're going to be doing serious work with threadpools, you may want to look into using a robust, flexible, simple, and optimized implementation instead of coding something up yourself. For example, you might want to be able to cancel jobs, shutdown the queue nicely, join the whole pool instead of joining threads one by one, do batching or smart load balancing, etc.
If you're using Python 3, you should look at the standard-library ThreadPoolExecutor. If you're stuck with Python 2, or can't figure out Futures, you might want to look at the ThreadPool class hidden inside the multiprocessing module. Both of these have the advantage that switching from multithreading to multiprocessing (if, say, it turns out that you have some CPU-bound work that needs to be parallelized along with your IO) is trivial. You can also search PyPI and you'll find multiple other good implementations.
As a side note, you don't want to call the queue queue, because that will shadow the module name. Also, it's a bit confusing to have something called workermethod that's actually a free function rather than a method.
Finally, if all you're doing is copying the files, you probably don't want to read in text mode, or go line by line. In fact, you probably don't want to implement it yourself at all; just use the appropriate copy function from shutil. You can do that with any of the above methods very easily. For example, instead of this:
t = threading.Thread(target=workermethod, args=(inpfile, outfile))
do this:
t = threading.Thread(target=shutil.copyfile, args=(inpfile, outfile))
In fact, it looks like your whole program can be replaced by:
threads = [threading.Thread(target=shutil.copyfile, args=(inpfile, outfile))
for (inpfile, outfile) in zip(filelistread, filelistwrte)]
for thread in threads:
thread.start()
for thread in threads:
thread.join()

Related

Shared pool map between processes with object-oriented python

(python2.7)
I'm trying to do a kind of scanner, that has to walk through CFG nodes, and split in different processes on branching for parallelism purpose.
The scanner is represented by an object of class Scanner. This class has one method traverse that walks through the said graph and splits if necessary.
Here how it looks:
class Scanner(object):
def __init__(self, atrb1, ...):
self.attribute1 = atrb1
self.process_pool = Pool(processes=4)
def traverse(self, ...):
[...]
if branch:
self.process_pool.map(my_func, todo_list).
My problem is the following:
How do I create a instance of multiprocessing.Pool, that is shared between all of my processes ? I want it to be shared, because since a path can be splitted again, I do not want to end with a kind of fork bomb, and having the same Pool will help me to limit the number of processes running at the same time.
The above code does not work, since Pool can not be pickled. In consequence, I have tried that:
class Scanner(object):
def __getstate__(self):
self_dict = self.__dict__.copy()
def self_dict['process_pool']
return self_dict
[...]
But obviously, it results in having self.process_pool not defined in the created processes.
Then, I tried to create a Pool as a module attribute:
process_pool = Pool(processes=4)
def my_func(x):
[...]
class Scanner(object):
def __init__(self, atrb1, ...):
self.attribute1 = atrb1
def traverse(self, ...):
[...]
if branch:
process_pool.map(my_func, todo_list)
It does not work, and this answer explains why.
But here comes the thing, wherever I create my Pool, something is missing. If I create this Pool at the end of my file, it does not see self.attribute1, the same way it did not see answer and fails with an AttributeError.
I'm not even trying to share it yet, and I'm already stuck with Multiprocessing way of doing thing.
I don't know if I have not been thinking correctly the whole thing, but I can not believe it's so complicated to handle something as simple as "having a worker pool and giving them tasks".
Thank you,
EDIT:
I resolved my first problem (AttributeError), my class had a callback as its attribute, and this callback was defined in the main script file, after the import of the scanner module... But the concurrency and "do not fork bomb" thing is still a problem.

What you want to do can't be done safely. Think about if you somehow had a single shared Pool shared across parent and worker processes, with, say, two worker processes. The parent runs a map that tries to perform two tasks, and each task needs to map two more tasks. The two parent dispatched tasks go to each worker, and the parent blocks. Each worker sends two more tasks to the shared pool and blocks for them to complete. But now all workers are now occupied, waiting for a worker to become free; you've deadlocked.
A safer approach would be to have the workers return enough information to dispatch additional tasks in the parent. Then you could do something like:
class MoreWork(object):
def __init__(self, func, *args):
self.func = func
self.args = args
pool = multiprocessing.Pool()
try:
base_task = somefunc, someargs
outstanding = collections.deque([pool.apply_async(*base_task)])
while outstanding:
result = outstanding.popleft().get()
if isinstance(result, MoreWork):
outstanding.append(pool.apply_async(result.func, result.args))
else:
... do something with a "final" result, maybe breaking the loop ...
finally:
pool.terminate()
What the functions are is up to you, they'd just return information in a MoreWork when there was more to do, not launch a task directly. The point is to ensure that by having the parent be solely responsible for task dispatch, and the workers solely responsible for task completion, you can't deadlock due to all workers being blocked waiting for tasks that are in the queue, but not being processed.
This is also not at all optimized; ideally, you wouldn't block waiting on the first item in the queue if other items in the queue were complete; it's a lot easier to do this with the concurrent.futures module, specifically with concurrent.futures.wait to wait on the first available result from an arbitrary number of outstanding tasks, but you'd need a third party PyPI package to get concurrent.futures on Python 2.7.

Multiprocesses python with shared memory

I have an object that connects to a websocket remote server. I need to do a parallel process at the same time. However, I don't want to create a new connection to the server. Since threads are the easier way to do this, this is what I have been using so far. However, I have been getting a huge latency because of GIL. Can I achieve the same thing as threads but with multiprocesses in parallel?
This is the code that I have:
class WebSocketApp(object):
def on_open(self):
# Create another thread to make sure the commands are always been read
print "Creating thread..."
try:
thread.start_new_thread( self.read_commands,() )
except:
print "Error: Unable to start thread"
Is there an equivalent way to do this with multiprocesses?
Thanks!

The direct equivalent is
import multiprocessing
class WebSocketApp(object):
def on_open(self):
# Create another process to make sure the commands are always been read
print "Creating process..."
try:
multiprocessing.Process(target=self.read_commands,).start()
except:
print "Error: Unable to start process"
However, this doesn't address the "shared memory" aspect, which has to be handled a little differently than it is with threads, where you can just use global variables. You haven't really specified what objects you need to share between processes, so it's hard to say exactly what approach you should take. The multiprocessing documentation does cover ways to deal with shared state, however. Do note that in general it's better to avoid shared state if possible, and just explicitly pass state between the processes, either as an argument to the Process constructor or via a something like a Queue.

You sure can, use something along the lines of:
from multiprocessing import Process
class WebSocketApp(object):
def on_open(self):
# Create another thread to make sure the commands are always been read
print "Creating thread..."
try:
p = Process(target = WebSocketApp.read_commands, args = (self, )) # Add other arguments to this tuple
p.start()
except:
print "Error: Unable to start thread"
It is important to note, however, that as soon as the object is sent to the other process the two objects self and self in the different threads diverge and represent different objects. If you wish to communicate you will need to use something like the included Queue or Pipe in the multiprocessing module.
You may need to keep a reference of all the processes (p in this case) in your main thread in order to be able to communicate that your program is terminating (As a still-running child process will appear to hang the parent when it dies), but that depends on the nature of your program.
If you wish to keep the object the same, you can do one of a few things:
Make all of your object properties either single values or arrays and then do something similar to this:
from multiprocessing import Process, Value, Array
class WebSocketApp(object):
def __init__(self):
self.my_value = Value('d', 0.3)
self.my_array = Array('i', [4 10 4])
# -- Snip --
And then these values should work as shared memory. The types are very restrictive though (You must specify their types)
A different answer is to use a manager:
from multiprocessing import Process, Manager
class WebSocketApp(object):
def __init__(self):
self.my_manager = Manager()
self.my_list = self.my_manager.list()
self.my_dict = self.my_manager.dict()
# -- Snip --
And then self.my_list and self.my_dict act as a shared-memory list and dictionary respectively.
However, the types for both of these approaches can be restrictive so you may have to roll your own technique with a Queue and a Semaphore. But it depends what you're doing.
Check out the multiprocessing documentation for more information.

Python Threading stdin/stdout

I have a file which contains a lot of data. Each row is a record. And I am trying to do some ETL work against the whole file. Right now I am using standard input to read the data line by line. The cool thing about this is your script could be very flexible to integrate with other script and shell commands. I write the result to standard output. For example.
$ cat input_file
line1
line2
line3
line4
...
My current python code looks like this - parse.py
import sys
for line in sys.stdin:
result = ETL(line) # ETL is some self defined function which takes a while to execute.
print result
The code below is how it is working right now:
cat input_file | python parse.py > output_file
I have looked at the Threading module of Python and I am wondering if the performance would be dramatically improved if I use that module.
Question1: How should I plan the quotas for each thread, why?
...
counter = 0
buffer = []
for line in sys.stdin:
buffer.append(line)
if counter % 5 == 0: # maybe assign 5 rows to each thread? if not, is there a rule of thumb to determine
counter = 0
thread = parser(buffer)
buffer = []
thread.start()
Question2: Multiple Threads might print the result back to stdout at the same time, how to organize them and avoid the situation below?
import threading
import time
class parser(threading.Thread):
def __init__ (self, data_input):
threading.Thread.__init__(self)
self.data_input = data_input
def run(self):
for elem in self.data_input:
time.sleep(3)
print elem + 'Finished'
work = ['a', 'b', 'c', 'd', 'e', 'f']
thread1 = parser(['a', 'b'])
thread2 = parser(['c', 'd'])
thread3 = parser(['e', 'f'])
thread1.start()
thread2.start()
thread3.start()
The output is really ugly, where one row contains the outputs from two threads.
aFinished
cFinishedeFinished
bFinished
fFinished
dFinished

Taking your second question first, this is what mutexes are for. You can get the cleaner output that you want by using a lock to coordinate among the parsers and ensure that only one thread has access to the output stream during a given period of time:
class parser(threading.Thread):
output_lock = threading.Lock()
def __init__ (self, data_input):
threading.Thread.__init__(self)
self.data_input = data_input
def run(self):
for elem in self.data_input:
time.sleep(3)
with self.output_lock:
print elem + 'Finished'
As regards your first question, note that it's probably the case that multi-threading will provide no benefit for your particular workload. It largely depends on whether the work you do with each input line (your ETL function) is primarily CPU-bound or IO-bound. If the former (which I suspect is likely), threads will be of no help, because of the global interpreter lock. In that case, you would want to use the multiprocessing module to distribute work among multiple processes instead of multiple threads.
But you can get the same results with an easier to implement workflow: Split the input file into n pieces (using, e.g., the split command); invoke the extract-and-transform script separately on each subfile; then concatenate the resulting output files.
One nitpick: "using standard input to read the data line by line because it won't load the whole file into memory" involves a misconception. You can read a file line by line from within Python by, e.g., replacing sys.stdin with a file object in a construct like:
for line in sys.stdin:
See also the readline() method of file objects, and note that read() can take as parameter the maximum number of bytes to read.

Whether threading will be helpful you is highly dependent on on your situation. In particular, if your ETL() function involves a lot of disk access, then threading would likely give you pretty significant speed improvement.
In response to your first question, I've always found that it just depends. There are a lot of factors at play when determining the ideal number of threads, and many of them are program-dependent. If you're doing a lot of disk access (which is pretty slow), for example, then you'll want more threads to take advantage of the downtime while waiting for disk access. If the program is CPU-bound, though, tons of threads may not be super helpful. So, while it may be possible to analyze all the factors to come up with an ideal number of threads, it's usually a lot faster to make an initial guess and then adjust from there.
More specifically, though, assigning a certain number of lines to each thread probably isn't the best way to go about divvying up the work. Consider, for example, if one line takes a particularly long time to process. It would be best if one thread could work away at that one line and the other threads could each do a few more lines in the meantime. The best way to handle this is to use a Queue. If you push each line into a Queue, then each thread can pull a line off the Queue, handle it, and repeat until the Queue is empty. This way, the work gets distributed such that no thread is ever without work to do (until the end, of course).
Now, the second question. You're definitely right that writing to stdout from multiple threads at once isn't an ideal solution. Ideally, you would arrange things so that the writing to stdout happens in only one place. One great way to do that is to use a Queue. If you have each thread write its output to a shared Queue, then you can spawn an additional thread whose sole task is to pull items out of that Queue and print them to stdout. By restricting the printing to just one threading, you'll avoid the issues inherent in multiple threads trying to print at once.

how to download multiple file simultaneously and join them in python?

I have some split files on a remote server.
I have tried downloading them one by one and join them. But it takes a lot of time. I googled and found that simultaneous download might speed up things. The script is on Python.
My pseudo is like this:
url1 = something
url2 = something
url3 = something
data1 = download(url1)
data2 = download(url2)
data3 = download(url3)
wait for all download to complete
join all data and save
Could anyone point me to a direction by which I can load files all simultaneously and wait till they are done.
I have tried by creating a class. But again I can't figure out how to wait till all complete.
I am more interested in Threading and Queue feature and I can import them in my platform.
I have tried with Thread and Queue with an example found on this site. Here is the code pastebin.com/KkiMLTqR . But it does not wait or waits forever..not sure

There are 2 ways to do things simultaneously. Or, really, 2-3/4 or so:
Multiple threads
Or multiple processes, especially if the "things" take a lot of CPU power
Or coroutines or greenlets, especially if there are thousands of "things"
Or pools of one of the above
Event loops (either coded manually)
Or hybrid greenlet/event loop systems like gevent.
If you have 1000 URLs, you probably don't want to do 1000 requests at the same time. For example, web browsers typically only do something like 8 requests at a time. A pool is a nice way to do only 8 things at a time, so let's do that.
And, since you're only doing 8 things at a time, and those things are primarily I/O bound, threads are perfect.
I'll implement it with futures. (If you're using Python 2.x, or 3.0-3.1, you will need to install the backport, futures.)
import concurrent.futures
urls = ['http://example.com/foo',
'http://example.com/bar']
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
result = b''.join(executor.map(download, urls))
with open('output_file', 'wb') as f:
f.write(result)
Of course you need to write the download function, but that's exactly the same function you'd write if you were doing these one at a time.
For example, using urlopen (if you're using Python 2.x, use urllib2 instead of urllib.request):
def download(url):
with urllib.request.urlopen(url) as f:
return f.read()
If you want to learn how to build a thread pool executor yourself, the source is actually pretty simple, and multiprocessing.pool is another nice example in the stdlib.
However, both of those have a lot of excess code (handling weak references to improve memory usage, shutting down cleanly, offering different ways of waiting on the results, propagating exceptions properly, etc.) that may get in your way.
If you look around PyPI and ActiveState, you will find simpler designs like threadpool that you may find easier to understand.
But here's the simplest joinable threadpool:
class ThreadPool(object):
def __init__(self, max_workers):
self.queue = queue.Queue()
self.workers = [threading.Thread(target=self._worker) for _ in range(max_workers)]
def start(self):
for worker in self.workers:
worker.start()
def stop(self):
for _ in range(self.workers):
self.queue.put(None)
for worker in self.workers:
worker.join()
def submit(self, job):
self.queue.put(job)
def _worker(self):
while True:
job = self.queue.get()
if job is None:
break
job()
Of course the downside of a dead-simple implementation is that it's not as friendly to use as concurrent.futures.ThreadPoolExecutor:
urls = ['http://example.com/foo',
'http://example.com/bar']
results = [list() for _ in urls]
results_lock = threading.Lock()
def download(url, i):
with urllib.request.urlopen(url) as f:
result = f.read()
with results_lock:
results[i] = url
pool = ThreadPool(max_workers=8)
pool.start()
for i, url in enumerate(urls):
pool.submit(functools.partial(download, url, i))
pool.stop()
result = b''.join(results)
with open('output_file', 'wb') as f:
f.write(result)

You can use an async framwork like twisted.
Alternatively this is one thing that Python's threads do ok at. Since you are mostly IO bound

How to parse a large file taking advantage of threading in Python?

I have a huge file and need to read it and process.
with open(source_filename) as source, open(target_filename) as target:
for line in source:
target.write(do_something(line))
do_something_else()
Can this be accelerated with threads? If I spawn a thread per line, will this have a huge overhead cost?
edit: To make this question not a discussion, How should the code look like?
with open(source_filename) as source, open(target_filename) as target:
?
#Nicoretti: In an iteration I need to read a line of several KB of data.
update 2: the file may be a bz2, so Python may have to wait for unpacking:
$ bzip2 -d country.osm.bz2 | ./my_script.py

You could use three threads: for reading, processing and writing. The possible advantage is that the processing can take place while waiting for I/O, but you need to take some timings yourself to see if there is an actual benefit in your situation.
import threading
import Queue
QUEUE_SIZE = 1000
sentinel = object()
def read_file(name, queue):
with open(name) as f:
for line in f:
queue.put(line)
queue.put(sentinel)
def process(inqueue, outqueue):
for line in iter(inqueue.get, sentinel):
outqueue.put(do_something(line))
outqueue.put(sentinel)
def write_file(name, queue):
with open(name, "w") as f:
for line in iter(queue.get, sentinel):
f.write(line)
inq = Queue.Queue(maxsize=QUEUE_SIZE)
outq = Queue.Queue(maxsize=QUEUE_SIZE)
threading.Thread(target=read_file, args=(source_filename, inq)).start()
threading.Thread(target=process, args=(inq, outq)).start()
write_file(target_filename, outq)
It is a good idea to set a maxsize for the queues to prevent ever-increasing memory consumption. The value of 1000 is an arbitrary choice on my part.

Does the processing stage take relatively long time, ie, is it cpu-intenstive? If not, then no, you dont win much by threading or multiprocessing it. If your processing is expensive, then yes. So, you need to profile to know for sure.
If you spend relatively more time reading the file, ie it is big, than processing it, then you can't win in performance by using threads, the bottleneck is just the IO which threads dont improve.

This is the exact sort of thing which you should not try to analyse a priori, but instead should profile.
Bear in mind that threading will only help if the per-line processing is heavy. An alternative strategy would be to slurp the whole file into memory, and process it in memory, which may well obviate threading.
Whether you have a thread per line is, once again, something for fine-tuning, but my guess is that unless parsing the lines is pretty heavy, you may want to use a fixed number of worker threads.
There is another alternative: spawn sub-processes, and have them do the reading, and the processing. Given your description of the problem, I would expect this to give you the greatest speed-up. You could even use some sort of in-memory caching system to speed up the reading, such as memcached (or any of the similar-ish systems out there, or even a relational database).

In CPython, threading is limited by the global interpreter lock — only one thread at a time can actually be executing Python code. So threading only benefits you if either:
you are doing processing that doesn't require the global interpreter lock; or
you are spending time blocked on I/O.
Examples of (1) include applying a filter to an image in the Python Imaging Library, or finding the eigenvalues of a matrix in numpy. Examples of (2) include waiting for user input, or waiting for a network connection to finish sending data.
So whether your code can be accelerated using threads in CPython depends on what exactly you are doing in the do_something call. (But if you are parsing the line in Python then it very unlikely that you can speed this up by launching threads.) You should also note that if you do start launching threads then you will face a synchronization problem when you are writing the results to the target file. There is no guarantee that threads will complete in the same order that they were started, so you will have to take care to ensure that the output comes out in the right order.
Here's a maximally threaded implementation that has threads for reading the input, writing the output, and one thread for processing each line. Only testing will tell you if this faster or slower than the single-threaded version (or Janne's version with only three threads).
from threading import Thread
from Queue import Queue
def process_file(f, source_filename, target_filename):
"""
Apply the function `f` to each line of `source_filename` and write
the results to `target_filename`. Each call to `f` is evaluated in
a separate thread.
"""
worker_queue = Queue()
finished = object()
def process(queue, line):
"Process `line` and put the result on `queue`."
queue.put(f(line))
def read():
"""
Read `source_filename`, create an output queue and a worker
thread for every line, and put that worker's output queue onto
`worker_queue`.
"""
with open(source_filename) as source:
for line in source:
queue = Queue()
Thread(target = process, args=(queue, line)).start()
worker_queue.put(queue)
worker_queue.put(finished)
Thread(target = read).start()
with open(target_filename, 'w') as target:
for output in iter(worker_queue.get, finished):
target.write(output.get())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.