Python, read many files and merge the results

Python, read many files and merge the results - python

I might be asking a very basic question but I really can't figure how to make a simple parallel application in python.
I am running my scripts on a machine with 16 cores and I would like to use all of them efficiently. I have 16 huge files to read and I would like each cpu to read one file and then merge the result.
Here I give a quick example of what I would like to do:
parameter1_glob=[]
parameter2_glob[]
do cpu in arange(0,16):
parameter1,parameter2=loadtxt('file'+str(cpu)+'.dat',unpack=True)
parameter1_glob.append(parameter1)
parameter2_glob.append(parameter2)
I think that the multiprocessing module might help but I couldn't understand how to apply it to what I want to do.

I agree with what Colin Dunklau said in his comment, this process will bottleneck on reading and writing these files, the CPU demands are minimal. Even if you had 17 dedicated drives, you wouldn't be maxing out even one core. Additionally, though I realize this is tangential to your actual question, you'll likely run into memory limitations with these "huge" files - loading 16 files into memory as arrays and then combining them into another file will almost certainly take up more memory than you have.
You may find better results looking into shell scripting this problem. In particular, GNU sort uses a memory efficient merge-sort to sort one or more files very rapidly - much faster than all but the most carefully written applications in Python or most other languages.
I would suggest avoiding any sort of multi-threading effort, it will dramatically add to the complexity, with minimal benefit. Be sure you keep as little of the file(s) in memory at a time, or you'll run out quickly. In any case, you will absolutely want to have the reading and writing running on two separate disks. The slowdown associated with reading and writing simultaneously to the same disk is tremendously painful.

Do you want to merge line by line? Sometimes coroutines are more interesting for I/O bound applications than classic multitasking. You can chain generators and coroutines for all sort of routing, merging and broadcasting. Blow your mind with this nice presentation by David Beazley.
You can use a coroutine as a sink (untested, please refer to dabeaz examples):
# A sink that just prints the lines
#coroutine
def printer():
while True:
line = (yield)
print line,
sources = [
open('file1'),
open('file2'),
open('file3'),
open('file4'),
open('file5'),
open('file6'),
open('file7'),
]
output = printer()
while sources:
for source in sources:
line = source.next()
if not line: # EOF
sources.remove(source)
source.close()
continue
output.send(line)

Assuming that the results from each file are smallish, you could do this with my package jug:
from jug import TaskGenerator
loadtxt = TaskGenerator(loadtxt)
parameter1_glob=[]
parameter2_glob[]
#TaskGenerator
def write_parameter(oname, ps):
with open(oname, 'w') as output:
for p in ps:
print >>output, p
parameter1_glob = []
parameter2_glob = []
for cpu in arange(0,16):
ps = loadtxt('file'+str(cpu)+'.dat',unpack=True)
parameter1_glob.append(ps[0])
parameter2_glob.append(ps[1])
write_parameter('output1.txt', parameter1_glob)
write_parameter('output2.txt', parameter2_glob)
Now, you can execute several jug execute jobs.

Related

Best Approach for I/O Bound Problems?

I am currently running a code on a HPC cluster that writes several 16 MB files on disk (same directory) for a short period of time and then deletes it. They are written to disks and then deleted sequentially. However, the total number of I/O operations exceeds 20,000 * 12,000 times.
I am using the joblib module in python2.7 to take advantage of running my code on several cores. Its basically a nested loop problem with the outer loop being parallelised by joblib and the inner loop is run sequentially in the function. In total its a (20,000 * 12,000 loop.)
The basic skeleton of my code is the following.
from joblib import Parallel, delayed
import subprocess
def f(a,b,c,d):
cmds = 'path/to/a/bash_script_on_disk with arguments from a,b > \
save_file_to_disk'
subprocess.check_output(cmds,shell=True)
cmds1 = 'path/to/a/second_bash_script_on_disk > \
save_file_to_disk'
subprocess.check_output(cmds1,shell=True)
#The structure above is repeated several times.
#However I do delete the files as soon as I can using:
cmds2 = 'rm -rf files'
subprocess.check_output(cmds2,shell=True)
#This is followed by the second/inner loop.
for i in range(12000):
#Do some computation, create and delete files in each
#iteration.
if __name__ == '__main__':
num_cores = 48
Parallel(n_jobs=num_cores)(delayed(f)(a,b,c,d) for i in range(20,000))
#range(20,000) is batched by a wrapper script that sends no more \
#than 48 jobs per node.(Max.cores available)
This code is extremely slow and the bottleneck is the I/O time. Is this a good use case to temporarily write files to /dev/shm/? I have 34GB of space available as tmpfs on /dev/shm/.
Things I already tested:
I tried to set up the same code on a smaller scale on my laptop which has 8 cores. However, writing to /dev/shm/ ran slower than writing to disk.
Side Note: (The inner loop can be parallelised too, however, the number of cores I have available is far lesser than 20,000 which is why I am sticking to this configuration. Please let me know if there are better ways to do this.)

First, do not talk about total I/O operations, that is meaningless. Instead, talk about IOPS and throughout.
Second, that is almost impossible that writing to /dev/shm/ will be slower than writing to disk. Please provide more information. You can test write performance using fio, example command: sudo fio --name fio_test_file --rw=read --direct=1 --bs=4k --size=50M --numjobs=16 --group_reporting, and my test result is: bw=428901KB/s, iops=107225.
Third, you are really writing too many files, you should think about your structure.

It depends on your temporary data size.
If you have much more memory than you're using for the data, then yes - shm will be a good place for it. If you're going to write almost as much as you've got available, then you're likely going to start swapping - which would kill the performance of everything.
If you can fit your data in memory, then tmpfs by definition will always be faster than writing to a physical disk. If it isn't, then there are more factors impacting your environment. Running your code under a profiler would be a good idea in this case.

Optimization of successive processing of large audio collection with several programs

I was given a task to process a large collection of audiofiles. Each file must be processed in four steps:
convertion from .wav into raw pcm,
resampling,
quantization
coding with one of three speech codecs.
Each step corresponds to a program taking a file as input and returning a file as output. Processing file by file seems to take long. How can I optimize the procedure? E.g. parrallel programming or something? I tried to make use of ramdisk to reduce the time spent to file reading/writing but it didn't give improvement. (Why?)
I'm writing in Python under Ubuntu Linux. Thanks in advance.

Reading and writing to disk is pretty slow. If each program result is being written to disk then it would be better to stop that from happening. Sockets seem like a good fit to me. Read more here: http://docs.python.org/library/ipc.html
Parallel program is nice... need more info before I can say much more on this topic. I remember reading a while ago about python not handling threading so efficiently, so that might not be the best bet. As far as I recall it just emulates parallel processing by switching between tasks really gosh darn quickly. so that wont help. This may have changed since I've worked with threading.... Extra processes on the other hand sound like a good idea.
If you need a less-vague answer please supply specifics in your question.
EDIT
The thing i read a while ago about threads looks like this: http://docs.python.org/2/glossary.html#term-global-interpreter-lock

file.read() multiprocessing and the GIL

I've read that certain Python functions implemented in C, which I assume includes file.read(), can release the GIL while they're working and then get it back on completion and by doing so make use of multiple cores if they're available.
I'm using multiprocess to parallelize some code and currently I've got three processes, the parent, one child that reads data from a file, and one child that generates a checksum from the data passed to it by the first child process.
Now if I'm understanding this right, it seems that creating a new process to read the file as I'm currently doing is uneccessary and I should just call it in the main process. The question is am I understanding this right and will I get better performance with the read kept in the main process or in a separate one?
So given my function to read and pipe the data to be processed:
def read(file_path, pipe_out):
with open(file_path, 'rb') as file_:
while True:
block = file_.read(block_size)
if not block:
break
pipe_out.send(block)
pipe_out.close()
I reckon that this will definitely make use of multiple cores, but also introduces some overhead:
multiprocess.Process(target=read, args).start()
But now I'm wondering if just doing this will also use multiple cores, minus the overhead:
read(*args)
Any insights anybody has as to which one would be faster and for what reason would be much appreciated!

Okay, as came out by the comments, the actual question is:
Does (C)Python create threads on its own, and if so, how can I make use of that?
Short answer: No.
But, the reason why these C-Functions are nevertheless interesting for Python programmers is the following. By default, no two snippets of python code running in the same interpreter can execute in parallel, this is due to the evil called the Global Interpreter Lock, aka the GIL. The GIL is held whenever the interpreter is executing Python code, which implies the above statement, that no two pieces of python code can run in parallel in the same interpreter.
Nevertheless, you can still make use of multithreading in python, namely when you're doing a lot of I/O or make a lot of use of external libraries like numpy, scipy, lxml and so on, which all know about the issue and release the GIL whenever they can (i.e. whenever they do not need to interact with the python interpreter).
I hope that cleared up the issue a bit.

I think this is the main part of your question:
The question is am I understanding this right and will I get better
performance with the read kept in the main process or in a separate
one?
I assume your goal is to read and process the file as fast as possible. File reading is in any case I/O bound and not CPU bound. You cannot process data faster than you are able to read it. So file I/O clearly limits the performance of your software. You cannot increase the read data rate by using concurrent threads/processes for file reading. Also 'low level' CPython is not doing this. As long as you read the file in one process or thread (even in case of CPython with its GIL a thread is fine), you will get as much data per time as you can get from the storage device. It is also fine if you do the file reading in the main thread as long as there are no other blocking calls that would actually slow down the file reading.

Multithreading / Multiprocessing in Python

I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.
Link to program: http://pastebin.com/KqEMMMCT
What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.
What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)
I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.
Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.
I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!
FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.

What to do depends on what's slowing down the process.
If you're reading on a single disk, and disk I/O is slowing you down, multiple threads/process will probably just slow you down as the read head will now be jumping all over the place as different threads get control, and you'll be spending more time seeking than reading.
If you're reading on a single disk, and processing is slowing you down, then you might get a speedup from using multiprocessing to analyze the data, but you should still read from a single thread to avoid seek time delays (which are usually very long, multiple milliseconds).
If you're reading from multiple disks, and disk I/O is slowing you down, then either multiple threads or processes will probably give you a speed improvement. Threads are easier, and since most of your delay time is away from the processor, the GIL won't be in your way.
If you're reading from multiple disks,, and processing is slowing you down, then you'll need to go with multiprocessing.

Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.
I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.
This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.
Also on the same site, check out the coroutines section.

Facing "MemoryError" while doing multithread txt file I/Os, looking for better solution

I'm working with only one txt file which is about 4 MB, and the file needs frequently I/O such as append new lines/search for certain lines which includes specific phrases/replace certain line with another line etc.
In order to process the file "at the same time", threading.RLock() is used to lock the resource when its under operation. As it's not a big file, I simply use readlines() to read them all into a list and do the search job, and also use read() to read the whole file into a string FileContent, and use FileContent.replace("demo", "test") to replace certain phrases with anything I want.
But the problem is, I'm occasionally facing "MemoryError", I mean sometimes every 3 or 4 days, sometimes longer like a week or so. I've checked my code carefully and there's no unclosed file object when each thread ends. As to file operation, I simply use:
CurrentFile = open("TestFile.txt", "r")
FileContent = CurrentFile.read()
CurrentFile.close()
I think maybe python is not deleting useless variables as fast as I expected which finally result into out of memory, so I'm considering to use with statement which might be quick in garbage collecting. I'm not experienced with such statement, anybody knows if this would help? Or is there a better solution for my problem?
Thanks a lot.
Added: My script would do lots of replacement in a short period of time, so my guess is maybe hundreds of threads using FileContent = CurrentFile.read() would cause out of memory if FileContent not deleted quickly? How do I debug such problem?

Without seeing more of your code, it's impossible to know why you are running out of memory. The with statement is the preferred way to open files and close them when done though:
with open("TestFile.txt", "r") as current_file:
file_content = current_file.read()
(sorry, UpperCamelCase for variables just doesn't look right to me...)
Frankly, I doubt this will solve your problem if you are really closing files as you show in the question, but it's still good practice.

Sounds like you are leaking memory. Python will use all available system memory before giving MemoryError and 4 MB does not sound much. Where you leak memory depends on your code which you didn't give in your question.
Have you watched the memory usage in the task manage of the OS?
Here is a tool to debug Python memory usage (needs Python debug compiliation):
http://guppy-pe.sourceforge.net/#Heapy
Use it to analyze your code memory usage and see what objects you are creating which don't get freed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.