Best Approach for I/O Bound Problems?

Best Approach for I/O Bound Problems? - python

I am currently running a code on a HPC cluster that writes several 16 MB files on disk (same directory) for a short period of time and then deletes it. They are written to disks and then deleted sequentially. However, the total number of I/O operations exceeds 20,000 * 12,000 times.
I am using the joblib module in python2.7 to take advantage of running my code on several cores. Its basically a nested loop problem with the outer loop being parallelised by joblib and the inner loop is run sequentially in the function. In total its a (20,000 * 12,000 loop.)
The basic skeleton of my code is the following.
from joblib import Parallel, delayed
import subprocess
def f(a,b,c,d):
cmds = 'path/to/a/bash_script_on_disk with arguments from a,b > \
save_file_to_disk'
subprocess.check_output(cmds,shell=True)
cmds1 = 'path/to/a/second_bash_script_on_disk > \
save_file_to_disk'
subprocess.check_output(cmds1,shell=True)
#The structure above is repeated several times.
#However I do delete the files as soon as I can using:
cmds2 = 'rm -rf files'
subprocess.check_output(cmds2,shell=True)
#This is followed by the second/inner loop.
for i in range(12000):
#Do some computation, create and delete files in each
#iteration.
if __name__ == '__main__':
num_cores = 48
Parallel(n_jobs=num_cores)(delayed(f)(a,b,c,d) for i in range(20,000))
#range(20,000) is batched by a wrapper script that sends no more \
#than 48 jobs per node.(Max.cores available)
This code is extremely slow and the bottleneck is the I/O time. Is this a good use case to temporarily write files to /dev/shm/? I have 34GB of space available as tmpfs on /dev/shm/.
Things I already tested:
I tried to set up the same code on a smaller scale on my laptop which has 8 cores. However, writing to /dev/shm/ ran slower than writing to disk.
Side Note: (The inner loop can be parallelised too, however, the number of cores I have available is far lesser than 20,000 which is why I am sticking to this configuration. Please let me know if there are better ways to do this.)

First, do not talk about total I/O operations, that is meaningless. Instead, talk about IOPS and throughout.
Second, that is almost impossible that writing to /dev/shm/ will be slower than writing to disk. Please provide more information. You can test write performance using fio, example command: sudo fio --name fio_test_file --rw=read --direct=1 --bs=4k --size=50M --numjobs=16 --group_reporting, and my test result is: bw=428901KB/s, iops=107225.
Third, you are really writing too many files, you should think about your structure.

It depends on your temporary data size.
If you have much more memory than you're using for the data, then yes - shm will be a good place for it. If you're going to write almost as much as you've got available, then you're likely going to start swapping - which would kill the performance of everything.
If you can fit your data in memory, then tmpfs by definition will always be faster than writing to a physical disk. If it isn't, then there are more factors impacting your environment. Running your code under a profiler would be a good idea in this case.

Related

Having time-consuming object in memory

The code below is a part of my main function
def main():
model = GoodPackage.load_file_format('hello.bin', binary=True)
do_stuff_with_model(model)
def do_stuff_with_model(model):
do something~
Assume that the size of hello.bin is a few gigabytes and it takes a while to load it. the method do_stuff_with_model is still unstable and I must do a lot of iterations until I have a stable version. In other words, I have to run the main function many times to finish debugging. However, since it takes a few minutes to load the model every time I run the code, it is time consuming. Is there a way for me to store the model object in some other place, so that every time I run the code by typing python my_code.py in the console I don't have to wait? I assume using pickle wouldn't help either because the file will still be big.

How about creating a ramdisk? If you have enough memory, you can store the entire file in RAM. This will drastically speed things up, though you'll likely have to do this every time you restart your computer.
Creating a ramdisk is quite simple on linux. Just create a directory:
mkdir ramdisk
and mount it as a temps or ramfs filesystem:
mount -t tmpfs -o size=512m tmpfs ./ramdisk
From there you can simply copy your large file to the ramdisk. This has the benefit that your code stays exactly the same, apart from simply changing the path to your big file. File access occurs just as it normally would, but now it's much faster, since it's loading it from RAM.

Python, read many files and merge the results

I might be asking a very basic question but I really can't figure how to make a simple parallel application in python.
I am running my scripts on a machine with 16 cores and I would like to use all of them efficiently. I have 16 huge files to read and I would like each cpu to read one file and then merge the result.
Here I give a quick example of what I would like to do:
parameter1_glob=[]
parameter2_glob[]
do cpu in arange(0,16):
parameter1,parameter2=loadtxt('file'+str(cpu)+'.dat',unpack=True)
parameter1_glob.append(parameter1)
parameter2_glob.append(parameter2)
I think that the multiprocessing module might help but I couldn't understand how to apply it to what I want to do.

I agree with what Colin Dunklau said in his comment, this process will bottleneck on reading and writing these files, the CPU demands are minimal. Even if you had 17 dedicated drives, you wouldn't be maxing out even one core. Additionally, though I realize this is tangential to your actual question, you'll likely run into memory limitations with these "huge" files - loading 16 files into memory as arrays and then combining them into another file will almost certainly take up more memory than you have.
You may find better results looking into shell scripting this problem. In particular, GNU sort uses a memory efficient merge-sort to sort one or more files very rapidly - much faster than all but the most carefully written applications in Python or most other languages.
I would suggest avoiding any sort of multi-threading effort, it will dramatically add to the complexity, with minimal benefit. Be sure you keep as little of the file(s) in memory at a time, or you'll run out quickly. In any case, you will absolutely want to have the reading and writing running on two separate disks. The slowdown associated with reading and writing simultaneously to the same disk is tremendously painful.

Do you want to merge line by line? Sometimes coroutines are more interesting for I/O bound applications than classic multitasking. You can chain generators and coroutines for all sort of routing, merging and broadcasting. Blow your mind with this nice presentation by David Beazley.
You can use a coroutine as a sink (untested, please refer to dabeaz examples):
# A sink that just prints the lines
#coroutine
def printer():
while True:
line = (yield)
print line,
sources = [
open('file1'),
open('file2'),
open('file3'),
open('file4'),
open('file5'),
open('file6'),
open('file7'),
]
output = printer()
while sources:
for source in sources:
line = source.next()
if not line: # EOF
sources.remove(source)
source.close()
continue
output.send(line)

Assuming that the results from each file are smallish, you could do this with my package jug:
from jug import TaskGenerator
loadtxt = TaskGenerator(loadtxt)
parameter1_glob=[]
parameter2_glob[]
#TaskGenerator
def write_parameter(oname, ps):
with open(oname, 'w') as output:
for p in ps:
print >>output, p
parameter1_glob = []
parameter2_glob = []
for cpu in arange(0,16):
ps = loadtxt('file'+str(cpu)+'.dat',unpack=True)
parameter1_glob.append(ps[0])
parameter2_glob.append(ps[1])
write_parameter('output1.txt', parameter1_glob)
write_parameter('output2.txt', parameter2_glob)
Now, you can execute several jug execute jobs.

Multithreading / Multiprocessing in Python

I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.
Link to program: http://pastebin.com/KqEMMMCT
What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.
What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)
I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.
Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.
I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!
FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.

What to do depends on what's slowing down the process.
If you're reading on a single disk, and disk I/O is slowing you down, multiple threads/process will probably just slow you down as the read head will now be jumping all over the place as different threads get control, and you'll be spending more time seeking than reading.
If you're reading on a single disk, and processing is slowing you down, then you might get a speedup from using multiprocessing to analyze the data, but you should still read from a single thread to avoid seek time delays (which are usually very long, multiple milliseconds).
If you're reading from multiple disks, and disk I/O is slowing you down, then either multiple threads or processes will probably give you a speed improvement. Threads are easier, and since most of your delay time is away from the processor, the GIL won't be in your way.
If you're reading from multiple disks,, and processing is slowing you down, then you'll need to go with multiprocessing.

Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.
I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.
This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.
Also on the same site, check out the coroutines section.

python parallel processing

I am new to Python. I have 2000 files each about 100 MB. I have to read each of them and merge them into a big matrix (or table). Can I use parallel processing for this so that I can save some time? If yes, how? I tried searching and things seem very complicated. Currently, it takes about 8 hours to get this done serially. We have a really big server with one Tera Byte RAM and few hundred processors. How can I efficiently make use of this?
Thank you for your help.

You make be able to preprocess the files in separate processes using the subprocess module; however, if the final table is kept in memory, then that process will end up being you bottleneck.
There is another possible approach using shared memory with mmap objects. Each subprocess can be responsible for loading the files into a subsection of the mapped memory.

Efficient writing to a Compact Flash in python

I'm writing a gui to do perform a glorified 'dd'.
I could just subprocess to 'dd' but I thought I might as well use python's open()/read()/write() if I can as it'll let me display progress much more easily.
Prompted by this link here I have:
input = open('filename.img', 'rb')
output = open("/dev/sdc", 'wb')
while True:
buffer = input.read(1024)
if buffer:
output.write(buffer)
else:
break
input.close()
output.close()
...however it is horribly slow. Or at least far slower than dd. (around 4-5x slower)
I had a play and noticed altering the number of bytes 'buffered' had a huge affect on the speed of completion. Raising it to 2048 for example seems to half the time taken. Perhaps going OT for SO here but I guess the flash has an optimum number of bytes to be written at once? Could anyone suggest how I discover this?
The image & card are 1Gb so I would very much like to return to the ~5 minutes dd took if possible. I appreciate that in all likelihood I won't match it.
Rather than trial and error, would anyone be able to suggest a way to optimise the above code and reasoning as to why it works? Especially what value for input.read() for example?
One restriction: python 2.4.3 on linux (centos5) (please don't hurt me)

Speed depending on buffer size is unrelated to the specific characteristics of compact flash, but inherent to all I/O with (relatively) slow devices, even to all kinds of system calls. You should make the buffer size as large as possible without exhausting memory - 2MiB should be enough for a Flash drive.
You should use the time and strace utilities to determine why your program is slower. If time shows a large user/real (large meaning greater than 0.1), you can optimize your Python interpreter - cpython 2.4 is pretty slow, and you're creating new objects all the time instead of writing into a preallocated buffer. If there is a significant difference in sys timings, analyze the syscalls made by both programs (with strace) and try to emit the ones dd does.
Also note that you must call fsync (or execute the sync program) afterwards to measure the real time it took writing the file to disk (or open the output file with O_DIRECT). Otherwise, the operating system will let your program exit and just keep all the written data in buffers that are then continually written out to the actual disk. To test that you're doing it right, remove the disk immediately after your program is finished. Note that the speed difference can be staggering. This effect is less noticeable if your disk(CF card) is way larger than the available physical memory.

So with a little help, I've removed the 'buffer' bit completely and added an os.fsync().
import os
input = open('filename.img', 'rb')
output = open("/dev/sdc", 'wb')
output.write(input.read())
input.close()
output.close()
outputfile.flush()
os.fsync(outputfile.fileno())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.