I have 10,000 binary files, named like this:
file0.bin
file1.bin
............
............
file10000.bin
Each of the above files contains exactly 391 float values (1564 bytes per file).
my goal is to read all of the files into a python array in the fastest way possible. If I open & close each file using a script, it takes a lot of time (about 8min!).
are there any other creative ways to read these files FAST?
I am using Ubuntu Linux and would prefer a solution that can work with Python. Thanks.
If you want even faster make ramdisk:
# mkfs -q /dev/ram1 $(( 2 * 10000)) ## roughly the size you need
# mkdir -p /ramcache
# mount /dev/ram1 /ramcache
# df -H | grep ramcache
now concat
# cat file{1..10000}.bin >> /ramcache/concat.bin ## thanks SiegeX
Then let your script on that file
Since I haven't tested I prefixed everything with '#' so that you wouldn't have any accidents. Just remove them if you want it to work.
This is an option but I would urge you to consider looking at the comments people have posted directly under your Q
You could probably get better results examining what you are doing wrong as I could not reproduce your speed problem of 8 mins.
Iterate over them and use optimise flag you might also want to parse them using pypy it compiles python via a JIT compiler allowing for a somewhat marked increase in speed.
You have 10001 files (0 to 10000 inclusive) and it takes 8 minutes to run the following?
try: xrange # python 2 - 3 compatibility
except NameError: xrange= range
import array
final= array.array('f')
for file_seq in xrange(10001):
with open("file%d.bin" % file_seq, "rb") as fp:
final.fromfile(fp, 391)
What's the underlying filesystem? How much RAM do you have? What's your processor and its speed?
Related
For my application, I am attempting to determine whether a data backup system missed any writes. I am doing this by writing an incrementing integer counter to a 1GB virtual disk, and to make sure no writes were missed I can look at the reverted snapshot and see if there were any gaps (i.e. if I see 1, 2, 3, 0, 0, 6, 7 I know that the backup didn't get writes 4 and 5 correctly). This is all on a CentOS 7 VM, with mostly Python 2.7 scripts for writes/reads (speed isn't a huge concern)
A big part of my issues has been caching: since I'm simulating random I/O, writes are often flushed from caches and written to disk out of order. This makes every test appear as a false positive, since it looks like some data is missing at the time of the snapshot. Again, I don't really care about efficiency at all, so I don't mind really slow writes. Reads can use caching, that's not a problem, but also doesn't matter much one way or the other
Here are the things I have done to try to disable caching:
disable the disk write cache with sudo hdparm -W 0 /dev/sdb where /dev/sdb
writing to a raw disk with no filesystem, so no filesystem caching
set the buffering flag on with open in the Python script to 0 (no Python write cache)
Is it basically an impossible task to make sure that my writes get put on the disk in sequential order? All I need is write #(n) to happen before write #(n+1), and #(n+1) before #(n+2), etc.
This is the Python script I'm using to write to disk (SIZE and PRIME change based on the size of the disk an a random seed):
from struct import pack, unpack
import sys
SIZE,PRIME = [x],[x]
# random I/O traversal iterator
def rand_index_generator(a,b):
ctr=0
while True:
yield (ctr%b)
ctr+=a
with open('/dev/sdb', 'rb+', buffering=0) as f:
index_gen = rand_index_generator(PRIME, SIZE)
# random traversal using iterator above, write counter to file
for counter in xrange(1, SIZE-16):
f.seek(index_gen.next()*4)
f.write(pack('>I', counter))
Then to validate I traverse in the same order and watch for gaps of unwritten data. This is after reverting the VM back to the snapshot. I know all the traversal and writing things work since validation will work smoothly with no missed writes before reverting, but I think some "written" data dies in RAM and doesn't make it to disk
Will take any suggestions to guarantee the write order I need for this application
Found out the answer to this question. I misunderstood the effect of writing to a raw disk, it did not eliminate OS caching since I was still calling the OS to write to my raw disk. Oops
To bypass OS caches you should use os.open and pass os.O_DIRECT and os.O_SYNC flags to make sure writes happen in the correct sequence (more info on those flags) and are not stuck in volatile memory. I used mmap and os file descriptors but you could also use the normal filehandles like this
Page size is specific to your operating system. For Linux it is 4096
The top section of the code stayed the same but here is the write loop:
PAGESIZE = 4096
filedesc = os.open('/dev/sdb', os.O_DIRECT|os.O_SYNC|os.O_RDWR)
for counter in xrange(1, SIZE-16):
write_loc = index_gen.next()*4
page_dist = (write_loc%PAGESIZE)
offset = write_loc - page_dist
bytemap = mmap.mmap(filedesc, PAGESIZE, offset=offset)
bytemap[page_dist:(page_dist+4)] = pack('>I', counter)
bytemap.flush()
bytemap.close()
I am trying to write a big list of numpy nd_arrays to disk.
The list is ~50000 elements long
Each element is a nd_array of size (~2048,2) of ints. The arrays have different shapes.
The method I am (curently) using is
#staticmethod
def _write_with_yaml(path, obj):
with io.open(path, 'w+', encoding='utf8') as outfile:
yaml.dump(obj, outfile, default_flow_style=False, allow_unicode=True)
I have also tried pickle which also give the same problem:
On small lists (~3400 long), this works fine, finishes fast enough (<30 sec).
On ~6000 long lists, this finishes after ~2 minutes.
When the list gets larger, the process seems not to do anything. No change in RAM or disk activity.
I stopped waiting after 30 minutes.
After force stopping the process, the file suddenly became of significant size (~600MB).
I can't know if it finished writing or not.
What is the correct way to write such large lists, know if he write succeeded, and, if possible, knowing when the write/read is going to finish?
How can I debug what's happening when the process seems to hang?
I prefer not to break and assemble the lists manually in my code, I expect the serialization libraries to be able to do that for me.
For the code
import numpy as np
import yaml
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
with open("t.yaml", 'w+', encoding='utf8') as outfile:
yaml.dump(x, outfile, default_flow_style=False, allow_unicode=True)
on my system (MacOSX, i7, 16 GiB RAM, SSD) with Python 3.7 and PyYAML 3.13 the finish time is 61min. During the save the python process occupied around 5 GBytes of memory and final file size is 2 GBytes. This also shows the overhead of the file format: as the size of the data is 50k * 2048 * 2 * 8 (the size of a float is generally 64 bits in python) = 1562 MBytes, means yaml is around 1.3 times worse (and serialisation/deserialisation is also taking time).
To answer your questions:
There is no correct or incorrect way. To have a progress update and
estimation of finishing time is not easy (ex: other tasks might
interfere with the estimation, resources like memory could be used
up, etc.). You can rely on a library that supports that or implement
something yourself (as the other answer suggested)
Not sure "debug" is the correct term, as in practice it might be that the process just slow. Doing a performance analysis is not easy, especially if
using multiple/different libraries. What I would start with is clear
requirements: what do you want from the file saved? Do they need to
be yaml? Saving 50k arrays as yaml does not seem the best solution
if you care about performance. Should you ask yourself first "which is the best format for what I want?" (but you did not give details so can't say...)
Edit: if you want something just fast, use pickle. The code:
import numpy as np
import yaml
import pickle
x = []
for i in range(0,50000):
x.append(np.random.rand(2048,2))
print("Arrays generated")
pickle.dump( x, open( "t.yaml", "wb" ) )
finishes in 9 seconds, and generates a file of 1.5GBytes (no overhead). Of course pickle format should be used in very different circumstances than yaml...
I cant say this is the answer, but it may be it.
When I was working on app that required fast cycles, I found out that something in the code is very slow. It was opening / closing yaml files.
It was solved by using JSON.
Dont use YAML for anything else than as some kind of config you dont open often.
Solution to your array saving:
np.save(path,array) # path = path+name+'.npy'
If you really need to save a list of arrays, I recommend you to save list with array paths(array themselfs you will save on disk with np.save). Saving python objects on disk is not really what you want. What you want is to save numpy arrays with np.save
Complete solution(Saving example):
for array_index in range(len(list_of_arrays)):
np.save(array_index+'.npy',list_of_arrays[array_index])
# path = array_index+'.npy'
Complete solution(Loading example):
list_of_array_paths = ['1.npy','2.npy']
list_of_arrays = []
for array_path in list_of_array_paths:
list_of_arrays.append(np.load(array_path))
Further advice:
Python cant really handle large arrays. Moreover if you have loaded several of them in the list. From the point of speed and memory, always work with one,two arrays at a time. The rest must be waiting on the disk. So instead of object reference, have reference as a path and when needed, load it from disk.
Also, you said you dont want to assemble the list manually.
Possible solution, which I dont advice, but is possibly exactly what you are looking for
>>> a = np.zeros(shape = [10,5,3])
>>> b = np.zeros(shape = [7,7,9])
>>> c = [a,b]
>>> np.save('data.npy',c)
>>> d = np.load('data.npy')
>>> d.shape
(2,)
>>> type(d)
<type 'numpy.ndarray'>
>>> d.shape
(2,)
>>> d[0].shape
(10, 5, 3)
>>>
I believe I dont need to comment above mentioned code. However, after loading back, you will lose list as the list will be transformed into numpy array.
I have seen some installation files (huge ones, install.sh for Matlab or Mathematica, for example) for Unix-like systems, they must have embedded quite a lot of binary data, such as icons, sound, graphics, etc, into the script. I am wondering how that can be done, since this can be potentially useful in simplifying file structure.
I am particularly interested in doing this with Python and/or Bash.
Existing methods that I know of in Python:
Just use a byte string: x = b'\x23\xa3\xef' ..., terribly inefficient, takes half a MB for a 100KB wav file.
base64, better than option 1, enlarge the size by a factor of 4/3.
I am wondering if there are other (better) ways to do this?
You can use base64 + compression (using bz2 for instance) if that suits your data (e.g., if you're not embedding already compressed data).
For instance, to create your data (say your data consist of 100 null bytes followed by 200 bytes with value 0x01):
>>> import bz2
>>> bz2.compress(b'\x00' * 100 + b'\x01' * 200).encode('base64').replace('\n', '')
'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
And to use it (in your script) to write the data to a file:
import bz2
data = 'QlpoOTFBWSZTWcl9Q1UAAABBBGAAQAAEACAAIZpoM00SrccXckU4UJDJfUNV'
with open('/tmp/testfile', 'w') as fdesc:
fdesc.write(bz2.decompress(data.decode('base64')))
Here's a quick and dirty way. Create the following script called MyInstaller:
#!/bin/bash
dd if="$0" of=payload bs=1 skip=54
exit
Then append your binary to the script, and make it executable:
cat myBinary >> myInstaller
chmod +x myInstaller
When you run the script, it will copy the binary portion to a new file specified in the path of=. This could be a tar file or whatever, so you can do additional processing (unarchiving, setting execute permissions, etc) after the dd command. Just adjust the number in "skip" to reflect the total length of the script before the binary data starts.
I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards
If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.
(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.
I will try my best to explain my problem and my line of thought on how I think I can solve it.
I use this code
for root, dirs, files in os.walk(downloaddir):
for infile in files:
f = open(os.path.join(root,infile),'rb')
filehash = hashlib.md5()
while True:
data = f.read(10240)
if len(data) == 0:
break
filehash.update(data)
print "FILENAME: " , infile
print "FILE HASH: " , filehash.hexdigest()
and using start = time.time() elapsed = time.time() - start I measure how long it takes to calculate an hash. Pointing my code to a file with 653megs this is the result:
root#Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.624
root#Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.373
root#Mars:/home/tiago# python algorithm-timer.py
FILENAME: freebsd.iso
FILE HASH: ace0afedfa7c6e0ad12c77b6652b02ab
12.540
Ok now 12 seconds +- on a 653mb file, my problem is I intend to use this code on a program that will run through multiple files, some of them might be 4/5/6Gb and it will take wayy longer to calculate. What am wondering is if there is a faster way for me to calculate the hash of the file? Maybe by doing some multithreading? I used a another script to check the use of the CPU second by second and I see that my code is only using 1 out of my 2 CPUs and only at 25% max, any way I can change this?
Thank you all in advance for the given help.
Hash calculation in your case will almost certanly be I/O bound (unless you'll be running it on a machine with a really slow processor), so multithreading or processing of multiple files at once probably won't yield you expected results.
Arraging files over multiple drives or on a faster (SSD) drive would probably help, even though that is probably not the solution you are looking for.
Aren't disk operations a bottleneck here?
Assuming 80MB/sec read speed (this is how my hard disk performs), it takes about 8 seconds to read the file.
For what it's worth, doing this:
c:\python\Python.exe c:\python\Tools\scripts\md5sum.py cd.iso
takes 9.671 seconds on my laptop (2GHz core2 duo with an 80 GB SATA laptop hard drive).
As others have mentioned, MD5s are disk-bound, but your 12 second benchmark is probably pretty close to the fastest you could get.
Also, python's md5sum.py uses 8096 for the buffer size (even though I'm sure they meant either 4096 or 8192).
It helped me to increase my buffer size, up to a point. I started with 1024 and multiplied it with 2^N, increasing N each time starting from 1. With this method, I found that on my system that a buffer size of 65536 seemed to be about as good as it would get. However, it only gave me about an 7% improvement in running time.
Profiling indicated that about 80% of the time is spent in the MD5 update method and the other 20% is reading in the file. Since MD5 is a serial algorithm and the Python algorithm is already implemented in C, I don't think that there is much that you can do to speed up the MD5 part. You can try calculating the MD5s of two different files in parallel, but as everyone has said, you're ultimately going to be limited by the disk access speed.