I have a large zip file that contains 1 file inside.
I want to unzip that file to a given directory for further processing and used this code:
def unzip(zipfile: ZipFile, filename: str, dest: str):
ZipFile.extract(zipfile, filename, dest)
This function is called using:
with ZipFile(file_path, "r") as zip_source:
unzip(zip_source, zip_source.infolist()[0], extract_path) # extract path is correctly defined earlier in the code
It seems like unzipping a large file takes a long time (file size > 500 Mb) and I would like to optimize this solution.
All the optimizations I found were multiprocessing based in order to make the extraction of multiple files faster, however, my zip contains only a single file so multiprocessing doesn't seem to be the answer.
You cannot parallelize the decompression of a zip file with 1 file inside as long are the file is actually compressed using the usual decompression algorithms LZ77/LZW/LZSS. These algorithm are intrinsically sequential.
Moreover these decompression methods are known to be slow (often much slower than reading the file from a storage device). This is mainly because of the algorithm themselves: their complexity and the fact that most mainstream processors cannot speed the computation up by a large margin.
Thus, there is no way to decompress the file faster, although you might find a slightly faster implementation by using another library.
Related
I am working on a project where I am combining 300,000 small files together to form a dataset to be used for training a machine learning model. Because each of these files do not represent a single sample, but rather a variable number of samples, the dataset I require can only be formed by iterating through each of these files and concatenating/appending them to a single, unified array. With this being said, I unfortunately cannot avoid having to iterate through such files in order to form the dataset I require. As such, the process of data loading prior to model training is very slow.
Therefore my question is this: would it be better to merge these small files together into relatively larger files, e.g., reducing the 300,000 files to 300 (merged) files? I assume that iterating through less (but larger) files would be faster than iterating through many (but smaller) files. Can someone confirm if this is actually the case?
For context, my programs are written in Python and I am using PyTorch as the ML framework.
Thanks!
Usually working with one bigger file is faster than working with many small files.
It needs less open, read, close, etc. functions which need time to
check if file exists,
check if you have privilege to access this file,
get file's information from disk (where is beginning of file on disk, what is its size, etc.),
search beginning of file on disk (when it has to read data),
create system's buffer for data from disk (system reads more data to buffer and later function read() can read partially from buffer instead of reading partially from disk).
Using many files it has to do this for every file and disk is much slower than buffer in memory.
I have a file with all of my pathnames for each .npy file. I have about 5 million files so I would like to avoid unnecessary fors.
What I need to do is to load them all into my data variable like this:
data = np.load( input_file_w_pathnames )
I know this will not work but I was wondering if someone knows of a clever way to do something similar, or at least a way to do this efficiently.
np.load takes a filename or a file object (a file that you opened). It uses standard Python file reading tools. It does not take multiple names or files.
np.stack([np.load(f) for f in ['x.npy','x.npy','x.npy']])
can join the arrays in each file into a larger array, it is still doing a file by file load.
Keep in mind that numpy 'efficiency' is achieved by performing the task in compiled code - it's faster because of the compiling, not because it is getting around the serial nature of the task. And this task does not come up often enough to warrant special code.
I assume you can easily deal with loading the filenames into a list.
I have around 60 files each contains around 900000 lines which each line is 17 tab separated float numbers. Per each line i need to do some calculation using all corresponding lines from all 60 files, but because of their huge sizes (each file size is 400 MB) and limited computation resources, it takes so long time. I would like to know is there any solution to do this fast?
It depends on how you process them. If you have enough memory you can read all the files first and change them to python data structures. Then you can do calculations.
If your files don't fit into memory probably the easiest way is to use some distributed computing mechanism (hadoop or other lighter alternatives).
Another smaller improvements could be to use fadvice linux function call to say how you will be using the file (sequential reading or random access), it tells the operating system how to optimize file access.
If the calculations fit into some common libraries like numpy numexpr which has a lot of optimizations you can use them (this can help if your computations use not-optimized algorithms to process them).
If "corresponding lines" means "first lines of all files, then second lines of all files etc", you can use `itertools.izip:
# cat f1.txt
1.1
1.2
1.3
# cat f2.txt
2.1
2.2
2.3
# python
>>> from itertools import izip
>>> files = map(open, ("f1.txt", "f2.txt"))
>>> lines_iterator = izip(*files)
>>> for lines in lines_iterator:
... print lines
...
('1.1\n', '2.1\n')
('1.2\n', '2.2\n')
('1.3\n', '2.3\n')
>>>
A few options:
1. Just use the memory
If you have 17x900000 = 15.3 M floats/file. Storing this as doubles (as numpy usually does) will take roughly 120 MB of memory per file. You can reduce this by storing the floats as float32, so that each file will take roughly 60 MB. If you have 60 files and 60 MB/file, you have 3.6 GB of data.
This amount is not unreasonable if you use 64-bit python. If you have less than, say, 6 GB of RAM in your machine, it will result in a lot of virtual memory swapping. Whether or not that is a problem depends on the way you access data.
2. Do it row-by-row
If you can do it row-by-row, just read each file one row at a time. It is quite easy to have 60 open files, that'll not cause any problems. This is probably the most efficient method, if you process the files sequentially. The memory usage is next to nothing, and the operating system will take the trouble of reading the files.
The operating system and the underlying file system try very hard to be efficient in sequential disk reads and writes.
3. Preprocess your files and use mmap
You may also preprocess your files so that they are not in CSV but in a binary format. That way each row will take exactly 17x8 = 136 or 17x4 = 68 bytes in the file. Then you can use numpy.mmap to map the files into arrays of [N, 17] shape. You can handle the arrays as usual arrays, and numpy plus the operating system will take care of optimal memory management.
The preprocessing is required because the record length (number of characters on a row) in a text file is not fixed.
This is probably the best solution, if your data access is not sequential. Then mmap is the fastest method, as it only reads the required blocks from the disk when they are needed. It also caches the data, so that it uses the optimal amount of memory.
Behind the scenes this is a close relative to solution #1 with the exception that nothing is loaded into memory until required. The same limitations about 32-bit python apply; it is not able to do this as it runs out of memory addresses.
The file conversion into binary is relatively fast and easy, almost a one-liner.
I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.
I need to create ZIP archives on demand, using either Python zipfile module or unix command line utilities.
Resources to be zipped are often > 1GB and not necessarily compression-friendly.
How do I efficiently estimate its creation time / size?
Extract a bunch of small parts from the big file. Maybe 64 chunks of 64k each. Randomly selected.
Concatenate the data, compress it, measure the time and the compression ratio. Since you've randomly selected parts of the file chances are that you have compressed a representative subset of the data.
Now all you have to do is to estimate the time for the whole file based on the time of your test-data.
I suggest you measure the average time it takes to produce a zip of a certain size. Then you calculate the estimate from that measure. However I think the estimate will be very rough in any case if you don't know how well the data compresses. If the data you want to compress had a very similar "profile" each time you could probably make better predictions.
If its possible to get progress callbacks from the python module i would suggest finding out how many bytes are processed pr second ( By simply storing where in the file you where at start of the second, and where you are at the end ). When you have the data on how fast the computer your on you can off course save it, and use it as a basis for your next zip file. ( I normally collect about 5 samples before showing a time prognosses )
Using this method can give you Microsoft minutes so as you get more samples you would need to average it out. This would esp be the case if your making a zip file that contains a lot of files, as ZIP tends to slow down when compressing many small files compared to 1 large file.
If you're using the ZipFile.write() method to write your files into the archive, you could do the following:
Get a list of the files you want to zip and their relative sizes
Write one file to the archive and time how long it took
Calculate ETA based on the number of files written, their size, and how much is left.
This won't work if you're only zipping one really big file though. I've never used the zip module myself, so I'm not sure if it would work, but for small numbers of large files, maybe you could use the ZipFile.writestr() function and read in / zip up your files in chunks?