Working with multiple Large Files in Python

Working with multiple Large Files in Python - python

I have around 60 files each contains around 900000 lines which each line is 17 tab separated float numbers. Per each line i need to do some calculation using all corresponding lines from all 60 files, but because of their huge sizes (each file size is 400 MB) and limited computation resources, it takes so long time. I would like to know is there any solution to do this fast?

It depends on how you process them. If you have enough memory you can read all the files first and change them to python data structures. Then you can do calculations.
If your files don't fit into memory probably the easiest way is to use some distributed computing mechanism (hadoop or other lighter alternatives).
Another smaller improvements could be to use fadvice linux function call to say how you will be using the file (sequential reading or random access), it tells the operating system how to optimize file access.
If the calculations fit into some common libraries like numpy numexpr which has a lot of optimizations you can use them (this can help if your computations use not-optimized algorithms to process them).

If "corresponding lines" means "first lines of all files, then second lines of all files etc", you can use `itertools.izip:
# cat f1.txt
1.1
1.2
1.3
# cat f2.txt
2.1
2.2
2.3
# python
>>> from itertools import izip
>>> files = map(open, ("f1.txt", "f2.txt"))
>>> lines_iterator = izip(*files)
>>> for lines in lines_iterator:
... print lines
...
('1.1\n', '2.1\n')
('1.2\n', '2.2\n')
('1.3\n', '2.3\n')
>>>

A few options:
1. Just use the memory
If you have 17x900000 = 15.3 M floats/file. Storing this as doubles (as numpy usually does) will take roughly 120 MB of memory per file. You can reduce this by storing the floats as float32, so that each file will take roughly 60 MB. If you have 60 files and 60 MB/file, you have 3.6 GB of data.
This amount is not unreasonable if you use 64-bit python. If you have less than, say, 6 GB of RAM in your machine, it will result in a lot of virtual memory swapping. Whether or not that is a problem depends on the way you access data.
2. Do it row-by-row
If you can do it row-by-row, just read each file one row at a time. It is quite easy to have 60 open files, that'll not cause any problems. This is probably the most efficient method, if you process the files sequentially. The memory usage is next to nothing, and the operating system will take the trouble of reading the files.
The operating system and the underlying file system try very hard to be efficient in sequential disk reads and writes.
3. Preprocess your files and use mmap
You may also preprocess your files so that they are not in CSV but in a binary format. That way each row will take exactly 17x8 = 136 or 17x4 = 68 bytes in the file. Then you can use numpy.mmap to map the files into arrays of [N, 17] shape. You can handle the arrays as usual arrays, and numpy plus the operating system will take care of optimal memory management.
The preprocessing is required because the record length (number of characters on a row) in a text file is not fixed.
This is probably the best solution, if your data access is not sequential. Then mmap is the fastest method, as it only reads the required blocks from the disk when they are needed. It also caches the data, so that it uses the optimal amount of memory.
Behind the scenes this is a close relative to solution #1 with the exception that nothing is loaded into memory until required. The same limitations about 32-bit python apply; it is not able to do this as it runs out of memory addresses.
The file conversion into binary is relatively fast and easy, almost a one-liner.

Related

Whats the performance difference between a python accessing multiple small .npy files vs one large .npy files

I am currently working on a repository that has millions of small .npy numpy or .png image files. The code reads and writes these multiple small files.
It seems to be extremely slow, I was wondering if I merged all smaller .npy files to a larger files, would the code run faster? If yes, what would be the reason? Does it have something to do with disk I/O?

Opening each file require the operating system to fetch some data on the storage device. More specifically multiple requests regarding the target file system. Reading a block also require a request. This means many request per file. Each file is typically stored at different locations (that often looks random in practice). Performing random fetches on storage devices is known to be slow. Storage devices have a limited number of IO operations per second (IOPS). HDD have a very low number of IOPS (eg. ~75 IOPS for mainstream 7200 RPM HDDs). SSDs are much faster but neither the hardware nor the software stacks are optimized to perform many requests sequentially yet so multiple threads are often needed to reach a good IOPS.
Thus, yes, merging many files into one will certainly improve performance, especially if you use a HDD, because you can then read big contiguous chunks.
For more information about the expected performance, please read this.

How to read two larger 5GB csv files in local system Jupyter Notebook using python pandas? how to join two dataframes for data analysis in local?

How to upload two large (5GB) each csv file in local system Jupyter Notebook using python pandas. Please suggest any configuration to handle big csv files for data analysis ?
Local System Configuration:
OS: Windows 10
RAM: 16 GB
Processor: Intel-Core-i7
Code:
dpath = 'p_flg_tmp1.csv'
pdf = pd.read_csv(dpath, sep="|")
Error:
MemoryError: Unable to allocate array
or
pd.read_csv(po_cust_data, sep="|", low_memory=False)
Error:
ParserError: Error tokenizing data. C error: out of memory
How to handle two bigger csv file in local system for data analysis? please suggested better configuration if possible in local system using python pandas.

If you do not need to process everything at once you can use chunks:
reader = pd.read_csv('tmp.sv', sep='|', chunksize=4000)
for chunk in reader:
print(chunk)
see the Documentation of Pandas for further information.
If you need to process everything at once and chunking really isnt an option you have only two options left
Increase RAM of your system
Switch to another data storage type
A csv file takes an enormous amount of memory in RAM, see this article for more information even if it is for another software it gives a good idea about the problem:
Memory Usage
You can estimate the memory usage of your CSV file with this simple
formula:
memory = 25 * R * C + F
where R is the number of rows, C the number of columns and F the file size in bytes.
One of my test files is 524 MB large, contains 10 columns in 4.4
million rows. Using the formula from above the RAM usage will be about
1.6 GB:
memory = 25 * 4,400,000 * 10 + 524,000,000 = 1,624,000,000 bytes
While this file is opened in Tablecruncher the Activity Monitor
reports 1.4 GB RAM used, so the formula represents a rather accurate
guess.

Use chunk to read data partially.
dpath = 'p_flg_tmp1.csv'
for pdf in pd.read_csv(dpath, sep="|", chunksize=1000):
*do something here*

What's your overall goal here though? People are providing help with how to read it, but then what? You want to do a join/merge? You are gonna need more tricks to get through that.
But then what? Is the rest of your algorithm also chunkable? Are you going to have enough RAM left to process anything? And what about CPU performance? Is one little i7 enough? Do you plan on waiting hours or days for results? This might all be acceptable for your use case, sure, but we don't know that.
At a certain point, if you want to use big data, you need big computer(s). Do you really have to do this locally? Even if you aren't ready for distributed computing over clusters, you could just get an adequately sized VM instance. Your company will pay for it. They pay for themselves. It's much cheaper to give you a better computer than to pay you to wait around for a small one to finish. In India, the price ratios between labor / AWS costs is lower than in the US, sure, but it's still well worth it. Be like hey boss, do you want this to take 3 days or 3 weeks?
Realistically, your small computer problems are only going to get worse after reading in the CSV. I mean I don't know your use case, but that seems likely. You could spend a long time trying to engineer your way out of these problems, but much cheaper to just spin up an EC2 instance.

How to solve the memory error in Python

I am dealing with several large txt file, each of them has about 8000000 lines. A short example of the lines are:
usedfor zipper fasten_coat
usedfor zipper fasten_jacket
usedfor zipper fasten_pant
usedfor your_foot walk
atlocation camera cupboard
atlocation camera drawer
atlocation camera house
relatedto more plenty
The code to store them in a dictionary is:
dicCSK = collections.defaultdict(list)
for line in finCSK:
line=line.strip('\n')
try:
r, c1, c2 = line.split(" ")
except ValueError:
print line
dicCSK[c1].append(r+" "+c2)
It runs good in the first txt file, but when it runs to the second txt file, I got an error MemoryError.
I am using window 7 64bit with python 2.7 32bit, intel i5 cpu, with 8Gb memory. How can I solve the problem?
Further explaining:
I have four large files, each file contains different information for many entities. For example, I want to find all information for cat, its father node animal and its child node persian cat and so on. So my program first read all txt files in the dictionary, then I scan all dictionaries to find information for cat and its father and its children.

Simplest solution: You're probably running out of virtual address space (any other form of error usually means running really slowly for a long time before you finally get a MemoryError). This is because a 32 bit application on Windows (and most OSes) is limited to 2 GB of user mode address space (Windows can be tweaked to make it 3 GB, but that's still a low cap). You've got 8 GB of RAM, but your program can't use (at least) 3/4 of it. Python has a fair amount of per-object overhead (object header, allocation alignment, etc.), odds are the strings alone are using close to a GB of RAM, and that's before you deal with the overhead of the dictionary, the rest of your program, the rest of Python, etc. If memory space fragments enough, and the dictionary needs to grow, it may not have enough contiguous space to reallocate, and you'll get a MemoryError.
Install a 64 bit version of Python (if you can, I'd recommend upgrading to Python 3 for other reasons); it will use more memory, but then, it will have access to a lot more memory space (and more physical RAM as well).
If that's not enough, consider converting to a sqlite3 database (or some other DB), so it naturally spills to disk when the data gets too large for main memory, while still having fairly efficient lookup.

Assuming your example text is representative of all the text, one line would consume about 75 bytes on my machine:
In [3]: sys.getsizeof('usedfor zipper fasten_coat')
Out[3]: 75
Doing some rough math:
75 bytes * 8,000,000 lines / 1024 / 1024 = ~572 MB
So roughly 572 meg to store the strings alone for one of these files. Once you start adding in additional, similarly structured and sized files, you'll quickly approach your virtual address space limits, as mentioned in #ShadowRanger's answer.
If upgrading your python isn't feasible for you, or if it only kicks the can down the road (you have finite physical memory after all), you really have two options: write your results to temporary files in-between loading in and reading the input files, or write your results to a database. Since you need to further post-process the strings after aggregating them, writing to a database would be the superior approach.

How to read a big (3-4GB) file that doesn't have newlines into a numpy array?

I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?

This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.

Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.

Is it possible to speed-up python IO?

Consider this python program:
import sys
lc = 0
for line in open(sys.argv[1]):
lc = lc + 1
print lc, sys.argv[1]
Running it on my 6GB text file, it completes in ~ 2minutes.
Question: is it possible to go faster?
Note that the same time is required by:
wc -l myfile.txt
so, I suspect the anwer to my quesion is just a plain "no".
Note also that my real program is doing something more interesting than just counting the lines, so please give a generic answer, not line-counting-tricks (like keeping a line count metadata in the file)
PS: I tagged "linux" this question, because I'm interested only in linux-specific answers. Feel free to give OS-agnostic, or even other-OS answers, if you have them.
See also the follow-up question

Throw hardware at the problem.
As gs pointed out, your bottleneck is the hard disk transfer rate. So, no you can't use a better algorithm to improve your time, but you can buy a faster hard drive.
Edit: Another good point by gs; you could also use a RAID configuration to improve your speed. This can be done either with hardware or software (e.g. OS X, Linux, Windows Server, etc).
Governing Equation
(Amount to transfer) / (transfer rate) = (time to transfer)
(6000 MB) / (60 MB/s) = 100 seconds
(6000 MB) / (125 MB/s) = 48 seconds
Hardware Solutions
The ioDrive Duo is supposedly the fastest solution for a corporate setting, and "will be available in April 2009".
Or you could check out the WD Velociraptor hard drive (10,000 rpm).
Also, I hear the Seagate Cheetah is a good option (15,000 rpm with sustained 125MB/s transfer rate).

The trick is not to make electrons move faster (that's hard to do) but to get more work done per unit of time.
First, be sure your 6GB file read is I/O bound, not CPU bound.
If It's I/O bound, consider the "Fan-Out" design pattern.
A parent process spawns a bunch of children.
The parent reads the 6Gb file, and deals rows out to the children by writing to their STDIN pipes. The 6GB read time will remain constant. The row dealing should involve as little parent processing as possible. Very simple filters or counts should be used.
A pipe is an in-memory channel for communication. It's a shared buffer with a reader and a writer.
Each child reads a row from STDIN, and does appropriate work. Each child should probably write a simple disk file with the final (summarized, reduce) results. Later, the results in those files can be consolidated.

You can't get any faster than the maximum disk read speed.
In order to reach the maximum disk speed you can use the following two tips:
Read the file in with a big buffer. This can either be coded "manually" or simply by using io.BufferedReader ( available in python2.6+ ).
Do the newline counting in another thread, in parallel.

plain "no".
You've pretty much reached maximum disk speed.
I mean, you could mmap the file, or read it in binary chunks, and use .count('\n') or something. But that is unlikely to give major improvements.

If you assume that a disk can read 60MB/s you'd need 6000 / 60 = 100 seconds, which is 1 minute 40 seconds. I don't think that you can get any faster because the disk is the bottleneck.

as others have said - "no"
Almost all of your time is spent waiting for IO. If this is something that you need to do more than once, and you have a machine with tons of ram, you could keep the file in memory. If your machine has 16GB of ram, you'll have 8GB available at /dev/shm to play with.
Another option:
If you have multiple machines, this problem is trivial to parallelize. Split the it among multiple machines, each of them count their newlines, and add the results.

2 minutes sounds about right to read an entire 6gb file. Theres not really much you can do to the algorithm or the OS to speed things up. I think you have two options:
Throw money at the problem and get better hardware. Probably the best option if this project is for your job.
Don't read the entire file. I don't know what your are trying to do with the data, so maybe you don't have any option but to read the whole thing. On the other hand if you are scanning the whole file for one particular thing, then maybe putting some metadata in there at the start would be helpful.

PyPy provides optimised input/output faster up to 7 times.

This is a bit of an old question, but one idea I've recently tested out in my petabyte project was the speed benefit of compressing data, then using compute to decompress it into memory. I used a gigabyte as a standard, but using zlib you can get really impressive file size reductions.
Once you've reduced your file size, when you go to iterate through this file you just:
Load the smaller file into memory (or use stream object).
Decompress it (as a whole, or using the stream object to get chunks of decompressed data).
Work on the decompressed file data as you wish.
I've found this process is 3x faster in the best best case than using native I/O bound tasks. It's a bit outside of the question, but it's an old one and people may find it useful.
Example:
compress.py
import zlib
with open("big.csv", "rb") as f:
compressed = zlib.compress(f.read())
open("big_comp.csv", "wb").write(compressed)
iterate.py
import zlib
with open("big_comp.csv", "rb") as f:
big = zlib.decompress(f.read())
for line in big.split("\n"):
line = reversed(line)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.