I am working with information from big models, which means I have a lot of big ascii files with two float columns (lets say X and Y). However, whenever I have to read these files it takes a long time, so I thought maybe converthing them to binary files will make the reading process much faster.
I converted my asciifiles into binary files using the uu.encode(ascii_file,binary_file) command, and it worked quite well (Actually, tested the decode part and I recovered the same files).
My question is: is there anyway to read the binary files directly into python and get the data into two variables (x and y)?
Thanks!
You didn't specify how your float columns are represented in Python. The cPickle module is a fast general solution, with the drawback that it creates files readable only from Python, and that it should never be allowed to read untrusted data (received from the network). It is likely to just work with all regular datatypes, including numpy arrays.
If you can use numpy and store your data in numpy arrays, look into numpy.save and numpy.savetxt and the corresponding loading functions, which should offer performance superior to manually extracting the data.
array.array also has methods for writing array data to file, with the drawback that the array data is written in the native format and cannot be read from a different architecture.
Check out python's struct module. It's probably what you'd want to be using for reading and writing your data.
I suggest that instead of the suggested struct module, if your model is just floats/doubles (coordinates), you should see the array module, must be much faster than any ops in the struct module. The downside of it is that the collection is homogenous, you need to have first values in odd indexes, second ones in even indexes, or sequentially.
Related
I have a 20GB library of images stored as a high-dimensional numpy array. This library allows me to these use images without having to generate them anew each time. Now my problem is that np.load("mylibrary") takes as much time as it would take to generate a couple of those images. Therefore my question is: Is there a way to store a numpy array such that it is readily accessible without having to load it?
Edit: I am using PyCharm
I would suggest h5py which is a Pythonic interface to the HDF5 binary data format.
It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
You can also use PyTables'. It is another HDF5 interface for python and numpy
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations here.
numpy.memap is another option. It however would be slower than hdf5. Another condition is that a array should be limited to 2.5G
I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.
I've looked all over for an answer to this one, but nothing really seems to fit the bill. I've got very large files that I'm trying to read with ATpy, and the data comes in the form of numpy arrays. For smaller files the following code has been sufficient:
sat = atpy.Table('satellite_data.tbl')
From there I build up a number of variables that I have to manipulate later for plotting purposes. It's lots of these kinds of operations:
w1 = np.array([sat['w1_column']])
w2 = np.array([sat['w2_column']])
w3 = np.array([sat['w3_column']])
colorw1w2 = w1 - w2 #just subtracting w2 values from w1 values for each element
colorw1w3 = w1 - w3
etc.
But for very large files the computer can't handle it. I think all the data is getting stored in memory before parsing begins, and that's not feasible for 2GB files. So, what can I use instead to handle these large files?
I've seen lots of posts where people are breaking up the data into chunks and using for loops to iterate over each line, but I don't think that's going to work for me here given the nature of these files, and the kinds of operations I need to do on these arrays. I can't just do a single operation on every line of the file, because each line contains a number of parameters that are assigned to columns, and in some cases I need to do multiple operations with figures from a single column.
Honestly I don't really understand everything going on behind the scenes with ATpy and numpy. I'm new to Python, so I appreciate answers that spell it out clearly (i.e. not relying on lots of implicit coding knowledge). There has to be a clean way of parsing this, but I'm not finding it. Thanks.
For very large arrays (larger than your memory capacity) you can use pytables which stores arrays on disk in some clever ways (using the HDF5 format) so that manipulations can be done on them without loading the entire array into memory at once. Then, you won't have to manually break up your datasets or manipulate them one line at a time.
I know nothing about ATpy so you might be better off asking on an ATpy mailing list or at least some astronomy python users mailing list, as it's possible that ATpy has another solution built in.
From the pyables website:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
PyTables is built on top of the HDF5 library, using the Python language and the NumPy package.
... fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space...
Look into using pandas. It's built for this kind of work. But the data files need to be stored in a well structured binary format like hdf5 to get good performance with any solution.
I recently came across Pytables and find it to be very cool. It is clear that they are superior to a csv format for very large data sets. I am running some simulations using python. The output is not so large, say 200 columns and 2000 rows.
If someone has experience with both, can you suggest which format would be more convenient in the long run for such data sets that are not very large. Pytables has data manipulation capabilities and browsing of the data with Vitables, but the browser does not have as much functionality as, say Excel, which can be used for CSV. Similarly, do you find one better than the other for importing and exporting data, if working mainly in python? Is one more convenient in terms of file organization? Any comments on issues such as these would be helpful.
Thanks.
Have you considered Numpy arrays?
PyTables are wonderful when your data is too large to fit in memory, but a
200x2000 matrix of 8 byte floats only requires about 3MB of memory. So I think
PyTables may be overkill.
You can save numpy arrays to files using np.savetxt or np.savez (for compression), and can read them from files with np.loadtxt or np.load.
If you have many such arrays to store on disk, then I'd suggest using a database instead of numpy .npz files. By the way, to store a 200x2000 matrix in a database, you only need 3 table columns: row, col, value:
import sqlite3
import numpy as np
db = sqlite3.connect(':memory:')
cursor = db.cursor()
cursor.execute('''CREATE TABLE foo
(row INTEGER,
col INTEGER,
value FLOAT,
PRIMARY KEY (row,col))''')
ROWS=4
COLUMNS=6
matrix = np.random.random((ROWS,COLUMNS))
print(matrix)
# [[ 0.87050721 0.22395398 0.19473001 0.14597821 0.02363803 0.20299432]
# [ 0.11744885 0.61332597 0.19860043 0.91995295 0.84857095 0.53863863]
# [ 0.80123759 0.52689885 0.05861043 0.71784406 0.20222138 0.63094807]
# [ 0.01309897 0.45391578 0.04950273 0.93040381 0.41150517 0.66263562]]
# Store matrix in table foo
cursor.executemany('INSERT INTO foo(row, col, value) VALUES (?,?,?) ',
((r,c,value) for r,row in enumerate(matrix)
for c,value in enumerate(row)))
# Retrieve matrix from table foo
cursor.execute('SELECT value FROM foo ORDER BY row,col')
data=zip(*cursor.fetchall())[0]
matrix2 = np.fromiter(data,dtype=np.float).reshape((ROWS,COLUMNS))
print(matrix2)
# [[ 0.87050721 0.22395398 0.19473001 0.14597821 0.02363803 0.20299432]
# [ 0.11744885 0.61332597 0.19860043 0.91995295 0.84857095 0.53863863]
# [ 0.80123759 0.52689885 0.05861043 0.71784406 0.20222138 0.63094807]
# [ 0.01309897 0.45391578 0.04950273 0.93040381 0.41150517 0.66263562]]
If you have many such 200x2000 matrices, you just need one more table column to specify which matrix.
As far as importing/exporting goes, PyTables uses a standardized file format called HDF5. Many scientific software packages (like MATLAB) have built-in support for HDF5, and the C API isn't terrible. So any data you need to export from or import to one of these languages can simply be kept in HDF5 files.
PyTables does add some attributes of its own, but these shouldn't hurt you. Of course, if you store Python objects in the file, you won't be able to read them elsewhere.
The one nice thing about CSV files is that they're human readable. However, if you need to store anything other than simple numbers in them and communicate with others, you'll have issues. I receive CSV files from people in other organizations, and I've noticed that humans aren't good at making sure things like string quoting are done correctly. It's good that Python's CSV parser is as flexible as it is. One other issue is that floating point numbers can't be stored exactly in text using decimal format. It's usually good enough, though.
One big plus for PyTables is the storage of metadata, like variables etc.
If you run the simulations more often with different parameters you the store the results as an array entry in the h5 file.
We use it to store measurement data + experiment scripts to get the data so it is all self contained.
BTW: If you need to look quickly into a hdf5 file you can use HDFView. It's a Java app for free from the HDFGroup. It's easy to install.
i think its very hard to comapre pytables and csv.. pyTable is a datastructure ehile CSV is an exchange format for data.
This is actually quite related to another answer I've provided regarding reading / writing csv files w/ numpy:
Python: how to do basic data manipulation like in R?
You should definitely use numpy, no matter what else! The ease of indexing, etc. far outweighs the cost of the additional dependency (well, I think so). PyTables, of course, relies on numpy too.
Otherwise, it really depends on your application, your hardware and your audience. I suspect that reading in csv files of the size you're talking about won't matter in terms of speed compared to PyTables. But if that's a concern, write a benchmark! Read and write some random data 100 times. Or, if read times matter more, write once, read 100 times, etc.
I strongly suspect that PyTables will outperform SQL. SQL will rock on complex multi-table queries (especially if you do the same ones frequently), but even on single-table (so called "denormalized") table queries, pytables is hard to beat in terms of speed. I can't find a reference for this off-hand, but you may be able to dig something up if you mine the links here:
http://www.pytables.org/moin/HowToUse#HintsforSQLusers
I'm guessing execute performance for you at this stage will pale in comparison to coder performance. So, above all, pick something that makes the most sense to you!
Other points:
As with SQL, PyTables has an undo feature. CSV files won't have this, but you can keep them in version control, and you VCS doesn't need to be too smart (CSV files are text).
On a related note, CSV files will be much bigger than binary formats (you can certainly write your own tests for this too).
These are not "exclusive" choices.
You need both.
CSV is just a data exchange format. If you use pytables, you still need to import and export in CSV format.
I need to handle tens of Gigabytes data in one binary file. Each record in the data file is variable length.
So the file is like:
<len1><data1><len2><data2>..........<lenN><dataN>
The data contains integer, pointer, double value and so on.
I found python can not even handle this situation. There is no problem if I read the whole file in memory. It's fast. But it seems the struct package is not good at performance. It almost stuck on unpack the bytes.
Any help is appreciated.
Thanks.
struct and array, which other answers recommend, are fine for the details of the implementation, and might be all you need if your needs are always to sequentially read all of the file or a prefix of it. Other options include buffer, mmap, even ctypes, depending on many details you don't mention regarding your exact needs. Maybe a little specialized Cython-coded helper can offer all the extra performance you need, if no suitable and accessible library (in C, C++, Fortran, ...) already exists that can be interfaced for the purpose of handling this humongous file as you need to.
But clearly there are peculiar issues here -- how can a data file contain pointers, for example, which are intrinsically a concept related to addressing memory? Are they maybe "offsets" instead, and, if so, how exactly are they based and coded? Are your needs at all more advanced than simply sequential reading (e.g., random access), and if so, can you do a first "indexing" pass to get all the offsets from start of file to start of record into a more usable, compact, handily-formatted auxiliary file? (That binary file of offsets would be a natural for array -- unless the offsets need to be longer than array supports on your machine!). What is the distribution of record lengths and compositions and number of records to make up the "tens of gigabytes"? Etc, etc.
You have a very large scale problem (and no doubt very large scale hardware to support it, since you mention that you can easily read all of the file into memory that means a 64bit box with many tens of GB of RAM -- wow!), so it's well worth the detailed care to optimize the handling thereof -- but we can't help much with such detailed care unless we know enough detail to do so!-).
have a look at array module, specifically at array.fromfile method. This bit:
Each record in the data file is variable length.
is rather unfortunate. but you could handle it with a try-except clause.
For a similar task, I defined a class like this:
class foo(Structure):
_fields_ = [("myint", c_uint32)]
created an instance
bar = foo()
and did,
block = file.read(sizeof(bar))
memmove(addressof(bar), block, sizeof(bar))
In the event of variable-size records, you can use a similar method for retrieving lenN, and then read the corresponding data entries. Seems trivial to implement. However, I have no idea of how fast this method is compared to using pack() and unpack(), perhaps someone else has profiled both methods.
For help with parsing the file without reading it into memory you can use the bitstring module.
Internally this is using the struct module and a bytearray, but an immutable Bits object can be initialised with a filename so it won't read it all into memory.
For example:
from bitstring import Bits
s = Bits(filename='your_file')
while s.bytepos != s.length:
# Read a byte and interpret as an unsigned integer
length = s.read('uint:8')
# Read 'length' bytes and convert to a Python string
data = s.read(length*8).bytes
# Now do whatever you want with the data
Of course you can parse the data however you want.
You can also use slice notation to read the file contents, although note that the indices will be in bits rather than bytes so for example s[-800:] would be the final 100 bytes.
What if you use dump the data file into sqlite3 in memory.
import sqlite3
sqlite3.Connection(":memory:")
You can then use sql to process the data.
Besides, you might want to look at generators (or here) and iterators (or here and here).
PyTables is a very good library to handle HDF5, a binary format used in astronomy and meteorology to handle very big datasets:
PyTables
It works more or less like an hierarchical database, where you can store multiple tables, inside columns. Have a look at it.