I periodically receive data (every 15 minutes) and have them in an array (numpy array to be precise) in python, that is roughly 50 columns, the number of rows varies, usually is somewhere around 100-200.
Before, I only analyzed this data and tossed it, but now I'd like to start saving it, so that I can create statistics later.
I have considered saving it in a csv file, but it did not seem right to me to save high amounts of such big 2D arrays to a csv file.
I've looked at serialization options, particularly pickle and numpy's .tobytes(), but in both cases I run into an issue - I have to track the amount of arrays stored. I've seen people write the number as the first thing in the file, but I don't know how I would be able to keep incrementing the number while having the file still opened (the program that gathers the data runs practically non-stop). Constantly opening the file, reading the number, rewriting it, seeking to the end to write new data and closing the file again doesn't seem very efficient.
I feel like I'm missing some vital information and have not been able to find it. I'd love it if someone could show me something I can not see and help me solve the problem.
Saving on a csv file might not be a good idea in this case, think about the accessibility and availability of your data. Using a database will be better, you can easily update your data and control the size amount of data you store.
I have many files with three million lines in identical tab delimited format. All I need to do is divide the number in the 14th "column" by the number in the 12th "column", then set the number in the 14th column to the result.
Although this is a very simple function I'm actually really struggling to work out how to achieve this. I've spent a good few hours searching this website but unfortunately the answers I've seen have completely gone over the top of my head as I'm a novice coder!
The tools I have Notepad++ and Ultraedit (which has the ability to use Javascript, although i'm not familiar with this), and Python 3.6 (I have very basic Python knowledge). Other answers have suggested using something called "awk", but when I looked this up it needs Unix - I only have Windows. What's the best tool for getting this done? I'm more than willing to learn something new.
In python there are a few ways to handle csv. For your particular use case
I think pandas is what you are looking for.
You can load your file with df = pandas.read_csv(), then performing your division and replacement will be as easy as df[13] /= df[11].
Finally you can write your data back in csv format with df.to_csv().
I leave it to you to fill in the missing details of the pandas functions, but I promise it is very easy and you'll probably benefit from learning it for a long time.
Hope this helps
If I have some X vs Y data saved in a Matlab .fig file, is there a way to extract that data in Python? I've tried using the method shown in a previous discussion, but this does not work for me. I have also tried to open the files using h5py and PyTables, since .mat files are actually HDF5 files now, but this results in an error where a valid file signature can't be found.
Currently I'm trying to do this with the Anaconda distribution of Python 3.4.
EDIT: I managed to figure out something that works, but I don't know why. This has me worried something might break in the future and I won't be able to debug it. If anyone can explain why this works, but the method in the old discussion doesn't I'd really appreciate it.
from scipy.io import loadmat
d = loadmat('linear.fig', squeeze_me=True, struct_as_record=False)
x = d['hgS_070000'].children.children.properties.XData
y = d['hgS_070000'].children.children.properties.YData
The best way I can think of is using any of the Matlab-Python bridge (such as pymatbridge).
You can call Matlab code directly on python files and transform the data from one to the other. You could use some Matlab code to load the fig and extract the data and then convert the numerical variables to python arrays (or numpy arrays) easily.
I've looked all over for an answer to this one, but nothing really seems to fit the bill. I've got very large files that I'm trying to read with ATpy, and the data comes in the form of numpy arrays. For smaller files the following code has been sufficient:
sat = atpy.Table('satellite_data.tbl')
From there I build up a number of variables that I have to manipulate later for plotting purposes. It's lots of these kinds of operations:
w1 = np.array([sat['w1_column']])
w2 = np.array([sat['w2_column']])
w3 = np.array([sat['w3_column']])
colorw1w2 = w1 - w2 #just subtracting w2 values from w1 values for each element
colorw1w3 = w1 - w3
etc.
But for very large files the computer can't handle it. I think all the data is getting stored in memory before parsing begins, and that's not feasible for 2GB files. So, what can I use instead to handle these large files?
I've seen lots of posts where people are breaking up the data into chunks and using for loops to iterate over each line, but I don't think that's going to work for me here given the nature of these files, and the kinds of operations I need to do on these arrays. I can't just do a single operation on every line of the file, because each line contains a number of parameters that are assigned to columns, and in some cases I need to do multiple operations with figures from a single column.
Honestly I don't really understand everything going on behind the scenes with ATpy and numpy. I'm new to Python, so I appreciate answers that spell it out clearly (i.e. not relying on lots of implicit coding knowledge). There has to be a clean way of parsing this, but I'm not finding it. Thanks.
For very large arrays (larger than your memory capacity) you can use pytables which stores arrays on disk in some clever ways (using the HDF5 format) so that manipulations can be done on them without loading the entire array into memory at once. Then, you won't have to manually break up your datasets or manipulate them one line at a time.
I know nothing about ATpy so you might be better off asking on an ATpy mailing list or at least some astronomy python users mailing list, as it's possible that ATpy has another solution built in.
From the pyables website:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
PyTables is built on top of the HDF5 library, using the Python language and the NumPy package.
... fast, yet extremely easy to use tool for interactively browse, process and search very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space...
Look into using pandas. It's built for this kind of work. But the data files need to be stored in a well structured binary format like hdf5 to get good performance with any solution.
I recently came across Pytables and find it to be very cool. It is clear that they are superior to a csv format for very large data sets. I am running some simulations using python. The output is not so large, say 200 columns and 2000 rows.
If someone has experience with both, can you suggest which format would be more convenient in the long run for such data sets that are not very large. Pytables has data manipulation capabilities and browsing of the data with Vitables, but the browser does not have as much functionality as, say Excel, which can be used for CSV. Similarly, do you find one better than the other for importing and exporting data, if working mainly in python? Is one more convenient in terms of file organization? Any comments on issues such as these would be helpful.
Thanks.
Have you considered Numpy arrays?
PyTables are wonderful when your data is too large to fit in memory, but a
200x2000 matrix of 8 byte floats only requires about 3MB of memory. So I think
PyTables may be overkill.
You can save numpy arrays to files using np.savetxt or np.savez (for compression), and can read them from files with np.loadtxt or np.load.
If you have many such arrays to store on disk, then I'd suggest using a database instead of numpy .npz files. By the way, to store a 200x2000 matrix in a database, you only need 3 table columns: row, col, value:
import sqlite3
import numpy as np
db = sqlite3.connect(':memory:')
cursor = db.cursor()
cursor.execute('''CREATE TABLE foo
(row INTEGER,
col INTEGER,
value FLOAT,
PRIMARY KEY (row,col))''')
ROWS=4
COLUMNS=6
matrix = np.random.random((ROWS,COLUMNS))
print(matrix)
# [[ 0.87050721 0.22395398 0.19473001 0.14597821 0.02363803 0.20299432]
# [ 0.11744885 0.61332597 0.19860043 0.91995295 0.84857095 0.53863863]
# [ 0.80123759 0.52689885 0.05861043 0.71784406 0.20222138 0.63094807]
# [ 0.01309897 0.45391578 0.04950273 0.93040381 0.41150517 0.66263562]]
# Store matrix in table foo
cursor.executemany('INSERT INTO foo(row, col, value) VALUES (?,?,?) ',
((r,c,value) for r,row in enumerate(matrix)
for c,value in enumerate(row)))
# Retrieve matrix from table foo
cursor.execute('SELECT value FROM foo ORDER BY row,col')
data=zip(*cursor.fetchall())[0]
matrix2 = np.fromiter(data,dtype=np.float).reshape((ROWS,COLUMNS))
print(matrix2)
# [[ 0.87050721 0.22395398 0.19473001 0.14597821 0.02363803 0.20299432]
# [ 0.11744885 0.61332597 0.19860043 0.91995295 0.84857095 0.53863863]
# [ 0.80123759 0.52689885 0.05861043 0.71784406 0.20222138 0.63094807]
# [ 0.01309897 0.45391578 0.04950273 0.93040381 0.41150517 0.66263562]]
If you have many such 200x2000 matrices, you just need one more table column to specify which matrix.
As far as importing/exporting goes, PyTables uses a standardized file format called HDF5. Many scientific software packages (like MATLAB) have built-in support for HDF5, and the C API isn't terrible. So any data you need to export from or import to one of these languages can simply be kept in HDF5 files.
PyTables does add some attributes of its own, but these shouldn't hurt you. Of course, if you store Python objects in the file, you won't be able to read them elsewhere.
The one nice thing about CSV files is that they're human readable. However, if you need to store anything other than simple numbers in them and communicate with others, you'll have issues. I receive CSV files from people in other organizations, and I've noticed that humans aren't good at making sure things like string quoting are done correctly. It's good that Python's CSV parser is as flexible as it is. One other issue is that floating point numbers can't be stored exactly in text using decimal format. It's usually good enough, though.
One big plus for PyTables is the storage of metadata, like variables etc.
If you run the simulations more often with different parameters you the store the results as an array entry in the h5 file.
We use it to store measurement data + experiment scripts to get the data so it is all self contained.
BTW: If you need to look quickly into a hdf5 file you can use HDFView. It's a Java app for free from the HDFGroup. It's easy to install.
i think its very hard to comapre pytables and csv.. pyTable is a datastructure ehile CSV is an exchange format for data.
This is actually quite related to another answer I've provided regarding reading / writing csv files w/ numpy:
Python: how to do basic data manipulation like in R?
You should definitely use numpy, no matter what else! The ease of indexing, etc. far outweighs the cost of the additional dependency (well, I think so). PyTables, of course, relies on numpy too.
Otherwise, it really depends on your application, your hardware and your audience. I suspect that reading in csv files of the size you're talking about won't matter in terms of speed compared to PyTables. But if that's a concern, write a benchmark! Read and write some random data 100 times. Or, if read times matter more, write once, read 100 times, etc.
I strongly suspect that PyTables will outperform SQL. SQL will rock on complex multi-table queries (especially if you do the same ones frequently), but even on single-table (so called "denormalized") table queries, pytables is hard to beat in terms of speed. I can't find a reference for this off-hand, but you may be able to dig something up if you mine the links here:
http://www.pytables.org/moin/HowToUse#HintsforSQLusers
I'm guessing execute performance for you at this stage will pale in comparison to coder performance. So, above all, pick something that makes the most sense to you!
Other points:
As with SQL, PyTables has an undo feature. CSV files won't have this, but you can keep them in version control, and you VCS doesn't need to be too smart (CSV files are text).
On a related note, CSV files will be much bigger than binary formats (you can certainly write your own tests for this too).
These are not "exclusive" choices.
You need both.
CSV is just a data exchange format. If you use pytables, you still need to import and export in CSV format.