I have a large file I need to load to a dataframe. I will need to work on it for a while. Is there a way of keeping in loaded in memory, so that if my script fails, I will not need to load it again ?
Here's an example of how one can keep variables in memory between runs.
For persistent storage beyond RAM, I would recommend looking into HDF5. It's fast, simple, and allows for queries if necessary: (see docs).
It supports .read_hdf() and .to_hdf() similar to the _csv() methods, but is significantly faster.
A simple illustration of storage and retrieval including query (from the docs) would be:
df = DataFrame(dict(A=list(range(5)), B=list(range(5))))
df.to_hdf('store_tl.h5','table', append=True)
read_hdf('store_tl.h5', 'table', where = ['index>2'])
Related
I'm writing a program that requires running algorithms on a very large (~6GB) csv file, which is loaded with pandas using read_csv().
The issue I have now, is that anytime I tweak my algorithms and need to re-simulate (which is very often), I need to wait ~30s for the dataset to load into memory, and then another 30s afterward to load the same dataset into a graphing module so I can visually see what's going on. Once it's loaded however, operations are done very quickly.
So far I've tried using mmap, and loading the dataset into a RAM disk for access, with no improvement.
I'm hoping to find a way to load up the dataset once into memory with one process, and then access it in memory with the algorithm-crunching process, which gets re-run each time I make a change.
This thread seems to be close-ish to what I need, but uses multiprocessing which needs everything to be run within the same context.
I'm not a computer engineer (I'm electrical :), so I'm not sure what I'm asking for is even possible. Any help would be appreciated however.
Thanks,
Found a solution that worked, although it was not directly related to my original ask.
Instead of loading a large file into memory and sharing between independent processes, I found that the bottleneck was really the parsing function in pandas library.
Particularly, CSV parsing, as CSVs are notoriously inefficient in terms of data storage.
I started storing my files in the python-native pickle format, which is supported by pandas through the to_pickle() and read_pickle() functions. This cut my load times drastically from ~30s to ~2s.
I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).
Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.
I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.
Do you need all the columns? You might be able to save memory by just loading the ones you actually use.
A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.
You can use Dask instead of pandas. It it is built on pandas, so has similar API that you will likely be familiar with, and is meant for larger data.
https://examples.dask.org/dataframes/01-data-access.html
Its possible to read parquet data in
batches
read certain row groups or iterate over row groups
read only certain columns
This way you can reduce the memory footprint. Both fastparquet and pyarrow should allow you to do this.
In case of pyarrow, iter_batches can be used to read streaming batches from a Parquet file.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=1000):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 1000 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):
I wrote a Python wrapper for a big data R library. It's primary use is to:
receive a potentially large amount of data as an R dataframe/tibble
convert that to a Pandas dataframe
convert that to a Koalas dataframe
So I am worried about running out of memory. I'm not really worried about how fast it is because it's kind of a workaround, but it just has to be reliable. Also not looking to handle this is a formalized pipeline.
Will Python automatically swap my data onto disk if my users run out of RAM for some reason? Does the fact that it is running in a Docker env have any impact on that?
I have previously saved a dictionary which maps image_name -> list of feature vectors, with the file being ~32 Gb. I have been using cPickle to load the dictionary in, but since I only have 8 GB of RAM, this process takes forever. Someone suggested using a database to store all the info, and reading from that, but would that be a faster/better solution than reading a file from disk? Why?
Use a database because it allows you to query faster. I've done this before. I would suggest against using cPickle. What specific implementation are you using?
I'm trying to write data collected from a data acquisition system to locations in memory, and then asynchronously perform further processing on the data, or write it out to file for offline processing. I'm trying to do this architecture this way to isolate data acquisition from data analysis and transmittal, buying us some flexibility for future expansion and improvement, but it is definitely more complex then simply writing the data directly to a file.
Here is some exploratory code I wrote.
#io.BufferedRWPair test
from io import BufferedRWPair
# Samples of instrumentation data to be stored in RAM
test0 = {'Wed Aug 1 16:48:51 2012': ['20.0000', '0.0000', '13.5', '75.62', '8190',
'1640', '240', '-13', '79.40']}
test1 = {'Wed Aug 1 17:06:48 2012': ['20.0000', '0.0000', '13.5', '75.62', '8190',
'1640', '240', '-13', '79.40']}
# Attempt to create a RAM-resident object into which to read the data.
data = BufferedRWPair(' ', ' ', buffer_size=1024)
data.write(test0)
data.write(test1)
print data.getvalue()
data.close()
There are a couple of issues here (maybe more!):
-> 'data' is a variable name that picks up a construct (outside of Python) that I'm trying to assemble -- which is an array-like structure that should hold sequential records with each record containing several process data measurements, prefaced by a timestamp that can serve as a key for retrieval. I offered this as background to my design intent, in case the code was too vague to reflect my true questions.
-> This code does not work, because the 'data' object is not being created. I'm just trying to open an empty buffer, to be filled later, but Python is looking for two objects, one readable, one writeable, which are not present in my code. Because of this, I'm not sure I'm even using the right construct, which leads to these questions:
Is io.BufferedRWPair the best way to deal with this data? I've tried StringIO, since I'm on Python 2.7.2, but no luck. I like the idea of a record with a timestamp key, hence my choice of the dict structure, but I'd sure look at alternatives. Are there other io classes I should look at instead?
One alternative I've looked at is the DataFrame construct which is defined in the NumPy/ SciPy/ Pandas world. It looks interesting, but there seems like a lot of additional modules required, so I've shied away from that. I have no experience with any of those modules -- Should I be looking at these more complex modules to get what I need?
I'd welcome any suggestions or feedback, folks... Thanks for checking out this question!
If I understand what you are asking, using an in-memory sqlite database might be the way to go. Sqlite allows you to create a fully functioning SQL database entirly in memory. Instead of reads and writes you would do selects and inserts.
Writing a mechanism to hold data in memory while it fits and only write it to a file if necessary is redundant – the operating system does this for you anyway. If you use a normal file and access it from the different parts of your application, the operating system will keep the file contents in the disk cache as long as enough memory is available.
If you want to have access to the file by memory addresses, you can memory-map it using the mmap module. However, my impression is that all you need is a standard database, or one of the simpler alternatives offered by the Python standard library, such as the shelve any anydbm modules.
Based on your comments, also check out key-value stores like Redis and memcached.