I have previously saved a dictionary which maps image_name -> list of feature vectors, with the file being ~32 Gb. I have been using cPickle to load the dictionary in, but since I only have 8 GB of RAM, this process takes forever. Someone suggested using a database to store all the info, and reading from that, but would that be a faster/better solution than reading a file from disk? Why?
Use a database because it allows you to query faster. I've done this before. I would suggest against using cPickle. What specific implementation are you using?
Related
I'm writing a program that requires running algorithms on a very large (~6GB) csv file, which is loaded with pandas using read_csv().
The issue I have now, is that anytime I tweak my algorithms and need to re-simulate (which is very often), I need to wait ~30s for the dataset to load into memory, and then another 30s afterward to load the same dataset into a graphing module so I can visually see what's going on. Once it's loaded however, operations are done very quickly.
So far I've tried using mmap, and loading the dataset into a RAM disk for access, with no improvement.
I'm hoping to find a way to load up the dataset once into memory with one process, and then access it in memory with the algorithm-crunching process, which gets re-run each time I make a change.
This thread seems to be close-ish to what I need, but uses multiprocessing which needs everything to be run within the same context.
I'm not a computer engineer (I'm electrical :), so I'm not sure what I'm asking for is even possible. Any help would be appreciated however.
Thanks,
Found a solution that worked, although it was not directly related to my original ask.
Instead of loading a large file into memory and sharing between independent processes, I found that the bottleneck was really the parsing function in pandas library.
Particularly, CSV parsing, as CSVs are notoriously inefficient in terms of data storage.
I started storing my files in the python-native pickle format, which is supported by pandas through the to_pickle() and read_pickle() functions. This cut my load times drastically from ~30s to ~2s.
Yes, this was asked seven years ago, but the 'answers' were not helpful in my opinion. So much open data uses JSON, so I'm asking this again to see if any better techniques are available. I'm loading a 28 MB JSON file (with 7,000 lines) and the memory used for json.loads is over 300 MB.
This statement is run repeatedly:
data_2_item = json.loads(data_1_item)
and is eating up the memory for the duration of the program. I've tried various other statements such as pd.read_json(in_file_name, lines=True)
with the same results. I've tried simplejson and rapidjson alternative packages also.
As a commenter observed, json.loads is NOT the culprit. data_2_item can be very large -- sometimes 45K. Since it is appended to a list over 7,000 times, the list becomes huge (300 MB) and that memory is NEVER released. So to me, the answer: no solution with existing packages/loaders. The overall goal is to load a large JSON file into a Pandas Dataframe without using 300 MB (or more) of memory for intermediate processing. And that memory does not shrink. See also https://github.com/pandas-dev/pandas/issues/17048
If after loading your content you will be using only part of it then consider using ijson to load the JSON content in a streaming fashion, with low memory consumption, and only constructing the data you need to handle rather than the whole object.
I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).
Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.
I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.
Do you need all the columns? You might be able to save memory by just loading the ones you actually use.
A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.
You can use Dask instead of pandas. It it is built on pandas, so has similar API that you will likely be familiar with, and is meant for larger data.
https://examples.dask.org/dataframes/01-data-access.html
Its possible to read parquet data in
batches
read certain row groups or iterate over row groups
read only certain columns
This way you can reduce the memory footprint. Both fastparquet and pyarrow should allow you to do this.
In case of pyarrow, iter_batches can be used to read streaming batches from a Parquet file.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=1000):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 1000 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):
I wrote a Python wrapper for a big data R library. It's primary use is to:
receive a potentially large amount of data as an R dataframe/tibble
convert that to a Pandas dataframe
convert that to a Koalas dataframe
So I am worried about running out of memory. I'm not really worried about how fast it is because it's kind of a workaround, but it just has to be reliable. Also not looking to handle this is a formalized pipeline.
Will Python automatically swap my data onto disk if my users run out of RAM for some reason? Does the fact that it is running in a Docker env have any impact on that?
baseline - I have CSV data with 10,000 entries. I save this as 1 csv file and load it all at once.
alternative - I have CSV data with 10,000 entries. I save this as 10,000 CSV files and load it individually.
Approximately how much more inefficient is this computationally. I'm not hugely interested in memory concerns. The purpose of the alternative method is because I frequently need to access subsets of the data and don't want to have to read the entire array.
I'm using python.
Edit: I can other file formats if needed.
Edit1: SQLite wins. Amazingly easy and efficient compared to what I was doing before.
SQLite is ideal solution for your application.
Simply import your CSV file into SQLite database table (it is going to be single file), then add indexes as necessary.
To access your data, use python sqlite3 library. You can use this tutorial on how to use it.
Compared to many other solutions, SQLite will be the fastest way to select partial data sets locally - certainly much, much faster than access 10000 files. Also read this answer which explains why SQLite is so good.
I would write all the lines to one file. For 10,000 lines it's probably not worthwhile, but you can pad all the lines to the same length - say 1000 bytes.
Then it's easy to seek to the nth line, just multiply n by the line length
10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.
My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:
myFileObject.seek(n*80)
theLine = myFileObject.read(80)