How to open huge parquet file using Pandas without enough RAM - python

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).
Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.
I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.

Do you need all the columns? You might be able to save memory by just loading the ones you actually use.
A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.

You can use Dask instead of pandas. It it is built on pandas, so has similar API that you will likely be familiar with, and is meant for larger data.
https://examples.dask.org/dataframes/01-data-access.html

Its possible to read parquet data in
batches
read certain row groups or iterate over row groups
read only certain columns
This way you can reduce the memory footprint. Both fastparquet and pyarrow should allow you to do this.
In case of pyarrow, iter_batches can be used to read streaming batches from a Parquet file.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=1000):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 1000 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):

Related

Multiple External Processes Reading From the Same Data Source

I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!
You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python

Does Python automatically use swap memory?

I wrote a Python wrapper for a big data R library. It's primary use is to:
receive a potentially large amount of data as an R dataframe/tibble
convert that to a Pandas dataframe
convert that to a Koalas dataframe
So I am worried about running out of memory. I'm not really worried about how fast it is because it's kind of a workaround, but it just has to be reliable. Also not looking to handle this is a formalized pipeline.
Will Python automatically swap my data onto disk if my users run out of RAM for some reason? Does the fact that it is running in a Docker env have any impact on that?

Reading big json dataset using pandas with chunks

I want to read a json of 6gb size (and I've another of 1.5gb), and i tried to read normally with pandas (just with pd.read_json), and clearly memory dies.
Then, I tried with chunksize param, like:
with open('data/products.json', encoding='utf-8') as f:
df = []
df_reader = pd.read_json(f, lines=True, chunksize=1000000)
for chunk in df_reader:
df.append(chunk)
data = pd.read_json(df)
But that doesn't work too, and my pc dies on the first running minute (8gb RAM actually).
Dask and Pyspark has dataframe solutions that are nearly identical to pandas. Pyspark is a Spark api and distributes workloads across JVMs. Dask specifically targets the out-of-memory on a single workstation use case and implements the dataframe api.
As shown here read_json's api mostly passes through from pandas.
As you port your example code from the question, I would note two things:
I suspect you won't need the file context manager, as simply passing the file path probably works.
If you have multiple records, Dask supports blobs like "path/to/files/*.json"

Large data from SQL server to local hard disc to Tableau and Pandas

I am trying to export a large dataset from SQL Server to my local hard disc for some data analysis. The file size goes up to 30gb, with 6 million over rows and about 10 columns.
This data will then be fed through python Pandas or Tableau for consumption. I am thinking the size of the file itself will give me poor performances during my analysis.
Any best practices to be shared for analyzing big-ish data on a local machine?
I am running an i7 4570 with 8gb ram. I am hoping to be less reliant on SQL queries and be able to run huge analysis offline.
Due to the nature of the database, a fresh extract needs to happen and this process will have to repeat itself, meaning there will not be much of appending happening.
I have explored HDFStores and also Tableau Data Extracts, but still curious whether I can get better performances by reading whole CSV files.
Is there a compression method of sorts that I might be missing out? Again the objective here is to run the analysis without constant querying to the server, the source itself (which I am optimizing) will refresh itself every morning so when I get in office I can just focus on getting coffee and some blazing fast analytics done.
With Tableau you would want to take an extract of the CSV (it will be much quicker to query than a CSV). That should be fine since the extract sits on disk. However, as mentioned, you need to create a new extract once your data changes.
With Pandas I usually load everything into memory, but if it doesn't fit then you can read the CSV in chunks using chunksize (see this thread: How to read a 6 GB csv file with pandas)

Using Pickle vs database for loading large amount of data?

I have previously saved a dictionary which maps image_name -> list of feature vectors, with the file being ~32 Gb. I have been using cPickle to load the dictionary in, but since I only have 8 GB of RAM, this process takes forever. Someone suggested using a database to store all the info, and reading from that, but would that be a faster/better solution than reading a file from disk? Why?
Use a database because it allows you to query faster. I've done this before. I would suggest against using cPickle. What specific implementation are you using?

Categories

Resources