Reading big json dataset using pandas with chunks

Reading big json dataset using pandas with chunks - python

I want to read a json of 6gb size (and I've another of 1.5gb), and i tried to read normally with pandas (just with pd.read_json), and clearly memory dies.
Then, I tried with chunksize param, like:
with open('data/products.json', encoding='utf-8') as f:
df = []
df_reader = pd.read_json(f, lines=True, chunksize=1000000)
for chunk in df_reader:
df.append(chunk)
data = pd.read_json(df)
But that doesn't work too, and my pc dies on the first running minute (8gb RAM actually).

Dask and Pyspark has dataframe solutions that are nearly identical to pandas. Pyspark is a Spark api and distributes workloads across JVMs. Dask specifically targets the out-of-memory on a single workstation use case and implements the dataframe api.
As shown here read_json's api mostly passes through from pandas.
As you port your example code from the question, I would note two things:
I suspect you won't need the file context manager, as simply passing the file path probably works.
If you have multiple records, Dask supports blobs like "path/to/files/*.json"

Related

Memory efficient data loading with Dask

Well, what I have is a big CSV file with data and a bottleneck of server RAM. Besides that, there is a dask-distributed cluster that looks like a solution for this case, dask-scheduler is running on the server. Theres what I have tried:
import dask.dataframe as dd
import pandas as pd
from dask.bag import from_sequence
cheques = dd.read_csv('cheque_data.csv') # not working because of distributed workers can't access file directly
cheques = from_sequence(pd.read_csv('cheque_data.csv',chunksize=10**4)).to_dataframe() # dask Bag from_sequence constructs from tuples or dict
So I have stucked here, any ideas and clues would be great

Your code snippet says dd.read_csv('cheque_data.csv') is "not working because of distributed workers can't access file directly".
I think you need to give your workers access to the file.
I prefer splitting up huge files before they're read by Dask. That makes it easier for Dask to read the files in parallel.
It depends on the size of cheque_data.csv. 2GB is a lot easier to manage than 30GB.

Multiple External Processes Reading From the Same Data Source

I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!

You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python

How to open huge parquet file using Pandas without enough RAM

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).
Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.
I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.

Do you need all the columns? You might be able to save memory by just loading the ones you actually use.
A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.

You can use Dask instead of pandas. It it is built on pandas, so has similar API that you will likely be familiar with, and is meant for larger data.
https://examples.dask.org/dataframes/01-data-access.html

Its possible to read parquet data in
batches
read certain row groups or iterate over row groups
read only certain columns
This way you can reduce the memory footprint. Both fastparquet and pyarrow should allow you to do this.
In case of pyarrow, iter_batches can be used to read streaming batches from a Parquet file.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=1000):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 1000 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):

Does Python automatically use swap memory?

I wrote a Python wrapper for a big data R library. It's primary use is to:
receive a potentially large amount of data as an R dataframe/tibble
convert that to a Pandas dataframe
convert that to a Koalas dataframe
So I am worried about running out of memory. I'm not really worried about how fast it is because it's kind of a workaround, but it just has to be reliable. Also not looking to handle this is a formalized pipeline.
Will Python automatically swap my data onto disk if my users run out of RAM for some reason? Does the fact that it is running in a Docker env have any impact on that?

Using pool to read multiple files in parallel takes forever on Jupyter Windows:

I want to read 22 files (stored on my hard disk) with around 300,000 rows each to store in a single pandas data frame. My code was able to do it in 15-25 minutes. I initial thought is: I should make it faster using more CPUs. (correct me if I am wrong here, and if all CPU can't read the data from same hard disk at the same time, however, in this case we can assume data might be present at different hard disks later on, so this exercise is still useful).
I found few posts like this and this and tried the code below.
import os
import pandas as pd
from multiprocessing import Pool
def read_psv(filename):
'reads one row of a file (pipe delimited) to a pandas dataframe'
return pd.read_csv(filename,
delimiter='|',
skiprows=1, #need this as first row is junk
nrows=1, #Just one row for faster testing
encoding = "ISO-8859-1", #need this as well
low_memory=False
)
files = os.listdir('.') #getting all files, will use glob later
df1 = pd.concat((read_psv(f) for f in files[0:6]), ignore_index=True, axis=0, sort=False) #takes less than 1 second
pool = Pool(processes=3)
df_list = pool.map(read_psv, files[0:6]) #takes forever
#df2 = pd.concat(df_list, ignore_index=True) #cant reach this
This takes forever (more than 30-60 minutes, without finishing when I kill the process). I also went through a similar question like mine but of no use.
EDIT: I am using Jupyter on Windows.

Your task is IO-bound, the bottleneck is the hard-drive. The CPU has to do only a little work to parse each line in the CSV.
Disk reads are fastest when they are sequential. If you want to read a large file, it's best to let the disk seek the beginning and then just read all of its bytes sequentially.
If you have multiple large files on the same hard-drive and read from them using multiple processes, then the disk head will have to jump back and forth between them, where each jump takes up to 10 ms.
Multiprocessing can still make your code faster, but you will need to store your files on multiple disks, so each disk head can focus on reading one file.
Another alternative is to buy an SSD. Disk seek time is much lower at 0.1 ms and throughput is around 5x faster.

So the issue is not related to Bad Performance or getting stuck at I/O. The issue is related to Jupyter and Windows. On Windows we need to include an if clause like this: if __name__ = '__main__': before initializing the Pool. For Jupyter, we need to save the worker in a separate file and import it in the code. Jupyter is also problematic as it does not give the error log by default. I got to know about windows issue when I ran the code on a python shell. I got to know about Jupyter error when I ran the code on Ipython Shell. Following post has helped me a lot.
For Jupyter
For Windows Issue

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading big json dataset using pandas with chunks - python

Related

Memory efficient data loading with Dask

Multiple External Processes Reading From the Same Data Source

How to open huge parquet file using Pandas without enough RAM

Does Python automatically use swap memory?

Using pool to read multiple files in parallel takes forever on Jupyter Windows:

Categories

Resources