Reading json file into pandas dataframe is very slow - python

I have a json file of size less than 1Gb.I am trying to read the file on a server that have 400 Gb RAM using the following simple command:
df = pd.read_json('filepath.json')
However this code is taking forever (several hours) to execute,I tried several suggestions such as
df = pd.read_json('filepath.json', low_memory=False)
or
df = pd.read_json('filepath.json', lines=True)
But none have worked. How come reading 1GB file into a server of 400GB be so slow?

You can use Chunking can shrink memory use.
I recommend Dask Library can load data in parallel.

Related

How to read a large file as Pandas dataframe?

I want to read a large file (4GB) as a Pandas dataframe. Since using Dask directly still consumes maximum CPU, I read the file as a pandas dataframe, then use dask_cudf, and then convert back to a pandas dataframe.
However, my code is still using maximum CPU on Kaggle. GPU accelerator is switched on.
import pandas as pd
from dask import dataframe as dd
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
df = pd.read_csv("../input/subtype-nt/meth_subtype_normal_tumor.csv", sep="\t", index_col=0)
ddf = dask_cudf.from_cudf(df, npartitions=2)
meth_sub_nt = ddf.infer_objects()
I have had similar problem. With some research, I came to know about Vaex.
You can read about its performance here and here.
Essentially this is what you can try to do:
Read the csv file using Vaex and convert it to a hdf5 file (file format most optimised for Vaex)
vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv', convert=True, chunk_size=5_000)
Open the hdf5 file using Vaex. Vaex will do the memory-mapping and thus will not load data into memory.
vaex_df = vaex.open('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5')
Now you can perform operations on your Vaex dataframe just like you would be doing with Pandas. It will be blazingly fast and you will certainly notice huge performance gains (lower CPU and memory usage).
You can also try to read your csv file directly into Vaex dataframe without converting it to hdf5. I had read somewhere that Vaex works fastest with hdf5 files therefore I suggested the above approach.
vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5', chunk_size=5_000)
Right now your code suggests that you first attempt to load data using pandas and then convert it to dask-cuDF dataframe. That's not optimal (or might not even be feasible). Instead, one can use dask_cudf.read_csv function (see docs):
from dask_cudf import read_csv
ddf = read_csv('example_output/foo_dask.csv')

how to share pandas dataframe between processes to save memory

I have a large csv file(around 10Gb).
I use different ipython notebooks to analyse it.(Using pd.read_csv() to load the file into dataframe in each notebook)
My problem is , every time I read the file, 10G memory is used.
I am wondering if there is a way to share dataframe data between processes so that I can optimize my memory usage.
An ideal solution would be like this:
in my server file,
def InitData():
df = pd.read_csv(my.csv)
share(df)
in other notebook files,
def loadingData():
df = LoadingSharedData()
result = df.sum() #something like this
No matter how many notebooks I create,there would be only one piece of dataframe in my memory.
Using pickle is fast and efficient if you are confident that nobody will be able to interfere with the pickled files, see security considerations.
import pickle
with open('filename.pickle', 'wb') as file:
pickle.dump(df, file)
with open('filename.pickle', 'rb') as file:
df_test = pickle.load(file)
print(df.equals(df_test))

How to extract and save in .csv chunks of data from a large .csv file iteratively using Python?

I am new to Python and I attempt to read a large .csv file (with hundreds of thousands or possibly few millions of rows; and about 15.000 columns) using pandas.
What I thought I could do is to create and save each chunk in a new .csv file, iteratively across all chunks. I am currently using a lap top with relatively limited memory (of about 4 Gb, in the process of upgrading it) but I was wondering whether I could do this without changing my set up now. Alternatively, I could transfer this process in a pc with large RAM and attempt larger chunks, but I wanted to get this in place even for shorter row chunks.
I have seen that I can process quickly chunks of data (e.g. 10.000 rows and all columns), using the code below. But due to me being a Python beginner, I have only managed to order the first chunk. I would like to loop iteratively across chunks and save them.
import pandas as pd
import os
print(os.getcwd())
print(os.listdir(os.getcwd()))
chunksize = 10000
data = pd.read_csv('ukb35190.csv', chunksize=chunksize)
df = data.get_chunk(chunksize)
print(df)
export_csv1 = df.to_csv (r'/home/user/PycharmProjects/PROJECT/export_csv_1.csv', index = None, header=True)
If you are not doing any processing on data then you dont have to even store it in any variable.You can do it directly. PFA code below.Hope this would help u.
import pandas as pd
import os
chunksize = 10000
batch=1
for chunk in pd.read_csv(r'ukb35190.csv',chunksize=chunk_size):
chunk.to_csv(r'ukb35190.csv'+str(batch_no)+'.csv',index=False)
batch_no+=1

import large dataset (4gb) in python using pandas

I'm trying to import a large (approximately 4Gb) csv dataset into python using the pandas library. Of course the dataset cannot fit all at once in the memory so I used chunks of size 10000 to read the csv.
After this I want to concat all the chunks into a single dataframe in order to perform some calculations but I ran out of memory (I use a desktop with 16gb RAM).
My code so far:
# Reading csv
chunks = pd.read_csv("path_to_csv", iterator=True, chunksize=1000)
# Concat the chunks
pd.concat([chunk for chunk in chunks])
pd.concat(chunks, ignore_index=True)
I searched many threads on StackOverflow and all of them suggest one of these solutions. Is there a way to overcome this? I can't believe I can't handle a 4 gb dataset with 16 gb ram!
UPDATE: I still haven't come up with any solution to import the csv file. I bypassed the problem by importing the data into a PostgreSQL then querying the database.
I once deal with this kind of situation using generator in python. I hope this will be helpful:
def read_big_file_in_chunks(file_object, chunk_size=1024):
"""Reading whole big file in chunks."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('very_very_big_file.log')
for chunk in read_big_file_in_chunks(f):
process_data(chunck)

How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.
pandas 0.21 introduces new functions for Parquet:
import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.
Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
The code is simple, just type:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files
Parquet
Step 1: Data to play with
df = pd.DataFrame({
'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
'marks': [20,10,22,21,22],
})
Step 2: Save as Parquet
df.to_parquet('sample.parquet')
Step 3: Read from Parquet
df = pd.read_parquet('sample.parquet')
When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.
df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')
Parquet files are always large. so read it using dask.
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
files = glob.glob('data/*.parquet')
#delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in files])
df.compute()
Considering the .parquet file named data.parquet
parquet_file = '../data.parquet'
open( parquet_file, 'w+' )
Convert to Parquet
Assuming one has a dataframe parquet_df that one wants to save to the parquet file above, one can use pandas.to_parquet (this function requires either the fastparquet or pyarrow library) as follows
parquet_df.to_parquet(parquet_file)
Read from Parquet
In order to read the parquet file into a dataframe new_parquet_df, one can use pandas.read_parquet() as follows
new_parquet_df = pd.read_parquet(parquet_file)

Categories

Resources