How to read a large file as Pandas dataframe?

How to read a large file as Pandas dataframe? - python

I want to read a large file (4GB) as a Pandas dataframe. Since using Dask directly still consumes maximum CPU, I read the file as a pandas dataframe, then use dask_cudf, and then convert back to a pandas dataframe.
However, my code is still using maximum CPU on Kaggle. GPU accelerator is switched on.
import pandas as pd
from dask import dataframe as dd
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
df = pd.read_csv("../input/subtype-nt/meth_subtype_normal_tumor.csv", sep="\t", index_col=0)
ddf = dask_cudf.from_cudf(df, npartitions=2)
meth_sub_nt = ddf.infer_objects()

I have had similar problem. With some research, I came to know about Vaex.
You can read about its performance here and here.
Essentially this is what you can try to do:
Read the csv file using Vaex and convert it to a hdf5 file (file format most optimised for Vaex)
vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv', convert=True, chunk_size=5_000)
Open the hdf5 file using Vaex. Vaex will do the memory-mapping and thus will not load data into memory.
vaex_df = vaex.open('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5')
Now you can perform operations on your Vaex dataframe just like you would be doing with Pandas. It will be blazingly fast and you will certainly notice huge performance gains (lower CPU and memory usage).
You can also try to read your csv file directly into Vaex dataframe without converting it to hdf5. I had read somewhere that Vaex works fastest with hdf5 files therefore I suggested the above approach.
vaex_df = vaex.from_csv('../input/subtype-nt/meth_subtype_normal_tumor.csv.hdf5', chunk_size=5_000)

Right now your code suggests that you first attempt to load data using pandas and then convert it to dask-cuDF dataframe. That's not optimal (or might not even be feasible). Instead, one can use dask_cudf.read_csv function (see docs):
from dask_cudf import read_csv
ddf = read_csv('example_output/foo_dask.csv')

Related

What happens when pandas read_csv is run on a file that is too large

If a file fed into pandas read_csv is too large, will it raise an exception?
What I'm afraid of is that it will just read what it can, say the first 1,000,000 rows and proceed as if there was no problem.
Does there exist situations in which pandas will fail to read all records in a file but also fail to raise an exception (print errors).

If you have large dataset, and if you want to read it manytimes, I recommend you to use .pkl file
Or you can use try exception method.
However, if you still want to use csv file, you can visit this link and find solution How do I read a large csv file with pandas?

I'd recommend using dask which is a high-level library that supports parallel computing,
You can easily import all your data but it won't be loaded in your memory
import pandas as pd
import dask.dataframe as dd
df = dd.read_csv('data.csv')
df
and from there , you can compute only selected columns/rows you are interested in:
df_selected = df[columns].loc[indices_to_select]
df_selected.compute()

How to speed up pandas dataframe creation from huge file?

I have a file bigger than 7GB. I am trying to place it into a dataframe using pandas, like this:
df = pd.read_csv('data.csv')
But it takes too long. Is there a better way to speed up the dataframe creation? I was considering changing the parameter engine='c', since it says in the documentation:
"engine{‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete."
But I dont see much gain in speed

If the problem is you are not able to create the dataframe since the big size makes the operation to fail, you can check how to chunk it in this answer
In case it is created at some point, but you consider it is too slow, then you can use datatable to read the file, then convert to pandas, and continue with your operations:
import pandas as pd
import datatable as dt
# Read with databale
datatable_df = dt.fread('myfile.csv')
# Then convert the dataframe into pandas
pandas_df = frame_datatable.to_pandas()

Split a parquet file in smaller chunks using dask

I am trying to split a parquet file using DASK with the following piece of code
import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)
I have only one physical file in input, i.e. file.parquet
The output of this script is as well only one file, i.e. part.0.parquet.
Based on the partition_size & chunksize parameters, I should have multiple files in output
Any help would be appreciated

df.repartition(partition_size="100MB") returns a Dask Dataframe.
You have to write :
df = df.repartition(partition_size="100MB")
You can check the number of partitions created looking at df.npartitions
Also, you can use the following to write your parquet files :
df.to_parquet(output_path)
Because Parquet files are meant to deal with large files, you should also consider using the argument compression= when writing you parquet files.
You should get what you expect.
NB: Writing import dask.dataframe as pd is missleading because import dask.dataframe as dd is commonly used

Convert 17GB JSON file to a numpy array

I have a big 17 GB JSON file placed in hdfs . I need to read that file and convert into nummy array which is then passed into K-Means clustering algorithm. I tried many ways but system slows down and getting a memory error or kernel dies.
the code i tried is
from hdfs3 import HDFileSystem
import pandas as pd
import numpy as nm
import json
hdfs = HDFileSystem(host='hostname', port=8020)
with hdfs.open('/user/iot_all_valid.json/') as f:
for line in f:
data = json.loads(line)
df = pd.DataFrame(data)
dataset= nm.array(df)
I tried using ijson but still not sure which is the right way to do this in faster way.

I would stay away from both numpy and Pandas, since you will get memory issues in both cases. I'd rather stick with SFrame or the Blaze ecosystem, which are designed specifically to handle this kind of "big data" cases. Amazing tools!

Because the data types are all going to be different per column a pandas dataframe would be a more appropriate data structure to keep it in. You can still manipulate the data with numpy functions.
import pandas as pd
data = pd.read_json('/user/iot_all_valid.json', dtype={<express converters for the different types here>})
In order to avoid the crashing issue, try running the k-means on a small sample on the data set. Make sure that works like expected. Then you can increase the data size till you feel comfortable with the whole data set.

In order to deal with a numpy array potentially larger than available ram I would use a memory mapped numpy array. On my machine ujson was 3.8 times faster than builtin json. Assuming rows is the number of lines of json:
from hdfs3 import HDFileSystem
import numpy as nm
import ujson as json
rows=int(1e8)
columns=4
# 'w+' will overwrite any existing output.npy
out = np.memmap('output.npy', dtype='float32', mode='w+', shape=(rows,columns))
with hdfs.open('/user/iot_all_valid.json/') as f:
for row, line in enumerate(f):
data = json.loads(line)
# convert data to numerical array
out[row] = data
out.flush()
# memmap closes on delete.
del out

How to read a Parquet file into Pandas DataFrame?

How to read a modestly sized Parquet data-set into an in-memory Pandas DataFrame without setting up a cluster computing infrastructure such as Hadoop or Spark? This is only a moderate amount of data that I would like to read in-memory with a simple Python script on a laptop. The data does not reside on HDFS. It is either on the local file system or possibly in S3. I do not want to spin up and configure other services like Hadoop, Hive or Spark.
I thought Blaze/Odo would have made this possible: the Odo documentation mentions Parquet, but the examples seem all to be going through an external Hive runtime.

pandas 0.21 introduces new functions for Parquet:
import pandas as pd
pd.read_parquet('example_pa.parquet', engine='pyarrow')
or
import pandas as pd
pd.read_parquet('example_fp.parquet', engine='fastparquet')
The above link explains:
These engines are very similar and should read/write nearly identical parquet format files. These libraries differ by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).

Update: since the time I answered this there has been a lot of work on this look at Apache Arrow for a better read and write of parquet. Also: http://wesmckinney.com/blog/python-parquet-multithreading/
There is a python parquet reader that works relatively well: https://github.com/jcrobak/parquet-python
It will create python objects and then you will have to move them to a Pandas DataFrame so the process will be slower than pd.read_csv for example.

Aside from pandas, Apache pyarrow also provides way to transform parquet to dataframe
The code is simple, just type:
import pyarrow.parquet as pq
df = pq.read_table(source=your_file_path).to_pandas()
For more information, see the document from Apache pyarrow Reading and Writing Single Files

Parquet
Step 1: Data to play with
df = pd.DataFrame({
'student': ['personA007', 'personB', 'x', 'personD', 'personE'],
'marks': [20,10,22,21,22],
})
Step 2: Save as Parquet
df.to_parquet('sample.parquet')
Step 3: Read from Parquet
df = pd.read_parquet('sample.parquet')

When writing to parquet, consider using brotli compression. I'm getting a 70% size reduction of 8GB file parquet file by using brotli compression. Brotli makes for a smaller file and faster read/writes than gzip, snappy, pickle. Although pickle can do tuples whereas parquet does not.
df.to_parquet('df.parquet.brotli',compression='brotli')
df = pd.read_parquet('df.parquet.brotli')

Parquet files are always large. so read it using dask.
import dask.dataframe as dd
from dask import delayed
from fastparquet import ParquetFile
import glob
files = glob.glob('data/*.parquet')
#delayed
def load_chunk(path):
return ParquetFile(path).to_pandas()
df = dd.from_delayed([load_chunk(f) for f in files])
df.compute()

Considering the .parquet file named data.parquet
parquet_file = '../data.parquet'
open( parquet_file, 'w+' )
Convert to Parquet
Assuming one has a dataframe parquet_df that one wants to save to the parquet file above, one can use pandas.to_parquet (this function requires either the fastparquet or pyarrow library) as follows
parquet_df.to_parquet(parquet_file)
Read from Parquet
In order to read the parquet file into a dataframe new_parquet_df, one can use pandas.read_parquet() as follows
new_parquet_df = pd.read_parquet(parquet_file)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read a large file as Pandas dataframe? - python

Related

What happens when pandas read_csv is run on a file that is too large

How to speed up pandas dataframe creation from huge file?

Split a parquet file in smaller chunks using dask

Convert 17GB JSON file to a numpy array

How to read a Parquet file into Pandas DataFrame?

Categories

Resources