I have to process hdf5 files. Each of them contains data that can be loaded into a pandas DataFrame formed by 100 columns and almost 5E5 rows. Each hdf5 file weighs approximately 130MB.
So I want to fetch the data from the hdf5 file then apply some processing and finally save the new data in a csv file. In my case, the performance of the process is very important because I will have to repeat it.
So far I have focused on Pandas and Dask to get the job done. Dask is good for parallelization and I will get good processing times with a stronger PC and more CPUs.
However some of you have already encountered this problem and found the best optimization ?
As others have mentioned in the comments, unless you have to move it to CSV, I'd recommend keeping it in HDF5. However, below is a description of how you might do it if you do have to carry out the conversion.
It sounds like you have a function for loading the HDF5 file into a pandas data frame. I would suggest using dask's delayed API to create a list of delayed pandas data frames, and then convert them into a dask data frame. The snipped below is copied from the linked page, with an added line to save to CSV.
import dask.dataframe as dd
from dask.delayed import delayed
from my_custom_library import load
filenames = ...
dfs = [delayed(load)(fn) for fn in filenames]
df = dd.from_delayed(dfs)
df.to_csv(filename, **kwargs)
See dd.to_csv() documentation for info on options for saving to CSV.
Related
I have been trying to merge small parquet files each with 10 k rows and for each set the number of small files will be 60-100. So resulting into around 600k rows minimum in the merged parquet file.
I have been trying to use pandas concat.it is working fine with around 10-15 small files merge.
But as the set may be consists of 50-100 files. The process it is getting killed while running python script with memory limit breached
So i am looking for a memory efficient way to merge any number of small parquet in range of 100 file set
Used pandas read parquet to read each individual dataframe and combine them with pd.conact(all dataframe)
Is there a better library other than pandas or if possible in pandas how it can be done efficiently.
Time is not constraint. It can run for some long time as well.
For large data you should definitely use the PySpark library, split into smaller sizes if possible, and then use Pandas.
PySpark is very similar to Pandas.
link
You can open files one by one and append them to the parquet file. Best to use pyarrow for this.
import pyarrow.parquet as pq
files = ["table1.parquet", "table2.parquet"]
with pq.ParquetWriter("output.parquet", schema=pq.ParquetFile(files[0]).schema_arrow) as writer:
for file in files:
writer.write_table(pq.read_table(file))
The data is stored in the DBeaver database.I would like to analyze my data through python without creating a data frame.And the python is installed in my computer. As the data is huge, creating a data frame will consume my ram and space.
So, it it possible to directly link my python code to the database and do the necessary aggregation or data manipulation and gather only the output
If you use python directly then also it will consume more ram and space. and if you do directly data analysis with database then it will may lead to unexpected results
instead you can use this Dask Dataframe from Dask Official ... Dask Wikipedia
with dask dataframe you can do data analysis even if you have big dataset
I don't know in which scale you want to work with your data and how big is your data set but if you are going to change the data in large scale i would recommend creating csv file which contains your dataset and working with pandas dataFrames reading csv files are fairly fast and easy to work if your interested you can visit here and read parts needed.
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
I am using pandas to read CSV file data, but the CSV module is also there to manage the CSV file.
so my questions are :-
what is the difference between these both?
what are the cons of using pandas over the CSV module?
Based upon benchmarks
CSV is faster to load data for smaller datasets (< 1K rows)
Pandas is several times faster for larger datasets
Code to Generate Benchmarks
Benchmarks
csv is a built-in module but pandas not. if you want only reading csv file you should not install pandas because you must install it and increasing in dependencies of project is not a best practice.
if you want to analyze data of csv file with pandas, pandas changes csv file to dataframe needed for manipulating data with pandas and you should not use csv module for these cases.
if you have a big data or data with large volume you should consider libraries like numpy and pandas.
Pandas is better then csv for managing data and doing operations on the data. CSV doesn't provide you with the scientific data manipulation tools that Pandas does.
If you are talking only about the part of reading the file it depends. You may simply google both modules online but generally I find it more comfortable to work with Pandas. it provides easier readability as well since printing there is better too.
Below is my Python code:
import dask.dataframe as dd
VALUE2015 = dd.read_csv('A/SKD - M2M by Salesman (value by uom) (NEWSALES)2015-2016.csv', usecols = VALUEFY, dtype = traintypes1)
REPORT = VALUE2015.groupby(index).agg({'JAN':'sum', 'FEB':'sum', 'MAR':'sum', 'APR':'sum', 'MAY':'sum','JUN':'sum', 'JUL':'sum', 'AUG':'sum', 'SEP':'sum', 'OCT':'sum', 'NOV':'sum', 'DEC':'sum'}).compute()
REPORT.to_csv('VALUE*.csv', header=True)
It takes 6 minutes to create a 100MB CSV file.
Looking through Dask documentation, it says there that, "generally speaking, Dask.dataframe groupby-aggregations are roughly same performance as Pandas groupby-aggregations." So unless you're using a Dask distributed client to manage workers, threads, etc., the benefit from using it over vanilla Pandas isn't always there.
Also, try to time each step in your code because if the bulk of the 6 minutes is taken up by writing the .CSV to file on disk, then again Dask will be of no help (for a single file).
Here's a nice tutorial from Dask on adding distributed schedulers for your tasks.
I would like to load large .csv (3.4m rows, 206k users) open sourced dataset from InstaCart https://www.instacart.com/datasets/grocery-shopping-2017
Basically, I have trouble loading orders.csv into Pandas DataFrame. I would like to learn best practices for loading large files into Pandas/Python.
Best option would be to read the data in chunks instead of loading the whole file into memory.
Luckily, read_csv method accepts chunksize argument.
for chunk in pd.read_csv(file.csv, chunksize=somesize):
process(chunk)
Note: By specifying a chunksize to read_csv or read_table, the return value will be an iterable object of type TextFileReader:
Also see:
read_csv
Iterating through files chunk by chunk
When you have large data frames that might not fit in memory, dask is quite useful. The main page I've linked to has examples on how you can create a dask dataframe that has the same API as the pandas one but which can be distributed.
Depending on your machine you may be able to read all of it in memory by specifying the data types while reading the csv file. When a csv is read by pandas then the default data types used may not be the best ones. Using dtype you can specify the data types. It reduces the size of the data frame read into the memory.