Reading and processing multiple csv files with limited RAM in Python

Reading and processing multiple csv files with limited RAM in Python - python

I need to read thousands of csv files and output them as a single csv file in Python.
Each of the original files will be used to create single row in the final output with columns being some operation on the rows of the original file.
Due to the combined size of the files, this takes many hours to process and also is not able to be fully loaded into memory.
I am able to read in each csv and delete it from memory to solve the RAM issue. However, I am currently iteratively reading and processing each csv (in Pandas) and appending the output row to the final csv, which seems slow. I believe I can use the multiprocessing library to have each process read and process its own csv, but wasn't sure if there was a better way than this.
What is the fastest way to complete this in Python while having RAM limitations?
As an example, ABC.csv and DEF.csv would be read and processed into individual rows in the final output csv. (The actual files would have tens of columns and hundreds of thousands of rows)
ABC.csv:
id,col1,col2
abc,2.3,3
abc,3.7,5
abc,3.0,9
DEF.csv:
id,col1,col2
def,1.9,3
def,2.8,2
def,1.6,1
Final Output:
id,col1_avg,col2_max
abc,3.0,9
def,2.1,3

I would suggest using dask for this. It's a library that allows you to do parallel processing on large datasets.
import dask.dataframe as dd
df = dd.read_csv('*.csv')
df = df.groupby('id').agg({'col1': 'mean', 'col2': 'max'})
df.to_csv('output.csv')
Code explanation
dd.read_csv will read all the csv files in the current directory and concatenate them into a single dataframe.
df.groupby('id').agg({'col1': 'mean', 'col2': 'max'}) will group the dataframe by the id column and then calculate the mean of col1 and the max of col2 for each group.
df.to_csv('output.csv') will write the dataframe to a csv file.
Performance
I tested this on my machine with a directory containing 10,000 csv files with 10,000 rows each. The code took about 2 minutes to run.
Installation
To install dask, run pip install dask.

Related

Merge small parquet files into a single large parquet file

I have been trying to merge small parquet files each with 10 k rows and for each set the number of small files will be 60-100. So resulting into around 600k rows minimum in the merged parquet file.
I have been trying to use pandas concat.it is working fine with around 10-15 small files merge.
But as the set may be consists of 50-100 files. The process it is getting killed while running python script with memory limit breached
So i am looking for a memory efficient way to merge any number of small parquet in range of 100 file set
Used pandas read parquet to read each individual dataframe and combine them with pd.conact(all dataframe)
Is there a better library other than pandas or if possible in pandas how it can be done efficiently.
Time is not constraint. It can run for some long time as well.

For large data you should definitely use the PySpark library, split into smaller sizes if possible, and then use Pandas.
PySpark is very similar to Pandas.
link

You can open files one by one and append them to the parquet file. Best to use pyarrow for this.
import pyarrow.parquet as pq
files = ["table1.parquet", "table2.parquet"]
with pq.ParquetWriter("output.parquet", schema=pq.ParquetFile(files[0]).schema_arrow) as writer:
for file in files:
writer.write_table(pq.read_table(file))

Dask dataframe concatenate and repartitions large files for time series and correlation

I have 11 years of data with a record (row) every second, over about 100 columns. It's indexed with a series of datetime (created with Pandas to_datetime())
We need to be able to make some correlation analysis between the columns, that can work just 2 columns loaded at a time. We may be resampling at lower time cadence (e.g. 48s, 1 hours, months, etc...) over up to 11 years and visualize those correlations over the 11 years.
The data are currently in 11 separate parquet files (one per year), individually generated with Pandas from 11 .txt files. Pandas did not partition any of those files. In memory, each of these parquet files load up to about 20GB. The intended target machine will only have 16 GB, loading even just 1 columns over the 11 years takes about 10 GB, so 2 columns will not fit either.
Is there a more effective solution than working with Pandas, for working on the correlation analysis over 2 columns at a time? For example, using Dask to (i) concatenate them, and (ii) repartition to some number of partitions so Dask can work with 2 columns at a time without blowing up the memory?
I tried the latter solution following this post, and did:
# Read all 11 parquet files in `data/`
df = dd.read_parquet("/blah/parquet/", engine='pyarrow')
# Export to 20 `.parquet` files
df.repartition(npartitions=20).to_parquet("/mnt/data2/SDO/AIA/parquet/combined")
but at the 2nd step, Dask blew up my memory and I got a kernel shutdown.
As Dask is a lot about working with larger-than-memory data, I am surprise this memory escalation happened.
----------------- UPDATE 1 ROW GROUPS---------------
I reprocessed the parquet files with Pandas, to create about 20 row groups (it had defaulted to just 1 group per file). Now regardless of setting split_row_groups to True or False, I am not able to resample with Dask (e.g. myseries = myseries.resample('48s').mean(). I have to do compute() on the Dask series first to get it as a Pandas dataframe, which seems to defeat the purpose of working with the row groups within Dask.
When doing that resampling, I get instead:
ValueError: Can only resample dataframes with known divisions See
https://docs.dask.org/en/latest/dataframe-design.html#partitions for
more information.
I did not have that problem when I used the default Pandas behavior to write the parquet files with just 1 row group.

dask.dataframe by default is structured a bit more toward reading smaller "hive" parquet files rather than chunking individual huge parquet files into manageable pieces. From the dask.dataframe docs:
By default, Dask will load each parquet file individually as a partition in the Dask dataframe. This is performant provided all files are of reasonable size.
We recommend aiming for 10-250 MiB in-memory size per file once loaded into pandas. Too large files can lead to excessive memory usage on a single worker, while too small files can lead to poor performance as the overhead of Dask dominates. If you need to read a parquet dataset composed of large files, you can pass split_row_groups=True to have Dask partition your data by row group instead of by file. Note that this approach will not scale as well as split_row_groups=False without a global _metadata file, because the footer will need to be loaded from every file in the dataset.
I'd try a few strategies here:
Only read in the columns you need. Since your files are so huge, you don't want dask even trying to load the first chunk to infer structure. You can provide the columns key dd.read_parquet which will be passed through to various stages of the parsing engines. In this case, dd.read_parquet(filepath, columns=list_of_columns).
If your parquet files have multiple row groups, you can make use of the dd.read_parquet argument split_row_groups=True. This will create smaller chunks which are each smaller than the full file size.
If (2) works, you may be able to avoid repartitioning, or if you need to, repartition to a multiple of your original number of partitions (22, 33, etc). When reading data from a file, dask doesn't know how large each partition is, and if you specify a number less than a multiple of the current number of partitions, the partitioning behavior isn't very well defined. On some small tests I've run, repartitioning 11 --> 20 will leave the first 10 partitions as-is and split the last one into the remaining 10!
If your file is on disk, you may be able to read the file as a memory map to avoid loading the data prior to repartitioning. You can do this by passing memory_map=True to dd.read_parquet.
I'm sure you're not the only one with this problem. Please let us know how this goes and report back what works!

What is the fastest way to read large data from multiple files and aggregate data in python?

I have many files: 1.csv, 2.csv ... N.csv. I want to read them all and aggregate a DataFrame. But reading files sequentially in one process will definitely be slow. So how can I improve it? Besides, Jupyter notebook is used.
Also, I am a little confused about the "cost of parsing parameters or return values between python processes"
I know the question may be duplicated. But I found that most of the answers use multi-process to solve it. Multiprocess does solve the GIL problem. But in my experience(maybe it is wrong): parsing large data(like a DataFrame) as a parameter to subprocess is slower than a for loop in a single process because the procedure needs serializing and de-serializing. And I am not sure about the return of large values from the subprocess.
Is it most efficient to use a Qeueu or joblib or Ray?

Reading csv is fast. I would read all csv in a list and then concat the list to one dataframe. Here is a bit of code form my use case. I find all .csv files in my path and save the csv file names in variable "results". I then loop the file names and read the csv and store it in list which I later concat to one dataframe.
data = []
for item in result:
data.append(pd.read_csv(path))
main_df = pd.concat(data, axis = 0)
I am not saying this is the best approach, but this works great for me :)

Is Panda appropriate for joining 120 large txt files?

I have 120 txt files, all are around 150mb in size and have thousands of columns. Overall theres definitely more than 1million columns.
When I try to concatenate using pandas I get this error: " Unable to allocate 36.4 MiB for an array with shape (57, 83626) and data type object"... I've tried Jupyter notebook and Spyder, neither work
How can I join the data? Or is this data not suitable for Pandas.
Thanks!

You are running out of memory. Even if you manage to load all of them (with pandas or other package), your system will still run out of memory for every task you want to perform with this data.
Assuming that you want to perform different operations in different columns of all the tables, the best way to do so is to perform each task separately, preferrably batching your columns since there are more than 1k for each file, as you say.
Let's say you want to sum the values in the first column of each file (assuming they are numbers...) and store these results in a list:
import glob
import pandas as pd
import numpy as np
filelist = glob.glob('*.txt') # Make sure you're working in the directory containing the files
sum_first_columns = []
for file in filelist:
df = pd.read_csv(file,sep=' ') # Adjust the separator for your case
sum_temp = np.sum(df.iloc[:,0])
sum_first_columns.append(sum_temp)
You now have a list of dimension (1,120).
For each operation, this is what I would do if it was mandatory for me to work with my own computer/system.
Please note that this process will be very time consuming as well, given the size of your files. You can either try to reduce your data or to use a cloud server to compute everything.

Saying you want to concat in pandas implies that you just want to merge all 150 files together into one file? If so you can iterate through all the files in a directory and read them in as lists of tuples or something like that and just combine them all into one list. Lists and tuples are magnitudes less memory than dataframes, but you won't be able to perform calculations and stuff unless you throw them in as a numpy array or dataframe.
At a certain point, when there is too much data it is appropriate to shift from pandas to spark since spark can use the power and memory from a cluster instead of being restricted to your local machine or servers resources.

How to write and read data efficiently in python?

My application need to process data periodically. The application need to process new data and then merge it with old ones. The data may have billions rows with only two columns, which the first column is the row name and the second one is values. The following one is the example:
a00001,12
a00002,2321
a00003,234
The new data may has new row names or old ones. I want to merge them. So each in processing procedure I need to read the old large data file and merge it with new ones. Then I write the new data to a new file.
I find that the most time-consuming process is read and write data. I have tried several data I/O way.
Orignal read and write text. This is the most time-consuming way
Python pickle package, however, it is not efficient for large data file
Are there any other data I/O formats or packages can load and write large data efficiently in python?

If you have such large amounts of data, it might be faster to try lowering the amount of data you have to read and write.
You could spread the data over multiple files instead of saving it all in one.
When processing your new data, check what old data has to be merged and just read and write those specific files.
Your data has two rows:
name1, data1
name2, data2
Files containing old data:
db_1.dat, db_2.dat, db_3.dat
name_1: data_1 name_1001: data_1001 name_2001: data_2001
. . .
. . .
. . .
name_1000: data_1000 name_2000: data_2000 name_3000: data_3000
Now you can check what data you need to merge and just read and write the specific files holding that data.
Not sure if what you are trying to achieve allows a system like this but it would speed up the process as there is less data to handle.

Maybe this article could help you. It seems like father and parquet may be interesting.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.