I am trying to perform analysis on dozens very large CSV files, each with hundreds of thousands of rows of time series data, with each file being about roughly 5GB in size.
My goal is to read in each of these CSV files as a dataframe, perform calculations on these dataframe, append some new columns to these dataframes based on these calculations, and then write these new dataframes to a unique output CSV file for each input CSV file. This whole process would occur within a for loop iterating through a folder containing all of these large CSV files. And so this whole process is very memory intensive, and when I try to run my code, I am met with this error message: MemoryError: Unable to allocate XX MiB for an array with shape (XX,) and data type int64
And so I want to explore a way to make the process of reading in my CSVs much loss memory intensive, which is why I want to try out the pickle module in python.
To "pickle" each CSV and then read it in I try the following:
#Pickle CSV and read in as pickle
df = pd.read_csv(path_to_csv)
filename = "pickle.csv"
file = open(filename, 'wb')
pickle.dump(df, file)
file = open(filename, 'rb')
pickled_df = pickle.load(file)
print(pickled_df)
However, after including this pickling code to read in my data in my larger script, I get the same error message as above. I suspect this is because I am still reading the file in with pandas to begin with before pickling and then reading that pickle. My question is, how to I avoid the memory-intensive process of reading my data into a pandas dataframe by just reading in the CSV with pickle? Most instruction I am finding tells me to pickle the CSV and then read in that pickle, but I do not understand how pickle the CSV without first reading in that CSV with pandas, which is what is causing my code to crash. I am also confused about whether reading in my data as a pickle would still provide me with a dataframe I can perform calculations on.
Related
I'm in a situation where I have to add a single row to the end of dataframe very frequently. Initially I used plain text .csv files and therefore appending a line at the end of the file was trivial and didn't require loading the dataframe to RAM.
line_to_add = '1,2,3\n'
with open('path/to/file.csv', 'a') as file_handle:
file_handle.write(line_to_add)
For memory disk reasons I would like to save my dataframe as a pickled+zipped file, but if I do that I loose the ability to easily append to the end of the file. Is this doable without having to load the dataframe into RAM every time?
I have a large csv file(around 10Gb).
I use different ipython notebooks to analyse it.(Using pd.read_csv() to load the file into dataframe in each notebook)
My problem is , every time I read the file, 10G memory is used.
I am wondering if there is a way to share dataframe data between processes so that I can optimize my memory usage.
An ideal solution would be like this:
in my server file,
def InitData():
df = pd.read_csv(my.csv)
share(df)
in other notebook files,
def loadingData():
df = LoadingSharedData()
result = df.sum() #something like this
No matter how many notebooks I create,there would be only one piece of dataframe in my memory.
Using pickle is fast and efficient if you are confident that nobody will be able to interfere with the pickled files, see security considerations.
import pickle
with open('filename.pickle', 'wb') as file:
pickle.dump(df, file)
with open('filename.pickle', 'rb') as file:
df_test = pickle.load(file)
print(df.equals(df_test))
I am new to Python and I attempt to read a large .csv file (with hundreds of thousands or possibly few millions of rows; and about 15.000 columns) using pandas.
What I thought I could do is to create and save each chunk in a new .csv file, iteratively across all chunks. I am currently using a lap top with relatively limited memory (of about 4 Gb, in the process of upgrading it) but I was wondering whether I could do this without changing my set up now. Alternatively, I could transfer this process in a pc with large RAM and attempt larger chunks, but I wanted to get this in place even for shorter row chunks.
I have seen that I can process quickly chunks of data (e.g. 10.000 rows and all columns), using the code below. But due to me being a Python beginner, I have only managed to order the first chunk. I would like to loop iteratively across chunks and save them.
import pandas as pd
import os
print(os.getcwd())
print(os.listdir(os.getcwd()))
chunksize = 10000
data = pd.read_csv('ukb35190.csv', chunksize=chunksize)
df = data.get_chunk(chunksize)
print(df)
export_csv1 = df.to_csv (r'/home/user/PycharmProjects/PROJECT/export_csv_1.csv', index = None, header=True)
If you are not doing any processing on data then you dont have to even store it in any variable.You can do it directly. PFA code below.Hope this would help u.
import pandas as pd
import os
chunksize = 10000
batch=1
for chunk in pd.read_csv(r'ukb35190.csv',chunksize=chunk_size):
chunk.to_csv(r'ukb35190.csv'+str(batch_no)+'.csv',index=False)
batch_no+=1
I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.
Thanks
Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.
One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file
with open(out_filename, 'w') as outfile:
for in_filename in filenames:
with open(in_filename, 'r') as infile:
# if your csv files have headers then you might want to burn a line here with `next(infile)
for line in infile:
outfile.write(line + '\n')
If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.
You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html
As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)
Like #mrocklin said, I recommend using other file formats.
I've a large CSV files (about a million records). I want to process write each record into a DB.
Since loading the complete file into the RAM makes no sense, hence I need to read the file in chunks (or any other better way).
So, I wrote this code .
import csv
with open ('/home/praful/Desktop/a.csv') as csvfile:
config_file = csv.reader(csvfile, delimiter = ',', quotechar = '|')
print config_file
for row in config_file:
print row
I guess it loads everything into its memory first and then process.
Upon looking at this thread and many others, I didnt see any difference in o/p code and the solution. Kindly advise, is it the only method for efficient processing of csv files
No, the csv module produces an iterator; rows are produced on demand. Unless you keep references to row elsewhere the file will not be loaded into memory in its entirety.
Note that that is exactly what I am saying in the other answer you linked to; the problem there is that the OP was building a list (data) holding all rows after reading instead of processing the rows as they were being read.