how to share pandas dataframe between processes to save memory - python

I have a large csv file(around 10Gb).
I use different ipython notebooks to analyse it.(Using pd.read_csv() to load the file into dataframe in each notebook)
My problem is , every time I read the file, 10G memory is used.
I am wondering if there is a way to share dataframe data between processes so that I can optimize my memory usage.
An ideal solution would be like this:
in my server file,
def InitData():
df = pd.read_csv(my.csv)
share(df)
in other notebook files,
def loadingData():
df = LoadingSharedData()
result = df.sum() #something like this
No matter how many notebooks I create,there would be only one piece of dataframe in my memory.

Using pickle is fast and efficient if you are confident that nobody will be able to interfere with the pickled files, see security considerations.
import pickle
with open('filename.pickle', 'wb') as file:
pickle.dump(df, file)
with open('filename.pickle', 'rb') as file:
df_test = pickle.load(file)
print(df.equals(df_test))

Related

How to read large csv as pickle in python?

I am trying to perform analysis on dozens very large CSV files, each with hundreds of thousands of rows of time series data, with each file being about roughly 5GB in size.
My goal is to read in each of these CSV files as a dataframe, perform calculations on these dataframe, append some new columns to these dataframes based on these calculations, and then write these new dataframes to a unique output CSV file for each input CSV file. This whole process would occur within a for loop iterating through a folder containing all of these large CSV files. And so this whole process is very memory intensive, and when I try to run my code, I am met with this error message: MemoryError: Unable to allocate XX MiB for an array with shape (XX,) and data type int64
And so I want to explore a way to make the process of reading in my CSVs much loss memory intensive, which is why I want to try out the pickle module in python.
To "pickle" each CSV and then read it in I try the following:
#Pickle CSV and read in as pickle
df = pd.read_csv(path_to_csv)
filename = "pickle.csv"
file = open(filename, 'wb')
pickle.dump(df, file)
file = open(filename, 'rb')
pickled_df = pickle.load(file)
print(pickled_df)
However, after including this pickling code to read in my data in my larger script, I get the same error message as above. I suspect this is because I am still reading the file in with pandas to begin with before pickling and then reading that pickle. My question is, how to I avoid the memory-intensive process of reading my data into a pandas dataframe by just reading in the CSV with pickle? Most instruction I am finding tells me to pickle the CSV and then read in that pickle, but I do not understand how pickle the CSV without first reading in that CSV with pandas, which is what is causing my code to crash. I am also confused about whether reading in my data as a pickle would still provide me with a dataframe I can perform calculations on.

use Pandas HDFStore to open file in read only mode

I needed compatibility between Pandas versions, so pickle was not enough, and I stored a bunch of dataframes like this:
import pandas as pd
hdf = pd.HDFStore('storage.h5')
hdf.put('mydata', df_mydata)
...and brought them back like this:
df_mydata = hdf.get('df_mydata')
Thing is, in Python, you can usually open a file read-only like this:
f = open('workfile', 'r')
I saved the dataframes for local use as it takes too long and stresses out a server to pull them out of SQL otherwise. How can you open these .h5 files so as to not accidentally alter them?
Try:
hdf = pd.HDFStore('storage.h5', 'r')
this class comes from pytables. You can read the doc here:pytables

import large dataset (4gb) in python using pandas

I'm trying to import a large (approximately 4Gb) csv dataset into python using the pandas library. Of course the dataset cannot fit all at once in the memory so I used chunks of size 10000 to read the csv.
After this I want to concat all the chunks into a single dataframe in order to perform some calculations but I ran out of memory (I use a desktop with 16gb RAM).
My code so far:
# Reading csv
chunks = pd.read_csv("path_to_csv", iterator=True, chunksize=1000)
# Concat the chunks
pd.concat([chunk for chunk in chunks])
pd.concat(chunks, ignore_index=True)
I searched many threads on StackOverflow and all of them suggest one of these solutions. Is there a way to overcome this? I can't believe I can't handle a 4 gb dataset with 16 gb ram!
UPDATE: I still haven't come up with any solution to import the csv file. I bypassed the problem by importing the data into a PostgreSQL then querying the database.
I once deal with this kind of situation using generator in python. I hope this will be helpful:
def read_big_file_in_chunks(file_object, chunk_size=1024):
"""Reading whole big file in chunks."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('very_very_big_file.log')
for chunk in read_big_file_in_chunks(f):
process_data(chunck)

Merge csv files using dask

I am new to python. I am using dask to read 5 large (>1 GB) csv files and merge (SQL like) them into a dask dataframe. Now, I am trying to write the merged result into a single csv. I used compute() on dask dataframe to collect data into a single df and then call to_csv. However, compute() is slow in reading data across all partitions. I tried calling to_csv directly on dask df and it created multiple .part files (I didn't try merging those .part files into a csv). Is there any alternative to get dask df into a single csv or any parameter to compute() to gather data. I am using 6GB RAM with HDD and i5 processor.
Thanks
Dask.dataframe will not write to a single CSV file. As you mention it will write to multiple CSV files, one file per partition. Your solution of calling .compute().to_csv(...) would work, but calling .compute() converts the full dask.dataframe into a Pandas dataframe, which might fill up memory.
One option is to just avoid Pandas and Dask all-together and just read in bytes from multiple files and dump them to another file
with open(out_filename, 'w') as outfile:
for in_filename in filenames:
with open(in_filename, 'r') as infile:
# if your csv files have headers then you might want to burn a line here with `next(infile)
for line in infile:
outfile.write(line + '\n')
If you don't need to do anything except for merge your CSV files into a larger one then I would just do this and not touch pandas/dask at all. They'll try to read the CSV data into in-memory data, which will take a while and which you don't need. If on the other hand you need to do some processing with pandas/dask then I would use dask.dataframe to read and process the data, write to many csv files, and then use the trick above to merge them afterwards.
You might also consider writing to a datastore other than CSV. Formats like HDF5 and Parquet can be much faster. http://dask.pydata.org/en/latest/dataframe-create.html
As of Dask 2.4.0 you may now specify single_file=True when calling to_csv. Example: dask_df.to_csv('path/to/csv.csv', single_file=True)
Like #mrocklin said, I recommend using other file formats.

Processing large XLSX file in python

I have a large xlsx Excel file (56mb, 550k rows) from which I tried to read the first 10 rows. I tried using xlrd, openpyxl, and pyexcel-xlsx, but they always take more than 35 mins because it loads the whole file in memory.
I unzipped the Excel file and found out that the xml which contains the data I need is 800mb unzipped.
When you load the same file in Excel it takes 30 seconds. I'm wondering why it takes that much time in Python?
Use openpyxl's read-only mode to do this.
You'll be able to work with the relevant worksheet instantly.
Here is it, i found a solution. The fastest way to read an xlsx sheet.
56mb file with over 500k rows and 4 sheets took 6s to proceed.
import zipfile
from bs4 import BeautifulSoup
paths = []
mySheet = 'Sheet Name'
filename = 'xlfile.xlsx'
file = zipfile.ZipFile(filename, "r")
for name in file.namelist():
if name == 'xl/workbook.xml':
data = BeautifulSoup(file.read(name), 'html.parser')
sheets = data.find_all('sheet')
for sheet in sheets:
paths.append([sheet.get('name'), 'xl/worksheets/sheet' + str(sheet.get('sheetid')) + '.xml'])
for path in paths:
if path[0] == mySheet:
with file.open(path[1]) as reader:
for row in reader:
print(row) ## do what ever you want with your data
reader.close()
Enjoy and happy coding.
The load time you're experiencing is directly related to the io speed of your memory chip.
When pandas loads an excel file, it makes several copies of the file -- since the file structure isn't serialized (excel uses a binary encoding).
In terms of a solution: I'd suggest, as a workaround:
load your excel file through a virtual machine with specialized hardware (here's what AWS has to offer)
save your file to a csv format for local use.
For even better performance, use an optimized data structure such as parquet
For a deeper dive, check out this article I've written: Loading Ridiculously Large Excel Files in Python

Categories

Resources