File Size Reduction in Pandas and HDF5 - python

I am running a model that outputs data into multiple Pandas frames, and then saves those frames to an HDF5 file. The model is run several hundred times, each time adding new columns (multi-indexed) into the existing HDF5 file's frames. This is done with Pandas merge. Since the frames are different lengths for each run, there ends up being a large number of NaN values in the frames.
After enough model runs are completed, data are dropped from the frames if the rows or columns are associated with a model run that had an error. In that process, the new data frames are put into a new HDF5 file. the following pseudo-python shows this process:
with pandas.HDFStore(filename) as store:
# figure out which indices should be removed
indices_to_drop = get_bad_indices(store)
new_store = pandas.HDFStore(reduced_filename)
for key in store.keys():
df = store[key]
for idx in indices_to_drop:
df = df.drop(idx, <level and axis info>)
new_store[key] = df
new_store.close()
The new hdf5 file ends up being about 10% of the size of the original. The only difference in the files is that all the NaN values are no longer equal (but are all numpy float64 values).
My question is, how can this filesize reduction (presumably through managing NaN values) be achieved on an existing hdf5 file? There are times where I don't need to do the above procedure, but I am doing it anyway to get the reduction. Is there an existing Pandas or PyTables command that can do this? Thank you very much in advance.

See the docs here
The warning says it all:
Warning Please note that HDF5 DOES NOT RECLAIM SPACE in the h5 files
automatically. Thus, repeatedly deleting (or removing nodes) and
adding again WILL TEND TO INCREASE THE FILE SIZE. To clean the file,
use ptrepack

Related

Read a partitioned parquet dataset from multiple files with PyArrow and add a partition key based on the filename

I have a bunch of parquet files, each containing a subset of my dataset. Let's say that the files are named data-N.parquet with N being an integer.
I can read them all and subsequently convert to a pandas dataframe:
files = glob.glob("data-**.parquet")
ds = pq.ParquetDataset(
files,
metadata_nthreads=64,
).read_table(use_threads=True)
df = ds.to_pandas()
This works just fine. What it would like to have is an additional column in the final data frame, indicating from which file the data is originating.
As far as I understand, the ds data is partitioned, with one partition per file. So it would be a matter of including the partition key in the data frame.
Is this feasible?
The partition key is, at the moment, included in the dataframe. However, all existing partitioning schemes use directory names for the key. So if your data was /N/data.parquet or /batch=N/data.parquet this will happen (you will need to supply a partitioning object when you read the dataset).
There is no way today (in pyarrow) to get the filename in the returned results.

Chaging type of really large columns - Python MemoryError

We are migrating from SAS to Python and I am having some trouble dealing with large dataframes.
I am dealing with a df with 15kk rows and 44 columns, a pretty large boy. I need to replace commas with dots in some columns, delete some others columns and change some to date.
To delete I found out that this works pretty well:
del df['column']
but when trying to replace using this:
df["column"] = (dfl["column"].replace('\.','', regex=True).replace(',','.', regex=True).astype(float))
I get:
MemoryError: Unable to allocate 14.2 MiB for an array with shape (14901054,) and data type uint8
Same happens when trying to convert to date using this:
df['column'] = pd.to_datetime(df['column'],errors='coerce')
I get:
MemoryError: Unable to allocate 114. MiB for an array with shape (14901054,) and data type datetime64[ns]
Is there any other way to do those things, only more memory efficient? Or is the only solution to split the df beforehand? Thanks!
ps. not all columns are giving me this problem, but I guess that is not important
I am not an expert with this library but when you have to deal with a big amount of data, you could store the information in disk and read it in chunks.
A good (but not the best I guess) solution could be store it in a temporal CSV file and them, read the file in chunks, dealing with less rows in memory.
Original Dataframe
Remove the unnecesary columns
Store the dataframe in a temporal CSV file.
Read the CSV by chunks:
For each chunk, perform the column modifications and store it in another final CSV file.
More information over here (a few minutes googling):
Why and how to use pandas with large data
Dealing with large datasets in pandas

Workflow for modifying an hdf5 file in vaex

As sort of follow on to my previous question [1], is there a way to open a hdf5 dataset in vaex, perform operations and then store the results to the same dataset?
I tried the following:
import vaex as vx
vxframe = vx.open('somedata.hdf5')
vxframe = some_transformation(vxframe)
vxframe.export_hdf5('somedata.hdf5')
This results in the error OSError: Unable to create file (unable to truncate a file which is already open), so h5py can't write to the file while it is open. Is there another workflow to achieve this? I can write to another file as a workaround, but that seems quite inefficient as (I imagine) it has to copy all the data that has not changed as well.
[1] Convert large hdf5 dataset written via pandas/pytables to vaex
Copying to a new file would not be less efficient than writing to itself (at least not for this example), since it will have to write the same amount of bytes. I also would not recommend it, since if you make a mistake, you will mess up your data.
Exporting data is actually quite efficient, but even better, you can also choose to just export the columns you want:
df = vaex.open('somedata.hdf5')
df2 = some_transformation(df)
df2[['new_column1', 'new_columns2']].export('somedata_extra.hdf5')
...
# next time
df = vaex.open('somedata.hdf5')
df2 = vaex.open('somedata_extra.hdf5')
df = df.join(df2) # merge without column name will merge on rows basis
We used this approach alot, to create auxiliary datasets on disk that were precomputed. Joining them back (on row bases) is instant, it does not take any time or memory.

How to efficiently read data from a large excel file, do the computation and then store results back in python?

Let us say i have an excel file with 100k rows. My code is trying to read it row by row, and for each row do computation (including benchmark of how long it takes to perform each row). Then, my code will produce an array of results, with 100k rows. I did my python code but it is not efficient and taking me several days and also the benchmark results getting worse due to high consumption of memory i guess. Please see my attempt and let me know how to improve it.
My code save results=[] and only write it at the end. Also, at the start i store the whole excel file in worksheet.. I think like this will cause memory issue since my excel has very large text in cells (not only numbers).
ExcelFileName = 'Data.xlsx'
workbook = xlrd.open_workbook(ExcelFileName)
worksheet = workbook.sheet_by_name("Sheet1") # We need to read the data
num_rows = worksheet.nrows #Number of Rows
num_cols = worksheet.ncols #Number of Columns
results=[]
for curr_row in range(1,num_rows,1):
row_data = []
for curr_col in range(0, num_cols, 1):
data = worksheet.cell_value(curr_row, curr_col) # Read the data in the current cell
row_data.append(data)
#### do computation here ####
## save results like results+=[]
### save results array in dataframe and then print it to excel
df = pd.DataFrame(results)
writer = pd.ExcelWriter("XX.xlsx", engine="xlsxwriter")
df.to_excel(writer, sheet_name= 'results')
writer.save()
What i would like is to read the first row from excel and store it in memory, do the calculation, get the result and save it into excel,, then go for the second row,,, without keep memory so busy. By doing so, i will not have results array containing 100k rows, since each loop i erase it.
To solve the issue about loading the input file into memory, I would look into using a generator. A Generator works by iterating over any iterable, but only returning the next element instead of the entire iterable. In your case, this would return only the next row from your .xlsx file, instead of keeping the entire file in memory.
However, this will not solve the issue of having a very large "results" array. Unfortunately, updating a .csv or .xlsx file in as you go will take a very long time, significantly longer than updating the object in memory. There is a trade off here, you can either use up lots of memory by updating your "results" array and then writing it all to a file at the end, or you can very slowly update a file in the file system with the results as you go at the cost of much slower execution.
For this kind of operation you are probably better off loading the csv directly into a DataFrame, there are several methods for dealing with large files in pandas that are detailed here, How to read a 6 GB csv file with pandas. Which method you choose will have a lot to do with the type of computation you need to do, since you seem to be processing one row at a time, using chunks will probably be the way to go.
Pandas has a lot of built in optimization for dealing with operations on large sets of data, so the majority of the time you will find increased performance working with data within a DataFrame or Series than you will using pure Python. For the best performance consider vectorizing your function or looping using the apply method, which allows pandas to apply the function to all rows in the most efficient way possible.

OverflowError with Pandas to_hdf

Python newbie here.
I am trying to save a large data frame into HDF file with lz4 compression using to_hdf.
I use Windows 10, Python 3, Pandas 20.2
I get the error “OverflowError: Python int too large to convert to C long”.
None of the machine resources are close to their limits (RAM, CPU, SWAP usage)
Previous posts discuss the dtype, but the following example shows that there is some other problem, potentially related to the size?
import numpy as np
import pandas as pd
# sample dataframe to be saved, pardon my French
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
'col2':['aaaaaaaaaaaaaaaaa']*n,
'col3':[999999999999999999]*n,
'col4':['aaaaaaaaaaaaaaaaa']*n,
'col5':[999999999999999999]*n,
'col6':['aaaaaaaaaaaaaaaaa']*n})
# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
....
OverflowError: Python int too large to convert to C long
I experienced the same issue and it seems that it is indeed connected to the size of the data frame rather than to dtype (I had all the columns stored as strings and was able to store them to .h5 separately).
The solution that worked for me is to save the data frame in chunks using mode='a'.
As suggested in pandas documentation : mode{‘a’, ‘w’, ‘r+’}, default ‘a’: ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
So the sample code would look something like:
batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')
As #Giovanni Maria Strampelli pointed, the answer of #Artem Snorkovenko only saves the last batch. Pandas documentation states the following:
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
Here is a possible workaround to save all batches (adjusted from the answer of #Artem Snorkovenko):
for i in range(len(df)):
sr = df.loc[i] #pandas series object for the given index
sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')
This code saves each Pandas Series object with a different key. Each key is indexed by i.
To load the existing .h5 file after saving, one can do the following:
i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
#print(i) #this is to see if code is working properly.
try: #check whether current i value exists in the keys of the .h5 file
sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
srl.append(sdfr) #append each series to a list to create the dataframe in the end.
i += 1 #increment i by 1 after loading the series object
except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
dfdone = True #Terminate the while loop.
df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.
I used a while loop, assuming we do not know the exact length of the dataframe in the .h5 file. If the length is known, for loop can also be used.
Note that I am not saving dataframes in chunks here. Thus, the loading procedure is in its current form not suitable for saving in chunks, where the data type would be DataFrame for each chunk. In my implementation, each saved object is Series, and DataFrame is generated from a list of Series. The code I provided can be adjusted to work for saving in chunks and generating a DataFrame from a list of DataFrame objects (a nice starting point can be found in ths Stack Overflow entry.).
In addition to #tetrisforjeff 's answer:
If the df contains object types, the reading could lead to error. I would suggest pd.concat(srl) instead of pd.DataFrame(srl)

Categories

Resources