OverflowError with Pandas to_hdf - python

Python newbie here.
I am trying to save a large data frame into HDF file with lz4 compression using to_hdf.
I use Windows 10, Python 3, Pandas 20.2
I get the error “OverflowError: Python int too large to convert to C long”.
None of the machine resources are close to their limits (RAM, CPU, SWAP usage)
Previous posts discuss the dtype, but the following example shows that there is some other problem, potentially related to the size?
import numpy as np
import pandas as pd
# sample dataframe to be saved, pardon my French
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
'col2':['aaaaaaaaaaaaaaaaa']*n,
'col3':[999999999999999999]*n,
'col4':['aaaaaaaaaaaaaaaaa']*n,
'col5':[999999999999999999]*n,
'col6':['aaaaaaaaaaaaaaaaa']*n})
# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
....
OverflowError: Python int too large to convert to C long

I experienced the same issue and it seems that it is indeed connected to the size of the data frame rather than to dtype (I had all the columns stored as strings and was able to store them to .h5 separately).
The solution that worked for me is to save the data frame in chunks using mode='a'.
As suggested in pandas documentation : mode{‘a’, ‘w’, ‘r+’}, default ‘a’: ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
So the sample code would look something like:
batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')

As #Giovanni Maria Strampelli pointed, the answer of #Artem Snorkovenko only saves the last batch. Pandas documentation states the following:
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
Here is a possible workaround to save all batches (adjusted from the answer of #Artem Snorkovenko):
for i in range(len(df)):
sr = df.loc[i] #pandas series object for the given index
sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')
This code saves each Pandas Series object with a different key. Each key is indexed by i.
To load the existing .h5 file after saving, one can do the following:
i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
#print(i) #this is to see if code is working properly.
try: #check whether current i value exists in the keys of the .h5 file
sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
srl.append(sdfr) #append each series to a list to create the dataframe in the end.
i += 1 #increment i by 1 after loading the series object
except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
dfdone = True #Terminate the while loop.
df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.
I used a while loop, assuming we do not know the exact length of the dataframe in the .h5 file. If the length is known, for loop can also be used.
Note that I am not saving dataframes in chunks here. Thus, the loading procedure is in its current form not suitable for saving in chunks, where the data type would be DataFrame for each chunk. In my implementation, each saved object is Series, and DataFrame is generated from a list of Series. The code I provided can be adjusted to work for saving in chunks and generating a DataFrame from a list of DataFrame objects (a nice starting point can be found in ths Stack Overflow entry.).

In addition to #tetrisforjeff 's answer:
If the df contains object types, the reading could lead to error. I would suggest pd.concat(srl) instead of pd.DataFrame(srl)

Related

Insert data range into multi dimensionnal array

I am trying to get data from an excel file using xlwings (am new to python) and load it into a multi dimensionnal array (or rather, table) that I could then loop through later on row by row.
What I would like to do :
db = []
wdb = xw.Book(r'C:\temp\xlpython\db.xlsx')
db.append(wdb.sheets[0].range('A2:K2').expand('down'))
So this would load the data into my table 'db', and I could later loop through it using :
for i in range(len(db)):
print(db[i][1])
If I wanted to retrieve the data originally in column B for instance
But instead of this, it loads the data in a single dimension, so if I run the code :
print(range(len(db)))
I will get (0,1) instead of the (0,145) expected if I had 146 rows of data in the excel file
Is there a way to do this, except loading the table line by line ?
Thanks
Have a look at the documentation here on converting the range to a numpy array or specifying the dimensions.
db = []
wdb = xw.Book(r'C:\temp\xlpython\db.xlsx')
db.append(wdb.sheets[0].range('A2:K2').options(np.array, expand='down').value)
After looking at numpy arrays as suggested by Rawson, it seems they have the same behaviour than python lists when appending a whole range, meaning it generates a flat array and does not preserve the rows of the excel range into the array; at least I couldn't get it to work that way.
So finally I looked into panda DataFrame and it seems to do the exact needed job, you can even import column titles which is a plus.
import pandas as pd
wdb = xw.Book(r'C:\temp\xlpython\db.xlsx')
db= pd.DataFrame(wdb.sheets[0].range('A2:K2').expand('down').value)

Pandas read_hdf is giving "can only use an iterator or chunksize on a table" error

I have an h5 data file, which includes key rawreport
I can read the rawreport and save as dataframe using read_hdf(filename, "rawreport") without any problems. But the data has 17 mil rows and i'd like to use chunking
When I ran this code
chunksize = 10**6
someval = 100
df = pd.DataFrame()
for chunk in pd.read_hdf(filename, 'rawreport', chunksize=chunksize, where='datetime < someval'):
df = pd.concat([df, chunk], ignore_index=True)
I get "TypeError: can only use an iterator or chunksize on a table"
What does it mean that the rawreport isn't a table and how could I overcome this issue? I'm not the person who created the h5 file.
Chunking is only possible if your file was written in a Table format using PyTables. This must be specified when your file was first written:
df.to_hdf('rawreport', format = 'table')
If this wasn't specified when you wrote the file, then Pandas defaults to using a fixed format. This means that while the file can be quickly written and read later, it does mean that the entire dataframe must be read into memory. Unfortunately, this means that chunking and other options in read_hdf to specify particular rows or columns can't be used here.

Save dictionary as a pyspark Dataframe and load it - Python, Databricks

I have a dictionary as follows:
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
I want to save this dictionary in Databricks in order for me not to obtain it every time I want to start working with it. Furthermore, I would like to know how to retrieve it and have it in its original form again.
I have tried doing the following:
from itertools import zip_longest
column_names, data = zip(*my_dict.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
and
column_names, data = zip(*dict_brands.items())
spark.createDataFrame(zip(*data), column_names).show()
However, I get the following error:
zip_longest argument #10342 must support iteration
I also do not know how to reload it or upload it. I tried with a sample dataframe (not the same one), as follows:
df.write.format("tfrecords").mode("overwrite").save('/data/tmp/my_df')
And the error is:
Attribute name "my_column" contains invalid character(s)
among " ,;{}()\n\t=". Please use alias to rename it.
Finally, in order to obtain it, I thought about:
my_df = spark.table("my_df") # Get table
df = my_df.toPandas() # Make pd dataframe
and then make it a dictionary, but maybe there is an easier way than making it a dataframe and then retrieving as dataframe and converting into dictionary back again.
I would also like to know the computational cost for the solutions, since the actual dataset is very large.
Here is my sample code for realizing your needs step by step.
Convert a dictionary to a Pandas dataframe
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
import pandas as pd
pdf = pd.DataFrame(my_dict)
Convert a Pandas dataframe to a PySpark dataframe
df = spark.createDataFrame(pdf)
To save a PySpark dataframe to a file using parquet format. Format tfrecords is not supported at here.
df.write.format("parquet").mode("overwrite").save('/data/tmp/my_df')
To load the saved file above as a PySpark dataframe.
df2 = spark.read.format("parquet").load('/data/tmp/my_df')
To convet a PySpark dataframe to a dictionary.
my_dict2 = df2.toPandas().to_dict()
The computational cost of these code above is depended on the memory usage for your actual dataset.

Pandas ignoring headers while reading large (2GB) csv

I'm trying to read a rather large CSV (2 GB) with pandas to do some datatype manipulation and joining with other dataframes I have already loaded before. As I want to be a little careful with memory I decided to read the it in chunks. For the purpose of the questions here is an extract of my CSV layout with dummy data (cant really share the real data, sorry!):
institution_id,person_id,first_name,last_name,confidence,institution_name
1141414141,4141414141,JOHN,SMITH,0.7,TEMP PLACE TOWN
10123131114,4141414141,JOHN,SMITH,0.7,TEMP PLACE CITY
1003131313188,4141414141,JOHN,SMITH,0.7,"TEMP PLACE,TOWN"
18613131314,1473131313,JOHN,SMITH,0.7,OTHER TEMP PLACE
192213131313152,1234242383,JANE,SMITH,0.7,"OTHER TEMP INC, LLC"
My pandas code to read the files:
inst_map = pd.read_csv("data/hugefile.csv",
engine="python",
chunksize=1000000,
index_col=False)
print("processing institution chunks")
chunk_list = [] # append each chunk df here
for chunk in inst_map:
# perform data filtering
chunk['person_id'] = chunk['person_id'].progress_apply(zip_check)
chunk['institution_id'] = chunk['institution_id'].progress_apply(zip_check)
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk)
ins_processed = pd.concat(chunk_list)
The zip check function that I'm applying is basically performing some datatype checks and then converting the value that it gets into an integer.
Whenever I read the CSV it will only ever read the institution_id column and generates an index. The other columns in the CSV are just silently dropped.
When i dont use index_col=False as an option it will just set 1141414141/4141414141/JOHN/SMITH/0.7 (basically the first 5 values in the row) as the index and only institution_id as the header while only reading the institution_name into the dataframe as a value.
I have honestly no clue what is going on here, and after 2 hours of SO / google search I decided to just ask this as a question. Hope someone can help me, thanks!
The issue came out to be that something went wrong while transferring the large CSV file to my remote processing server (which sufficient RAM to handle in memory editing). Processing the chunks on my local computer seems to work.
After reuploading the file it worked fine on the remote server.

File Size Reduction in Pandas and HDF5

I am running a model that outputs data into multiple Pandas frames, and then saves those frames to an HDF5 file. The model is run several hundred times, each time adding new columns (multi-indexed) into the existing HDF5 file's frames. This is done with Pandas merge. Since the frames are different lengths for each run, there ends up being a large number of NaN values in the frames.
After enough model runs are completed, data are dropped from the frames if the rows or columns are associated with a model run that had an error. In that process, the new data frames are put into a new HDF5 file. the following pseudo-python shows this process:
with pandas.HDFStore(filename) as store:
# figure out which indices should be removed
indices_to_drop = get_bad_indices(store)
new_store = pandas.HDFStore(reduced_filename)
for key in store.keys():
df = store[key]
for idx in indices_to_drop:
df = df.drop(idx, <level and axis info>)
new_store[key] = df
new_store.close()
The new hdf5 file ends up being about 10% of the size of the original. The only difference in the files is that all the NaN values are no longer equal (but are all numpy float64 values).
My question is, how can this filesize reduction (presumably through managing NaN values) be achieved on an existing hdf5 file? There are times where I don't need to do the above procedure, but I am doing it anyway to get the reduction. Is there an existing Pandas or PyTables command that can do this? Thank you very much in advance.
See the docs here
The warning says it all:
Warning Please note that HDF5 DOES NOT RECLAIM SPACE in the h5 files
automatically. Thus, repeatedly deleting (or removing nodes) and
adding again WILL TEND TO INCREASE THE FILE SIZE. To clean the file,
use ptrepack

Categories

Resources