Python Pandas to_pickle cannot pickle large dataframes - python

I have a dataframe "DF" with with 500,000 rows. Here are the data types per column:
ID int64
time datetime64[ns]
data object
each entry in the "data" column is an array with size = [5,500]
When I try to save this dataframe using
DF.to_pickle("my_filename.pkl")
it returned me the following error:
12 """
13 with open(path, 'wb') as f:
---> 14 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
I also try this method but I get the same error:
import pickle
with open('my_filename.pkl', 'wb') as f:
pickle.dump(DF, f)
I try to save 10 rows of this dataframe:
DF.head(10).to_pickle('test_save.pkl')
and I have no error at all. Therefore, it can save small DF but not large DF.
I am using python 3, ipython notebook 3 in Mac.
Please help me to solve this problem. I really need to save this DF to a pickle file. I can not find the solution in the internet.

Until there is a fix somewhere on pickle/pandas side of things,
I'd say a better option is to use alternative IO backend. HDF is suitable for large datasets (GBs). So you don't need to add additional split/combine logic.
df.to_hdf('my_filename.hdf','mydata',mode='w')
df = pd.read_hdf('my_filename.hdf','mydata')

Probably not the answer you were hoping for but this is what I did......
Split the dataframe into smaller chunks using np.array_split (although numpy functions are not guaranteed to work, it does now, although there used to be a bug for it).
Then pickle the smaller dataframes.
When you unpickle them use pandas.append or pandas.concat to glue everything back together.
I agree it is a fudge and suboptimal. If anyone can suggest a "proper" answer I'd be interested in seeing it, but I think it as simple as dataframes are not supposed to get above a certain size.
Split a large pandas dataframe

Try to use compression. It worked for me.
data_df.to_pickle('data_df.pickle.gzde', compression='gzip')

I ran into this same issue and traced the cause to a memory issue. According to this recourse it's usually not actually caused by the memory itself, but the movement of too many resources into the swap space. I was able to save the large pandas file by disableing swap all together withe the command (provided in that link):
swapoff -a

Related

Workflow for modifying an hdf5 file in vaex

As sort of follow on to my previous question [1], is there a way to open a hdf5 dataset in vaex, perform operations and then store the results to the same dataset?
I tried the following:
import vaex as vx
vxframe = vx.open('somedata.hdf5')
vxframe = some_transformation(vxframe)
vxframe.export_hdf5('somedata.hdf5')
This results in the error OSError: Unable to create file (unable to truncate a file which is already open), so h5py can't write to the file while it is open. Is there another workflow to achieve this? I can write to another file as a workaround, but that seems quite inefficient as (I imagine) it has to copy all the data that has not changed as well.
[1] Convert large hdf5 dataset written via pandas/pytables to vaex
Copying to a new file would not be less efficient than writing to itself (at least not for this example), since it will have to write the same amount of bytes. I also would not recommend it, since if you make a mistake, you will mess up your data.
Exporting data is actually quite efficient, but even better, you can also choose to just export the columns you want:
df = vaex.open('somedata.hdf5')
df2 = some_transformation(df)
df2[['new_column1', 'new_columns2']].export('somedata_extra.hdf5')
...
# next time
df = vaex.open('somedata.hdf5')
df2 = vaex.open('somedata_extra.hdf5')
df = df.join(df2) # merge without column name will merge on rows basis
We used this approach alot, to create auxiliary datasets on disk that were precomputed. Joining them back (on row bases) is instant, it does not take any time or memory.

Proper way of writing and reading Dataframe to file in Python

I would like to write and later read a dataframe in Python.
df_final.to_csv(self.get_local_file_path(hash,dataset_name), sep='\t', encoding='utf8')
...
df_final = pd.read_table(self.get_local_file_path(hash,dataset_name), encoding='utf8',index_col=[0,1])
But then I get:
sys:1: DtypeWarning: Columns (7,17,28) have mixed types. Specify dtype
option on import or set low_memory=False.
I found this question. Which in the bottom line says I should specify the field types when I read the file because "low_memory" is deprecated... I find it very inefficient.
Isn't there a simple way to write & later read a Dataframe? I don't care about the human-readability of the file.
You can pickle your dataframe:
df_final.to_pickle(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_pickle(self.get_local_file_path(hash,dataset_name))
If your dataframe ist big and this gets to slow, you might have more luck using the HDF5 format:
df_final.to_hdf(self.get_local_file_path(hash,dataset_name))
Read it back later:
df_final = pd.read_hdf(self.get_local_file_path(hash,dataset_name))
You might need to install PyTables first.
Both ways store the data along with their types. Therefore, this should solve your problem.
The warning is because Pandas has detected conflicting Data values in your Column. You can specify the datatypes in the DataFrame Constructor if you wish.
,dtype={'FIELD':int,'FIELD2':str}
Etc.

how to retrieve all lines with errors in pandas

For example, I can use
pd.read_csv('file.csv')
to load a csv file.
By default, it fails when there are any parsing errors. I understand that one can use error_bad_lines=False to skip the rows with errors.
But my question is:
How to get all the lines where errors occur? This way, I can potentially solve the problem for not only this particular file.csv but also other related files in a batch file1.csv, file2.csv, file3.csv ...
One easy way would be to prepend a row index number to each row. This can easily be done with Awk or Python before loading the data. You could even do it in-memory using StringIO or your own custom file-like object in Python which would "magically" prepend the row numbers.

Python, Memory Error in making dataframe

When I use pandas DataFrame, occuring the Memory Error.
data's row is 200000 and column is 30.(type: list)
fieldnames1 has columns name.(type:list)
Error occured in:
df = pd.DataFrame(data,columns=[fieldnames1])
what should I do?
(python version 2.7 32bit)
As indicated by Klaus, you're running out of memory. The problem occurs when you try to pull the entire text to memory in one go.
As pointed out in this post by Wes McKinney, "a solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat".
You can try this line of code:
data=pd.DataFrame.from_csv("train.csv")
This is an alternate of read.csv but it returns Data frame object without giving any memory error
P.S the size of the training data is around 73 mb

How to force a python function to return a particular type of object?

I am using pandas to read a csv file and convert it into a numpy array. Earlier I was loading the whole file and was getting memory error. So I went through this link and tried to read the file in chunks.
But now I am getting a different error which say:
AssertionError: first argument must be a list-like of pandas objects, you passed an object of type "TextFileReader"
This is the code I am using:
>>> X_chunks = pd.read_csv('train_v2.csv', iterator=True, chunksize=1000)
>>> X = pd.concat(X_chunks, ignore_index=True)
API reference for read_csv tells that it returns either a DataFrame or a TextParser. The problem is that concat function will work fine if X_chunks is DataFrame, but its type is TextParser here.
is there any way in which I can force the return type for read_csv or any work around to load the whole file as a numpy array?
Since iterator=False is the default, and chunksize forces a TextFileReader object, may I suggest:
X_chunks = pd.read_csv('train_v2.csv')
But you don't want to materialize the list?
Final suggestion:
X_chunks = pd.read_csv('train_v2.csv', iterator=True, chunksize=1000)
for chunk in x_chunks:
analyze(chunk)
Where analyze is whatever process you've broken up to analyze the chunks piece by piece, since you apparently can't load the entire dataset into memory.
You can't use concat in the way you're trying to, the reason is that it demands the data be fully materialized, which makes sense, you can't concatenate something that isn't there yet.

Categories

Resources