Python, Memory Error in making dataframe - python

When I use pandas DataFrame, occuring the Memory Error.
data's row is 200000 and column is 30.(type: list)
fieldnames1 has columns name.(type:list)
Error occured in:
df = pd.DataFrame(data,columns=[fieldnames1])
what should I do?
(python version 2.7 32bit)

As indicated by Klaus, you're running out of memory. The problem occurs when you try to pull the entire text to memory in one go.
As pointed out in this post by Wes McKinney, "a solution is to read the file in smaller pieces (use iterator=True, chunksize=1000) then concatenate then with pd.concat".

You can try this line of code:
data=pd.DataFrame.from_csv("train.csv")
This is an alternate of read.csv but it returns Data frame object without giving any memory error
P.S the size of the training data is around 73 mb

Related

Chaging type of really large columns - Python MemoryError

We are migrating from SAS to Python and I am having some trouble dealing with large dataframes.
I am dealing with a df with 15kk rows and 44 columns, a pretty large boy. I need to replace commas with dots in some columns, delete some others columns and change some to date.
To delete I found out that this works pretty well:
del df['column']
but when trying to replace using this:
df["column"] = (dfl["column"].replace('\.','', regex=True).replace(',','.', regex=True).astype(float))
I get:
MemoryError: Unable to allocate 14.2 MiB for an array with shape (14901054,) and data type uint8
Same happens when trying to convert to date using this:
df['column'] = pd.to_datetime(df['column'],errors='coerce')
I get:
MemoryError: Unable to allocate 114. MiB for an array with shape (14901054,) and data type datetime64[ns]
Is there any other way to do those things, only more memory efficient? Or is the only solution to split the df beforehand? Thanks!
ps. not all columns are giving me this problem, but I guess that is not important
I am not an expert with this library but when you have to deal with a big amount of data, you could store the information in disk and read it in chunks.
A good (but not the best I guess) solution could be store it in a temporal CSV file and them, read the file in chunks, dealing with less rows in memory.
Original Dataframe
Remove the unnecesary columns
Store the dataframe in a temporal CSV file.
Read the CSV by chunks:
For each chunk, perform the column modifications and store it in another final CSV file.
More information over here (a few minutes googling):
Why and how to use pandas with large data
Dealing with large datasets in pandas

pandas to_csv writing keeps consuming more memory until it crashes

UPDATE: I have realized that every new run was creating a new Python console which was causing more memory consumption. I had to turn of the setting that creates new console for each run. This feature automatically got enabled when i upgraded to Pycharm pro for some reason. Now, memory consumption is steady.
My project creates a csv named 'pressure_drop' and I want to create a new pandas dataframe using the code below.
The pressure_drop.csv in this example has 10150 rows and 12 columns. As you can see, I am deleting some columns that don't need to be shown and then creating a data frame by assigning row and column index. Finally, it is written to a new .csv file that is more readable that I will use to create interactive charts etc.
The problem is, Python takes up more memory space every time the code is run in the console and Python ends up crashing if the code is run enough number of times. Can you help me understand why this is happening?
For example, Python takes up ~100 more MB's every time the code is run for the data set above.
import pandas as pd
def data_frame_creator(result_array):
array = results_csv_loader(result_array)
array = np.delete(array,[3,4,5,6,7],1)
len = array.shape
row_count = len[0] +1
df = pd.DataFrame(data = array, index=[np.arange(1,row_count)], columns=columns.dataframe_columns)
df.to_csv('Output.csv')
data_frame_creator('pressure_drop.csv')
It's a little hard to know what you're trying to do without knowing what the dataframes look like and what columns you want. Perhaps the function you're looking for is read_csv? E.g.:
input_df = pd.read_csv('pressure_drop.csv', use_cols=[1,2,8,9,10,11,12])

OverflowError with Pandas to_hdf

Python newbie here.
I am trying to save a large data frame into HDF file with lz4 compression using to_hdf.
I use Windows 10, Python 3, Pandas 20.2
I get the error “OverflowError: Python int too large to convert to C long”.
None of the machine resources are close to their limits (RAM, CPU, SWAP usage)
Previous posts discuss the dtype, but the following example shows that there is some other problem, potentially related to the size?
import numpy as np
import pandas as pd
# sample dataframe to be saved, pardon my French
n=500*1000*1000
df= pd.DataFrame({'col1':[999999999999999999]*n,
'col2':['aaaaaaaaaaaaaaaaa']*n,
'col3':[999999999999999999]*n,
'col4':['aaaaaaaaaaaaaaaaa']*n,
'col5':[999999999999999999]*n,
'col6':['aaaaaaaaaaaaaaaaa']*n})
# works fine
lim=200*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# works fine
lim=300*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
# Error
lim=400*1000*1000
df[:lim].to_hdf('df.h5','table', complib= 'blosc:lz4', mode='w')
....
OverflowError: Python int too large to convert to C long
I experienced the same issue and it seems that it is indeed connected to the size of the data frame rather than to dtype (I had all the columns stored as strings and was able to store them to .h5 separately).
The solution that worked for me is to save the data frame in chunks using mode='a'.
As suggested in pandas documentation : mode{‘a’, ‘w’, ‘r+’}, default ‘a’: ‘a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
So the sample code would look something like:
batch_size = 1000
for i, df_chunk in df.groupby(np.arange(df.shape[0]) // batch_size):
df_chunk.to_hdf('df.h5','table', complib= 'blosc:lz4', mode='a')
As #Giovanni Maria Strampelli pointed, the answer of #Artem Snorkovenko only saves the last batch. Pandas documentation states the following:
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
Here is a possible workaround to save all batches (adjusted from the answer of #Artem Snorkovenko):
for i in range(len(df)):
sr = df.loc[i] #pandas series object for the given index
sr.to_hdf('df.h5', key='table_%i'%i, complib='blosc:lz4', mode='a')
This code saves each Pandas Series object with a different key. Each key is indexed by i.
To load the existing .h5 file after saving, one can do the following:
i = 0
dfdone = False #if True, all keys in the .h5 file are successfully loaded.
srl = [] #df series object list
while dfdone == False:
#print(i) #this is to see if code is working properly.
try: #check whether current i value exists in the keys of the .h5 file
sdfr = pd.read_hdf('df.h5', key='table_%i'%i) #Current series object
srl.append(sdfr) #append each series to a list to create the dataframe in the end.
i += 1 #increment i by 1 after loading the series object
except: #if an error occurs, current i value exceeds the number of keys, all keys are loaded.
dfdone = True #Terminate the while loop.
df = pd.DataFrame(srl) #Generate the dataframe from the list of series objects.
I used a while loop, assuming we do not know the exact length of the dataframe in the .h5 file. If the length is known, for loop can also be used.
Note that I am not saving dataframes in chunks here. Thus, the loading procedure is in its current form not suitable for saving in chunks, where the data type would be DataFrame for each chunk. In my implementation, each saved object is Series, and DataFrame is generated from a list of Series. The code I provided can be adjusted to work for saving in chunks and generating a DataFrame from a list of DataFrame objects (a nice starting point can be found in ths Stack Overflow entry.).
In addition to #tetrisforjeff 's answer:
If the df contains object types, the reading could lead to error. I would suggest pd.concat(srl) instead of pd.DataFrame(srl)

Python Pandas to_pickle cannot pickle large dataframes

I have a dataframe "DF" with with 500,000 rows. Here are the data types per column:
ID int64
time datetime64[ns]
data object
each entry in the "data" column is an array with size = [5,500]
When I try to save this dataframe using
DF.to_pickle("my_filename.pkl")
it returned me the following error:
12 """
13 with open(path, 'wb') as f:
---> 14 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
I also try this method but I get the same error:
import pickle
with open('my_filename.pkl', 'wb') as f:
pickle.dump(DF, f)
I try to save 10 rows of this dataframe:
DF.head(10).to_pickle('test_save.pkl')
and I have no error at all. Therefore, it can save small DF but not large DF.
I am using python 3, ipython notebook 3 in Mac.
Please help me to solve this problem. I really need to save this DF to a pickle file. I can not find the solution in the internet.
Until there is a fix somewhere on pickle/pandas side of things,
I'd say a better option is to use alternative IO backend. HDF is suitable for large datasets (GBs). So you don't need to add additional split/combine logic.
df.to_hdf('my_filename.hdf','mydata',mode='w')
df = pd.read_hdf('my_filename.hdf','mydata')
Probably not the answer you were hoping for but this is what I did......
Split the dataframe into smaller chunks using np.array_split (although numpy functions are not guaranteed to work, it does now, although there used to be a bug for it).
Then pickle the smaller dataframes.
When you unpickle them use pandas.append or pandas.concat to glue everything back together.
I agree it is a fudge and suboptimal. If anyone can suggest a "proper" answer I'd be interested in seeing it, but I think it as simple as dataframes are not supposed to get above a certain size.
Split a large pandas dataframe
Try to use compression. It worked for me.
data_df.to_pickle('data_df.pickle.gzde', compression='gzip')
I ran into this same issue and traced the cause to a memory issue. According to this recourse it's usually not actually caused by the memory itself, but the movement of too many resources into the swap space. I was able to save the large pandas file by disableing swap all together withe the command (provided in that link):
swapoff -a

How to force a python function to return a particular type of object?

I am using pandas to read a csv file and convert it into a numpy array. Earlier I was loading the whole file and was getting memory error. So I went through this link and tried to read the file in chunks.
But now I am getting a different error which say:
AssertionError: first argument must be a list-like of pandas objects, you passed an object of type "TextFileReader"
This is the code I am using:
>>> X_chunks = pd.read_csv('train_v2.csv', iterator=True, chunksize=1000)
>>> X = pd.concat(X_chunks, ignore_index=True)
API reference for read_csv tells that it returns either a DataFrame or a TextParser. The problem is that concat function will work fine if X_chunks is DataFrame, but its type is TextParser here.
is there any way in which I can force the return type for read_csv or any work around to load the whole file as a numpy array?
Since iterator=False is the default, and chunksize forces a TextFileReader object, may I suggest:
X_chunks = pd.read_csv('train_v2.csv')
But you don't want to materialize the list?
Final suggestion:
X_chunks = pd.read_csv('train_v2.csv', iterator=True, chunksize=1000)
for chunk in x_chunks:
analyze(chunk)
Where analyze is whatever process you've broken up to analyze the chunks piece by piece, since you apparently can't load the entire dataset into memory.
You can't use concat in the way you're trying to, the reason is that it demands the data be fully materialized, which makes sense, you can't concatenate something that isn't there yet.

Categories

Resources