Python Pandas Merge Causing Memory Overflow - python

I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?
The data I load is around 85 Megs or so but I often watch my python session run up close to 10 gigs of memory usage then give a memory error.
I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.
Here's what I've done:
Importing the Main data
import requests, zipfile, StringIO
import numpy as np
import pandas as pd
STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip"
STAR2013fileName = 'ca2013_all_csv_v3.txt'
r = requests.get(STAR2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STAR2013=pd.read_csv(z.open(STAR2013fileName))
Importing some Cross Cross Referencing Tables
STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip"
STARentityList2013fileName = "ca2013entities_csv.txt"
r = requests.get(STARentityList2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName))
STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip"
STARlookUpTestID2013fileName = "Tests.txt"
r = requests.get(STARlookUpTestID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName))
STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip"
STARlookUpSubgroupID2013fileName = "Subgroups.txt"
r = requests.get(STARlookUpSubgroupID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))
Renaming a Column ID to Allow for Merge
STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'})
STARlookUpSubgroupID2013
Successful Merge
merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')
Try a second merge. This is where the Memory Overflow Happens
merged=pd.merge(merged, STARentityList2013, on='School Code')
I did all of this in ipython notebook, but don't think that changes anything.

Although this is an old question, I recently came across the same problem.
In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.
The method I came up with is as follows:
Calculate merge size:
def merge_size(left_frame, right_frame, group_by, how='inner'):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_diff = left_keys - intersection
right_diff = right_keys - intersection
left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_nan * right_nan]
left_size = [left_groups[group_name] for group_name in left_diff]
right_size = [right_groups[group_name] for group_name in right_diff]
if how == 'inner':
return sum(sizes)
elif how == 'left':
return sum(sizes + left_size)
elif how == 'right':
return sum(sizes + right_size)
return sum(sizes + left_size + right_size)
Note:
At present with this method, the key can only be a label, not a list. Using a list for group_by currently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.
If you are using a list of labels for the group_by, the final row size is:
min([merge_size(df1, df2, label, how) for label in group_by])
Check if this fits in memory
The merge_size function defined here returns the number of rows which will be created by merging two dataframes together.
By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().available to see if your system can calculate the full merge.
def mem_fit(df1, df2, key, how='inner'):
rows = merge_size(df1, df2, key, how)
cols = len(df1.columns) + (len(df2.columns) - 1)
required_memory = (rows * cols) * np.dtype(np.float64).itemsize
return required_memory <= psutil.virtual_memory().available
The merge_size method has been proposed as an extension of pandas in this issue. https://github.com/pandas-dev/pandas/issues/15068.

Related

Creating Cartesian Product DataFrame without maxing Memory

I have several dataframes, from which I'm creating a cartesian product (on purpose!)
After this, I'm exporting the result to disk.
I believe the size of the resulting dataframe could exceed my memory footprint, so I'm wondering is there a way that I can chunk this so that the dataframe doesn't need to all be in memory at the same time?
Example Code:
import pandas as pd
def create_list_from_range(r1,r2):
if (r1 == r2):
return r1
else:
res = []
while(r1 < r2+1 ):
res.append(r1)
r1 += 1
return res
# make a list of options
color_opt = ['red','blue','green','orange']
dow_opt = create_list_from_range(1,7)
hod_opt = create_list_from_range(0,23)
# turn each list into a dataframe
df_color = pd.DataFrame({'color': color_opt})
df_day = pd.DataFrame({'day_of_week': dow_opt})
df_hour = pd.DataFrame({'hour_of_day': hod_opt})
# add a dummy columns to everything so I can easily do a cartesian product
df_color['dummy']=1
df_day['dummy']=1
df_hour['dummy']=1
# now cartesian product... cascading
merge1 = pd.merge(df_day, df_hour, on='dummy')
FINAL = pd.merge(merge1, df_color, on='dummy')
FINAL.to_csv('FINAL_OUTPUT.csv', index=False)
You could try building up individual rows using itertools.product. In your example, you could do this as follows:
from itertools import product
prod = product(color_opt, dow_opt, hod_opt)
You can then get a number of rows and append them to an existing csv file using
df.to_csv("file", mode="a")

How to optimize this pandas iterable

I have the following method in which I am eliminating overlapping intervals in a dataframe based on a set of hierarchical rules:
def disambiguate(arg):
arg['length'] = (arg.end - arg.begin).abs()
df = arg[['begin', 'end', 'note_id', 'score', 'length']].copy()
data = []
out = pd.DataFrame()
for row in df.itertuples():
test = df[df['note_id']==row.note_id].copy()
# get overlapping intervals:
# https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
iix = pd.IntervalIndex.from_arrays(test.begin.apply(pd.to_numeric), test.end.apply(pd.to_numeric), closed='neither')
span_range = pd.Interval(row.begin, row.end)
fx = test[iix.overlaps(span_range)].copy()
maxLength = fx['length'].max()
minLength = fx['length'].min()
maxScore = abs(float(fx['score'].max()))
minScore = abs(float(fx['score'].min()))
# filter out overlapping rows via hierarchy
if maxScore > minScore:
fx = fx[fx['score'] == maxScore]
elif maxLength > minLength:
fx = fx[fx['length'] == minScore]
data.append(fx)
out = pd.concat(data, axis=0)
# randomly reindex to keep random row when dropping remaining duplicates: https://gist.github.com/cadrev/6b91985a1660f26c2742
out.reset_index(inplace=True)
out = out.reindex(np.random.permutation(out.index))
return out.drop_duplicates(subset=['begin', 'end', 'note_id'])
This works fine, except for the fact that the dataframes I am iterating over have well over 100K rows each, so this is taking forever to complete. I did a timing of various methods using %prun in Jupyter, and the method that seems to eat up processing time was series.py:3719(apply) ... NB: I tried using modin.pandas, but that was causing more problems (I kept getting an error wrt to Interval needing a value where left was less than right, which I couldn't figure out: I may file a GitHub issue there).
Am looking for a way to optimize this, such as using vectorization, but honestly, I don't have the slightest clue how to convert this to a vectotrized form.
Here is a sample of my data:
begin,end,note_id,score
0,9,0365,1
10,14,0365,1
25,37,0365,0.7
28,37,0365,1
38,42,0365,1
53,69,0365,0.7857142857142857
56,60,0365,1
56,69,0365,1
64,69,0365,1
83,86,0365,1
91,98,0365,0.8333333333333334
101,108,0365,1
101,127,0365,1
112,119,0365,1
112,127,0365,0.8571428571428571
120,127,0365,1
163,167,0365,1
196,203,0365,1
208,216,0365,1
208,223,0365,1
208,231,0365,1
208,240,0365,0.6896551724137931
217,223,0365,1
217,231,0365,1
224,231,0365,1
246,274,0365,0.7692307692307693
252,274,0365,1
263,274,0365,0.8888888888888888
296,316,0365,0.7222222222222222
301,307,0365,1
301,316,0365,1
301,330,0365,0.7307692307692307
301,336,0365,0.78125
308,316,0365,1
308,323,0365,1
308,330,0365,1
308,336,0365,1
317,323,0365,1
317,336,0365,1
324,330,0365,1
324,336,0365,1
361,418,0365,0.7368421052631579
370,404,0365,0.7111111111111111
370,418,0365,0.875
383,418,0365,0.8285714285714286
396,404,0365,1
396,418,0365,0.8095238095238095
405,418,0365,0.8333333333333334
432,453,0365,0.7647058823529411
438,453,0365,1
438,458,0365,0.7222222222222222
I think I know what the issue was: I did my filtering on note_id incorrectly, and thus iterating over the entire dataframe.
It should been:
cases = set(df['note_id'].tolist())
for case in cases:
test = df[df['note_id']==case].copy()
for row in df.itertuples():
# get overlapping intervals:
# https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe
iix = pd.IntervalIndex.from_arrays(test.begin, test.end, closed='neither')
span_range = pd.Interval(row.begin, row.end)
fx = test[iix.overlaps(span_range)].copy()
maxLength = fx['length'].max()
minLength = fx['length'].min()
maxScore = abs(float(fx['score'].max()))
minScore = abs(float(fx['score'].min()))
if maxScore > minScore:
fx = fx[fx['score'] == maxScore]
elif maxLength > minLength:
fx = fx[fx['length'] == maxLength]
data.append(fx)
out = pd.concat(data, axis=0)
For testing on one note, before I stopped iterating over the entire, non-filtered dataframe, it was taking over 16 minutes. Now, it's at 28 seconds!

How to loop over multiple subsets, perform operations and take the results to the original dataframe in python?

I have a dataframe with millions of rows, and about 100k unique ID numbers. I want to perform operations per unique ID. For now I generate a subset per unique ID and perform some operations accordingly. This loops works. But how do I efficiently combine the subsets into one dataframe?
Maybe there is a more efficient way to perform operations per subset of unique IDs.
Thanks
for ID in np.unique(df_fin['ID']):
ID_subset = df_fin.loc[df_fin['ID'] == ID]
for i in ID_subset.index:
if ID_subset['date_diff'][i] > 0:
for p in range(0,ID_subset['date_diff'][i]):
if p == WIP:
sl.appendleft(ID_subset.return_bin[i-1])
else:
sl.appendleft(0)
lissa = list(sl)
ID_subset.at[i,'list_stock'] = lissa
frames = [ID_subset] #this does not work
final_mod = pd.concat(frames) #this also does not work
THIS IS WORKING:
I also tried with groupby.apply. See the code below.
def create_stocklist(x):
x['date_diff'] = x['dates'] - x['dates'].shift()
x['date_diff'] = x['date_diff'].fillna(0)
x['date_diff'] = (x['date_diff'] / np.timedelta64(1, 'D')).astype(int)
x['list_stock'] = x['list_stock'].astype(object)
x['stock_new'] = x['stock_new'].astype(object)
var_stock = DOS*[0]
sl = deque([0],maxlen=DOS)
for i in x.index:
if x['date_diff'][i] > 0:
for p in range(0,x['date_diff'][i]):
if p == WIP:
sl.appendleft(x.return_bin[i-1])
else:
sl.appendleft(0)
lissa = list(sl)
x.at[i,'list_stock'] = lissa
return x
df_fin.groupby(by=['ID']).apply(create_stocklist)
An approach could be:
for g, _id in df_din.groupby(by=['ID']):
# do stuff with g
g is a dataframe containing all rows such that df_fin['ID'] == _id

Optimizing Python Code: Faster groupby and for loops

I want to make a For Loop given below, faster in python.
import pandas as pd
import numpy as np
import scipy
np.random.seed(1)
xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)})
yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)})
tempno = [np.random.randint(1,100,size=5)]
k=1
p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object)
for ib in [xb for xb in range(0,len(xl))]:
tempno1 = np.append(tempno,ib)
temp = list(set(tempno1))
temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index()
temptab['contri'] = temptab['ships_x']/temptab['ships_y']
p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri'])
p.ix[k-1,'temp'] = temp
k = k+1
where,
xl, yl - two data frames I am working on with columns like Concat, x_ships and y_ships.
tempno - a initial list of indices of xl dataframe, referring to a list of 'Concat' values.
So, in for loop we add one extra index to tempno in each iteration and then subset 'yl' dataframe based on 'Concat' values matching with those of 'xl' dataframe. Then, we find "coefficient of variation"(taken from scipy lib) and make note in new dataframe 'p'.
The problem is it is taking too much time as number of iterations of for loop varies in thousands. The 'group_by' line is taking maximum time. I have tried and made a few changes, now the code look likes below, changes made mentioned in comments. There is a slight improvement but this doesn't solve my purpose. Please suggest the fastest way possible to implement this. Many thanks.
# Getting all tempno1 into a list with one step
tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]]
temp = [list(set(tempk)) for tempk in tempno1]
# Taking only needed columns from x and y dfs
xtemp = xl[['Concat']]
ytemp = yl[['Concat','ships_x','ships_y','PickDate']]
#Shortlisting y df and groupby in two diff steps
ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1]
temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp]
tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab]
tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))]
temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))]
pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab]
p = pd.DataFrame({'temp' : temp,'cv': pcv})

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Categories

Resources