Python Pandas Merge Causing Memory Overflow - python
I'm new to Pandas and am trying to merge a few subsets of data. I'm giving a specific case where this happens, but the question is general: How/why is it happening and how can I work around it?
The data I load is around 85 Megs or so but I often watch my python session run up close to 10 gigs of memory usage then give a memory error.
I have no idea why this happens, but it's killing me as I can't even get started looking at the data the way I want to.
Here's what I've done:
Importing the Main data
import requests, zipfile, StringIO
import numpy as np
import pandas as pd
STAR2013url="http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013_all_csv_v3.zip"
STAR2013fileName = 'ca2013_all_csv_v3.txt'
r = requests.get(STAR2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STAR2013=pd.read_csv(z.open(STAR2013fileName))
Importing some Cross Cross Referencing Tables
STARentityList2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/ca2013entities_csv.zip"
STARentityList2013fileName = "ca2013entities_csv.txt"
r = requests.get(STARentityList2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARentityList2013=pd.read_csv(z.open(STARentityList2013fileName))
STARlookUpTestID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/tests.zip"
STARlookUpTestID2013fileName = "Tests.txt"
r = requests.get(STARlookUpTestID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpTestID2013=pd.read_csv(z.open(STARlookUpTestID2013fileName))
STARlookUpSubgroupID2013url = "http://www3.cde.ca.gov/starresearchfiles/2013/p3/subgroups.zip"
STARlookUpSubgroupID2013fileName = "Subgroups.txt"
r = requests.get(STARlookUpSubgroupID2013url)
z = zipfile.ZipFile(StringIO.StringIO(r.content))
STARlookUpSubgroupID2013=pd.read_csv(z.open(STARlookUpSubgroupID2013fileName))
Renaming a Column ID to Allow for Merge
STARlookUpSubgroupID2013 = STARlookUpSubgroupID2013.rename(columns={'001':'Subgroup ID'})
STARlookUpSubgroupID2013
Successful Merge
merged = pd.merge(STAR2013,STARlookUpSubgroupID2013, on='Subgroup ID')
Try a second merge. This is where the Memory Overflow Happens
merged=pd.merge(merged, STARentityList2013, on='School Code')
I did all of this in ipython notebook, but don't think that changes anything.
Although this is an old question, I recently came across the same problem.
In my instance, duplicate keys are required in both dataframes, and I needed a method which could tell if a merge will fit into memory ahead of computation, and if not, change the computation method.
The method I came up with is as follows:
Calculate merge size:
def merge_size(left_frame, right_frame, group_by, how='inner'):
left_groups = left_frame.groupby(group_by).size()
right_groups = right_frame.groupby(group_by).size()
left_keys = set(left_groups.index)
right_keys = set(right_groups.index)
intersection = right_keys & left_keys
left_diff = left_keys - intersection
right_diff = right_keys - intersection
left_nan = len(left_frame[left_frame[group_by] != left_frame[group_by]])
right_nan = len(right_frame[right_frame[group_by] != right_frame[group_by]])
left_nan = 1 if left_nan == 0 and right_nan != 0 else left_nan
right_nan = 1 if right_nan == 0 and left_nan != 0 else right_nan
sizes = [(left_groups[group_name] * right_groups[group_name]) for group_name in intersection]
sizes += [left_nan * right_nan]
left_size = [left_groups[group_name] for group_name in left_diff]
right_size = [right_groups[group_name] for group_name in right_diff]
if how == 'inner':
return sum(sizes)
elif how == 'left':
return sum(sizes + left_size)
elif how == 'right':
return sum(sizes + right_size)
return sum(sizes + left_size + right_size)
Note:
At present with this method, the key can only be a label, not a list. Using a list for group_by currently returns a sum of merge sizes for each label in the list. This will result in a merge size far larger than the actual merge size.
If you are using a list of labels for the group_by, the final row size is:
min([merge_size(df1, df2, label, how) for label in group_by])
Check if this fits in memory
The merge_size function defined here returns the number of rows which will be created by merging two dataframes together.
By multiplying this with the count of columns from both dataframes, then multiplying by the size of np.float[32/64], you can get a rough idea of how large the resulting dataframe will be in memory. This can then be compared against psutil.virtual_memory().available to see if your system can calculate the full merge.
def mem_fit(df1, df2, key, how='inner'):
rows = merge_size(df1, df2, key, how)
cols = len(df1.columns) + (len(df2.columns) - 1)
required_memory = (rows * cols) * np.dtype(np.float64).itemsize
return required_memory <= psutil.virtual_memory().available
The merge_size method has been proposed as an extension of pandas in this issue. https://github.com/pandas-dev/pandas/issues/15068.
Related
Creating Cartesian Product DataFrame without maxing Memory
I have several dataframes, from which I'm creating a cartesian product (on purpose!) After this, I'm exporting the result to disk. I believe the size of the resulting dataframe could exceed my memory footprint, so I'm wondering is there a way that I can chunk this so that the dataframe doesn't need to all be in memory at the same time? Example Code: import pandas as pd def create_list_from_range(r1,r2): if (r1 == r2): return r1 else: res = [] while(r1 < r2+1 ): res.append(r1) r1 += 1 return res # make a list of options color_opt = ['red','blue','green','orange'] dow_opt = create_list_from_range(1,7) hod_opt = create_list_from_range(0,23) # turn each list into a dataframe df_color = pd.DataFrame({'color': color_opt}) df_day = pd.DataFrame({'day_of_week': dow_opt}) df_hour = pd.DataFrame({'hour_of_day': hod_opt}) # add a dummy columns to everything so I can easily do a cartesian product df_color['dummy']=1 df_day['dummy']=1 df_hour['dummy']=1 # now cartesian product... cascading merge1 = pd.merge(df_day, df_hour, on='dummy') FINAL = pd.merge(merge1, df_color, on='dummy') FINAL.to_csv('FINAL_OUTPUT.csv', index=False)
You could try building up individual rows using itertools.product. In your example, you could do this as follows: from itertools import product prod = product(color_opt, dow_opt, hod_opt) You can then get a number of rows and append them to an existing csv file using df.to_csv("file", mode="a")
How to optimize this pandas iterable
I have the following method in which I am eliminating overlapping intervals in a dataframe based on a set of hierarchical rules: def disambiguate(arg): arg['length'] = (arg.end - arg.begin).abs() df = arg[['begin', 'end', 'note_id', 'score', 'length']].copy() data = [] out = pd.DataFrame() for row in df.itertuples(): test = df[df['note_id']==row.note_id].copy() # get overlapping intervals: # https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe iix = pd.IntervalIndex.from_arrays(test.begin.apply(pd.to_numeric), test.end.apply(pd.to_numeric), closed='neither') span_range = pd.Interval(row.begin, row.end) fx = test[iix.overlaps(span_range)].copy() maxLength = fx['length'].max() minLength = fx['length'].min() maxScore = abs(float(fx['score'].max())) minScore = abs(float(fx['score'].min())) # filter out overlapping rows via hierarchy if maxScore > minScore: fx = fx[fx['score'] == maxScore] elif maxLength > minLength: fx = fx[fx['length'] == minScore] data.append(fx) out = pd.concat(data, axis=0) # randomly reindex to keep random row when dropping remaining duplicates: https://gist.github.com/cadrev/6b91985a1660f26c2742 out.reset_index(inplace=True) out = out.reindex(np.random.permutation(out.index)) return out.drop_duplicates(subset=['begin', 'end', 'note_id']) This works fine, except for the fact that the dataframes I am iterating over have well over 100K rows each, so this is taking forever to complete. I did a timing of various methods using %prun in Jupyter, and the method that seems to eat up processing time was series.py:3719(apply) ... NB: I tried using modin.pandas, but that was causing more problems (I kept getting an error wrt to Interval needing a value where left was less than right, which I couldn't figure out: I may file a GitHub issue there). Am looking for a way to optimize this, such as using vectorization, but honestly, I don't have the slightest clue how to convert this to a vectotrized form. Here is a sample of my data: begin,end,note_id,score 0,9,0365,1 10,14,0365,1 25,37,0365,0.7 28,37,0365,1 38,42,0365,1 53,69,0365,0.7857142857142857 56,60,0365,1 56,69,0365,1 64,69,0365,1 83,86,0365,1 91,98,0365,0.8333333333333334 101,108,0365,1 101,127,0365,1 112,119,0365,1 112,127,0365,0.8571428571428571 120,127,0365,1 163,167,0365,1 196,203,0365,1 208,216,0365,1 208,223,0365,1 208,231,0365,1 208,240,0365,0.6896551724137931 217,223,0365,1 217,231,0365,1 224,231,0365,1 246,274,0365,0.7692307692307693 252,274,0365,1 263,274,0365,0.8888888888888888 296,316,0365,0.7222222222222222 301,307,0365,1 301,316,0365,1 301,330,0365,0.7307692307692307 301,336,0365,0.78125 308,316,0365,1 308,323,0365,1 308,330,0365,1 308,336,0365,1 317,323,0365,1 317,336,0365,1 324,330,0365,1 324,336,0365,1 361,418,0365,0.7368421052631579 370,404,0365,0.7111111111111111 370,418,0365,0.875 383,418,0365,0.8285714285714286 396,404,0365,1 396,418,0365,0.8095238095238095 405,418,0365,0.8333333333333334 432,453,0365,0.7647058823529411 438,453,0365,1 438,458,0365,0.7222222222222222
I think I know what the issue was: I did my filtering on note_id incorrectly, and thus iterating over the entire dataframe. It should been: cases = set(df['note_id'].tolist()) for case in cases: test = df[df['note_id']==case].copy() for row in df.itertuples(): # get overlapping intervals: # https://stackoverflow.com/questions/58192068/is-it-possible-to-use-pandas-overlap-in-a-dataframe iix = pd.IntervalIndex.from_arrays(test.begin, test.end, closed='neither') span_range = pd.Interval(row.begin, row.end) fx = test[iix.overlaps(span_range)].copy() maxLength = fx['length'].max() minLength = fx['length'].min() maxScore = abs(float(fx['score'].max())) minScore = abs(float(fx['score'].min())) if maxScore > minScore: fx = fx[fx['score'] == maxScore] elif maxLength > minLength: fx = fx[fx['length'] == maxLength] data.append(fx) out = pd.concat(data, axis=0) For testing on one note, before I stopped iterating over the entire, non-filtered dataframe, it was taking over 16 minutes. Now, it's at 28 seconds!
How to loop over multiple subsets, perform operations and take the results to the original dataframe in python?
I have a dataframe with millions of rows, and about 100k unique ID numbers. I want to perform operations per unique ID. For now I generate a subset per unique ID and perform some operations accordingly. This loops works. But how do I efficiently combine the subsets into one dataframe? Maybe there is a more efficient way to perform operations per subset of unique IDs. Thanks for ID in np.unique(df_fin['ID']): ID_subset = df_fin.loc[df_fin['ID'] == ID] for i in ID_subset.index: if ID_subset['date_diff'][i] > 0: for p in range(0,ID_subset['date_diff'][i]): if p == WIP: sl.appendleft(ID_subset.return_bin[i-1]) else: sl.appendleft(0) lissa = list(sl) ID_subset.at[i,'list_stock'] = lissa frames = [ID_subset] #this does not work final_mod = pd.concat(frames) #this also does not work THIS IS WORKING: I also tried with groupby.apply. See the code below. def create_stocklist(x): x['date_diff'] = x['dates'] - x['dates'].shift() x['date_diff'] = x['date_diff'].fillna(0) x['date_diff'] = (x['date_diff'] / np.timedelta64(1, 'D')).astype(int) x['list_stock'] = x['list_stock'].astype(object) x['stock_new'] = x['stock_new'].astype(object) var_stock = DOS*[0] sl = deque([0],maxlen=DOS) for i in x.index: if x['date_diff'][i] > 0: for p in range(0,x['date_diff'][i]): if p == WIP: sl.appendleft(x.return_bin[i-1]) else: sl.appendleft(0) lissa = list(sl) x.at[i,'list_stock'] = lissa return x df_fin.groupby(by=['ID']).apply(create_stocklist)
An approach could be: for g, _id in df_din.groupby(by=['ID']): # do stuff with g g is a dataframe containing all rows such that df_fin['ID'] == _id
Optimizing Python Code: Faster groupby and for loops
I want to make a For Loop given below, faster in python. import pandas as pd import numpy as np import scipy np.random.seed(1) xl = pd.DataFrame({'Concat' : np.arange(101,999), 'ships_x' : np.random.randint(1001,3000,size=898)}) yl = pd.DataFrame({'PickDate' : np.random.randint(1,8,size=10000),'Concat' : np.random.randint(101,999,size=10000), 'ships_x' : np.random.randint(101,300,size=10000), 'ships_y' : np.random.randint(1001,3000,size=10000)}) tempno = [np.random.randint(1,100,size=5)] k=1 p = pd.DataFrame(0,index=np.arange(len(xl)),columns=['temp','cv']).astype(object) for ib in [xb for xb in range(0,len(xl))]: tempno1 = np.append(tempno,ib) temp = list(set(tempno1)) temptab = yl[yl['Concat'].isin(np.array(xl['Concat'][tempno1]))].groupby('PickDate')['ships_x','ships_y'].sum().reset_index() temptab['contri'] = temptab['ships_x']/temptab['ships_y'] p.ix[k-1,'cv'] = 1 if math.isnan(scipy.stats.variation(temptab['contri'])) else scipy.stats.variation(temptab['contri']) p.ix[k-1,'temp'] = temp k = k+1 where, xl, yl - two data frames I am working on with columns like Concat, x_ships and y_ships. tempno - a initial list of indices of xl dataframe, referring to a list of 'Concat' values. So, in for loop we add one extra index to tempno in each iteration and then subset 'yl' dataframe based on 'Concat' values matching with those of 'xl' dataframe. Then, we find "coefficient of variation"(taken from scipy lib) and make note in new dataframe 'p'. The problem is it is taking too much time as number of iterations of for loop varies in thousands. The 'group_by' line is taking maximum time. I have tried and made a few changes, now the code look likes below, changes made mentioned in comments. There is a slight improvement but this doesn't solve my purpose. Please suggest the fastest way possible to implement this. Many thanks. # Getting all tempno1 into a list with one step tempno1 = [np.append(tempno,ib) for ib in [xb for xb in range(0,len(xl))]] temp = [list(set(tempk)) for tempk in tempno1] # Taking only needed columns from x and y dfs xtemp = xl[['Concat']] ytemp = yl[['Concat','ships_x','ships_y','PickDate']] #Shortlisting y df and groupby in two diff steps ytemp = [ytemp[ytemp['Concat'].isin(np.array(xtemp['Concat'][tempnokk]))] for tempnokk in tempno1] temptab = [ytempk.groupby('PickDate')['ships_x','ships_y'].sum().reset_index() for ytempk in ytemp] tempkcontri = [tempk['ships_x']/tempk['ships_y'] for tempk in temptab] tempkcontri = [pd.DataFrame(tempkcontri[i],columns=['contri']) for i in range(0,len(tempkcontri))] temptab = [temptab[i].join(tempkcontri[i]) for i in range(0,len(temptab))] pcv = [1 if math.isnan(scipy.stats.variation(temptabkk['contri'])) else scipy.stats.variation(temptabkk['contri']) for temptabkk in temptab] p = pd.DataFrame({'temp' : temp,'cv': pcv})
pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift
I am trying to speed up my groupby.apply + shift and thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups. From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group. df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out. I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that? # the start of each group, ignoring the first entry df.groupby(level=0).size().cumsum()[1:] Test setup (for backwards shift) if you want to try it: length = 5 groups = 3 rng1 = pd.date_range('1/1/1990', periods=length, freq='D') frames = [] for x in xrange(0,groups): tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)}) frames.append(tmpdf) df = pd.concat(frames) df.sort(columns=['category','date'],inplace=True) df.set_index(['category','date'],inplace=True,drop=True) df['tmpShift'] = df['colB'].shift(1) df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan # Yay this is so much faster. df['newColumn'] = df['tmpShift'] / df['colA'] df.drop('tmp',1,inplace=True) Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards): def replace_tail(grp,col,N,value): if (N > 0): grp[col][:N] = value else: grp[col][N:] = value return grp df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan) So the final code is: def replace_tail(grp,col,N,value): if (N > 0): grp[col][:N] = value else: grp[col][N:] = value return grp length = 5 groups = 3 rng1 = pd.date_range('1/1/1990', periods=length, freq='D') frames = [] for x in xrange(0,groups): tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)}) frames.append(tmpdf) df = pd.concat(frames) df.sort(columns=['category','date'],inplace=True) df.set_index(['category','date'],inplace=True,drop=True) shiftBy=-1 df['tmpShift'] = df['colB'].shift(shiftBy) df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan) # Yay this is so much faster. df['newColumn'] = df['tmpShift'] / df['colA'] df.drop('tmpShift',1,inplace=True)