I have a pandas dataframe with approximately 3 million rows.
I want to partially aggregate the last column in seperate spots based on another variable.
My solution was to separate the dataframe rows into a list of new dataframes based on that variable, aggregate the dataframes, and then join them again into a single dataframe. The problem is that after a few 10s of thousands of rows, I get a memory error. What methods can I use to improve the efficiency of my function to prevent these memory errors?
An example of my code is below
test = pd.DataFrame({"unneeded_var": [6,6,6,4,2,6,9,2,3,3,1,4,1,5,9],
"year": [0,0,0,0,1,1,1,2,2,2,2,3,3,3,3],
"month" : [0,0,0,0,1,1,1,2,2,2,3,3,3,4,4],
"day" : [0,0,0,1,1,1,2,2,2,2,3,3,4,4,5],
"day_count" : [7,4,3,2,1,5,4,2,3,2,5,3,2,1,3]})
test = test[["year", "month", "day", "day_count"]]
def agg_multiple(df, labels, aggvar, repl=None):
if(repl is None): repl = aggvar
conds = df.duplicated(labels).tolist() #returns boolean list of false for a unique (year,month) then true until next unique pair
groups = []
start = 0
for i in range(len(conds)): #When false, split previous to new df, aggregate count
bul = conds[i]
if(i == len(conds) - 1): i +=1 #no false marking end of last group, special case
if not bul and i > 0 or bul and i == len(conds):
sample = df.iloc[start:i , :]
start = i
sample = sample.groupby(labels, as_index=False).agg({aggvar:sum}).rename(columns={aggvar : repl})
groups.append(sample)
df = pd.concat(groups).reset_index(drop=True) #combine aggregated dfs into new df
return df
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
I suppose that I could potentially apply the function to small samples of the dataframe, to prevent a memory error and then combine those, but I'd rather improve the computation time of my function.
This function does the same, and is 10 times faster.
test.groupby(["year", "month"], as_index=False).agg({"day_count":sum}).rename(columns={"day_count":"month_count"})
There are almost always pandas methods that are pretty optimized for tasks that will vastly outperform iteration through the dataframe. If I understand correctly, in your case, the following will return the same exact output as your function:
test2 = (test.groupby(['year', 'month'])
.day_count.sum()
.to_frame('month_count')
.reset_index())
>>> test2
year month month_count
0 0 0 16
1 1 1 10
2 2 2 7
3 2 3 5
4 3 3 5
5 3 4 4
To check that it's the same:
# Your original function:
test = agg_multiple(test, ["year", "month"], "day_count", repl="month_count")
>>> test == test2
year month month_count
0 True True True
1 True True True
2 True True True
3 True True True
4 True True True
5 True True True
Related
I'm doing a python code for data analysis. I would like to mark the lines, in a new column, that have the same value in EMP, RAZAO and ATRIB columns and add the values in MONTCALC is zero. For exemple:
Example of datas
In this image the lines marked with color are subgroup and if you add the values of MONTCALC column the result is 0.
My code:
conciliation_df_temp = conciliation_df.copy()
doc_clear = 1
for i in conciliation_df_temp.index:
if conciliation_df_temp.loc[i,'DOC_COMP'] == "":
company = conciliation_df_temp.loc[i,'EMP']
gl_account = conciliation_df_temp.loc[i,'RAZAO']
assignment = conciliation_df_temp.loc[i,'ATRIB']
df_temp = conciliation_df_temp.loc[(cconciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment)]
if round(df_temp['MONTCALC'].sum(),2) == 0:
conciliation_df_temp.loc[(conciliation_df_temp['EMP'] == company) & (conciliation_df_temp['RAZAO'] == gl_account) & (conciliation_df_temp['ATRIB'] == assignment),'DOC_COMP'] = doc_clear
doc_clear += 1
The performance with few lines (10,000) is good execute less than 1 minute. In the 1 minute also has read a text file, file handling and convertion to dataframe. But if I put a text file with more than 1 million lines the script does't execute, I wait 5 hours with out return.
What do I do to improve performance this code?
Regards!!
Sorry my English
I tried delete lines in dataFrame to decrease size dataFrame to search be faster, but the execution was slower.
It seems like you could just check if the sum of the group is zero:
import pandas as pd
df = pd.DataFrame([
[3000,1131500040,8701731701,-156002.08],
[3000,1131500040,8701731701, 156002.08],
[3000,1131500040,"EA-17012.2.22", -3990],
[3000,1131500040,"EA-17012.2.22", 400],
[3000,1131500040,"000100000103", -35822.86],
[3000,1131500040,"000100000103", 35822.86],
[3000,1131500040,"000100000103", -35822.86],
[3000,1131500040,"000100000103", 35822.86]
], columns=['EMP','RAZAO','ATRIB','MONTCALC']
)
df['zero'] = df.groupby(['EMP','RAZAO','ATRIB'])['MONTCALC'].transform(lambda x: sum(x)==0)
print(df)
Output
EMP RAZAO ATRIB MONTCALC zero
0 3000 1131500040 8701731701 -156002.08 True
1 3000 1131500040 8701731701 156002.08 True
2 3000 1131500040 EA-17012.2.22 -3990.00 False
3 3000 1131500040 EA-17012.2.22 400.00 False
4 3000 1131500040 000100000103 -35822.86 True
5 3000 1131500040 000100000103 35822.86 True
6 3000 1131500040 000100000103 -35822.86 True
7 3000 1131500040 000100000103 35822.86 True
I want to extract two first symbols in case three first symbols match a certain pattern (first two symbols should be any of those inside the brackets [ptkbdgG_fvsSxzZhmnNJlrwj], the third symbol should be any of those inside the brackets[IEAOYye|aouKLM#)3*<!(#0~q^LMOEK].
The first two lines work correctly.
The last lines do not work and I do not understand why. The code doesn`t give any errors, it just does nothing for those
# extract tree first symbols and save them in the new column
df['first_three_symbols'] = df['ITEM'].str[0:3]
#create a boolean column on condition whether first three symbols contain symbols
df["ccv"] = df["first_three_symbols"].str.contains('[ptkbdgG_fvsSxzZhmnNJlrwj][ptkbdgG_fvsSxzZhmnNJlrwj][IEAOYye|aouKLM#)3*<!(#0~q^LMOEK]')
#create another column for True values in the previous column
if df["ccv"].item == True:
df['first_two_symbols'] = df["ITEM"].str[0:2]
Here is my output:
ID ITEM FREQ first_three_symbols ccv
0 0 a 563 a False
1 1 OlrMndmEn 1 Olr False
2 2 OlrMndSpOrtl#r 0 Olr False
3 3 AG#l 74 AG# False
4 4 AG#lbMm 24 AG# False
... ... ... ... ... ...
51723 51723 zytzWt# 8 zyt False
51724 51724 zytzytOst 0 zyt False
51725 51725 zYxtIx 5 zYx False
51726 51726 zYxtIxkWt 0 zYx False
51727 51727 zyZe 4 zyZ False
[51728 rows x 5 columns]
you can either create a function, use apply method :
def f(row):
if row["ccv"] == True:
return row["ITEM"].str[0:2]
else:
return None
df['first_two_symbols'] = df.apply(f,axis=1)
or you can use np.wherefunction from numpy package.
I have a dataset with two columns: in the first column, the full directory path of a file. In the second column, the date the file was last modified. I am trying to figure out the number of files in each upper level folder ("dog", "feline", "mouse", "anteater") that were last modified later than 2004-06-23. Ultimately, I'd like something like this:
Here's my dataset:
import pandas as pd
data = {'FullName': ["dog\cat\cow\rover.doc","feline\cat\cow\digger.doc","dog\cat\cow\whatamess.doc","mouse\cat\mouse\jude.doc","anteater\cat\mouse\sam.doc","dog\cat\owl\audrey.doc",
], 'LastWriteTime': ['2003-01-02', '2004-01-02', '2005-01-02','2006-01-02','2007-01-02','2008-01-02']}
df1 = pd.DataFrame(data)
I can count the number of times the upper level folder recurs in the dataset:
df2 = (df1['FullName'].apply(lambda x: x.split('\\')[0]).value_counts()
I can also count the number of times files with a date greater than '2004-06-23' recurs in the dataset:
df3 = df1['LastWriteTime'].apply(lambda x: pd.to_datetime((x),yearfirst=True) > pd.to_datetime('2004-06-23',yearfirst=True)).value_counts()
I tried to combine them as follows:
df2 = (df1['FullName'].apply(lambda x: x.split('\\')[0]) & pd.to_datetime((x),yearfirst=True) > pd.to_datetime('2004-06-23',yearfirst=True)).value_counts()
but I get the error code: x is not defined
Does anyone know how to combine them?
You can only use expression. Refer python docs.
Just wrap it in parentheses and python interpreter is happy about it.
lambda x: (x.split('\\')[0]) & (pd.to_datetime(x, yearfirst=True) > pd.to_datetime('2004-06-23',yearfirst=True)))
However, you are applying function to respective table 'LastWriteTime' and 'FullName', function suits better. Create multiple evaluation functions and store it in dict with corresponding keys, and pass to aggregate().
def check_date(x):
return pd.to_datetime(x) > pd.to_datetime('2004-06-23', yearfirst=True)
def files_count(x):
return x.split('\\')[0]
df1 = pd.DataFrame(data)
func_dict = {'FullName': files_count, 'LastWriteTime': check_date}
df4 = df1.aggregate(func_dict) # applying multiple function to multiple columns
print(df4, end='\n'*2)
# use df[df['a']] & df[df['b']] for filtering.
df3_filtered = df4[df4['LastWriteTime']] # filtering.
print(df3_filtered, end='\n'*2)
df2 = df3_filtered['FullName'].value_counts() # counting.
print(df2)
Result:
FullName LastWriteTime
0 dog False
1 feline False
2 dog True
3 mouse True
4 anteater True
5 dog True
FullName LastWriteTime
2 dog True
3 mouse True
4 anteater True
5 dog True
dog 2
anteater 1
mouse 1
Name: FullName, dtype: int64
I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1
I have currently run the following script which uses Fuzzylogic to replace some common words from the list. Dataframe df1 contains my default list of possible values. Dataframe df2 is the main dataframe where transformations/changes are undertaken after referring to Dataframe df1. The code is as follows:
df1 = pd.DataFrame(['one','two','three','four','five','tsst'])
df2 = pd.DataFrame({'not_shifted':[np.nan,'one','too','three','fours','five','six',np.nan,'test']})
# Drop nan value
df2=pd.DataFrame(df2['not_shifted'].fillna(value=''))
df2['not_shifted'] = df2['not_shifted'].map(lambda x: difflib.get_close_matches(x, df1[0]))
The problem is the output is a dataframe which contains square brackets. To make matters worse, none of the texts within df2['not_shifted'] are viewable/ recallable:
Out[421]:
not_shifted
0 []
1 [one]
2 [two]
3 [three]
4 [four]
5 [five]
6 []
7 []
8 [tsst]
Please help.
df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom
def replace_all(eg):
rep = {"[":"",
"]":"",
"u":"",
"}":"",
"'":"",
'"':"",
"frozenset":""}
for i,j in rep.items():
eg = eg.replace(i,j)
return eg
for each in df.columns:
df[each] = df[each].apply(lambda x : replace_all(str(x)))