I am using panda version 0.23.0. I want to use data frame group by function to generate new aggregated columns using [lambda] functions..
My data frame looks like
ID Flag Amount User
1 1 100 123345
1 1 55 123346
2 0 20 123346
2 0 30 123347
3 0 50 123348
I want to generate a table which looks like
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM Flag0_User_Count Flag1_User_Count
1 2 2 0 155 0 2
2 2 0 50 0 2 0
3 1 0 50 0 1 0
here:
Flag0_Count is count of Flag = 0
Flag1_Count is count of Flag = 1
Flag0_Amount_SUM is SUNM of amount when Flag = 0
Flag1_Amount_SUM is SUNM of amount when Flag = 1
Flag0_User_Count is Count of Distinct User when Flag = 0
Flag1_User_Count is Count of Distinct User when Flag = 1
I have tried something like
df.groupby(["ID"])["Flag"].apply(lambda x: sum(x==0)).reset_index()
but it creates a new a new data frame. This means I will have to this for all columns and them merge them together into a new data frame.
Is there an easier way to accomplish this?
Use DataFrameGroupBy.agg by dictionary by column names with aggregate function, then reshape by unstack, flatten MultiIndex of columns, rename columns and last reset_index:
df = (df.groupby(["ID", "Flag"])
.agg({'Flag':'size', 'Amount':'sum', 'User':'nunique'})
.unstack(fill_value=0))
#python 3.6+
df.columns = [f'{i}{j}' for i, j in df.columns]
#python below
#df.columns = [f'{}{}'.format(i, j) for i, j in df.columns]
d = {'Flag0':'Flag0_Count',
'Flag1':'Flag1_Count',
'Amount0':'Flag0_Amount_SUM',
'Amount1':'Flag1_Amount_SUM',
'User0':'Flag0_User_Count',
'User1':'Flag1_User_Count',
}
df = df.rename(columns=d).reset_index()
print (df)
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM \
0 1 0 2 0 155
1 2 2 0 50 0
2 3 1 0 50 0
Flag0_User_Count Flag1_User_Count
0 0 2
1 2 0
2 1 0
Related
I have 6 dataframes with same column names.
The column names are:
"session_id", "player_id", "gersey_color","timestamp"
data in each frame looks like:
session_id
player_id
gersey_color
timestamp
123xyz
yellow
9
1347.85
I want to combine these dataframe in single dataframe where the format would be like:
session_id
player_info
df1
df2
df3
df4
df5
df6
total_occurance
timestamp
123xyz
yellow9
0
1
0
3
3
0
7
1347.85
green2
0
1
1
0
2
0
4
blue5
1
1
1
1
1
1
6
523pqr
yellow1
2
1
0
0
0
0
3
747.45
white2
0
1
0
0
0
1
205abd
green1
0
1
0
0
3
0
4
57.61
111mnz
yellow10
1
0
0
0
0
0
1
1821.21
black2
0
1
0
1
1
0
3
Here I am using timestamp as unique identifier and want to get the frequency of each time stamp happening across all the dataframe but categorised with session_id and player_id and gersey_color combined.
my current code can get all the information but can not format like the way I want:
for i, combo_row in combo_df.iterrows():
value_in_combo = combo_row['timestamp']
count = 0
player_info = []
session_id = []
for id, df_path in enumerate(df_list):
rule_df = pd.read_excel(df_path)
sub_counter = 0
for idx, entry in rule_df.iterrows():
idr = list(rule_df.columns).index('timestamp')
value = entry[idr]
s_id = entry[list(rule_df.columns).index('session.id')]
player_team = entry[list(rule_df.columns).index('gersey_color')]
player_num = entry[list(rule_df.columns).index('player_id')]
if value == value_in_combo:
sub_counter+=1
session_id.append(s_id)
player_info.append(str(player_team)+str(player_num))
combo_df.at[i, f'df{id+1}'] = ','.join(list(set(player_info)))
combo_df.at[i, 'session_id'] =','.join(list(set(session_id)))
count += sub_counter
combo_df.at[i, 'occurrence_across_rules'] = count
Here combo_df is the df I predefined to populate all the data.
current combo_df looks like:
session_id
player_info
df1
df2
df3
df4
df5
df6
timestamp
total_occurance
123xyz
yellow1
623.15
1
423pqz
green1,yellow5
yellow55
green1,yellow5
1347.85
5
. . . . . .
But as I said my code does not generate the format I want.
Can anyone suggest how to do it ?
Edit:
I solved the problem using:
combo_df.set_index(['session_id', 'player_team', 'player_num'], inplace=True)
I have a pandas data frame in python coming from a pd.concat with a recurring multiindex:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
0 0 8880873
1 1000521
1 0 1135488
1 5388773
No, I will reset only the first index of the multiIndex, so that I get a recurring number on the index. Something like this:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773
In general, I have around 5 Mio records and not the biggest machine. So I'm looking for a memory efficient solution for that.
ignore_index=True in pd.concat do not works, because then I lose the Multiindex.
Many thanks
You can convert first level by get_level_values to_series, then compare it with shifted values and add cumsum for count and last use MultiIndex.from_arrays:
a = df.index.get_level_values(0).to_series()
a = a.ne(a.shift()).cumsum() - 1
mux = pd.MultiIndex.from_arrays([a, df.index.get_level_values(1)], names=df.index.names)
df.index = mux
Or:
df = df.set_index(mux)
print (df)
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773
I have a pandas dataframe with several columns. Bulk of the column names can be looped. So I have made an array of the column names like this:
ycols = ['{}_{}d pred'.format(ticker, i) for i in range(hm_days)]
Now I want to make a new pandas dataframe with only these columns having the index of the parent dataframe. How to do this?
Ok, So you want to create a new dataframe with new column names, with the existing index of the original dataframe.
For some dataframe:
old_df = pd.DataFrame({'x':[0,1,2,3],'y':[10,9,8,7]})
>>>
x y
0 0 10
1 1 9
2 2 8
3 3 7
columns = list(old_df)
>>>
['x', 'y']
You can specify your new columns by doing:
y_cols = ['x_pred','y_pred']
>>> ['x_pred','y_pred']
Here, y_cols is the list of your new column names. In your code, you would replace this step with ycols = ['{}_{}d pred'.format(ticker, i) for i in range(hm_days)].
To get the new columns, you create new columns with a placeholder variable (in this case 0, as it looks like you are using numeric data), with the same index as your old dataframe:
# Iterate over all columns names in y_cols
for i in y_cols:
old_df[i]=0
>>> old_df:
x y x_pred y_pred
0 0 10 0 0
1 1 9 0 0
2 2 8 0 0
3 3 7 0 0
Finally, slice your dataframe to get your new dataframe with new column names, maintaining the index of the old dataframe.
df_new = old_df[y_cols]
>>>
x_pred y_pred
0 0 0
1 0 0
2 0 0
3 0 0
This works even if you have a named index:
x y x_pred y_pred
Date
0 0 10 0 0
1 1 9 0 0
2 2 8 0 0
3 3 7 0 0
df_new = old_df[y_cols]
x_pred y_pred
Date
0 0 0
1 0 0
2 0 0
3 0 0
My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.
You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
I have a DataFrame where a combination of column values identify a unique address (A,B,C). I would like to identify all such rows and assign them a unique identifier that I increment per address.
For example
A B C D E
0 1 1 0 1
0 1 2 0 1
0 1 1 1 1
0 1 3 0 1
0 1 2 1 0
0 1 1 2 1
I would like to generate the following
A B C D E ID
0 1 1 0 1 0
0 1 2 0 1 1
0 1 1 1 1 0
0 1 3 0 1 2
0 1 2 1 0 1
0 1 1 2 1 0
I tried the following:
id = 0
def set_id(df):
global id
df['ID'] = id
id += 1
df.groupby(['A','B','C']).transform(set_id)
This returns a NULL dataframe...This is definitely not the way to do it..I am new to pandas. The above should actually use df[['A','B','C']].drop_duplicates() to get all unique values
Thank you.
I think this is what you need :
df2 = df[['A','B','C']].drop_duplicates() #get unique values of ABC
df2 = df2.reset_index(drop = True).reset_index() #reset index to create a column named index
df2=df2.rename(columns = {'index':'ID'}) #rename index to ID
df = pd.merge(df,df2,on = ['A','B','C'],how = 'left') #append ID column with merge
# Create tuple triplet using values from columns A, B & C.
df['key'] = [triplet for triplet in zip(*[df[col].values.tolist() for col in ['A', 'B', 'C']])]
# Sort dataframe on new `key` column.
df.sort_values('key', inplace=True)
# Use `groupby` to keep running total of changes in key value.
df['ID'] = (df['key'] != df['key'].shift()).cumsum() - 1
# Clean up.
del df['key']
df.sort_index(inplace=True)
>>> df
A B C D E ID
0 0 1 1 0 1 0
1 0 1 2 0 1 1
2 0 1 1 1 1 0
3 0 1 3 0 1 2
4 0 1 2 1 0 1
5 0 1 1 2 1 0