Aggregating cells/column in pandas dataframe - python
I have a dataframe that is like this
Index Z1 Z2 Z3 Z4
0 A(Z1W1) A(Z2W1) A(Z3W1) B(Z4W2)
1 A(Z1W3) B(Z2W1) A(Z3W2) B(Z4W3)
2 B(Z1W1) A(Z3W4) B(Z4W4)
3 B(Z1W2)
I want to convert it to
Index Z1 Z2 Z3 Z4
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1)
Basically I want to aggregate the values of different cell to one cell as shown above
Edit 1
Actual column names are either two words or 3 words names and not A B
For example Nut Butter instead of A
Things are getting interested : -)
s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
v=('('+s.groupby([s.index.get_level_values(1),s[0]])[1].apply(','.join)+')').unstack().apply(lambda x : x.name+x.astype(str)).T
v[~v.apply(lambda x : x.str.contains('None'))].apply(lambda x : sorted(x,key=pd.isnull)).reset_index(drop=True)
Out[1865]:
Z1 Z2 Z3 Z4
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1) NaN NaN
Update
Change
#s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
to
s=df.stack().str.split('(',expand=True)
s[1]=s[1].replace({'[(|)]':' '},regex=True).str.strip()
Geneal idea:
split string values
regroup and join stings
apply to all columns
Update 1
# I had to add parameter as_index=False to groupby(0)
# to get exactly same output as asked
Lets try one column
def str_regroup(s):
return s.str.extract(r"(\w)\((.+)\)",expand=True).groupby(0,as_index=False).apply(
lambda x: '{}({})'.format(x.name,', '.join(x[1])))
str_regroup(df1.Z1)
output
A A(Z1W1, Z1W3)
B B(Z1W1, Z1W2)
then apply to all columns
df.apply(str_regroup)
output
Z1 Z2 Z3 Z4
0 A(Z1W1, Z1W3) A(Z2W1) A(Z3W1, Z3W2, Z3W4) B(Z4W2, Z4W3, Z4W4)
1 B(Z1W1, Z1W2) B(Z2W1)
Update 2
Performance on 100 000 sample rows
928 ms for this apply version ;b
1.55 s for stack() by #Wen
You could use the following approach:
Melt df to get:
In [194]: melted = pd.melt(df, var_name='col'); melted
Out[194]:
col value
0 Z1 A(Z1W1)
1 Z1 A(Z1W3)
2 Z1 B(Z1W1)
3 Z1 B(Z1W2)
4 Z2 A(Z2W1)
5 Z2 B(Z2W1)
6 Z2
7 Z2
8 Z3 A(Z3W1)
9 Z3 A(Z3W2)
10 Z3 A(Z3W4)
11 Z3
12 Z4 B(Z4W2)
13 Z4 B(Z4W3)
14 Z4 B(Z4W4)
15 Z4
Use regex to extract row and value columns:
In [195]: melted[['row','value']] = melted['value'].str.extract(r'(.*)\((.*)\)', expand=True); melted
Out[195]:
col value row
0 Z1 Z1W1 A
1 Z1 Z1W3 A
2 Z1 Z1W1 B
3 Z1 Z1W2 B
4 Z2 Z2W1 A
5 Z2 Z2W1 B
6 Z2 NaN NaN
7 Z2 NaN NaN
8 Z3 Z3W1 A
9 Z3 Z3W2 A
10 Z3 Z3W4 A
11 Z3 NaN NaN
12 Z4 Z4W2 B
13 Z4 Z4W3 B
14 Z4 Z4W4 B
15 Z4 NaN NaN
Group by col and row and join the values together:
In [185]: result = melted.groupby(['col', 'row'])['value'].agg(','.join)
In [186]: result
Out[186]:
col row
Z1 A Z1W1,Z1W3
B Z1W1,Z1W2
Z2 A Z2W1
B Z2W1
Z3 A Z3W1,Z3W2,Z3W4
Z4 B Z4W2,Z4W3,Z4W4
Name: value, dtype: object
Add the row values to the value values:
In [188]: result['value'] = result['row'] + '(' + result['value'] + ')'
In [189]: result
Out[189]:
row value
col
Z1 A A(Z1W1,Z1W3)
Z1 B B(Z1W1,Z1W2)
Z2 A A(Z2W1)
Z2 B B(Z2W1)
Z3 A A(Z3W1,Z3W2,Z3W4)
Z4 B B(Z4W2,Z4W3,Z4W4)
Overwrite the row column values with groupby/cumcount values to setup the upcoming pivot:
In [191]: result['row'] = result.groupby(level='col').cumcount()
In [192]: result
Out[192]:
row value
col
Z1 0 A(Z1W1,Z1W3)
Z1 1 B(Z1W1,Z1W2)
Z2 0 A(Z2W1)
Z2 1 B(Z2W1)
Z3 0 A(Z3W1,Z3W2,Z3W4)
Z4 0 B(Z4W2,Z4W3,Z4W4)
Pivoting produces the desired result:
result = result.pivot(index='row', columns='col', values='value')
import pandas as pd
df = pd.DataFrame({
'Z1': ['A(Z1W1)', 'A(Z1W3)', 'B(Z1W1)', 'B(Z1W2)'],
'Z2': ['A(Z2W1)', 'B(Z2W1)', '', ''],
'Z3': ['A(Z3W1)', 'A(Z3W2)', 'A(Z3W4)', ''],
'Z4': ['B(Z4W2)', 'B(Z4W3)', 'B(Z4W4)', '']}, index=[0, 1, 2, 3],)
melted = pd.melt(df, var_name='col').dropna()
melted[['row','value']] = melted['value'].str.extract(r'(.*)\((.*)\)', expand=True)
result = melted.groupby(['col', 'row'])['value'].agg(','.join)
result = result.reset_index('row')
result['value'] = result['row'] + '(' + result['value'] + ')'
result['row'] = result.groupby(level='col').cumcount()
result = result.reset_index()
result = result.pivot(index='row', columns='col', values='value')
print(result)
yields
col Z1 Z2 Z3 Z4
row
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1) NaN NaN
Related
Combining Pandas Dataframes
I am pretty new in Pandas. So please bear with me. I have a df like this one DF1 column1 column2(ids) a [1,2,13,4,9] b [20,14,10,18,17] c [6,8,12,16,19] d [11,3,15,7,5] Each number in each list corresponds to the column id in a second dataframe. DF2 id. value_to_change. 1 x1 2 x2 3 x3 4 x4 5 x5 6 x6 7 x7 8 x8 9 x9 . . . . . . 20 x20 STEP1 I want to iterate each list and select the rows in DF2 with the matching ids, AND create 4 dataframes since I have 4 rows in DF1. How to do this? So for instance for the first row after applying the logic i would get this back id. value_to_change 1 x1 2 x2 13 x13 14 x14 9 x9 The second row would give me id. value_to_change 20 x20 14 x14 10 x10 18 x18 17 x17 And so on... STEP 2 Once I have these 4 dataframes, i pass them as argument to a logic which returns me 4 dataframes. 2) How could I combine them into a sorted final one? DF3 id. new_value 1 y1 2 y2 3 y3 4 y4 5 y5 6 y6 7 y7 8 y8 9 y9 . . . . . . 20 y20 how could I go about this?
It would be much easier and efficient to use a single dataframe like so Initialization df1 = pd.DataFrame({'label': ['A', 'B', 'C', 'D'], 'ids': [[1,2,13,4,9], [20,14,10,18,17], [6,8,12,16,19],[11,3,15,7,5]]}) # Some custom function for dataframe operations def my_func(x): x['value_to_change'] = x.value_to_change.str.replace('x', 'y') return x Dataframe Operations df1 = df1.explode('ids') df1['value_to_change'] = df1.explode('ids')['ids'].map(dict(zip(df2.ids, df2.val))) df1['new_value'] = df1.groupby('label').apply(my_func)['value_to_change'] Output label ids value_to_change new_value 0 A 1 x1 y1 0 A 2 x2 y2 0 A 13 x13 y13 0 A 4 x4 y4 0 A 9 x9 y9 1 B 20 x20 y20 1 B 14 x14 y14 1 B 10 x10 y10 1 B 18 x18 y18 1 B 17 x17 y17 2 C 6 x6 y6 2 C 8 x8 y8 2 C 12 x12 y12 2 C 16 x16 y16 2 C 19 x19 y19 3 D 11 x11 y11 3 D 3 x3 y3 3 D 15 x15 y15 3 D 7 x7 y7 3 D 5 x5 y5
This code will help with the first part of the problem. import pandas as pd df1 = pd.DataFrame([[[1,2,4,5]],[[3,4,1]]], columns=["column2(ids)"]) df2 = pd.DataFrame([[1,"x1"],[2,"x2"],[3,"x3"],[4,"x4"],[5,"x5"]], columns=["id", "value_to_change"]) df3 = pd.DataFrame(columns=["id", "value_to_change"]) for row in df1.iterrows(): s = row[1][0] for item in s: val = df2.loc[df2['id']==item, 'value_to_change'].item() df_temp = pd.DataFrame([[item,val]], columns=["id", "value_to_change"]) df3 = df3.append(df_temp, ignore_index=True) df3 Note in the line s=row[1][0], you need to choose the index according to your dataframe, in my case it was [1][0] -For second part you can use pd.concat: Documentation -For sorting df.sort_values: Documentation
Use .loc and .isin to get new Dataframe with required rows in df2 Do your logic on these 4 dataframes combine the resulting 4 dataframes using pandas.concat() sort the dataframe by ids using .sort_values() Code: import pandas as pd df1 = pd.DataFrame({'column1 ': ['A', 'B', 'C', 'D'], 'ids': [[1,2,13,4,9], [20,14,10,18,17], [6,8,12,16,19],[11,3,15,7,5]]}) df2 = pd.DataFrame({'ids': list(range(1,21)), 'val': [f'x{x}' for x in range(1,21)]}) df_list=[] for id_list in df1['ids'].values: df_list.append(df2.loc[df2['ids'].isin(id_list)]) # do logic on each DF in df_list # assuming df_list now contains the resulting dataframes df3 = pd.concat(df_list) df3 = df3.sort_values('ids')
First things first, this code should do what you want. import pandas as pd idxs = [ [0,2], [1,3], ] df_idxs = pd.DataFrame({'idxs': idxs}) df = pd.DataFrame( {'data': ['a', 'b', 'c', 'd']} ) frames = [] for _, idx in df_idxs.iterrows(): rows = idx['idxs'] frame = df.loc[rows] # some logic print(frame) #collect frames.append(frame) pd.concat(frames) Note that pandas automatically creates a range index of none is passed. If you want to select on a different column, set that one as index, or use df.loc[df.data.isin(rows)] . The pandas doc on split-apply-combine may also interest you: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
Using key while sorting values for just one column
Lets say we have a df like below: df = pd.DataFrame({'A':['y2','x3','z1','z1'],'B':['y2','x3','a2','z1']}) A B 0 y2 y2 1 x3 x3 2 z1 a2 3 z1 z1 if we wanted to sort the values on just the numbers in column A, we can do: df.sort_values(by='A',key=lambda x: x.str[1]) A B 3 z1 z1 2 z1 a2 0 y2 y2 1 x3 x3 If we wanted to sort by both columns A and B, but have the key only apply to column A, is there a way to do that? df.sort_values(by=['A','B'],key=lambda x: x.str[1]) Expected output: A B 2 z1 a2 3 z1 z1 0 y2 y2 1 x3 x3
You can sort by B, then sort by A with a stable method: (df.sort_values('B') .sort_values('A', key=lambda x: x.str[1], kind='mergesort') ) Output: A B 2 z1 a2 3 z1 z1 0 y2 y2 1 x3 x3
Pandas Count Group Number
Given the following dataframe: df=pd.DataFrame({'col1':['A','A','A','A','A','A','B','B','B','B','B','B'], 'col2':['x','x','y','z','y','y','x','y','y','z','z','x'], }) df col1 col2 0 A x 1 A x 2 A y 3 A z 4 A y 5 A y 6 B x 7 B y 8 B y 9 B z 10 B z 11 B x I'd like to create a new column, col3 which classifies the values in col2 sequentially, grouped by the values in col1: col1 col2 col3 0 A x x1 1 A x x1 2 A y y1 3 A z z1 4 A y y2 5 A y y2 6 B x x1 7 B y y1 8 B y y1 9 B z z1 10 B z z1 11 B x x2 In the above example, col3[0:1] has a value of x1 because its the first group of x values in col2 for col1 = A. col3[4:5] has values of y2 because its the second group of y values in col2 for col1 = A etc... I hope the description makes sense. I was unable to find an answer partially because I can't find an elegant way to articulate what I'm looking for.
Here's my approach: groups = (df.assign(s=df.groupby('col1')['col2'] # group col2 by col1 .shift().ne(df['col2']) # check if col2 different from the previous (shift) .astype(int) # convert to int ) # the new column s marks the beginning of consecutive blocks with `1` .groupby(['col1','col2'])['s'] # group `s` by `col1` and `col2` .cumsum() # cumsum by group .astype(str) ) df['col3'] = df['col2'] + groups Output: col1 col2 col3 0 A x x1 1 A x x1 2 A y y1 3 A z z1 4 A y y2 5 A y y2 6 B x x1 7 B y y1 8 B y y1 9 B z z1 10 B z z1 11 B x x2
Pandas adding calculated vectors into df
my goal is to add formula based vectors to my following df: Day Name a b 1 2 x1 x2 1 ijk 1 2 3 3 0 1 2 mno 2 1 1 3 1 1 outcome: Day Name a b 1 2 x1 x2 y1 y2 z1 z2 1 ijk 1 2 3 3 0 1 (1*2)+3 (1*2)+3 (1+2)*(3*1+0*1) (1+2)*(3*2+1*2) 2 mno 2 1 1 3 1 1 (2*1)+1 (2*1)+3 (2+1)*(1*1+1*1) (2+1)*(3*2+1*2) This is my tedious approach: df[y1] = df[a]*df[b]+df[1] #This is y1 = a*b+value of column 1 df[y2] = df[a]*df[b]+df[2] #This is y2 = a*b+value of column 2 if column 3 and x3 were added in then: y3 would be y3 = a*b+value of column 3, if column 4 and x4 were added in then: y4 = a*b+value of column 4 and so on... df[z1] = (df[a]+df[b])*(df[1]*1+df[x1]*1) The "1" here is from the column name 1 and x1 #z1 = (a+b)*[(value of column 1)*1+(value of column x1)*1] df[z2] = (df[a]+df[b])*(df[1]*2+df[x1]*2) The "2" here is from the column name 2 and x2 #z2 = (a+b)*[(value of column 2)*2+(value of column x2)*2] if column 3 and x3 were added in then: z3 = (a+b)*[(value of column 3)*3+(value of column x3)*3] and so on This works fine; however, this will get tedious if there are more columns added in. For example, it might get "3 4,... x3 x4,..." I'm wondering if there's a better approach to this using a loop maybe? Many thanks :)
This is one way: import pandas as pd df = pd.DataFrame([[1, 'ijk', 1, 2, 3, 3, 2, 0, 1], [2, 'mno', 2, 1, 1, 3, 1, 1, 1]], columns=['Day', 'Name', 'a', 'b', 1, 2, 3, 'x1', 'x2']) for i in range(1, 4): df['y'+str(i)] = df['a'] * df['b'] + df[i] #output #Day Name a b 1 2 3 x1 x2 y1 y2 y3 #1 ijk 1 2 3 3 2 0 1 5 5 4 #2 mno 2 1 1 3 1 1 1 3 5 3
Pandas randomly select n groups from a larger dataset
If I have a dataframe with groups like so val label x A x A x B x B x C x C x D x D how can I randomly pick out n groups without replacement?
You can use random.choice with loc: N = 3 vals = np.random.choice(df['label'].unique(), N, replace=False) print (vals) ['C' 'A' 'B'] df = df.set_index('label').loc[vals].reset_index() print (df) label val 0 C x5 1 C x6 2 A x1 3 A x2 4 B x3 5 B x4