Aggregating cells/column in pandas dataframe - python

I have a dataframe that is like this
Index Z1 Z2 Z3 Z4
0 A(Z1W1) A(Z2W1) A(Z3W1) B(Z4W2)
1 A(Z1W3) B(Z2W1) A(Z3W2) B(Z4W3)
2 B(Z1W1) A(Z3W4) B(Z4W4)
3 B(Z1W2)
I want to convert it to
Index Z1 Z2 Z3 Z4
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1)
Basically I want to aggregate the values of different cell to one cell as shown above
Edit 1
Actual column names are either two words or 3 words names and not A B
For example Nut Butter instead of A

Things are getting interested : -)
s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
v=('('+s.groupby([s.index.get_level_values(1),s[0]])[1].apply(','.join)+')').unstack().apply(lambda x : x.name+x.astype(str)).T
v[~v.apply(lambda x : x.str.contains('None'))].apply(lambda x : sorted(x,key=pd.isnull)).reset_index(drop=True)
Out[1865]:
Z1 Z2 Z3 Z4
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1) NaN NaN
Update
Change
#s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
to
s=df.stack().str.split('(',expand=True)
s[1]=s[1].replace({'[(|)]':' '},regex=True).str.strip()

Geneal idea:
split string values
regroup and join stings
apply to all columns
Update 1
# I had to add parameter as_index=False to groupby(0)
# to get exactly same output as asked
Lets try one column
def str_regroup(s):
return s.str.extract(r"(\w)\((.+)\)",expand=True).groupby(0,as_index=False).apply(
lambda x: '{}({})'.format(x.name,', '.join(x[1])))
str_regroup(df1.Z1)
output
A A(Z1W1, Z1W3)
B B(Z1W1, Z1W2)
then apply to all columns
df.apply(str_regroup)
output
Z1 Z2 Z3 Z4
0 A(Z1W1, Z1W3) A(Z2W1) A(Z3W1, Z3W2, Z3W4) B(Z4W2, Z4W3, Z4W4)
1 B(Z1W1, Z1W2) B(Z2W1)
Update 2
Performance on 100 000 sample rows
928 ms for this apply version ;b
1.55 s for stack() by #Wen

You could use the following approach:
Melt df to get:
In [194]: melted = pd.melt(df, var_name='col'); melted
Out[194]:
col value
0 Z1 A(Z1W1)
1 Z1 A(Z1W3)
2 Z1 B(Z1W1)
3 Z1 B(Z1W2)
4 Z2 A(Z2W1)
5 Z2 B(Z2W1)
6 Z2
7 Z2
8 Z3 A(Z3W1)
9 Z3 A(Z3W2)
10 Z3 A(Z3W4)
11 Z3
12 Z4 B(Z4W2)
13 Z4 B(Z4W3)
14 Z4 B(Z4W4)
15 Z4
Use regex to extract row and value columns:
In [195]: melted[['row','value']] = melted['value'].str.extract(r'(.*)\((.*)\)', expand=True); melted
Out[195]:
col value row
0 Z1 Z1W1 A
1 Z1 Z1W3 A
2 Z1 Z1W1 B
3 Z1 Z1W2 B
4 Z2 Z2W1 A
5 Z2 Z2W1 B
6 Z2 NaN NaN
7 Z2 NaN NaN
8 Z3 Z3W1 A
9 Z3 Z3W2 A
10 Z3 Z3W4 A
11 Z3 NaN NaN
12 Z4 Z4W2 B
13 Z4 Z4W3 B
14 Z4 Z4W4 B
15 Z4 NaN NaN
Group by col and row and join the values together:
In [185]: result = melted.groupby(['col', 'row'])['value'].agg(','.join)
In [186]: result
Out[186]:
col row
Z1 A Z1W1,Z1W3
B Z1W1,Z1W2
Z2 A Z2W1
B Z2W1
Z3 A Z3W1,Z3W2,Z3W4
Z4 B Z4W2,Z4W3,Z4W4
Name: value, dtype: object
Add the row values to the value values:
In [188]: result['value'] = result['row'] + '(' + result['value'] + ')'
In [189]: result
Out[189]:
row value
col
Z1 A A(Z1W1,Z1W3)
Z1 B B(Z1W1,Z1W2)
Z2 A A(Z2W1)
Z2 B B(Z2W1)
Z3 A A(Z3W1,Z3W2,Z3W4)
Z4 B B(Z4W2,Z4W3,Z4W4)
Overwrite the row column values with groupby/cumcount values to setup the upcoming pivot:
In [191]: result['row'] = result.groupby(level='col').cumcount()
In [192]: result
Out[192]:
row value
col
Z1 0 A(Z1W1,Z1W3)
Z1 1 B(Z1W1,Z1W2)
Z2 0 A(Z2W1)
Z2 1 B(Z2W1)
Z3 0 A(Z3W1,Z3W2,Z3W4)
Z4 0 B(Z4W2,Z4W3,Z4W4)
Pivoting produces the desired result:
result = result.pivot(index='row', columns='col', values='value')
import pandas as pd
df = pd.DataFrame({
'Z1': ['A(Z1W1)', 'A(Z1W3)', 'B(Z1W1)', 'B(Z1W2)'],
'Z2': ['A(Z2W1)', 'B(Z2W1)', '', ''],
'Z3': ['A(Z3W1)', 'A(Z3W2)', 'A(Z3W4)', ''],
'Z4': ['B(Z4W2)', 'B(Z4W3)', 'B(Z4W4)', '']}, index=[0, 1, 2, 3],)
melted = pd.melt(df, var_name='col').dropna()
melted[['row','value']] = melted['value'].str.extract(r'(.*)\((.*)\)', expand=True)
result = melted.groupby(['col', 'row'])['value'].agg(','.join)
result = result.reset_index('row')
result['value'] = result['row'] + '(' + result['value'] + ')'
result['row'] = result.groupby(level='col').cumcount()
result = result.reset_index()
result = result.pivot(index='row', columns='col', values='value')
print(result)
yields
col Z1 Z2 Z3 Z4
row
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1) NaN NaN

Related

Combining Pandas Dataframes

I am pretty new in Pandas. So please bear with me. I have a df like this one
DF1
column1 column2(ids)
a [1,2,13,4,9]
b [20,14,10,18,17]
c [6,8,12,16,19]
d [11,3,15,7,5]
Each number in each list corresponds to the column id in a second dataframe.
DF2
id. value_to_change.
1 x1
2 x2
3 x3
4 x4
5 x5
6 x6
7 x7
8 x8
9 x9
. .
. .
. .
20 x20
STEP1
I want to iterate each list and select the rows in DF2 with the matching ids, AND create 4 dataframes since I have 4 rows in DF1.
How to do this?
So for instance for the first row after applying the logic i would get this back
id. value_to_change
1 x1
2 x2
13 x13
14 x14
9 x9
The second row would give me
id. value_to_change
20 x20
14 x14
10 x10
18 x18
17 x17
And so on...
STEP 2
Once I have these 4 dataframes, i pass them as argument to a logic which returns me 4 dataframes.
2) How could I combine them into a sorted final one?
DF3
id. new_value
1 y1
2 y2
3 y3
4 y4
5 y5
6 y6
7 y7
8 y8
9 y9
. .
. .
. .
20 y20
how could I go about this?
It would be much easier and efficient to use a single dataframe like so
Initialization
df1 = pd.DataFrame({'label': ['A', 'B', 'C', 'D'], 'ids': [[1,2,13,4,9],
[20,14,10,18,17], [6,8,12,16,19],[11,3,15,7,5]]})
# Some custom function for dataframe operations
def my_func(x):
x['value_to_change'] = x.value_to_change.str.replace('x', 'y')
return x
Dataframe Operations
df1 = df1.explode('ids')
df1['value_to_change'] = df1.explode('ids')['ids'].map(dict(zip(df2.ids, df2.val)))
df1['new_value'] = df1.groupby('label').apply(my_func)['value_to_change']
Output
label ids value_to_change new_value
0 A 1 x1 y1
0 A 2 x2 y2
0 A 13 x13 y13
0 A 4 x4 y4
0 A 9 x9 y9
1 B 20 x20 y20
1 B 14 x14 y14
1 B 10 x10 y10
1 B 18 x18 y18
1 B 17 x17 y17
2 C 6 x6 y6
2 C 8 x8 y8
2 C 12 x12 y12
2 C 16 x16 y16
2 C 19 x19 y19
3 D 11 x11 y11
3 D 3 x3 y3
3 D 15 x15 y15
3 D 7 x7 y7
3 D 5 x5 y5
This code will help with the first part of the problem.
import pandas as pd
df1 = pd.DataFrame([[[1,2,4,5]],[[3,4,1]]], columns=["column2(ids)"])
df2 = pd.DataFrame([[1,"x1"],[2,"x2"],[3,"x3"],[4,"x4"],[5,"x5"]], columns=["id", "value_to_change"])
df3 = pd.DataFrame(columns=["id", "value_to_change"])
for row in df1.iterrows():
s = row[1][0]
for item in s:
val = df2.loc[df2['id']==item, 'value_to_change'].item()
df_temp = pd.DataFrame([[item,val]], columns=["id", "value_to_change"])
df3 = df3.append(df_temp, ignore_index=True)
df3
Note in the line s=row[1][0], you need to choose the index according to your dataframe, in my case it was [1][0]
-For second part you can use pd.concat: Documentation
-For sorting df.sort_values: Documentation
Use .loc and .isin to get new Dataframe with required rows in df2
Do your logic on these 4 dataframes
combine the resulting 4 dataframes using pandas.concat()
sort the dataframe by ids using .sort_values()
Code:
import pandas as pd
df1 = pd.DataFrame({'column1 ': ['A', 'B', 'C', 'D'], 'ids': [[1,2,13,4,9], [20,14,10,18,17], [6,8,12,16,19],[11,3,15,7,5]]})
df2 = pd.DataFrame({'ids': list(range(1,21)), 'val': [f'x{x}' for x in range(1,21)]})
df_list=[]
for id_list in df1['ids'].values:
df_list.append(df2.loc[df2['ids'].isin(id_list)])
# do logic on each DF in df_list
# assuming df_list now contains the resulting dataframes
df3 = pd.concat(df_list)
df3 = df3.sort_values('ids')
First things first, this code should do what you want.
import pandas as pd
idxs = [
[0,2],
[1,3],
]
df_idxs = pd.DataFrame({'idxs': idxs})
df = pd.DataFrame(
{'data': ['a', 'b', 'c', 'd']}
)
frames = []
for _, idx in df_idxs.iterrows():
rows = idx['idxs']
frame = df.loc[rows]
# some logic
print(frame)
#collect
frames.append(frame)
pd.concat(frames)
Note that pandas automatically creates a range index of none is passed. If you want to select on a different column, set that one as index, or use
df.loc[df.data.isin(rows)]
.
The pandas doc on split-apply-combine may also interest you: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html

Using key while sorting values for just one column

Lets say we have a df like below:
df = pd.DataFrame({'A':['y2','x3','z1','z1'],'B':['y2','x3','a2','z1']})
A B
0 y2 y2
1 x3 x3
2 z1 a2
3 z1 z1
if we wanted to sort the values on just the numbers in column A, we can do:
df.sort_values(by='A',key=lambda x: x.str[1])
A B
3 z1 z1
2 z1 a2
0 y2 y2
1 x3 x3
If we wanted to sort by both columns A and B, but have the key only apply to column A, is there a way to do that?
df.sort_values(by=['A','B'],key=lambda x: x.str[1])
Expected output:
A B
2 z1 a2
3 z1 z1
0 y2 y2
1 x3 x3
You can sort by B, then sort by A with a stable method:
(df.sort_values('B')
.sort_values('A', key=lambda x: x.str[1], kind='mergesort')
)
Output:
A B
2 z1 a2
3 z1 z1
0 y2 y2
1 x3 x3

Pandas Count Group Number

Given the following dataframe:
df=pd.DataFrame({'col1':['A','A','A','A','A','A','B','B','B','B','B','B'],
'col2':['x','x','y','z','y','y','x','y','y','z','z','x'],
})
df
col1 col2
0 A x
1 A x
2 A y
3 A z
4 A y
5 A y
6 B x
7 B y
8 B y
9 B z
10 B z
11 B x
I'd like to create a new column, col3 which classifies the values in col2 sequentially, grouped by the values in col1:
col1 col2 col3
0 A x x1
1 A x x1
2 A y y1
3 A z z1
4 A y y2
5 A y y2
6 B x x1
7 B y y1
8 B y y1
9 B z z1
10 B z z1
11 B x x2
In the above example, col3[0:1] has a value of x1 because its the first group of x values in col2 for col1 = A. col3[4:5] has values of y2 because its the second group of y values in col2 for col1 = A etc...
I hope the description makes sense. I was unable to find an answer partially because I can't find an elegant way to articulate what I'm looking for.
Here's my approach:
groups = (df.assign(s=df.groupby('col1')['col2'] # group col2 by col1
.shift().ne(df['col2']) # check if col2 different from the previous (shift)
.astype(int) # convert to int
) # the new column s marks the beginning of consecutive blocks with `1`
.groupby(['col1','col2'])['s'] # group `s` by `col1` and `col2`
.cumsum() # cumsum by group
.astype(str)
)
df['col3'] = df['col2'] + groups
Output:
col1 col2 col3
0 A x x1
1 A x x1
2 A y y1
3 A z z1
4 A y y2
5 A y y2
6 B x x1
7 B y y1
8 B y y1
9 B z z1
10 B z z1
11 B x x2

Pandas adding calculated vectors into df

my goal is to add formula based vectors to my following df:
Day Name a b 1 2 x1 x2
1 ijk 1 2 3 3 0 1
2 mno 2 1 1 3 1 1
outcome:
Day Name a b 1 2 x1 x2 y1 y2 z1 z2
1 ijk 1 2 3 3 0 1 (1*2)+3 (1*2)+3 (1+2)*(3*1+0*1) (1+2)*(3*2+1*2)
2 mno 2 1 1 3 1 1 (2*1)+1 (2*1)+3 (2+1)*(1*1+1*1) (2+1)*(3*2+1*2)
This is my tedious approach:
df[y1] = df[a]*df[b]+df[1] #This is y1 = a*b+value of column 1
df[y2] = df[a]*df[b]+df[2] #This is y2 = a*b+value of column 2
if column 3 and x3 were added in then: y3 would be y3 = a*b+value of column 3,
if column 4 and x4 were added in then: y4 = a*b+value of column 4 and so on...
df[z1] = (df[a]+df[b])*(df[1]*1+df[x1]*1) The "1" here is from the column name 1 and x1 #z1 = (a+b)*[(value of column 1)*1+(value of column x1)*1]
df[z2] = (df[a]+df[b])*(df[1]*2+df[x1]*2) The "2" here is from the column name 2 and x2 #z2 = (a+b)*[(value of column 2)*2+(value of column x2)*2]
if column 3 and x3 were added in then: z3 = (a+b)*[(value of column 3)*3+(value of column x3)*3] and so on
This works fine; however, this will get tedious if there are more columns added in. For example, it might get "3 4,... x3 x4,..." I'm wondering if there's a better approach to this using a loop maybe?
Many thanks :)
This is one way:
import pandas as pd
df = pd.DataFrame([[1, 'ijk', 1, 2, 3, 3, 2, 0, 1],
[2, 'mno', 2, 1, 1, 3, 1, 1, 1]],
columns=['Day', 'Name', 'a', 'b', 1, 2, 3, 'x1', 'x2'])
for i in range(1, 4):
df['y'+str(i)] = df['a'] * df['b'] + df[i]
#output
#Day Name a b 1 2 3 x1 x2 y1 y2 y3
#1 ijk 1 2 3 3 2 0 1 5 5 4
#2 mno 2 1 1 3 1 1 1 3 5 3

Pandas randomly select n groups from a larger dataset

If I have a dataframe with groups like so
val label
x A
x A
x B
x B
x C
x C
x D
x D
how can I randomly pick out n groups without replacement?
You can use random.choice with loc:
N = 3
vals = np.random.choice(df['label'].unique(), N, replace=False)
print (vals)
['C' 'A' 'B']
df = df.set_index('label').loc[vals].reset_index()
print (df)
label val
0 C x5
1 C x6
2 A x1
3 A x2
4 B x3
5 B x4

Categories

Resources