Pandas Count Group Number - python

Given the following dataframe:
df=pd.DataFrame({'col1':['A','A','A','A','A','A','B','B','B','B','B','B'],
'col2':['x','x','y','z','y','y','x','y','y','z','z','x'],
})
df
col1 col2
0 A x
1 A x
2 A y
3 A z
4 A y
5 A y
6 B x
7 B y
8 B y
9 B z
10 B z
11 B x
I'd like to create a new column, col3 which classifies the values in col2 sequentially, grouped by the values in col1:
col1 col2 col3
0 A x x1
1 A x x1
2 A y y1
3 A z z1
4 A y y2
5 A y y2
6 B x x1
7 B y y1
8 B y y1
9 B z z1
10 B z z1
11 B x x2
In the above example, col3[0:1] has a value of x1 because its the first group of x values in col2 for col1 = A. col3[4:5] has values of y2 because its the second group of y values in col2 for col1 = A etc...
I hope the description makes sense. I was unable to find an answer partially because I can't find an elegant way to articulate what I'm looking for.

Here's my approach:
groups = (df.assign(s=df.groupby('col1')['col2'] # group col2 by col1
.shift().ne(df['col2']) # check if col2 different from the previous (shift)
.astype(int) # convert to int
) # the new column s marks the beginning of consecutive blocks with `1`
.groupby(['col1','col2'])['s'] # group `s` by `col1` and `col2`
.cumsum() # cumsum by group
.astype(str)
)
df['col3'] = df['col2'] + groups
Output:
col1 col2 col3
0 A x x1
1 A x x1
2 A y y1
3 A z z1
4 A y y2
5 A y y2
6 B x x1
7 B y y1
8 B y y1
9 B z z1
10 B z z1
11 B x x2

Related

How to crosstab a pandas dataframe when one variable (column) is a list of varying length

How can I generate a crossed table from the following dataframe:
import pandas as pd
dat = pd.read_csv('data.txt', sep=',')
dat.head(6)
Factor1 Factor2
0 A X
1 B X
2 A X|Y
3 B X|Y
4 A X|Y|Z
5 B X|Y|Z
dat[['Factor2']] = dat[['Factor2']].applymap(lambda x : x.split('|'))
dat.head(6)
Factor1 Factor2
0 A [X]
1 B [X]
2 A [X, Y]
3 B [X, Y]
4 A [X, Y, Z]
5 B [X, Y, Z]
The resulting pd.crosstab() should look like this:
X Y Z
A 3 2 1
B 3 2 1
We can use get_dummies to convert the Feature2 column to indicator variables, then group the indicator variables by Feature1 and aggregate with sum
df['Factor2'].str.get_dummies('|').groupby(df['Factor1']).sum()
X Y Z
Factor1
A 3 2 1
B 3 2 1
You would have to first split on | using Series.str.split then explode using DataFrame.explode.
df['Factor2'] = df['Factor2'].str.split('|')
t = df.explode('Factor2')
pd.crosstab(t['Factor1'], t['Factor2'])
# Factor2 X Y Z
# Factor1
# A 3 2 1
# B 3 2 1
# to remove the axis names.
# pd.crosstab(t['Factor1'], t['Factor2']).rename_axis(index=None, columns=None)

Aggregating cells/column in pandas dataframe

I have a dataframe that is like this
Index Z1 Z2 Z3 Z4
0 A(Z1W1) A(Z2W1) A(Z3W1) B(Z4W2)
1 A(Z1W3) B(Z2W1) A(Z3W2) B(Z4W3)
2 B(Z1W1) A(Z3W4) B(Z4W4)
3 B(Z1W2)
I want to convert it to
Index Z1 Z2 Z3 Z4
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1)
Basically I want to aggregate the values of different cell to one cell as shown above
Edit 1
Actual column names are either two words or 3 words names and not A B
For example Nut Butter instead of A
Things are getting interested : -)
s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
v=('('+s.groupby([s.index.get_level_values(1),s[0]])[1].apply(','.join)+')').unstack().apply(lambda x : x.name+x.astype(str)).T
v[~v.apply(lambda x : x.str.contains('None'))].apply(lambda x : sorted(x,key=pd.isnull)).reset_index(drop=True)
Out[1865]:
Z1 Z2 Z3 Z4
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1) NaN NaN
Update
Change
#s=df.stack().replace({'[(|)]':' '},regex=True).str.strip().str.split(' ',expand=True)
to
s=df.stack().str.split('(',expand=True)
s[1]=s[1].replace({'[(|)]':' '},regex=True).str.strip()
Geneal idea:
split string values
regroup and join stings
apply to all columns
Update 1
# I had to add parameter as_index=False to groupby(0)
# to get exactly same output as asked
Lets try one column
def str_regroup(s):
return s.str.extract(r"(\w)\((.+)\)",expand=True).groupby(0,as_index=False).apply(
lambda x: '{}({})'.format(x.name,', '.join(x[1])))
str_regroup(df1.Z1)
output
A A(Z1W1, Z1W3)
B B(Z1W1, Z1W2)
then apply to all columns
df.apply(str_regroup)
output
Z1 Z2 Z3 Z4
0 A(Z1W1, Z1W3) A(Z2W1) A(Z3W1, Z3W2, Z3W4) B(Z4W2, Z4W3, Z4W4)
1 B(Z1W1, Z1W2) B(Z2W1)
Update 2
Performance on 100 000 sample rows
928 ms for this apply version ;b
1.55 s for stack() by #Wen
You could use the following approach:
Melt df to get:
In [194]: melted = pd.melt(df, var_name='col'); melted
Out[194]:
col value
0 Z1 A(Z1W1)
1 Z1 A(Z1W3)
2 Z1 B(Z1W1)
3 Z1 B(Z1W2)
4 Z2 A(Z2W1)
5 Z2 B(Z2W1)
6 Z2
7 Z2
8 Z3 A(Z3W1)
9 Z3 A(Z3W2)
10 Z3 A(Z3W4)
11 Z3
12 Z4 B(Z4W2)
13 Z4 B(Z4W3)
14 Z4 B(Z4W4)
15 Z4
Use regex to extract row and value columns:
In [195]: melted[['row','value']] = melted['value'].str.extract(r'(.*)\((.*)\)', expand=True); melted
Out[195]:
col value row
0 Z1 Z1W1 A
1 Z1 Z1W3 A
2 Z1 Z1W1 B
3 Z1 Z1W2 B
4 Z2 Z2W1 A
5 Z2 Z2W1 B
6 Z2 NaN NaN
7 Z2 NaN NaN
8 Z3 Z3W1 A
9 Z3 Z3W2 A
10 Z3 Z3W4 A
11 Z3 NaN NaN
12 Z4 Z4W2 B
13 Z4 Z4W3 B
14 Z4 Z4W4 B
15 Z4 NaN NaN
Group by col and row and join the values together:
In [185]: result = melted.groupby(['col', 'row'])['value'].agg(','.join)
In [186]: result
Out[186]:
col row
Z1 A Z1W1,Z1W3
B Z1W1,Z1W2
Z2 A Z2W1
B Z2W1
Z3 A Z3W1,Z3W2,Z3W4
Z4 B Z4W2,Z4W3,Z4W4
Name: value, dtype: object
Add the row values to the value values:
In [188]: result['value'] = result['row'] + '(' + result['value'] + ')'
In [189]: result
Out[189]:
row value
col
Z1 A A(Z1W1,Z1W3)
Z1 B B(Z1W1,Z1W2)
Z2 A A(Z2W1)
Z2 B B(Z2W1)
Z3 A A(Z3W1,Z3W2,Z3W4)
Z4 B B(Z4W2,Z4W3,Z4W4)
Overwrite the row column values with groupby/cumcount values to setup the upcoming pivot:
In [191]: result['row'] = result.groupby(level='col').cumcount()
In [192]: result
Out[192]:
row value
col
Z1 0 A(Z1W1,Z1W3)
Z1 1 B(Z1W1,Z1W2)
Z2 0 A(Z2W1)
Z2 1 B(Z2W1)
Z3 0 A(Z3W1,Z3W2,Z3W4)
Z4 0 B(Z4W2,Z4W3,Z4W4)
Pivoting produces the desired result:
result = result.pivot(index='row', columns='col', values='value')
import pandas as pd
df = pd.DataFrame({
'Z1': ['A(Z1W1)', 'A(Z1W3)', 'B(Z1W1)', 'B(Z1W2)'],
'Z2': ['A(Z2W1)', 'B(Z2W1)', '', ''],
'Z3': ['A(Z3W1)', 'A(Z3W2)', 'A(Z3W4)', ''],
'Z4': ['B(Z4W2)', 'B(Z4W3)', 'B(Z4W4)', '']}, index=[0, 1, 2, 3],)
melted = pd.melt(df, var_name='col').dropna()
melted[['row','value']] = melted['value'].str.extract(r'(.*)\((.*)\)', expand=True)
result = melted.groupby(['col', 'row'])['value'].agg(','.join)
result = result.reset_index('row')
result['value'] = result['row'] + '(' + result['value'] + ')'
result['row'] = result.groupby(level='col').cumcount()
result = result.reset_index()
result = result.pivot(index='row', columns='col', values='value')
print(result)
yields
col Z1 Z2 Z3 Z4
row
0 A(Z1W1,Z1W3) A(Z2W1) A(Z3W1,Z3W2,Z3W4) B(Z4W2,Z4W3,Z4W4)
1 B(Z1W1,Z1W2) B(Z2W1) NaN NaN

Pandas randomly select n groups from a larger dataset

If I have a dataframe with groups like so
val label
x A
x A
x B
x B
x C
x C
x D
x D
how can I randomly pick out n groups without replacement?
You can use random.choice with loc:
N = 3
vals = np.random.choice(df['label'].unique(), N, replace=False)
print (vals)
['C' 'A' 'B']
df = df.set_index('label').loc[vals].reset_index()
print (df)
label val
0 C x5
1 C x6
2 A x1
3 A x2
4 B x3
5 B x4

Create a new column based on other columns as indices for another dataframe

let's suppose I have one dataframe with at least two columns col1 and col2. Also I have another dataframe whose column names are values in col 1 and whose indices are values in col2.
import pandas as pd
df1 = pd.DataFrame( {'col1': ['x1', 'x2', 'x2'], 'col2': ['y0', 'y1', 'y0']})
print(df1)
col1 col2
0 x1 y0
1 x2 y1
2 x2 y0
print(df2)
y0 y1
x1 1 4
x2 2 5
x3 3 6
Now I wish to add col3 that gives me the value of the second dataframe at index of col1 and in column of col2.
The result should look like this:
col1 col2 col3
0 x1 y0 1
1 x2 y1 5
2 x2 y0 2
Thank you all!
You can use stack for new df with merge:
df2 = df2.stack().reset_index()
df2.columns = ['col1','col2','col3']
print (df2)
col1 col2 col3
0 x1 y0 1
1 x1 y1 4
2 x2 y0 2
3 x2 y1 5
4 x3 y0 3
5 x3 y1 6
print (pd.merge(df1, df2, on=['col1','col2'], how='left'))
col1 col2 col3
0 x1 y0 1
1 x2 y1 5
2 x2 y0 2
Another solution is create new Series with join:
s = df2.stack().rename('col3')
print (s)
col1 col2
0 x1 y0
1 x2 y1
2 x2 y0
x1 y0 1
y1 4
x2 y0 2
y1 5
x3 y0 3
y1 6
Name: col3, dtype: int64
print (df1.join(s, on=['col1','col2']))
col1 col2 col3
0 x1 y0 1
1 x2 y1 5
2 x2 y0 2
Simple join
Pandas supports the join operation both on indexes and on columns, meaning you can do this:
df1.merge(df2, left_on='col1', right_index=True)
Produces
col1 col2 y0 y1
0 x1 y0 1 4
1 x2 y1 2 5
2 x2 y0 2 5
Getting the proper value into col3 is the next step
Apply
This is a bit inefficient, but it is a way to get the correct data into one column
df['col3'] = df[['col2', 'y0', 'y1']].apply(lambda x: x[int(x[0][1]) + 1], axis=1)

how to convert column names into column values in pandas - python

df=pd.DataFrame(index=['x','y'], data={'a':[1,2],'b':[3,4]})
how can I convert column names into values of a column? This is my desired output
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
You can use:
print (df.T.unstack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
Or:
print (df.stack().reset_index(level=1, name='c1')
.rename(columns={'level_1':'c2'})[['c1','c2']])
c1 c2
x 1 a
x 3 b
y 2 a
y 4 b
try this:
In [279]: df.stack().reset_index().set_index('level_0').rename(columns={'level_1':'c2',0:'c1'})
Out[279]:
c2 c1
level_0
x a 1
x b 3
y a 2
y b 4
Try:
df1 = df.stack().reset_index(-1).iloc[:, ::-1]
df1.columns = ['c1', 'c2']
df1
In [62]: (pd.melt(df.reset_index(), var_name='c2', value_name='c1', id_vars='index')
.set_index('index'))
Out[62]:
c2 c1
index
x a 1
y a 2
x b 3
y b 4

Categories

Resources