Doing group by on one hot encoded columns PANDAS

Doing group by on one hot encoded columns PANDAS - python

i have the following dataframe called df. I want for each sector column (sector_) to basically do a group by and get the unique ids for each sector. The sector is denoted as 1 for each row if the id is apart of that sector. How can i do this group by if the columns are one hot encoded?
id winner sector_food sector_learning sector_parenting sector_consumer
1 1 1 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
2 0 1 0 0 0
3 1 0 0 0 1
expected output
sector unique_id
sector_food 2
sector_learning 0
sector_parenting 0
sector_consumer 1

You can do something like this:
out = df.drop(["id", "winner"], 1).multiply(df.id, 0).nunique().subtract(1)
#sector_food 2
#sector_learning 0
#sector_parenting 0
#sector_consumer 1
#dtype: int64
To get your exact expected output you can add:
out = out.rename_axis("sector").to_frame("unique_id")
# unique_id
#sector
#sector_food 2
#sector_learning 0
#sector_parenting 0
#sector_consumer 1

Try this:
df.drop('winner',axis=1).groupby(level=0).sum().gt(0).sum().to_frame('unique_id')

Given
import pandas as pd
ids = [1, 1, 2, 2, 3]
winner = [1, 0, 1, 0, 1]
sector_food = [1, 1, 0, 1, 0]
sector_learning = [0, 0, 0, 0, 0]
sector_parenting = [0, 0, 0, 0, 0]
sector_consumer = [0, 0, 0, 0, 1]
df = pd.DataFrame({
'id': ids,
'winner': winner,
'sector_food': sector_food,
'sector_learning': sector_learning,
'sector_parenting': sector_parenting,
'sector_consumer': sector_consumer
})
print(df)
output
id winner sector_food sector_learning sector_parenting sector_consumer
0 1 1 1 0 0 0
1 1 0 1 0 0 0
2 2 1 0 0 0 0
3 2 0 1 0 0 0
4 3 1 0 0 0 1
You can do
_df = (df
# drop unused cols
.drop('winner', axis=1)
# melt with 'id' as index
.melt(id_vars='id')
# drop all duplicates
.drop_duplicates(['id', 'variable', 'value'])
# sum unique values
.groupby('variable').value.sum()
)
print(_df)
output
variable
sector_consumer 1
sector_food 2
sector_learning 0
sector_parenting 0
Name: value, dtype: int64

Related

Anonymizing column names

I have a dataframe like so
IsCool IsTall IsHappy Target
0 1 0 1
1 1 0 0
0 1 0 0
1 0 1 1
I want to anonymize the column names except for target.
How can I do this?
Expected output:
col1 col2 col3 Target
0 1 0 1
1 1 0 0
0 1 0 0
1 0 1 1
Source dataframe :
import pandas as pd
df = pd.DataFrame({"IsCool": [0, 1, 0, 1],
"IsTall": [1, 1, 1, 0],
"IsHappy": [0, 0, 0, 1],
"Target": [1, 0, 0, 1]})

What about:
cols = {
col: f"col{i + 1}" if col != "Target" else col
for i, col in enumerate(df.columns)
}
out = df.rename(columns=cols)
col1 col2 col3 Target
0 0 1 0 1
1 1 1 0 0
2 0 1 0 0
3 1 0 1 1
You can also do it in place:
cols = [
f"col{i + 1}" if col != "Target" else col
for i, col in enumerate(df.columns)
]
df.columns = cols

You can use:
# get all columns except excluded ones (here "Target")
cols = df.columns.difference(['Target'])
# give a new name
names = 'col' + pd.Series(range(1, len(cols)+1), index=cols).astype(str)
out = df.rename(columns=names)
Output:
col1 col2 col3 Target
0 0 1 0 1
1 1 1 0 0
2 0 1 0 0
3 1 0 1 1

Proposed code :
You can pass a dict to the rename() Pandas function with a dict like this in parameters :
columns={'IsCool': 'col0', 'IsTall': 'col1', 'IsHappy': 'col2'}
This dict is obtained by using of a zip function : dict(zip(keys, values))
import pandas as pd
df = pd.DataFrame({"IsCool": [0, 1, 0, 1],
"IsTall": [1, 1, 1, 0],
"IsHappy": [0, 0, 0, 1],
"Target": [1, 0, 0, 1]})
df = df.rename(columns = dict(zip(df.columns.drop('Target'),
["col%s"%i for i in range(len(df.columns)-1)])))
print(df)
Result :
col0 col1 col2 Target
0 0 1 0 1
1 1 1 0 0
2 0 1 0 0
3 1 0 1 1

How to convert column values into new columns showing frequency

I created a new dataframe by splitting a column and expanding it.
I now want to convert the dataframe to create new columns for every value and only display the frequency of the value.
I wrote an example below.
Example dataframe:
import pandas as pd
import numpy as np
df= pd.DataFrame({0:['cake','fries', 'ketchup', 'potato', 'snack'],
1:['fries', 'cake', 'potato', np.nan, 'snack'],
2:['ketchup', 'cake', 'potatos', 'snack', np.nan],
3:['potato', np.nan,'cake', 'ketchup',np.nan],
'index':['james','samantha','ashley','tim', 'mo']})
df.set_index('index')
Expected output:
output = pd.DataFrame({'cake': [1, 2, 1, 0, 0],
'fries': [1, 1, 0, 0, 0],
'ketchup': [1, 0, 1, 1, 0],
'potatoes': [1, 0, 2, 1, 0],
'snack': [0, 0, 0, 1, 2],
'index': ['james', 'samantha', 'asheley', 'tim', 'mo']})
output.set_index('index')

Based on the description of what you want, you would need a crosstab on the reshaped data:
df2 = df.reset_index().melt('index')
out = pd.crosstab(df2['index'], df2['value'].str.lower())
This, however, doesn't match the provided output.
Output:
value apple berries cake chocolate drink fries fruits ketchup potato potatoes snack
index
Ashley 0 0 0 0 0 0 0 1 1 0 1
James 0 1 1 0 0 1 1 0 0 0 0
Mo 0 0 0 1 0 0 1 1 0 1 0
samantha 1 0 0 1 0 1 0 0 0 0 0
tim 0 0 0 0 1 0 0 0 0 0 1

pandas series add with previous row on condition

I need to add a series with previous rows only if a condition matches in current cell. Here's the dataframe:
import pandas as pd
data = {'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0]}
df = pd.DataFrame(data, columns=['col1'])
df['continuous'] = df.col1
print(df)
I need to +1 a cell with previous sum if it's value > 0 else -1. So, result I'm expecting is;
col1 continuous
0 1 1//+1 as its non-zero
1 2 2//+1 as its non-zero
2 1 3//+1 as its non-zero
3 0 2//-1 as its zero
4 0 1
5 0 0
6 0 0// not to go less than 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
Case 2 : where I want instead of >0 , I need <-0.1
data = {'col1': [-0.097112634,-0.092674324,-0.089176841,-0.087302284,-0.087351866,-0.089226185,-0.092242213,-0.096446987,-0.101620036,-0.105940337,-0.109484752,-0.113515648,-0.117848816,-0.121133266,-0.123824577,-0.126030136,-0.126630895,-0.126015218,-0.124235003,-0.122715224,-0.121746573,-0.120794916,-0.120291174,-0.120323152,-0.12053229,-0.121491186,-0.122625851,-0.123819704,-0.125751858,-0.127676591,-0.129339428,-0.132342431,-0.137119556,-0.142040092,-0.14837848,-0.15439201,-0.159282645,-0.161271982,-0.162377701,-0.162838307,-0.163204393,-0.164095634,-0.165496071,-0.167224488,-0.167057078,-0.165706164,-0.163301617,-0.161423938,-0.158669389,-0.156508912,-0.15508329,-0.15365104,-0.151958972,-0.150317528,-0.149234892,-0.148259354,-0.14737422,-0.145958527,-0.144633388,-0.143120273,-0.14145652,-0.139930163,-0.138774126,-0.136710524,-0.134692221,-0.132534879,-0.129921444,-0.127974949,-0.128294058,-0.129241763,-0.132263506,-0.137828981,-0.145549768,-0.154244588,-0.163125109,-0.171814857,-0.179911465,-0.186223859,-0.190653162,-0.194761064,-0.197988536,-0.200500606,-0.20260121,-0.204797089,-0.208281065,-0.211846904,-0.215312626,-0.218696339,-0.221489975,-0.221375209,-0.220996031,-0.218558429,-0.215936558,-0.213933531,-0.21242896,-0.209682125,-0.208196607,-0.206243585,-0.202190476,-0.19913106,-0.19703291,-0.194244664,-0.189609518,-0.186600526,-0.18160171,-0.175875689,-0.170767095,-0.167453329,-0.163516985,-0.161168703,-0.158197984,-0.156378046,-0.154794499,-0.153236804,-0.15187487,-0.151623385,-0.150628282,-0.149039072,-0.14826268,-0.147535739,-0.145557646,-0.142223729,-0.139343068,-0.135355686,-0.13047743,-0.125999173,-0.12218752,-0.117021996,-0.111542982,-0.106409901,-0.101904095,-0.097910825,-0.094683375,-0.092079967,-0.088953862,-0.086268097,-0.082907394,-0.080723466,-0.078117426,-0.075431993,-0.072079536,-0.068962411,-0.064831759,-0.061257701,-0.05830671,-0.053889968,-0.048972414,-0.044763431,-0.042162829,-0.039328369,-0.038968862,-0.040450835,-0.041974942,-0.042161609,-0.04280523,-0.042702428,-0.042593856,-0.043166561,-0.043691795,-0.044093492,-0.043965231,-0.04263305,-0.040836102,-0.039605133,-0.037204273,-0.034368645,-0.032293737,-0.029037983,-0.025509509,-0.022704668,-0.021346266,-0.019881524,-0.018675734,-0.017509566,-0.017148129,-0.016671088,-0.016015011,-0.016241862,-0.016416445,-0.016548878,-0.016475455,-0.016405742,-0.015567737,-0.014190101,-0.012373151,-0.010370329,-0.008131459,-0.006729419,-0.005667607,-0.004883919,-0.004841328,-0.005403019,-0.005343759,-0.005377974,-0.00548823,-0.004889709,-0.003884973,-0.003149113,-0.002975268,-0.00283163,-0.00322658,-0.003546589,-0.004233582,-0.004448617,-0.004706967,-0.007400356,-0.010104064,-0.01230257,-0.014430498,-0.016499501,-0.015348355,-0.013974229,-0.012845464,-0.012688459,-0.012552231,-0.013719074,-0.014404172,-0.014611632,-0.013401283,-0.011807386,-0.007417753,-0.003321279,0.000363954,0.004908491,0.010151584,0.013223831,0.016746553,0.02106351,0.024571507,0.027588073,0.031313637,0.034419301,0.037016545,0.038172954,0.038237253,0.038094387,0.037783779,0.036482515,0.036080763,0.035476154,0.034107081,0.03237083,0.030934259,0.029317076,0.028236195,0.027850758,0.024612491,0.01964433,0.015153308,0.009684456,0.003336172]}
df = pd.DataFrame(data, columns=['col1'])
lim = float(-0.1)
s = df['col1'].lt(lim)
out = s.where(s, -1).cumsum()
df['sol'] = out - out.where((out < 0) & (~s)).ffill().fillna(0)
print(df)

The key problem here, to me, is to control the out not to go below zero. With that in mind, we can mask the output where it's negative and adjust accordingly:
# a little longer data for corner case
df = pd.DataFrame({'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0,0,0,0,2,3,4]})
s = df.col1.gt(0)
out = s.where(s,-1).cumsum()
df['continuous'] = out - out.where((out<0)&(~s)).ffill().fillna(0)
Output:
col1 continuous
0 1 1
1 2 2
2 1 3
3 0 2
4 0 1
5 0 0
6 0 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
12 0 0
13 0 0
14 0 0
15 2 1
16 3 2
17 4 3

You can do this using cumsum function on booleans:
Give me a +1 whenever col1 is not zero:
(df.col1 != 0 ).cumsum()
Give me a -1 whenever col1 is zero:
- (df.col1 == 0 ).cumsum()
Then just add them together!
df['continuous'] = (df.col1 != 0 ).cumsum() - (df.col1 == 0 ).cumsum()
However this does not resolve the dropping below zero criteria you mentioned

Pandas - Does row fall below a row with a column value and same id

I am new to Pandas. I have a Pandas data frame like so:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0]})
I want to add a column val2, that indicates whether an row falls below another row having the same id as itself where val1 == 1.
The result would be a data frame like:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0], 'val2': [0, 0, 1, 0, 0, 1, 1]})
My first thought was to use an apply statement, but these only go by row. And from my experience for loops are never the answer. Any help would be greatly appreciated!

Let's try shift + cumsum inside a groupby.
df['val2'] = df.groupby('id').val1.apply(
lambda x: x.shift().cumsum()
).ge(1).astype(int)
Or, in an attempt to avoid the lambda,
df['val2'] = (
df.groupby('id')
.val1.shift()
.groupby(df.id)
.cumsum()
.ge(1)
.astype(int)
)
df
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1

Using groupby + transform. Similar to coldspeed's but using bool conversion for non-zero cumsum values.
df['val2'] = df.groupby('id')['val1'].transform(lambda x: x.cumsum().shift())\
.fillna(0).astype(bool).astype(int)
print(df)
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1

pandas: Use if-else to populate new column

I have a DataFrame like this:
col1 col2
1 0
0 1
0 0
0 0
3 3
2 0
0 4
I'd like to add a column that is a 1 if col2 is > 0 or 0 otherwise. If I was using R I'd do something like
df1[,'col3'] <- ifelse(df1$col2 > 0, 1, 0)
How would I do this in python / pandas?

You could convert the boolean series df.col2 > 0 to an integer series (True becomes 1 and False becomes 0):
df['col3'] = (df.col2 > 0).astype('int')
(To create a new column, you simply need to name it and assign it to a Series, array or list of the same length as your DataFrame.)
This produces col3 as:
col2 col3
0 0 0
1 1 1
2 0 0
3 0 0
4 3 1
5 0 0
6 4 1
Another way to create the column could be to use np.where, which lets you specify a value for either of the true or false values and is perhaps closer to the syntax of the R function ifelse. For example:
>>> np.where(df['col2'] > 0, 4, -1)
array([-1, 4, -1, -1, 4, -1, 4])

I assume that you're using Pandas (because of the 'df' notation). If so, you can assign col3 a boolean flag by using .gt (greater than) to compare col2 against zero. Multiplying the result by one will convert the boolean flags into ones and zeros.
df1 = pd.DataFrame({'col1': [1, 0, 0, 0, 3, 2, 0],
'col2': [0, 1, 0, 0, 3, 0, 4]})
df1['col3'] = df1.col2.gt(0) * 1
>>> df1
Out[70]:
col1 col2 col3
0 1 0 0
1 0 1 1
2 0 0 0
3 0 0 0
4 3 3 1
5 2 0 0
6 0 4 1
You can also use a lambda expression to achieve the same result, but I believe the method above is simpler for your given example.
df1['col3'] = df1['col2'].apply(lambda x: 1 if x > 0 else 0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Doing group by on one hot encoded columns PANDAS - python

Try this: df.drop('winner',axis=1).groupby(level=0).sum().gt(0).sum().to_frame('unique_id')

Related

Anonymizing column names

How to convert column values into new columns showing frequency

pandas series add with previous row on condition

Pandas - Does row fall below a row with a column value and same id

pandas: Use if-else to populate new column

Categories

Resources