I have a dataframe like so
IsCool IsTall IsHappy Target
0 1 0 1
1 1 0 0
0 1 0 0
1 0 1 1
I want to anonymize the column names except for target.
How can I do this?
Expected output:
col1 col2 col3 Target
0 1 0 1
1 1 0 0
0 1 0 0
1 0 1 1
Source dataframe :
import pandas as pd
df = pd.DataFrame({"IsCool": [0, 1, 0, 1],
"IsTall": [1, 1, 1, 0],
"IsHappy": [0, 0, 0, 1],
"Target": [1, 0, 0, 1]})
What about:
cols = {
col: f"col{i + 1}" if col != "Target" else col
for i, col in enumerate(df.columns)
}
out = df.rename(columns=cols)
col1 col2 col3 Target
0 0 1 0 1
1 1 1 0 0
2 0 1 0 0
3 1 0 1 1
You can also do it in place:
cols = [
f"col{i + 1}" if col != "Target" else col
for i, col in enumerate(df.columns)
]
df.columns = cols
You can use:
# get all columns except excluded ones (here "Target")
cols = df.columns.difference(['Target'])
# give a new name
names = 'col' + pd.Series(range(1, len(cols)+1), index=cols).astype(str)
out = df.rename(columns=names)
Output:
col1 col2 col3 Target
0 0 1 0 1
1 1 1 0 0
2 0 1 0 0
3 1 0 1 1
Proposed code :
You can pass a dict to the rename() Pandas function with a dict like this in parameters :
columns={'IsCool': 'col0', 'IsTall': 'col1', 'IsHappy': 'col2'}
This dict is obtained by using of a zip function : dict(zip(keys, values))
import pandas as pd
df = pd.DataFrame({"IsCool": [0, 1, 0, 1],
"IsTall": [1, 1, 1, 0],
"IsHappy": [0, 0, 0, 1],
"Target": [1, 0, 0, 1]})
df = df.rename(columns = dict(zip(df.columns.drop('Target'),
["col%s"%i for i in range(len(df.columns)-1)])))
print(df)
Result :
col0 col1 col2 Target
0 0 1 0 1
1 1 1 0 0
2 0 1 0 0
3 1 0 1 1
I created a new dataframe by splitting a column and expanding it.
I now want to convert the dataframe to create new columns for every value and only display the frequency of the value.
I wrote an example below.
Example dataframe:
import pandas as pd
import numpy as np
df= pd.DataFrame({0:['cake','fries', 'ketchup', 'potato', 'snack'],
1:['fries', 'cake', 'potato', np.nan, 'snack'],
2:['ketchup', 'cake', 'potatos', 'snack', np.nan],
3:['potato', np.nan,'cake', 'ketchup',np.nan],
'index':['james','samantha','ashley','tim', 'mo']})
df.set_index('index')
Expected output:
output = pd.DataFrame({'cake': [1, 2, 1, 0, 0],
'fries': [1, 1, 0, 0, 0],
'ketchup': [1, 0, 1, 1, 0],
'potatoes': [1, 0, 2, 1, 0],
'snack': [0, 0, 0, 1, 2],
'index': ['james', 'samantha', 'asheley', 'tim', 'mo']})
output.set_index('index')
Based on the description of what you want, you would need a crosstab on the reshaped data:
df2 = df.reset_index().melt('index')
out = pd.crosstab(df2['index'], df2['value'].str.lower())
This, however, doesn't match the provided output.
Output:
value apple berries cake chocolate drink fries fruits ketchup potato potatoes snack
index
Ashley 0 0 0 0 0 0 0 1 1 0 1
James 0 1 1 0 0 1 1 0 0 0 0
Mo 0 0 0 1 0 0 1 1 0 1 0
samantha 1 0 0 1 0 1 0 0 0 0 0
tim 0 0 0 0 1 0 0 0 0 0 1
I need to add a series with previous rows only if a condition matches in current cell. Here's the dataframe:
import pandas as pd
data = {'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0]}
df = pd.DataFrame(data, columns=['col1'])
df['continuous'] = df.col1
print(df)
I need to +1 a cell with previous sum if it's value > 0 else -1. So, result I'm expecting is;
col1 continuous
0 1 1//+1 as its non-zero
1 2 2//+1 as its non-zero
2 1 3//+1 as its non-zero
3 0 2//-1 as its zero
4 0 1
5 0 0
6 0 0// not to go less than 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
Case 2 : where I want instead of >0 , I need <-0.1
data = {'col1': [-0.097112634,-0.092674324,-0.089176841,-0.087302284,-0.087351866,-0.089226185,-0.092242213,-0.096446987,-0.101620036,-0.105940337,-0.109484752,-0.113515648,-0.117848816,-0.121133266,-0.123824577,-0.126030136,-0.126630895,-0.126015218,-0.124235003,-0.122715224,-0.121746573,-0.120794916,-0.120291174,-0.120323152,-0.12053229,-0.121491186,-0.122625851,-0.123819704,-0.125751858,-0.127676591,-0.129339428,-0.132342431,-0.137119556,-0.142040092,-0.14837848,-0.15439201,-0.159282645,-0.161271982,-0.162377701,-0.162838307,-0.163204393,-0.164095634,-0.165496071,-0.167224488,-0.167057078,-0.165706164,-0.163301617,-0.161423938,-0.158669389,-0.156508912,-0.15508329,-0.15365104,-0.151958972,-0.150317528,-0.149234892,-0.148259354,-0.14737422,-0.145958527,-0.144633388,-0.143120273,-0.14145652,-0.139930163,-0.138774126,-0.136710524,-0.134692221,-0.132534879,-0.129921444,-0.127974949,-0.128294058,-0.129241763,-0.132263506,-0.137828981,-0.145549768,-0.154244588,-0.163125109,-0.171814857,-0.179911465,-0.186223859,-0.190653162,-0.194761064,-0.197988536,-0.200500606,-0.20260121,-0.204797089,-0.208281065,-0.211846904,-0.215312626,-0.218696339,-0.221489975,-0.221375209,-0.220996031,-0.218558429,-0.215936558,-0.213933531,-0.21242896,-0.209682125,-0.208196607,-0.206243585,-0.202190476,-0.19913106,-0.19703291,-0.194244664,-0.189609518,-0.186600526,-0.18160171,-0.175875689,-0.170767095,-0.167453329,-0.163516985,-0.161168703,-0.158197984,-0.156378046,-0.154794499,-0.153236804,-0.15187487,-0.151623385,-0.150628282,-0.149039072,-0.14826268,-0.147535739,-0.145557646,-0.142223729,-0.139343068,-0.135355686,-0.13047743,-0.125999173,-0.12218752,-0.117021996,-0.111542982,-0.106409901,-0.101904095,-0.097910825,-0.094683375,-0.092079967,-0.088953862,-0.086268097,-0.082907394,-0.080723466,-0.078117426,-0.075431993,-0.072079536,-0.068962411,-0.064831759,-0.061257701,-0.05830671,-0.053889968,-0.048972414,-0.044763431,-0.042162829,-0.039328369,-0.038968862,-0.040450835,-0.041974942,-0.042161609,-0.04280523,-0.042702428,-0.042593856,-0.043166561,-0.043691795,-0.044093492,-0.043965231,-0.04263305,-0.040836102,-0.039605133,-0.037204273,-0.034368645,-0.032293737,-0.029037983,-0.025509509,-0.022704668,-0.021346266,-0.019881524,-0.018675734,-0.017509566,-0.017148129,-0.016671088,-0.016015011,-0.016241862,-0.016416445,-0.016548878,-0.016475455,-0.016405742,-0.015567737,-0.014190101,-0.012373151,-0.010370329,-0.008131459,-0.006729419,-0.005667607,-0.004883919,-0.004841328,-0.005403019,-0.005343759,-0.005377974,-0.00548823,-0.004889709,-0.003884973,-0.003149113,-0.002975268,-0.00283163,-0.00322658,-0.003546589,-0.004233582,-0.004448617,-0.004706967,-0.007400356,-0.010104064,-0.01230257,-0.014430498,-0.016499501,-0.015348355,-0.013974229,-0.012845464,-0.012688459,-0.012552231,-0.013719074,-0.014404172,-0.014611632,-0.013401283,-0.011807386,-0.007417753,-0.003321279,0.000363954,0.004908491,0.010151584,0.013223831,0.016746553,0.02106351,0.024571507,0.027588073,0.031313637,0.034419301,0.037016545,0.038172954,0.038237253,0.038094387,0.037783779,0.036482515,0.036080763,0.035476154,0.034107081,0.03237083,0.030934259,0.029317076,0.028236195,0.027850758,0.024612491,0.01964433,0.015153308,0.009684456,0.003336172]}
df = pd.DataFrame(data, columns=['col1'])
lim = float(-0.1)
s = df['col1'].lt(lim)
out = s.where(s, -1).cumsum()
df['sol'] = out - out.where((out < 0) & (~s)).ffill().fillna(0)
print(df)
The key problem here, to me, is to control the out not to go below zero. With that in mind, we can mask the output where it's negative and adjust accordingly:
# a little longer data for corner case
df = pd.DataFrame({'col1': [1, 2, 1, 0, 0, 0, 0, 3, 2, 2, 0, 0,0,0,0,2,3,4]})
s = df.col1.gt(0)
out = s.where(s,-1).cumsum()
df['continuous'] = out - out.where((out<0)&(~s)).ffill().fillna(0)
Output:
col1 continuous
0 1 1
1 2 2
2 1 3
3 0 2
4 0 1
5 0 0
6 0 0
7 3 1
8 2 2
9 2 3
10 0 2
11 0 1
12 0 0
13 0 0
14 0 0
15 2 1
16 3 2
17 4 3
You can do this using cumsum function on booleans:
Give me a +1 whenever col1 is not zero:
(df.col1 != 0 ).cumsum()
Give me a -1 whenever col1 is zero:
- (df.col1 == 0 ).cumsum()
Then just add them together!
df['continuous'] = (df.col1 != 0 ).cumsum() - (df.col1 == 0 ).cumsum()
However this does not resolve the dropping below zero criteria you mentioned
I am new to Pandas. I have a Pandas data frame like so:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0]})
I want to add a column val2, that indicates whether an row falls below another row having the same id as itself where val1 == 1.
The result would be a data frame like:
df = pd.DataFrame(data={'id': [1, 1, 1, 2, 2, 2, 2], 'val1': [0, 1, 0, 0, 1, 0, 0], 'val2': [0, 0, 1, 0, 0, 1, 1]})
My first thought was to use an apply statement, but these only go by row. And from my experience for loops are never the answer. Any help would be greatly appreciated!
Let's try shift + cumsum inside a groupby.
df['val2'] = df.groupby('id').val1.apply(
lambda x: x.shift().cumsum()
).ge(1).astype(int)
Or, in an attempt to avoid the lambda,
df['val2'] = (
df.groupby('id')
.val1.shift()
.groupby(df.id)
.cumsum()
.ge(1)
.astype(int)
)
df
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1
Using groupby + transform. Similar to coldspeed's but using bool conversion for non-zero cumsum values.
df['val2'] = df.groupby('id')['val1'].transform(lambda x: x.cumsum().shift())\
.fillna(0).astype(bool).astype(int)
print(df)
id val1 val2
0 1 0 0
1 1 1 0
2 1 0 1
3 2 0 0
4 2 1 0
5 2 0 1
6 2 0 1
I have a DataFrame like this:
col1 col2
1 0
0 1
0 0
0 0
3 3
2 0
0 4
I'd like to add a column that is a 1 if col2 is > 0 or 0 otherwise. If I was using R I'd do something like
df1[,'col3'] <- ifelse(df1$col2 > 0, 1, 0)
How would I do this in python / pandas?
You could convert the boolean series df.col2 > 0 to an integer series (True becomes 1 and False becomes 0):
df['col3'] = (df.col2 > 0).astype('int')
(To create a new column, you simply need to name it and assign it to a Series, array or list of the same length as your DataFrame.)
This produces col3 as:
col2 col3
0 0 0
1 1 1
2 0 0
3 0 0
4 3 1
5 0 0
6 4 1
Another way to create the column could be to use np.where, which lets you specify a value for either of the true or false values and is perhaps closer to the syntax of the R function ifelse. For example:
>>> np.where(df['col2'] > 0, 4, -1)
array([-1, 4, -1, -1, 4, -1, 4])
I assume that you're using Pandas (because of the 'df' notation). If so, you can assign col3 a boolean flag by using .gt (greater than) to compare col2 against zero. Multiplying the result by one will convert the boolean flags into ones and zeros.
df1 = pd.DataFrame({'col1': [1, 0, 0, 0, 3, 2, 0],
'col2': [0, 1, 0, 0, 3, 0, 4]})
df1['col3'] = df1.col2.gt(0) * 1
>>> df1
Out[70]:
col1 col2 col3
0 1 0 0
1 0 1 1
2 0 0 0
3 0 0 0
4 3 3 1
5 2 0 0
6 0 4 1
You can also use a lambda expression to achieve the same result, but I believe the method above is simpler for your given example.
df1['col3'] = df1['col2'].apply(lambda x: 1 if x > 0 else 0)