for example I have a dataframe:
0
1
2
3
4
5
6
0
0.493212
0.586246
nan
0.589289
nan
0.629087
0.593872
1
0.568513
0.367722
nan
nan
nan
nan
0.423369
2
0.70054
0.735529
nan
nan
0.494135
nan
nan
3
nan
nan
nan
0.338822
0.466331
0.765367
0.83082
4
0.512891
nan
0.623782
0.642438
nan
0.541117
0.92981
If I compare it like:
df >= 0.5
The result is:
0
1
2
3
4
5
6
0
0
1
0
1
0
1
1
1
1
0
0
0
0
0
0
2
1
1
0
0
0
0
0
3
0
0
0
0
0
1
1
4
1
0
1
1
0
1
1
How can I keep nan cell ? I mean I need 0.5 > np.nan == np.nan not 0.5 > np.nan == False
IIUC, you can use a mask:
df.lt(0.5).astype(int).mask(df.isna())
output:
0 1 2 3 4 5 6
0 1.0 0.0 NaN 0.0 NaN 0.0 0.0
1 0.0 1.0 NaN NaN NaN NaN 1.0
2 0.0 0.0 NaN NaN 1.0 NaN NaN
3 NaN NaN NaN 1.0 1.0 0.0 0.0
4 0.0 NaN 0.0 0.0 NaN 0.0 0.0
If you want to keep the integer type:
out = df.lt(0.5).astype(pd.Int64Dtype()).mask(df.isna()))
output:
0 1 2 3 4 5 6
0 1 0 <NA> 0 <NA> 0 0
1 0 1 <NA> <NA> <NA> <NA> 1
2 0 0 <NA> <NA> 1 <NA> <NA>
3 <NA> <NA> <NA> 1 1 0 0
4 0 <NA> 0 0 <NA> 0 0
Use DataFrame.mask with convert values to integers:
df = (df >= 0.5).astype(int).mask(df.isna())
print (df)
0 1 2 3 4 5 6
0 0.0 1.0 NaN 1.0 NaN 1.0 1.0
1 1.0 0.0 NaN NaN NaN NaN 0.0
2 1.0 1.0 NaN NaN 0.0 NaN NaN
3 NaN NaN NaN 0.0 0.0 1.0 1.0
4 1.0 NaN 1.0 1.0 NaN 1.0 1.0
Details:
print ((df >= 0.5).astype(int))
0 1 2 3 4 5 6
0 0 1 0 1 0 1 1
1 1 0 0 0 0 0 0
2 1 1 0 0 0 0 0
3 0 0 0 0 0 1 1
4 1 0 1 1 0 1 1
Another idea with numpy.select:
df[:] = np.select([df.isna(), df >= 0.5], [None, 1], default=0)
print (df)
0 1 2 3 4 5 6
0 0.0 1.0 NaN 1.0 NaN 1.0 1.0
1 1.0 0.0 NaN NaN NaN NaN 0.0
2 1.0 1.0 NaN NaN 0.0 NaN NaN
3 NaN NaN NaN 0.0 0.0 1.0 1.0
4 1.0 NaN 1.0 1.0 NaN 1.0 1.0
Btw, if need True/False with NaN is possible use Nullable Boolean data type:
df = (df >= 0.5).astype(int).mask(df.isna()).astype('boolean')
print (df)
0 1 2 3 4 5 6
0 False True <NA> True <NA> True True
1 True False <NA> <NA> <NA> <NA> False
2 True True <NA> <NA> False <NA> <NA>
3 <NA> <NA> <NA> False False True True
4 True <NA> True True <NA> True True
Related
I want to fill all rows between two values by group. For each group, var1 has two values equal to 1, and I want to fill the missing rows between the two 1s. var1 represents what I have, var2 represents what I want, var3 shows what I am obtaining with my code, but it is not what I want (different from var2):
var1 group var2 var3
NaN 1 NaN NaN
NaN 1 NaN NaN
1 1 1 1
NaN 1 1 1
NaN 1 1 1
1 1 1 1
NaN 1 NaN 1
NaN 1 NaN 1
1 2 1 1
NaN 2 1 1
1 2 1 1
NaN 2 NaN 1
My code:
df.var3 = df.groupby('group')['var1'].bffill()
Assuming the values are only 1 or NaN, you can groupby.ffill and groupby.bfill and only keep the values that are identical:
g = df.groupby('group')['var1']
s1 = g.ffill()
s2 = g.bfill()
df['var2'] = s1.where(s1.eq(s2))
Output:
var1 group var2
0 NaN 1 NaN
1 NaN 1 NaN
2 1.0 1 1.0
3 NaN 1 1.0
4 NaN 1 1.0
5 1.0 1 1.0
6 NaN 1 NaN
7 NaN 1 NaN
8 1.0 2 1.0
9 NaN 2 1.0
10 1.0 2 1.0
11 NaN 2 NaN
Intermediates:
var1 group var2 ffill bfill
0 NaN 1 NaN NaN 1.0
1 NaN 1 NaN NaN 1.0
2 1.0 1 1.0 1.0 1.0
3 NaN 1 1.0 1.0 1.0
4 NaN 1 1.0 1.0 1.0
5 1.0 1 1.0 1.0 1.0
6 NaN 1 NaN 1.0 NaN
7 NaN 1 NaN 1.0 NaN
8 1.0 2 1.0 1.0 1.0
9 NaN 2 1.0 1.0 1.0
10 1.0 2 1.0 1.0 1.0
11 NaN 2 NaN 1.0 NaN
I have the following panel dataset. "winner" =1 if in period (date), someone is a winner, zero if loser.
ID date winner
A 2017Q4 NaN
A 2018Q4 1
A 2019Q4 0
A 2020Q4 0
A 2021Q4 1
B 2017Q4 NaN
B 2018Q4 1
B 2019Q4 1
B 2020Q4 0
B 2021Q4 0
C 2017Q4 NaN
C 2018Q4 0
C 2019Q4 0
C 2020Q4 0
C 2021Q4 0
D 2017Q4 NaN
D 2018Q4 0
D 2019Q4 1
D 2020Q4 1
D 2021Q4 1
I want to create four dummy variables, WW =1 if someone is winner in two consecutive periods. LL=1 if loser in two consecutive periods. WL if winner in period 1 and loser the next period, and LW vice versa.
UPDATE
when i apply the answers below i get the following
ID date winner WW LL WL LW
A 2017Q4 NaN
A 2018Q4 1 0 0 0 0
A 2019Q4 0 0 0 1 0
A 2020Q4 0 0 1 0 0
A 2021Q4 1 0 0 0 1
B 2017Q4 NaN
B 2018Q4 1 0 0 0 0
B 2019Q4 1 1 0 0 0
B 2020Q4 0 0 0 1 0
B 2021Q4 0 0 1 0 0
C 2017Q4 NaN
C 2018Q4 0 0 0 0 0
C 2019Q4 0 0 1 0 0
C 2020Q4 0 0 1 0 0
C 2021Q4 0 0 1 0 0
D 2017Q4 NaN
D 2018Q4 0 0 0 0 0
D 2019Q4 1 0 0 0 1
D 2020Q4 1 1 0 0 0
D 2021Q4 1 1 0 0 0
How do i make sure I get the NaN when the previous value is NaN?
desired output
ID date winner WW LL WL LW
A 2017Q4 NaN
A 2018Q4 1 NaN NaN NaN NaN
A 2019Q4 0 0 0 1 0
A 2020Q4 0 0 1 0 0
A 2021Q4 1 0 0 0 1
B 2017Q4 NaN
B 2018Q4 1 NaN NaN NaN NaN
B 2019Q4 1 1 0 0 0
B 2020Q4 0 0 0 1 0
B 2021Q4 0 0 1 0 0
C 2017Q4 NaN
C 2018Q4 0 NaN NaN NaN NaN
C 2019Q4 0 0 1 0 0
C 2020Q4 0 0 1 0 0
C 2021Q4 0 0 1 0 0
D 2017Q4 NaN
D 2018Q4 0 NaN NaN NaN NaN
D 2019Q4 1 0 0 0 1
D 2020Q4 1 1 0 0 0
D 2021Q4 1 1 0 0 0
How to do this in the most simple way?
Here's one way: Use groupby.shift to get the previous record; then use numpy.select to assign values, which you use get_dummies to convert to dummy variables:
import numpy as np
df['previous'] = df.groupby('ID')['winner'].shift()
tmp = df[['previous','winner']]
dummy_vars = ['WW','LL','WL', 'LW']
out = (df.join(pd.get_dummies(np.select([tmp.eq(1).all(1),
tmp.eq(0).all(1),
tmp.eq([1,0]).all(1),
tmp.eq([0,1]).all(1)],
dummy_vars, ''))[dummy_vars+['']]
.mask(df['previous'].isna(), ''))
.drop(columns=['previous','']))
Output:
ID date winner WW LL WL LW
0 A 2018Q4 1
1 A 2019Q4 0 0 0 1 0
2 A 2020Q4 0 0 1 0 0
3 A 2021Q4 1 0 0 0 1
4 B 2018Q4 1
5 B 2019Q4 1 1 0 0 0
6 B 2020Q4 0 0 0 1 0
7 B 2021Q4 0 0 1 0 0
8 C 2018Q4 0
9 C 2019Q4 0 0 1 0 0
10 C 2020Q4 0 0 1 0 0
11 C 2021Q4 0 0 1 0 0
12 D 2018Q4 0
13 D 2019Q4 1 0 0 0 1
14 D 2020Q4 1 1 0 0 0
15 D 2021Q4 1 1 0 0 0
map 1 and 0 to "W" and "L"
get the 2-period streak
get_dummies for the "streak"
join to original DataFrame ignoring the first row of each ID
wins = df["winner"].fillna(0).map({1:"W",0:"L"})
streaks = wins.shift() + wins
other = pd.get_dummies(streaks.where(df["ID"].eq(df["ID"].shift())))
output = df.join(other.where(df["ID"].duplicated()&df["winner"].shift().notna()))
>>> output
ID date winner LL LW WL WW
0 A 2017Q4 NaN NaN NaN NaN NaN
1 A 2018Q4 1.0 NaN NaN NaN NaN
2 A 2019Q4 0.0 0.0 0.0 1.0 0.0
3 A 2020Q4 0.0 1.0 0.0 0.0 0.0
4 A 2021Q4 1.0 0.0 1.0 0.0 0.0
5 B 2017Q4 NaN NaN NaN NaN NaN
6 B 2018Q4 1.0 NaN NaN NaN NaN
7 B 2019Q4 1.0 0.0 0.0 0.0 1.0
8 B 2020Q4 0.0 0.0 0.0 1.0 0.0
9 B 2021Q4 0.0 1.0 0.0 0.0 0.0
10 C 2017Q4 NaN NaN NaN NaN NaN
11 C 2018Q4 0.0 NaN NaN NaN NaN
12 C 2019Q4 0.0 1.0 0.0 0.0 0.0
13 C 2020Q4 0.0 1.0 0.0 0.0 0.0
14 C 2021Q4 0.0 1.0 0.0 0.0 0.0
15 D 2017Q4 NaN NaN NaN NaN NaN
16 D 2018Q4 0.0 NaN NaN NaN NaN
17 D 2019Q4 1.0 0.0 1.0 0.0 0.0
18 D 2020Q4 1.0 0.0 0.0 0.0 1.0
19 D 2021Q4 1.0 0.0 0.0 0.0 1.0
I am trying to find when a price value is cross above a high, I can find the high but when I compare it to current price it gives me all 1
my code :
peak = df[(df[‘price’] > df[‘price’].shift(-1)) & (df[‘price’] > df[‘price’].shift(1))]
df[‘peak’] = peak
df[‘breakout’] = df[‘price’] > df[‘peak’]
print(df)
out :
price
peak
breakout
1
2
NaN
1
2
2
NaN
1
3
4
NaN
1
4
5
NaN
1
5
6
6.0
1
6
5
NaN
1
7
4
NaN
1
8
3
NaN
1
9
12
12.0
1
10
10
NaN
1
11
50
NaN
1
12
100
NaN
1
13
110
110
1
14
84
NaN
1
expect:
price
peak
high
breakout
1
2
NaN
0
0
2
2
NaN
0
0
3
4
NaN
0
0
4
5
NaN
0
0
5
6
6.0
1
1
6
5
NaN
0
0
7
4
NaN
0
0
8
3
NaN
0
0
9
12
12.0
1
1
10
10
NaN
0
0
11
50
NaN
0
1
12
100
NaN
0
1
13
110
110
1
1
14
84
NaN
0
0
with fillna :
price peak look breakout
0 2 NaN NaN False
1 4 NaN NaN False
2 5 NaN NaN False
3 6 6.0 6.0 False
4 5 NaN 6.0 False
5 4 NaN 6.0 False
6 3 NaN 6.0 False
7 12 12.0 12.0 False ----> this should be True because it it higher than 6 and it also the high for shift(-1) and shift(1)
8 10 NaN 12.0 False
9 50 NaN 12.0 True
10 100 100.0 100.0 False
11 40 NaN 100.0 False
12 45 45.0 45.0 False
13 30 NaN 45.0 False
14 200 NaN 45.0 True
Try with pandas.DataFrame.fillna:
df["breakout"] = df["price"] >= df["peak"].fillna(method = "ffill")
If you want it with 1s and 0s add the line:
df["breakout"] = df["breakout"].replace([True, False],[1,0])
Note that df["peak"].fillna(method = "ffill") returns:
0 NaN
1 NaN
2 NaN
3 NaN
4 6.0
5 6.0
6 6.0
7 6.0
8 12.0
9 12.0
10 12.0
11 12.0
12 110.0
13 110.0
Name: peak, dtype: float64
So you can compare it easily with the price column.
I'm having a bit of trouble with this. My dataframe looks like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 nan nan
1 nan nan
2 nan 0
2 50 0
2 20 1
2 nan nan
2 nan nan
So, what I need to do is, after the dummy gets value = 1, I need to fill the amount variable with zeroes for each id, like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 0 nan
1 0 nan
2 nan 0
2 50 0
2 20 1
2 0 nan
2 0 nan
I'm guessing I'll need some combination of groupby('id'), fillna(method='ffill'), maybe a .loc or a shift() , but everything I tried has had some problem or is very slow. Any suggestions?
The way I will use
s = df.groupby('id')['dummy'].ffill().eq(1)
df.loc[s&df.dummy.isna(),'amount']=0
You can do this much easier:
data[data['dummy'].isna()]['amount'] = 0
This will select all the rows where dummy is nan and fill the amount column with 0.
IIUC, ffill() and mask the still-nan:
s = df.groupby('id')['amount'].ffill().notnull()
df.loc[df['amount'].isna() & s, 'amount'] = 0
Output:
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN
Could you please try following.
df.loc[df['dummy'].isnull(),'amount']=0
df
Output will be as follows.
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN
I want to merge static data with time varying data.
First dataframe
a_columns = pd.MultiIndex.from_product([["A","B","C"],["1","2"]])
a_index = pd.date_range("20100101","20110101",freq="BM")
a = pd.DataFrame(columns=a_columns,index=a_index)#A
Second dataframe
b_columns = ["3","4","5"]
b_index = ["A","B","C"]
b = pd.DataFrame(columns=b_columns,index=b_index)
How do i join these two? My desired dataframe has the form as A but with additional columns.
Thanks!
I think you need reshape by stack and then create df by to_frame - for concat need Datetimeindex, so new index was from first value of index of a.
Last concat + sort_index:
#added some data - 2
a_columns = pd.MultiIndex.from_product([["A","B","C"],["1","2"]])
a_index = pd.date_range("20100101","20110101",freq="BM")
a = pd.DataFrame(2,columns=a_columns,index=a_index)#A
#added some data - 1
b_columns = ["3","4","5"]
b_index = ["A","B","C"]
b = pd.DataFrame(1,columns=b_columns,index=b_index)
c = b.stack().to_frame(a.index[0]).T
print (c)
A B C
3 4 5 3 4 5 3 4 5
2010-01-29 1 1 1 1 1 1 1 1 1
d = pd.concat([a,c], axis=1).sort_index(axis=1)
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-02-26 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-03-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-04-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-05-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-06-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-07-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-08-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-09-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-10-29 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-11-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-12-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
Last if need replace NaNs only in added columns by first row:
d[c.columns] = d[c.columns].ffill()
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-02-26 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-03-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-04-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-05-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-06-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-07-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-08-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-09-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-10-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-11-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-12-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
Similar solution with reindex:
c = b.stack().to_frame(a.index[0]).T.reindex(a.index, method='ffill')
print (c)
A B C
3 4 5 3 4 5 3 4 5
2010-01-29 1 1 1 1 1 1 1 1 1
2010-02-26 1 1 1 1 1 1 1 1 1
2010-03-31 1 1 1 1 1 1 1 1 1
2010-04-30 1 1 1 1 1 1 1 1 1
2010-05-31 1 1 1 1 1 1 1 1 1
2010-06-30 1 1 1 1 1 1 1 1 1
2010-07-30 1 1 1 1 1 1 1 1 1
2010-08-31 1 1 1 1 1 1 1 1 1
2010-09-30 1 1 1 1 1 1 1 1 1
2010-10-29 1 1 1 1 1 1 1 1 1
2010-11-30 1 1 1 1 1 1 1 1 1
2010-12-31 1 1 1 1 1 1 1 1 1
d = pd.concat([a,c], axis=1).sort_index(axis=1)
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-02-26 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-03-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-04-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-05-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-06-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-07-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-08-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-09-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-10-29 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-11-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-12-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1