How to apply a custom fuction with rolling on groups in pandas? - python

import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({'date':[1,2,3,4,5,6,7,8,9,10] ,'open':[4,5,3,4,5,6,7,8,9,10],'close':[4,5,6,7,8,1,2,9,10,11],'stock':['A']*5 + ['B']*5})
df['flag'] = np.select([df['close']>df['open'],df['close']<df['open']],['up','down'],default='flat')
df
date
open
close
stock
flag
0
1
4
4
A
flat
1
2
5
5
A
flat
2
3
3
6
A
up
3
4
4
7
A
up
4
5
5
8
A
up
5
6
6
1
B
down
6
7
7
2
B
down
7
8
8
9
B
up
8
9
9
10
B
up
9
10
10
11
B
up
I tried the following. None of them works. They all give me "No numeric types to aggregate" error
# flag if previous 3 days (t-2,t-1, and t) are all increase for each stock
df['3days_up'] = df.groupby('stock')['flag'].rolling(3).apply(lambda x: 'Yes' if all(x['flag']=='up') else 'No')
df['3days_up'] = df.groupby('stock')[['flag']].rolling(3).apply(lambda x: 'Yes' if all(x['flag']=='up') else 'No')
df['3days_up'] = df.groupby('stock').rolling(3).apply(lambda x: 'Yes' if all(x['flag']=='up') else 'No')
Expected output:
date
open
close
stock
flag
3days_up
0
1
4
4
A
flat
No
1
2
5
5
A
flat
No
2
3
3
6
A
up
No
3
4
4
7
A
up
No
4
5
5
8
A
up
Yes
5
6
6
1
B
down
No
6
7
7
2
B
down
No
7
8
8
9
B
up
No
8
9
9
10
B
up
No
9
10
10
11
B
up
Yes

Convert up value to True and others to False as starting point:
df['3days_up'] = np.where(df.assign(is_up=df['flag'] == 'up')
.groupby('stock').rolling(3)['is_up']
.sum() >= 3, 'Yes', 'No')
print(df)
# Output
date open close stock flag 3days_up
0 1 4 4 A flat No
1 2 5 5 A flat No
2 3 3 6 A up No
3 4 4 7 A up No
4 5 5 8 A up Yes
5 6 6 1 B down No
6 7 7 2 B down No
7 8 8 9 B up No
8 9 9 10 B up No
9 10 10 11 B up Yes

Related

How to add multiindex columns to existing df, preserving original index

I start with:
df
0 1 2 3 4
0 5 0 0 2 6
1 9 6 5 8 6
2 8 9 4 2 1
3 2 5 8 9 6
4 8 8 8 0 8
and want to end up with:
df
0 1 2 3 4
A B C
1 2 0 5 0 0 2 6
1 9 6 5 8 6
2 8 9 4 2 1
3 2 5 8 9 6
4 8 8 8 0 8
where A and B are known after df creation, and C is the original
index of the df.
MWE:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df_a = 1
df_b = 2
breakpoint()
What I have in mind, but gives unhashable type error:
df.reindex([df_a, df_b, df.index])
Try with pd.MultiIndex.from_product:
df.index = pd.MultiIndex.from_product(
[[df_a], [df_b], df.index], names=['A','B','C'])
df
Out[682]:
0 1 2 3 4
A B C
1 2 0 7 0 1 9 9
1 0 4 7 3 2
2 7 2 0 0 4
3 5 5 6 8 4
4 1 4 9 8 1

Calculate count of a column based on other column in python dataframe

I have a dataframe like below having patients stay in ICU (in hours) that is shown by ICULOS.
df # Main dataframe
dfy = df.copy()
dfy
P_ID
ICULOS
Count
1
1
5
1
2
5
1
3
5
1
4
5
1
5
5
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
3
1
3
3
2
3
3
3
3
4
1
7
4
2
7
4
3
7
4
4
7
4
5
7
4
6
7
4
7
7
I calculated their ICULOS Count and placed in the new column named Count using the code:
dfy['Count'] = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
Now, I want to remove those patients based on P_ID whose Count is less than 8. (Note, I want to remove whole patient record). So, after removing the patients with Count < 8, Only the P_ID = 2 will remain as the count is 9.
The desired output:
P_ID
ICULOS
Count
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
I tried the following code, but for some reason, it is not working for me. It did worked for me but when I re-run the code after few days, it is giving me 0 result. Can someone suggest a better code? Thanks.
dfy = dfy.drop_duplicates(subset=['P_ID'],keep='first')
lis1 = dfy['P_ID'].tolist()
Icu_less_8 = dfy.loc[dfy['Count'] < 8]
lis2 = Icu_less_8.P_ID.to_list()
lis_3 = [k for k in tqdm_notebook(lis1) if k not in lis2]
# removing those patients who have ICULOS of less than 8 hours
df_1 = pd.DataFrame()
for l in tqdm_notebook(lis_3, desc = 'Progress'):
df_1 = df_1.append(df.loc[df['P_ID']==l])
You can directly filter rows in transform using Series.ge:
In [1521]: dfy[dfy.groupby(['P_ID'])['ICULOS'].transform('count').ge(8)]
Out[1521]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
EDIT after OP's comment: For multiple conditions, do:
In [1533]: x = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
In [1539]: dfy.loc[x[x.ge(8) & x.le(72)].index]
Out[1539]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9

I want to generate a new column in a pandas dataframe, counting "edges" in another column

i have a dataframe looking like this:
A B....X
1 1 A
2 2 B
3 3 A
4 6 K
5 7 B
6 8 L
7 9 M
8 1 N
9 7 B
1 6 A
7 7 A
that is, some "rising edges" occur from time to time in the column X (in this example the edge is x==B)
What I need is, a new column Y which increments every time a value of B occurs in X:
A B....X Y
1 1 A 0
2 2 B 1
3 3 A 1
4 6 K 1
5 7 B 2
6 8 L 2
7 9 M 2
8 1 N 2
9 7 B 3
1 6 A 3
7 7 A 3
In SQL I would use some trick like sum(case when x=B then 1 else 0) over ... rows between first and previous. How can I do it in Pandas?
Use cumsum
df['Y'] = (df.X == 'B').cumsum()
Out[8]:
A B X Y
0 1 1 A 0
1 2 2 B 1
2 3 3 A 1
3 4 6 K 1
4 5 7 B 2
5 6 8 L 2
6 7 9 M 2
7 8 1 N 2
8 9 7 B 3
9 1 6 A 3
10 7 7 A 3

how to dynamically add values of some of the columns in a dataframe?

dataframe in the image
year= 2020 (MAX COLUMN)
lastFifthYear = year - 4
input = '2001509-00'
I want to add all the values between year(2020) and lastFifthYear(2016) WHERE INPUT PARTNO = input
so for input value I should get 4+6+2+3+2 (2016+2017+2018+2019+2020) i.e 17
please give me some code
Here is some code that should work but you definitely need to improve on the way you ask questions here :-)
Considering df is the table you pasted as image above.
>>> year = 2016
>>> df_new=df.query('INPUT_PARTNO == "2001509-00"').melt(['ACTUAL_GI_YEAR', 'INPUT_PARTNO'], var_name='year', value_name='number')
>>> df_new.year=df_new.year.astype(int)
>>> df_new[df_new.year >= year].groupby(['ACTUAL_GI_YEAR','INPUT_PARTNO']).agg({'number' : sum})
number
ACTUAL_GI_YEAR INPUT_PARTNO
0 2001509-00 17
Example Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (10, 10)),
columns=list('ab')+list(range(2,10)))
Solved
#sum where a == 9 columns between 3,6 by rows
df['number'] = df.loc[df['a'].eq(9),
pd.to_numeric(df.columns, errors='coerce')
.to_series()
.between(3, 6)
.values].sum(axis=1)
print(df)
a b 2 3 4 5 6 7 8 9 number
0 1 9 9 2 6 0 6 1 4 2 NaN
1 2 3 4 8 7 2 4 0 0 6 NaN
2 2 2 7 4 9 6 7 1 0 0 NaN
3 0 3 5 3 0 4 2 7 2 6 NaN
4 7 7 1 4 7 7 9 7 4 2 NaN
5 9 9 9 0 3 3 3 8 7 7 9.0
6 9 0 5 5 7 9 6 6 5 7 27.0
7 2 1 9 1 9 3 3 4 4 9 NaN
8 4 0 5 9 6 7 3 9 1 6 NaN
9 5 5 0 8 6 4 5 4 7 4 NaN

Holding a first value in a column while another column equals a value?

I would like to hold the first value in a column while another column does not equal zero. For Column B, values alternate between -1, 0, 1. For Column C, values equal any integer. The objective is holding the first value of Column C while Column B equals zero. The current DataFrame is as follows:
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 8
5 5 0 9
6 6 0 1
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 8
11 5 0 9
12 6 0 10
The resulting DataFrame should be as follows:
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9
You need first create NaNs by condition in column C and then add values by ffill:
mask = (df['B'].shift().fillna(False)).astype(bool) | (df['B'])
df['C'] = df.loc[mask, 'C']
df['C'] = df['C'].ffill().astype(int)
print (df)
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9
Or use where and if type of all values is integer, add astype:
mask = (df['B'].shift().fillna(False)).astype(bool) | (df['B'])
df['C'] = df['C'].where(mask).ffill().astype(int)
print (df)
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9

Categories

Resources