How can I rank based on condition in Pandas - python

Supposed, I have Pandas DataFrame looks like below:
Cluster
Variable
Group
Ratio
Value
1
GDP_M3
GDP
20%
70%
1
HPI_M6
HPI
40%
80%
1
GDP_lg2
GDP
35%
50%
2
CPI_M9
CPI
10%
50%
2
HPI_lg6
HPI
15%
65%
3
CPI_lg12
CPI
15%
90%
3
CPI_lg1
CPI
20%
95%
I would like to rank Variable based on Ratio and Value in the separated columns. The Ratio will rank from the lowest to the highest, while the Value will rank from the highest to the lowest.
There are some variables that I do not want to rank. In the example, I do not prefer CPI. Any type of CPI will not be considered for the rank e.g., CPI_M9. However, the case will be expected only if there is only that particular variable in the Cluster.
The results from condition above will look like the table below:
Cluster
Variable
Group
Ratio
Value
RankRatio
RankValue
1
GDP_M3
GDP
20%
70%
1
2
1
HPI_M6
HPI
40%
80%
3
1
1
GDP_lg2
GDP
35%
50%
2
3
2
CPI_M9
CPI
10%
50%
NaN
NaN
2
HPI_lg6
HPI
15%
65%
1
1
3
CPI_lg12
CPI
15%
90%
1
2
3
CPI_lg1
CPI
20%
95%
2
1
For Cluster 1, the GDP_M3 has the lowest Ratio at 20%, while the HPI_M3 has the highest Value at 80%. Thus, both of them will be assigned rank 1 and the others will be followed subsequently.
For Cluster 2, even CPI_M9 has the lowest Ratio but the CPI is not prefer. Thus, the rank 1 will be assigned to HPI_lg6.
For Cluster 3, there are variables from the only CPI Group and there is no other options to rank. Thus, the CPI_lg12 and CPI_lg1 are ranked based on the lowest Ratio and the highest Value.
df['RankRatio'] = df.groupby(['Cluster'])['Ratio'].rank(method = 'first', ascending = True)
df['RankValue'] = df.groupby(['Cluster'])['Value'].rank(method = 'first', ascending = False)
I have some code that can be handled only general case but for specific case with unprefer group of variables, my code cannot handle it.
Please help or suggest on this. Thank you.

Use:
#convert columns to numeric
df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)
Remove row with CPI by condition - test rows if no only CPI per Cluster:
m = df['Group'].eq('CPI')
m1 = ~df['Cluster'].isin(df.loc[m, 'Cluster']) | m
df['RankRatio'] = df[m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
print (df)
Cluster Variable Group Ratio Value RankRatio RankValue
0 1 GDP_M3 GDP 20.0 70.0 1.0 2.0
1 1 HPI_M6 HPI 40.0 80.0 3.0 1.0
2 1 GDP_lg2 GDP 35.0 50.0 2.0 3.0
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN
4 2 HPI_lg6 HPI 15.0 65.0 1.0 1.0
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0
How it working:
For mask2 are filter all Cluster values if match mask1 and filtered original column Cluster, then invert mask by ~. Last chain both conditions by | for bitwise OR for all rows without CPI if exist with another values per Cluster:
print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[m, 'Cluster']), both = m1))
Cluster Variable Group Ratio Value mask1 mask2 both
0 1 GDP_M3 GDP 20.0 70.0 False True True
1 1 HPI_M6 HPI 40.0 80.0 False True True
2 1 GDP_lg2 GDP 35.0 50.0 False True True
3 2 CPI_M9 CPI 10.0 50.0 True False True
4 2 HPI_lg6 HPI 15.0 65.0 False False False
5 3 CPI_lg12 CPI 15.0 90.0 True False True
6 3 CPI_lg1 CPI 20.0 95.0 True False True
EDIT:
df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)
m = df['Group'].isin(['CPI','HPI'])
m2 = df.groupby('Cluster')['Group'].transform('nunique').ne(1)
m1 = (~df['Cluster'].isin(df.loc[~m, 'Cluster']) | m) & m2
df['RankRatio'] = df[~m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[~m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
print (df)
Cluster Variable Group Ratio Value RankRatio RankValue
0 1 GDP_M3 GDP 20.0 70.0 1.0 1.0
1 1 HPI_M6 HPI 40.0 80.0 NaN NaN
2 1 GDP_lg2 GDP 35.0 50.0 2.0 2.0
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN
4 2 HPI_lg6 HPI 15.0 65.0 NaN NaN
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0
print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[~m, 'Cluster']), m2=m2, all = ~m1))
Cluster Variable Group Ratio Value RankRatio RankValue mask1 mask2 \
0 1 GDP_M3 GDP 20.0 70.0 1.0 1.0 False False
1 1 HPI_M6 HPI 40.0 80.0 NaN NaN True False
2 1 GDP_lg2 GDP 35.0 50.0 2.0 2.0 False False
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN True True
4 2 HPI_lg6 HPI 15.0 65.0 NaN NaN True True
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0 True True
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0 True True
m2 all
0 True True
1 True False
2 True True
3 True False
4 True False
5 False True
6 False True

Related

Subtract one column from another in pandas - with a condition

I have this code that will subtract, for each person (AAC or AAB), timepoint 1 from time point 2 data.
i.e this is the original data:
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 1 2.0 NaN 4.0
1 4 3 2.0 6.0 NaN
2 4 3 NaN 6.0 NaN
3 4 5 2.0 6.0 NaN
This is the code:
import sys
import numpy as np
from sklearn.metrics import auc
import pandas as pd
from numpy import trapz
#read in file
df = pd.DataFrame([[0,1,2,np.nan,4],[4,3,2,6,np.nan],[4,3,np.nan,6,np.nan],[4,5,2,6,np.nan]],columns=['pep_seq','AAC-T01','AAC-T02','AAB-T01','AAB-T02'])
#standardise the data by taking T0 away from each sample
df2 = df.drop(['pep_seq'],axis=1)
df2 = df2.apply(lambda x: x.sub(df2[x.name[:4]+"T01"]))
df2.insert(0,'pep_seq',df['pep_seq'])
print(df)
print(df2)
This is the output (i.e. df2)
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN NaN
1 4 0 -1.0 0.0 NaN
2 4 0 NaN 0.0 NaN
3 4 0 -3.0 0.0 NaN
...but what I actually wanted was to subtract the T01 columns from all the others EXCEPT for when the T01 value is NaN in which case keep the original value, so the desired output was (see the 4.0 in AAB-T02):
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN 4.0
1 4 0 -1.0 0 NaN
2 4 0 NaN 0 NaN
3 4 0 -3.0 0 NaN
Could someone show me where I went wrong? Note that in real life, there are ~100 timepoints per person, not just two.
You can fill the nan to 0 when doing subtraction
df2 = df2.apply(lambda x: x.sub(df2[x.name[:4]+"T01"].fillna(0)))
# ^^^^ Changes here
df2.insert(0,'pep_seq',df['pep_seq'])
print(df2)
pep_seq AAC-T01 AAC-T02 AAB-T01 AAB-T02
0 0 0 1.0 NaN 4.0
1 4 0 -1.0 0.0 NaN
2 4 0 NaN 0.0 NaN
3 4 0 -3.0 0.0 NaN
I hope that I understand you correctly but numpy.where() should do it for you.
Have a look here: condition based substraction

Python Pandas Groupby Alternative (Time Series Analysis)

Hi guys I'm using pandas to conduct several time series analysis, here is the sample df
data = {'ticker': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],
'high': [10.2,10.5,11,12,15,16,10.2,5,6,6.2,5.3,5.6,7.8,6],
'low': [10,10.4,10.5,11,14,15,10,4.8,5.5,6,5,5,7.5,5.8]}
df = pd.DataFrame(data)
I wanna compute 5-day beta for these two stocks with BETA function in TA-Lib and groupby in pandas, here is my code:
def beta(df):
df['beta'] = talib.BETA(df.high, df.low, timeperiod=5)
df['beta_above_1'] = np.where(df.beta > float(1), True, False)
return df
df = df.groupby(['ticker']).apply(beta)
It works and return me this result:
ticker high low beta beta_above_1
0 A 10.2 10.0 NaN False
1 A 10.5 10.4 NaN False
2 A 11.0 10.5 NaN False
3 A 12.0 11.0 NaN False
4 A 15.0 14.0 NaN False
5 A 16.0 15.0 1.151536 True
6 A 10.2 10.0 0.952395 False
7 B 5.0 4.8 NaN False
8 B 6.0 5.5 NaN False
9 B 6.2 6.0 NaN False
10 B 5.3 5.0 NaN False
11 B 5.6 5.0 NaN False
12 B 7.8 7.5 1.182857 True
13 B 6.0 5.8 1.177912 True
However, the required time is a bit long while I apply this approach to over million rows dataframe. I've researched about vectorization to speed up the calculation but I got no idea to improve it.
May I know if there is other faster ways to conduct the same analysis? Thanks a lot!

Create a new column based on two others and conditionals

I have a two column data frame of the form:
Death HEALTH
0 other 0.0
1 other 1.0
2 vascular 0.0
3 other 0.0
4 other 0.0
5 vascular 0.0
6 NaN 0.0
7 NaN 0.0
8 NaN 0.0
9 vascular 1.0
I would like to create a new column following the steps:
wherever appears the value 'other', write a 'No'
wherever appears the NaN, leave it as it is
wherever appears the value 'vascular' in the first column and 1.0 in the second, write 'Yes'
wherever appears the value 'vascular' in the first column and 0.0 in the second, write 'No'
The output should be:
Death HEAlTH New
0 other 0.0 No
1 other 1.0 No
2 vascular 0.0 No
3 other 0.0 No
4 other 0.0 No
5 vascular 0.0 No
6 NaN 0.0 NaN
7 NaN 0.0 NaN
8 NaN 0.0 NaN
9 vascular 1.0 Yes
Is there a pythonic way to achieve this? I'm all lost between loops and conditionals.
You can create conditions for No and Yes and for all another values are created original value in numpy.select:
m1 = df['Death'].eq('other') | (df['Death'].eq('vascular') & df['HEALTH'].eq(0))
m2 = (df['Death'].eq('vascular') & df['HEALTH'].eq(1))
df['new'] = np.select([m1, m2], ['No','Yes'], default=df['Death'])
Another idea is test also missing values and if no match conditions is set original values:
m1 = df['Death'].eq('other') | (df['Death'].eq('vascular') & df['HEALTH'].eq(0))
m2 = (df['Death'].eq('vascular') & df['HEALTH'].eq(1))
m3 = df['Death'].isna()
df['new'] = np.select([m1, m2, m3], ['No','Yes', np.nan], default=df['Death'])
print (df)
print (df)
0 another val 0.0 another val
1 other 1.0 No
2 vascular 0.0 No
3 other 0.0 No
4 other 0.0 No
5 vascular 0.0 No
6 NaN 0.0 NaN
7 NaN 0.0 NaN
8 NaN 0.0 NaN
9 vascular 1.0 Yes
A simple way to do this is to implement your conditional logic using if/else inside a function, and apply this function row-wise to the dataframe.
def function(row):
if row['Death']=='other':
return 'No'
if row['Death']=='vascular':
if row['Health']==1:
return 'Yes'
elif row['Health']==0:
return 'No'
return np.nan
# axis = 1 to apply it row-wise
df['New'] = df.apply(function, axis=1)
It produces the following output as required:
Death Health New
0 other 0 No
1 other 1 No
2 vascular 0 No
3 other 0 No
4 other 0 No
5 vascular 0 No
6 NaN 0 NaN
7 NaN 0 NaN
8 NaN 0 NaN
9 vascular 1 Yes

How to interpolate in Pandas using only previous values?

This is my dataframe:
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
id value
0 1 5
1 1 6
2 1 NaN
3 2 NaN
4 2 8
5 2 4
6 2 NaN
7 2 10
8 3 NaN
This is my expected output:
id value
0 1 5
1 1 6
2 1 7
3 2 NaN
4 2 8
5 2 4
6 2 2
7 2 10
8 3 NaN
This is my current output using this code:
df.value.interpolate(method="krogh")
0 5.000000
1 6.000000
2 9.071429
3 10.171429
4 8.000000
5 4.000000
6 2.357143
7 10.000000
8 36.600000
Basically, I want to do two important things here:
Groupby ID then Interpolate using only above values not below row values
This should do the trick:
df["value_interp"]=df.value.combine_first(df.groupby("id")["value"].apply(lambda y: y.expanding().apply(lambda x: x.interpolate(method="krogh").to_numpy()[-1], raw=False)))
Outputs:
id value value_interp
0 1.0 5.0 5.0
1 1.0 6.0 6.0
2 1.0 NaN 7.0
3 2.0 NaN NaN
4 2.0 8.0 8.0
5 2.0 4.0 4.0
6 2.0 NaN 0.0
7 2.0 10.0 10.0
8 3.0 NaN NaN
(It interpolates based only on the previous values within the group - hence index 6 will return 0 not 2)
You can group by id and then loop over groups to make interpolations. For id = 2 interpolation will not give you value 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
data = []
for name, group in df.groupby('id'):
group_interpolation = group.interpolate(method='krogh', limit_direction='forward', axis=0)
data.append(group_interpolation)
df = (pd.concat(data)).round(1)
Output:
id value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 4.7
7 2.0 10.0
8 3.0 NaN
Current pandas.Series.interpolate does not support what you want so to achieve your goal you need to do 2 grouby's that will account for your desire to use only previous rows. The idea is as follows: to combine into one group only missing value (!!!) and previous rows (it might have limitations if you have several missing values in a row, but it serves well for your toy example)
Suppose we have a df:
print(df)
ID Value
0 1 5.0
1 1 6.0
2 1 NaN
3 2 NaN
4 2 8.0
5 2 4.0
6 2 NaN
7 2 10.0
8 3 NaN
Then we will combine any missing values within a group with previous rows:
df["extrapolate"] = df.groupby("ID")["Value"].apply(lambda grp: grp.isnull().cumsum().shift().bfill())
print(df)
ID Value extrapolate
0 1 5.0 0.0
1 1 6.0 0.0
2 1 NaN 0.0
3 2 NaN 1.0
4 2 8.0 1.0
5 2 4.0 1.0
6 2 NaN 1.0
7 2 10.0 2.0
8 3 NaN NaN
You may see, that when grouped by ["ID","extrapolate"] the missing value will fall into the same group as nonnull values of previous rows.
Now we are ready to do extrapolation (with spline of order=1):
df.groupby(["ID","extrapolate"], as_index=False).apply(lambda grp:grp.interpolate(method="spline",order=1)).drop("extrapolate", axis=1)
ID Value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 0.0
7 2.0 10.0
8 NaN NaN
Hope this helps.

Forward fill missing values by group after condition is met in pandas

I'm having a bit of trouble with this. My dataframe looks like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 nan nan
1 nan nan
2 nan 0
2 50 0
2 20 1
2 nan nan
2 nan nan
So, what I need to do is, after the dummy gets value = 1, I need to fill the amount variable with zeroes for each id, like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 0 nan
1 0 nan
2 nan 0
2 50 0
2 20 1
2 0 nan
2 0 nan
I'm guessing I'll need some combination of groupby('id'), fillna(method='ffill'), maybe a .loc or a shift() , but everything I tried has had some problem or is very slow. Any suggestions?
The way I will use
s = df.groupby('id')['dummy'].ffill().eq(1)
df.loc[s&df.dummy.isna(),'amount']=0
You can do this much easier:
data[data['dummy'].isna()]['amount'] = 0
This will select all the rows where dummy is nan and fill the amount column with 0.
IIUC, ffill() and mask the still-nan:
s = df.groupby('id')['amount'].ffill().notnull()
df.loc[df['amount'].isna() & s, 'amount'] = 0
Output:
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN
Could you please try following.
df.loc[df['dummy'].isnull(),'amount']=0
df
Output will be as follows.
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN

Categories

Resources