How to improve my boolean indexing in pandas?

How to improve my boolean indexing in pandas? - python

I'm struggling with a simple pandas algo for stock market trading. Nothing serious or complicated, I just want to learn how to do it in python.
What I am trying to do is
buy stocks when signal turns True, and pay for it with the cash (so cash goes down, stock position goes up
When signal turns false, sell stocks and add the result to the cash.
But I can't get this to work. I could get this to work with looping, but that would be too time consuming. Any suggestions?
## data set
close=[21.02,21.05,21.10,21.22, 22.17,22.13,22.07]
signal=[False,True,True,True,False,True,True]
data={'close':close, 'signal':signal}
df=pd.DataFrame.from_dict(data)
df['cash']=1000
df['trade']=0
df['pos']=0
## if signal turns True, buy stocks
buysubset = ((df.signal==True) & (df.signal.shift(1)==False))
sellsubset = ((df.signal==False) & df.signal.shift(1)==True)
df.loc[buysubset,'trade']=(df.cash/df.close).astype(int)
df.loc[buysubset,'cash']=df.cash-(df.trade*df.close)
df.loc[sellsubset,'trade']=-df.pos.shift(1)
## if previous row has position, keep the position if the signal is still True
df['pos']=df.trade.mask(df.signal & (df.trade == 0)).ffill().astype(int)
I get this as a result:
close signal cash trade pos
0 21.02 False 1000.00 0.0 0
1 21.05 True 10.65 47.0 47
2 21.10 True 1000.00 0.0 47
3 21.22 True 1000.00 0.0 47
4 22.17 False 1000.00 -0.0 0
5 23.34 True 4.15 45.0 45
But would like to get this :
close signal cash trade pos
0 21.02 False 1000.00 0.0 0
1 21.05 True 10.65 47.0 47
2 21.10 True 10.65 0.0 47
3 21.22 True 10.65 0.0 47
4 22.17 False 1052.62 -47.0 0
5 23.34 True 2.57 45.0 45

Related

Determine the duration of an event

I have a dataframe with a list of events, a column for an indicator for a criterion, and a column for a timestamp.
For each event, if the indicator is true, I want to see if the event lasted more than one period, and for how long.
In terms of an expected output, I have provided an example below.
For the duration column, A is true for only one time period so it will be coded as 1. Then, A is False for the next period, so it will code that as 0. Then, A is true for 2 time periods, so the duration is two, the next entry can be coded as 0 since I am only interested in the first entry, and so on.
id target time duration
0 A True 2023-01-22 11:00:00 1
3 A False 2023-01-22 11:05:00 0
6 A True 2023-01-22 11:10:00 2
9 A True 2023-01-22 11:15:00 0
12 A False 2023-01-22 11:20:00 0
But I have no idea how to do this.
A sample dataframe is included below
import pandas as pd
time_test = pd.DataFrame({'id':[
'A','B','C','A','B','C',
'A','B','C','A','B','C',
'A','B','C','A','B','C'],
'target':[
'True','True','True','False','True','True',
'True','False','True','True','True','True',
'False','True','False','True','False','True'],
'time':[
'11:00','11:00','11:00','11:05','11:05','11:05',
'11:10','11:10','11:10','11:15','11:15','11:15',
'11:20','11:20','11:20','11:25','11:25','11:25']})
time_test =time_test.sort_values(['id','time'])
time_test['time'] =pd.to_datetime(time_test['time'])
time_test
EDIT: I need to provide some clarification about the expected output
Let's take group B as an example. An event occurs for B at 11:00, indicated by the "True" under target. At 11:05, the event is still occurring so duration should be 2 for the row 1 B True 2023-01-22 11:00:00 . I am not interested in the row following so that can coded as 0. So in a since 0 would represent both "already accounted for" and the absence of an event.
At 11:10 that event is not occurring so the summation would re-set.
At 11:15 another event is occurring, and at 11:20 that event is still going, so the value for the first row should be 2.
In the end, the values for B should be 2,0,0,2,0,0.
I can see why this method would be confusing but I hope my explanation makes since. My data is in 5 minute chunks so I figured I could just count the number of chunks to see how long an event lasted for, instead of using a start and end time to calculate the elapsed time(but maybe that would be easier?)

Annotated code
# Convert the target column to boolean
mask = time_test['target'].eq('True')
# Create subgroups to identify blocks of consecutive True's
time_test['subgrps'] = (~mask).cumsum()[mask]
# Group the target mask by id and subgrps
g = mask.groupby([time_test['id'], time_test['subgrps']])
# Create a boolean mask to identify dupes per id and subgrps
dupes = time_test.duplicated(subset=['id', 'subgrps'])
# Sum the True value per group and mask the duplicates
time_test['duration'] = g.transform('sum').mask(dupes).fillna(0)
Result
id target time subgrps duration
0 A True 2023-01-22 11:00:00 0.0 1.0
3 A False 2023-01-22 11:05:00 NaN 0.0
6 A True 2023-01-22 11:10:00 1.0 2.0
9 A True 2023-01-22 11:15:00 1.0 0.0
12 A False 2023-01-22 11:20:00 NaN 0.0
15 A True 2023-01-22 11:25:00 2.0 1.0
1 B True 2023-01-22 11:00:00 2.0 2.0
4 B True 2023-01-22 11:05:00 2.0 0.0
7 B False 2023-01-22 11:10:00 NaN 0.0
10 B True 2023-01-22 11:15:00 3.0 2.0
13 B True 2023-01-22 11:20:00 3.0 0.0
16 B False 2023-01-22 11:25:00 NaN 0.0
2 C True 2023-01-22 11:00:00 4.0 4.0
5 C True 2023-01-22 11:05:00 4.0 0.0
8 C True 2023-01-22 11:10:00 4.0 0.0
11 C True 2023-01-22 11:15:00 4.0 0.0
14 C False 2023-01-22 11:20:00 NaN 0.0
17 C True 2023-01-22 11:25:00 5.0 1.0

Keep max value until ID and condition change in Pandas

I have a dataframe that looks like this (link to csv)
id time value approved
1 0:00 10 false
1 0:01 20 true
1 0:02 30 true
1 0:03 20 true
1 0:04 40 false
1 0:05 35 false
1 0:06 60 false
2 0:07 20 true
2 0:08 30 true
2 0:09 50 false
2 0:10 45 false
2 0:11 70 false
2 0:12 62 false
and I want to create two more columns that will keep the max approved values with a tolerance of 2 secs and the time of the respective max values. So I want it to look like this
id time value approved max_approved max_time
1 0:00 10 false NaN NaN
1 0:01 20 true 20 0:01
1 0:02 30 true 30 0:02
1 0:03 20 true 30 0:02
1 0:04 40 false 40 0:04
1 0:05 35 false 40 0:04
1 0:06 60 false 40 0:04
2 0:07 20 true 20 0:07
2 0:08 30 true 30 0:08
2 0:09 50 false 50 0:09
2 0:10 45 false 50 0:09
2 0:11 70 false 50 0:09
How can I do this? Thanks

The logic or output is not fully clear, but if I guess correctly, you can try:
df['td'] = pd.to_timedelta('0:'+df['time'])
df[['max_approved', 'max_time']] = (df
.assign(value=df['value'].where(df['approved']),
last_time=lambda d: d['td'].dt.total_seconds().where(df['approved']),
)
.set_index('td').groupby('id')[['value', 'last_time']]
.apply(lambda s: s.rolling('2s').max().ffill())
.to_numpy()
)
output:
id time value approved td max_approved max_time
0 1 0:00 10 False 0 days 00:00:00 NaN NaN
1 1 0:01 20 True 0 days 00:00:01 20.0 1.0
2 1 0:02 30 True 0 days 00:00:02 30.0 2.0
3 1 0:03 20 True 0 days 00:00:03 30.0 3.0
4 1 0:04 40 False 0 days 00:00:04 20.0 3.0
5 1 0:05 35 False 0 days 00:00:05 20.0 3.0
6 1 0:06 60 False 0 days 00:00:06 20.0 3.0
7 2 0:07 20 True 0 days 00:00:07 20.0 7.0
8 2 0:08 30 True 0 days 00:00:08 30.0 8.0
9 2 0:09 50 False 0 days 00:00:09 30.0 8.0
10 2 0:10 45 False 0 days 00:00:10 30.0 8.0
11 2 0:11 70 False 0 days 00:00:11 30.0 8.0
12 2 0:12 62 False 0 days 00:00:12 30.0 8.0

You could use iterrows to do so
max_value = 0
for index, row_data in df.iterrows():
# your logic, e.g.
if row_data.approved and row_data.value > max_value:
max_value = row_data.value
df['max_approved'].iloc(index) = max_value
...
Does this help to get started?
If you want a exact solution, please provide code with the DataFrame (so we don't have to parse the data out of your question. Or your code and where your problems are

After a few days of research I managed to do it this way:
canBeTop = (df['approved'].rolling(window = 3, min_periods=1).max() == True)
df['max_approved'] = df.groupby(['id', canBeTop])['value'].transform('cummax').where(canBeTop).ffill()
df['max_time'] = df.where((canBeTop == True) & (df['value'] == df['max_approved']))['time']
df['max_time'] = df.groupby('id', group_keys=False)['max_time'].apply(lambda x: x.ffill())

How can I rank based on condition in Pandas

Supposed, I have Pandas DataFrame looks like below:
Cluster
Variable
Group
Ratio
Value
1
GDP_M3
GDP
20%
70%
1
HPI_M6
HPI
40%
80%
1
GDP_lg2
GDP
35%
50%
2
CPI_M9
CPI
10%
50%
2
HPI_lg6
HPI
15%
65%
3
CPI_lg12
CPI
15%
90%
3
CPI_lg1
CPI
20%
95%
I would like to rank Variable based on Ratio and Value in the separated columns. The Ratio will rank from the lowest to the highest, while the Value will rank from the highest to the lowest.
There are some variables that I do not want to rank. In the example, I do not prefer CPI. Any type of CPI will not be considered for the rank e.g., CPI_M9. However, the case will be expected only if there is only that particular variable in the Cluster.
The results from condition above will look like the table below:
Cluster
Variable
Group
Ratio
Value
RankRatio
RankValue
1
GDP_M3
GDP
20%
70%
1
2
1
HPI_M6
HPI
40%
80%
3
1
1
GDP_lg2
GDP
35%
50%
2
3
2
CPI_M9
CPI
10%
50%
NaN
NaN
2
HPI_lg6
HPI
15%
65%
1
1
3
CPI_lg12
CPI
15%
90%
1
2
3
CPI_lg1
CPI
20%
95%
2
1
For Cluster 1, the GDP_M3 has the lowest Ratio at 20%, while the HPI_M3 has the highest Value at 80%. Thus, both of them will be assigned rank 1 and the others will be followed subsequently.
For Cluster 2, even CPI_M9 has the lowest Ratio but the CPI is not prefer. Thus, the rank 1 will be assigned to HPI_lg6.
For Cluster 3, there are variables from the only CPI Group and there is no other options to rank. Thus, the CPI_lg12 and CPI_lg1 are ranked based on the lowest Ratio and the highest Value.
df['RankRatio'] = df.groupby(['Cluster'])['Ratio'].rank(method = 'first', ascending = True)
df['RankValue'] = df.groupby(['Cluster'])['Value'].rank(method = 'first', ascending = False)
I have some code that can be handled only general case but for specific case with unprefer group of variables, my code cannot handle it.
Please help or suggest on this. Thank you.

Use:
#convert columns to numeric
df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)
Remove row with CPI by condition - test rows if no only CPI per Cluster:
m = df['Group'].eq('CPI')
m1 = ~df['Cluster'].isin(df.loc[m, 'Cluster']) | m
df['RankRatio'] = df[m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
print (df)
Cluster Variable Group Ratio Value RankRatio RankValue
0 1 GDP_M3 GDP 20.0 70.0 1.0 2.0
1 1 HPI_M6 HPI 40.0 80.0 3.0 1.0
2 1 GDP_lg2 GDP 35.0 50.0 2.0 3.0
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN
4 2 HPI_lg6 HPI 15.0 65.0 1.0 1.0
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0
How it working:
For mask2 are filter all Cluster values if match mask1 and filtered original column Cluster, then invert mask by ~. Last chain both conditions by | for bitwise OR for all rows without CPI if exist with another values per Cluster:
print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[m, 'Cluster']), both = m1))
Cluster Variable Group Ratio Value mask1 mask2 both
0 1 GDP_M3 GDP 20.0 70.0 False True True
1 1 HPI_M6 HPI 40.0 80.0 False True True
2 1 GDP_lg2 GDP 35.0 50.0 False True True
3 2 CPI_M9 CPI 10.0 50.0 True False True
4 2 HPI_lg6 HPI 15.0 65.0 False False False
5 3 CPI_lg12 CPI 15.0 90.0 True False True
6 3 CPI_lg1 CPI 20.0 95.0 True False True
EDIT:
df[['Ratio','Value']]=df[['Ratio','Value']].apply(lambda x: x.str.strip('%')).astype(float)
m = df['Group'].isin(['CPI','HPI'])
m2 = df.groupby('Cluster')['Group'].transform('nunique').ne(1)
m1 = (~df['Cluster'].isin(df.loc[~m, 'Cluster']) | m) & m2
df['RankRatio'] = df[~m1].groupby('Cluster')['Ratio'].rank(method='first', ascending=True)
df['RankValue'] = df[~m1].groupby('Cluster')['Value'].rank(method='first', ascending=False)
print (df)
Cluster Variable Group Ratio Value RankRatio RankValue
0 1 GDP_M3 GDP 20.0 70.0 1.0 1.0
1 1 HPI_M6 HPI 40.0 80.0 NaN NaN
2 1 GDP_lg2 GDP 35.0 50.0 2.0 2.0
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN
4 2 HPI_lg6 HPI 15.0 65.0 NaN NaN
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0
print (df.assign(mask1 = m, mask2 = ~df['Cluster'].isin(df.loc[~m, 'Cluster']), m2=m2, all = ~m1))
Cluster Variable Group Ratio Value RankRatio RankValue mask1 mask2 \
0 1 GDP_M3 GDP 20.0 70.0 1.0 1.0 False False
1 1 HPI_M6 HPI 40.0 80.0 NaN NaN True False
2 1 GDP_lg2 GDP 35.0 50.0 2.0 2.0 False False
3 2 CPI_M9 CPI 10.0 50.0 NaN NaN True True
4 2 HPI_lg6 HPI 15.0 65.0 NaN NaN True True
5 3 CPI_lg12 CPI 15.0 90.0 1.0 2.0 True True
6 3 CPI_lg1 CPI 20.0 95.0 2.0 1.0 True True
m2 all
0 True True
1 True False
2 True True
3 True False
4 True False
5 False True
6 False True

shifting down rows of specific columns from a specific index in python

I am scraping multiple tables from multiple pages of a website. The issue is there is a row missing from the initial table. Basically, this is how the dataframe looks.
mar2018 feb2018 jan2018 dec2017 nov2017
oct2017 sep2017 aug2017
balls faced 345 561 295 0 645 balls faced 200 58 0
runs scored 156 281 183 0 389 runs scored 50 20 0
strike rate 52.3 42.6 61.1 0 52.2 strike rate 25 34 0
dot balls 223 387 173 0 476 dot balls 125 34 0
fours 8 12 19 0 22 sixes 2 0 0
doubles 20 38 16 0 36 fours 4 2 0
notout 2 0 0 0 4 doubles 2 0 0
notout 4 2 0
the column 'sixes' is missing in the first page and present in the subsequent pages. So, I am trying to move the rows starting from 'fours' to 'not out' to a position down and leave nan's in row 4 for first 5 columns starting from mar2018 to nov2017.
I tried the following code but it isn't working. This is moving the values horizontally but not vertically downward.
df.iloc[4][0:6] = df.iloc[4][0:6].shift(1)
and also
df2 = pd.DataFrame(index = 4)
df = pd.concat([df.iloc[:], df2, df.iloc[4:]]).reset_index(drop=True)
did not work.
df['mar2018'] = df['mar2018'].shift(1)
But this moves all the values of that column down by 1 row.
So, I was wondering if it is possible to shift down rows of specific columns from a specific index?

I think need reindex by union by numpy.union1d of all index values:
idx = np.union1d(df1.index, df2.index)
df1 = df1.reindex(idx)
df2 = df2.reindex(idx)
print (df1)
mar2018 feb2018 jan2018 dec2017 nov2017
balls faced 345.0 561.0 295.0 0.0 645.0
dot balls 223.0 387.0 173.0 0.0 476.0
doubles 20.0 38.0 16.0 0.0 36.0
fours 8.0 12.0 19.0 0.0 22.0
notout 2.0 0.0 0.0 0.0 4.0
runs scored 156.0 281.0 183.0 0.0 389.0
sixes NaN NaN NaN NaN NaN
strike rate 52.3 42.6 61.1 0.0 52.2
print (df2)
oct2017 sep2017 aug2017
balls faced 200 58 0
dot balls 125 34 0
doubles 2 0 0
fours 4 2 0
notout 4 2 0
runs scored 50 20 0
sixes 2 0 0
strike rate 25 34 0
If multiple DataFrames in list is possible use list comprehension:
from functools import reduce
dfs = [df1, df2]
idx = reduce(np.union1d, [x.index for x in dfs])
dfs1 = [df.reindex(idx) for df in dfs]
print (dfs1)
[ mar2018 feb2018 jan2018 dec2017 nov2017
balls faced 345.0 561.0 295.0 0.0 645.0
dot balls 223.0 387.0 173.0 0.0 476.0
doubles 20.0 38.0 16.0 0.0 36.0
fours 8.0 12.0 19.0 0.0 22.0
notout 2.0 0.0 0.0 0.0 4.0
runs scored 156.0 281.0 183.0 0.0 389.0
sixes NaN NaN NaN NaN NaN
strike rate 52.3 42.6 61.1 0.0 52.2, oct2017 sep2017 aug2017
balls faced 200 58 0
dot balls 125 34 0
doubles 2 0 0
fours 4 2 0
notout 4 2 0
runs scored 50 20 0
sixes 2 0 0
strike rate 25 34 0]

Change all rows in a dataframe, following the occurrence of an item in a row.

Say I have a data frame like this:
Open Close Split
144 144 False
142 143 False
... ... ...
138 139 False
72 73 True
72 74 False
75 76 False
... ... ...
79 78 False
Obviously the dataframe can be quite large, and may contain other columns, but this is the core.
My end goal is to adjust all of the data to account for the split, and so far I've been able to identify the split (that's the "Split" column).
Now, I'm looking for an elegant way to divide everything before the split by 2, or multiply everything after the split by 2.
I thought the best way might be to spread the True down towards the bottom, and then multiply all rows that contain a True in the "Split" column, but is there a more Pythonic way to do it?

Assuming Split is the only boolean column, and that everything else is numeric in nature, you can just take the cumsum and set values with loc accordingly -
m = df.pop('Split').cumsum()
df.loc[m.eq(0)] /= 2 # division before split
df.loc[m.eq(1)] *= 2 # multiplication after split
df
Open Close
0 72.0 72.0
1 71.0 71.5
2 69.0 69.5
3 144.0 146.0
4 144.0 148.0
5 150.0 152.0
6 158.0 156.0
This is by far the most performant option. Another possible option involves np.where -
df[:] = np.where(m.eq(0)[:, None], df / 2, df * 2)
df
Open Close
0 72.0 72.0
1 71.0 71.5
2 69.0 69.5
3 144.0 146.0
4 144.0 148.0
5 150.0 152.0
6 158.0 156.0
Or,
df.where/df.mask -
(df / 2).where(m.eq(0), df * 2)
Or,
(df / 2).where(m.ne(0), df * 2)
Open Close Split
0 72.0 72.0 0
1 71.0 71.5 0
2 69.0 69.5 0
3 144.0 146.0 2
4 144.0 148.0 0
5 150.0 152.0 0
6 158.0 156.0 0
These are nowhere as near efficient as the indexing option with loc, because it involves a lot of redundant computation.

Another cumsum-based solution:
columns = ['Open','Close']
df[columns] = df[columns].mul(df.Split.cumsum() + 1, axis=0)
# Open Close Split
#0 144 144 False
#1 142 143 False
#2 138 139 False
#3 144 146 True
#4 144 148 False
#5 150 152 False
#6 158 156 False

split_true = df[df['Split'] == True].index[0]
df.iloc[split_true:,:]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.