How to find the maximum date value with conditions in python? - python

I have a three columns dataframe as follows. I want to calculate the returns in three months per day for every funds, so I need to get the date with recorded NAV data three months ago. Should I use the max() function with filter() function to deal this problem? If so, how? If not, could you please help me figure out a better way to do this?
fund code
date
NAV
fund 1
2021-01-04
1.0000
fund 1
2021-01-05
1.0001
fund 1
2021-01-06
1.0023
...
...
...
fund 2
2020-02-08
1.0000
fund 2
2020-02-09
0.9998
fund 2
2020-02-10
1.0001
...
...
...
fund 3
2022-05-04
2.0021
fund 3
2022-05-05
2.0044
fund 3
2022-05-06
2.0305
I tried to combined the max() function with filter() as follows:
max(filter(lambda x: x<=df['date']-timedelta(days=91)))
But it didn't work.
Were this in excel, I know I could use the following functions to solve this problem:
{max(if(B:B<=B2-91,B:B))}
{max(if(B:B<=B3-91,B:B))}
{max(if(B:B<=B4-91,B:B))}
....
But with python, I don't know what I could do. I just learnt it three days ago. Please help me.
This picture is what I want if it was in excel. The yellow area is the original data. The white part is the procedure I need for the calculation and the red part is the result I want. To get this result, I need to divide the 3rd column by the 5th column.
I know that I could use pct_change(period=7) function to get the same results in this picture. But here is the tricky part: the line 7 rows before is not necessarily the data 7 days before, and not all the funds are recorded daily. Some funds are recorded weekly, some monthly. So I need to check if the data used for division exists first.

what you need is an implementation of the maximum in sliding window (for your example 1 week, 7days).
I could recreated you example as follow (to create the data frame you have):
import pandas as pd
import datetime
from random import randint
df = pd.DataFrame(columns=["fund code", "date", "NAV"])
date = datetime.datetime.strptime("2021-01-04", '%Y-%m-%d')
for i in range(10):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 1', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
for i in range(20, 25):
df = df.append({"fund code": 'fund 2', "date": date + datetime.timedelta(i), "NAV":randint(0,10)}, ignore_index=True)
this will look like your example, with not continuous dates and two different funds.
The maximum sliding window (for variable days length look like this)
import queue
class max_queue:
def __init__(self, win=7):
self.win = win
self.queue = queue.deque()
self.date = None
def append(self, date, value):
while self.queue and value > self.queue[-1][1]:
self.queue.pop()
while self.queue and date - self.queue[0][0] >= datetime.timedelta(self.win):
self.queue.popleft()
self.queue.append((date, value))
self.date = date
def get_max(self):
return self.queue[0][1]
now you could simply iterate over rows and get the max value in the timeframe you are interested.
mq = max_queue(7)
pre_code = ''
for idx, row in df.iterrows():
code, date, nav,*_ = row
if code != pre_code:
mq = max_queue(7)
pre_code = code
mq.append(date, nav)
df.at[idx, 'max'] = mq.get_max()
results will look like this, with added max column. This assumes that funds data are continuous, but you could as well modify to have seperate max_queue for each funds as well.
using max queue to only keep track of the max in the window would be the correct complexity O(n) for a solution. important if you are dealing with huge datasets and especially bigger date ranges (instead of week).

Related

Improve running time when using inflation method on pandas

I'm trying to get real prices for my data in pandas. Right now, I am just playing with one year's worth of data (3962050 rows) and it took me 443 seconds to inflate the values using the code below. Is there a quicker way to find the real value? Is it possible to use pooling? I have many more years and if would take too long to wait every time.
Portion of df:
year quarter fare
0 1994 1 213.98
1 1994 1 214.00
2 1994 1 214.00
3 1994 1 214.50
4 1994 1 214.50
import cpi
import pandas as pd
def inflate_column(data, column):
"""
Adjust for inflation the series of values in column of the
dataframe data. Using cpi library.
"""
print('Beginning to inflate ' + column)
start_time = time.time()
df = data.apply(lambda x: cpi.inflate(x[column],
x.year), axis=1)
print("Inflating process took", time.time() - start_time, " seconds to run")
return df
df['real_fare'] = inflate_column(df, 'fare')
You have multiple values for each year: you can just call one for every year, store it in dict and then use the value instead of calling to cpi.inflate everytime.
all_years = df["year"].unique()
dict_years = {}
for year in all_years:
dict_years[year] = cpi.inflate(1.0, year)
df['real_fare'] = # apply here: dict_years[row['year']]*row['fare']
You can fill the last line using apply, or try do it in some other way like df['real_fare']=df['fare']*...

Make a function to return length based on a condition

I have 2 DataFrames - 1 contains Stock Tickers and a maximum/minimum price range along with other columns.
The other DataFrame has dates as indexes and is grouped by tickers with various metrics like open,close,high low, etc. Now, I want to take a count of days from this DataFrame when for a given stock, close price was higher than minimum price.
I am stuck here:Now I want to find for example how many days AMZN was trading below the period max price.
I want to take count of days from the second dataframe based on values from the first dataframe,count of days where closing price was lesser/greater than Max/Min period price.
I have added the code to reproduce the DataFrames.
Please check the screen shots.
import pandas as pd
import datetime
from dateutil.relativedelta import relativedelta
import yfinance as yf
start=datetime.datetime.today()-relativedelta(years=2)
end=datetime.datetime.today()
us_stock_list='FB AMZN BABA'
data_metric = yf.download(us_stock_list, start=start, end=end,group_by='column',auto_adjust=True)
data_ticker= yf.download(us_stock_list, start=start, end=end,group_by='ticker',auto_adjust=True)
stock_list=[stock for stock in data_ticker.stack()]
# max_price
max_values=pd.DataFrame(data_ticker.max().unstack()['High'])
# min_price
min_values=pd.DataFrame(data_ticker.min().unstack()['Low'])
# latest_price
latest_day=pd.DataFrame(data_ticker.tail(1).unstack())
latest_day=latest_day.unstack().unstack().unstack().reset_index()
# latest_day=latest_day.unstack().reset_index()
latest_day=latest_day.drop(columns=['level_0','Date'])
latest_day.set_index('level_3',inplace=True)
latest_day.rename(columns={0:'Values'},inplace=True)
latest_day=latest_day.groupby(by=['level_3','level_2']).max().unstack()
latest_day.columns=[ '_'.join(x) for x in latest_day.columns ]
latest_day=latest_day.join(max_values,how='inner')
latest_day=latest_day.join(min_values,how='inner')
latest_day.rename(columns={'High':'Period_High_Max','Low':'Period_Low_Min'},inplace=True)
close_price_data=pd.DataFrame(data_metric['Close'].unstack().reset_index())
close_price_data= close_price_data.rename(columns={'level_0':'Stock',0:'Close_price'})
close_price_data.set_index('Stock',inplace=True)
Use this to reproduce:
{"Values_Close":{"AMZN":2286.0400390625,"BABA":194.4799957275,"FB":202.2700042725},"Values_High":{"AMZN":2362.4399414062,"BABA":197.3800048828,"FB":207.2799987793},"Values_Low":{"AMZN":2258.1899414062,"BABA":192.8600006104,"FB":199.0500030518},"Values_Open":{"AMZN":2336.8000488281,"BABA":195.75,"FB":201.6000061035},"Values_Volume":{"AMZN":9754900.0,"BABA":22268800.0,"FB":30399600.0},"Period_High_Max":{"AMZN":2475.0,"BABA":231.1399993896,"FB":224.1999969482},"Period_Low_Min":{"AMZN":1307.0,"BABA":129.7700042725,"FB":123.0199966431},"%_Position":{"AMZN":0.8382192115,"BABA":0.6383544892,"FB":0.7832576338}}
{"Stock":{
"0":"AMZN",
"1":"AMZN",
"2":"AMZN",
"3":"AMZN",
"4":"AMZN",
"5":"AMZN",
"6":"AMZN",
"7":"AMZN",
"8":"AMZN",
"9":"AMZN",
"10":"AMZN",
"11":"AMZN",
"12":"AMZN",
"13":"AMZN",
"14":"AMZN",
"15":"AMZN",
"16":"AMZN",
"17":"AMZN",
"18":"AMZN",
"19":"AMZN"},
"Date":{
"0":1525305600000,
"1":1525392000000,
"2":1525651200000,
"3":1525737600000,
"4":1525824000000,
"5":1525910400000,
"6":1525996800000,
"7":1526256000000,
"8":1526342400000,
"9":1526428800000,
"10":1526515200000,
"11":1526601600000,
"12":1526860800000,
"13":1526947200000,
"14":1527033600000,
"15":1527120000000,
"16":1527206400000,
"17":1527552000000,
"18":1527638400000,
"19":1527724800000 },
"Close_price":{
"0":1572.0799560547,
"1":1580.9499511719,
"2":1600.1400146484,
"3":1592.3900146484,
"4":1608.0,
"5":1609.0799560547,
"6":1602.9100341797,
"7":1601.5400390625,
"8":1576.1199951172,
"9":1587.2800292969,
"10":1581.7600097656,
"11":1574.3699951172,
"12":1585.4599609375,
"13":1581.4000244141,
"14":1601.8599853516,
"15":1603.0699462891,
"16":1610.1500244141,
"17":1612.8699951172,
"18":1624.8900146484,
"19":1629.6199951172}}
Do a merge between both dataframes, groupby company (index level=0) and apply a custom function:
df_merge = close_price_data.merge(
latest_day[['Period_High_Max', 'Period_Low_Min']],
left_index=True,
right_index=True)
def fun(df):
d = {}
d['days_above_min'] = (df.Close_price > df.Period_Low_Min).sum()
d['days_below_max'] = (df.Close_price < df.Period_High_Max).sum()
return pd.Series(d)
df_merge.groupby(level=0).apply(fun)
Period_Low_Min and Period_High_Max are the min and max respectively so all closing prices will be in that range, if this is not what you are trying to accomplish let me know.

I need help making pandas perform better with dataframe interactions

I'm a newbie and have been studying pandas for a few days, and started my first project with it. I wanted to use it to create a product stock prediction timeline for the current month.
Basically I get the stock and predicted daily reduction and trace a line from today to the end of the month with the predicted stock. Also, if there is a purchase order to be delivered on day XYZ, I add the delivery amount on that day.
I have a dataframe that contain's the stock for today and the predicted daily redutcion for this month
ITEM STOCK DAILY_DEDUCTION
A 1000 20
B 2000 15
C 800 8
D 10000 100
And another dataframe that contains pending purchase orders and amount that will be delivered.
ITEM DATE RECEIVING_AMOUNT
A 2018-05-16 20
B 2018-05-23 15
A 2018-05-17 8
D 2018-05-29 100
I created this loop to iterate through the dataframe and do the following:
subtract the DAILY_DEDUCTION for the item
if the date is the same as a purchase order date, then add the RECEIVING_AMOUNT
df_dates = pd.date_range(start=today, end=endofmonth, freq='D')
temptable = []
for row in df_stock.itertuples(index=True):
predicted_stock= getattr(row, "STOCK")
item = getattr(row, "ITEM")
for date in df_dates:
date_format = date.strftime('%Y-%m-%d')
predicted_stock = predicted_stock - getattr(linha, "DAILY_DEDUCTION")
order_qty = df_purchase_orders.loc[(df_purchase_orders['DATE'] == date_format)
& (df_purchase_orders['ITEM'] == item), 'RECEIVING_AMOUNT']
if len(df_purchase_orders.index) > 0:
predicted_stock = predicted_stock + order_qty.item()
lista = [date_format, item, int(predicted_stock)]
temptable.append(lista)
And... well, it did the job, but it's quite slow. I run this on 100k rows give or take, and was hoping to find some insight on how I can solve this problem in a way that performs better?

How do I avoid a loop with Python/Pandas to build an equity curve?

I am trying to build an equity curve in Python using Pandas. For those not in the know, an equity curve is a cumulative tally of investing profits/losses day by day. The code below works but it is incredibly slow. I've tried to build an alternate using Pandas .iloc and such but nothing is working. I'm not sure if it is possible to do this outside of a loop given how I have to reference the prior row(s).
for today in range(len(f1)): #initiate a loop that runs the length of the "f1" dataframe
if today == 0: #if the index value is zero (aka first row in the dataframe) then...
f1.loc[today,'StartAUM'] = StartAUM #Set intial assets
f1.loc[today,'Shares'] = 0 #dummy placeholder for shares; no trading on day 1
f1.loc[today,'PnL'] = 0 #dummy placeholder for P&L; no trading day 1
f1.loc[today,'EndAUM'] = StartAUM #set ending AUM; should be beginning AUM since no trades
continue #and on to the second row in the dataframe
yesterday = today - 1 #used to reference the rows (see below)
f1.loc[today,'StartAUM'] = f1.loc[yesterday,'EndAUM'] #todays starting aseets are yesterday's ending assets
f1.loc[today,'Shares'] = f1.loc[yesterday,'EndAUM']//f1.loc[yesterday,'Shareprice'] #today's shares to trade = yesterday's assets/yesterday's share price
f1.loc[today,'PnL'] = f1.loc[today,'Shares']*f1.loc[today,'Outcome1'] #Our P&L should be the shares traded (see prior line) multiplied by the outcome for 1 share
#Note Outcome1 came from the dataframe before this loop >> for the purposes here it's value is irrelevant
f1.loc[today,'EndAUM'] = f1.loc[today,'StartAUM']+f1.loc[today,'PnL'] #ending assets are starting assets + today's P&L
There is a good example here: http://www.pythonforfinance.net/category/basic-data-analysis/ and I know that there is an example in Wes McKinney's book Python for Data Analysis. You might be able to find it here: http://wesmckinney.com/blog/python-for-financial-data-analysis-with-pandas/
Have you tried using iterrows() to construct the for loop?
for index, row in f1.iterrows():
if today == 0:
row['StartAUM'] = StartAUM #Set intial assets
row['Shares'] = 0 #dummy placeholder for shares; no trading on day 1
row['PnL'] = 0 #dummy placeholder for P&L; no trading day 1
row['EndAUM'] = StartAUM #set ending AUM; should be beginning AUM since no trades
continue #and on to the second row in the dataframe
yesterday = row[today] - 1 #used to reference the rows (see below)
row['StartAUM'] = row['EndAUM'] #todays starting aseets are yesterday's ending assets
row['Shares'] = row['EndAUM']//['Shareprice'] #today's shares to trade = yesterday's assets/yesterday's share price
row['PnL'] = row['Shares']*row['Outcome1'] #Our P&L should be the shares traded (see prior line) multiplied by the outcome for 1 share
#Note Outcome1 came from the dataframe before this loop >> for the purposes here it's value is irrelevant
row['EndAUM'] = row['StartAUM']+row['PnL'] #ending assets are starting assets + today's P&L
Probably the code is so slow as loc goes through f1 from beginning every time. iterrows() uses the same dataframe as it loops through it row by row.
See more details about iterrows() here.
You need to vectorize the operations (don't iterate with for but rather compute whole column at once)
# fill the initial values
f1['StartAUM'] = StartAUM # Set intial assets
f1['Shares'] = 0 # dummy placeholder for shares; no trading on day 1
f1['PnL'] = 0 # dummy placeholder for P&L; no trading day 1
f1['EndAUM'] = StartAUM # s
#do the computations (vectorized)
f1['StartAUM'].iloc[1:] = f1['EndAUM'].iloc[:-1]
f1['Shares'].iloc[1:] = f1['EndAUM'].iloc[:-1] // f1['Shareprice'].iloc[:-1]
f1['PnL'] = f1['Shares'] * f1['Outcome1']
f1['EndAUM'] = f1['StartAUM'] + f1 ['PnL']
EDIT: this will not work correctly since StartAUM, EndAUM, Shares depend on each other and cannot be computed one without another. I didn't notice that before.
Can you try the following:
#import relevant modules
import pandas as pd
import numpy as np
from pandas_datareader import data
import matplotlib.pyplot as plt
#download data into DataFrame and create moving averages columns
f1 = data.DataReader('AAPL', 'yahoo',start='1/1/2017')
StartAUM = 1000000
#populate DataFrame with starting values
f1['Shares'] = 0
f1['PnL'] = 0
f1['EndAUM'] = StartAUM
#Set shares held to be the previous day's EndAUM divided by the previous day's closing price
f1['Shares'] = f1['EndAUM'].shift(1) / f1['Adj Close'].shift(1)
#Set the day's PnL to be the number of shares held multiplied by the change in closing price from yesterday to today's close
f1['PnL'] = f1['Shares'] * (f1['Adj Close'] - f1['Adj Close'].shift(1))
#Set day's ending AUM to be previous days ending AUM plus daily PnL
f1['EndAUM'] = f1['EndAUM'].shift(1) + f1['PnL']
#Plot the equity curve
f1['EndAUM'].plot()
Does the above solve your issue?
The solution was to use the Numba package. It performs the loop task in a fraction of the time.
https://numba.pydata.org/
The arguments/dataframe can be passed to the numba module/function. I will try to write up a more detailed explanation with code when time permits.
Thanks to all
In case others come across this, you can definitely make an equity curve without loops.
Dummy up some data
import pandas as pd
import numpy as np
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (13, 10)
# Some data to work with
np.random.seed(1)
stock = pd.DataFrame(
np.random.randn(100).cumsum() + 10,
index=pd.date_range('1/1/2020', periods=100, freq='D'),
columns=['Close']
)
stock['ma_5'] = stock['Close'].rolling(5).mean()
stock['ma_15'] = stock['Close'].rolling(15).mean()
Holdings: simple long/short based on moving average crossover signals
longs = stock['Close'].where(stock['ma_5'] > stock['ma_15'], np.nan)
shorts = stock['Close'].where(stock['ma_5'] < stock['ma_15'], np.nan)
# Quick plot
stock.plot()
longs.plot(lw=5, c='green')
shorts.plot(lw=5, c='red')
EQUITY CURVE:
Identify which side (l/s) has first holding (ie: first trade, in this case, short), then keep the initial trade price and subsequently cumulatively sum the daily changes (there would normally be more nan's in the series if you have exit rules as well for when you are out of the market), and finally forward fill over the nan values and fill any last remaining nans with zeros. Its basically the same for the second opposite holdings (in this case, long) except don't keep the starting price. The other important thing is to invert the short daily changes (ie: negative changes should be positive to the PnL).
lidx = np.where(longs > 0)[0][0]
sidx = np.where(shorts > 0)[0][0]
startdx = min(lidx, sidx)
# For first holding side, keep first trade price, then calc daily change fwd and ffill nan's
# For second holdng side, get cumsum of daily changes, ffill and fillna(0) (make sure short changes are inverted)
if lidx == startdx:
lcurve = longs.diff() # get daily changes
lcurve[lidx] = longs[lidx] # put back initial starting price
lcurve = lcurve.cumsum().ffill() # add dialy changes/ffill to build curve
scurve = -shorts.diff().cumsum().ffill().fillna(0) # get daily changes (make declines positive changes)
else:
scurve = -shorts.diff() # get daily changes (make declines positive changes)
scurve[sidx] = shorts[sidx] # put back initial starting price
scurve = scurve.cumsum().ffill() # add dialy changes/ffill to build curve
lcurve = longs.diff().cumsum().ffill().fillna(0) # get daily changes
Add the 2 long/short curves together to get the final equity curve
eq_curve = lcurve + scurve
# quick plot
stock.iloc[:, :3].plot()
longs.plot(lw=5, c='green', label='Long')
shorts.plot(lw=5, c='red', label='Short')
eq_curve.plot(lw=2, ls='dotted', c='orange', label='Equity Curve')
plt.legend()

Python - time series alignment and "to date" functions

I have a dataset with the following first three columns.
Include Basket ID (unique identifier), Sale amount (dollars) and date of the transaction. I want to calculate the following column for each row of the dataset, and I would like to it in Python.
Previous Sale of the same basket (if any); Sale Count to date for current basket; Mean To Date for current basket (if available); Max To Date for current basket (if available)
Basket Sale Date PrevSale SaleCount MeanToDate MaxToDate
88 $15 3/01/2012 1
88 $30 11/02/2012 $15 2 $23 $30
88 $16 16/08/2012 $30 3 $20 $30
123 $90 18/06/2012 1
477 $77 19/08/2012 1
477 $57 11/12/2012 $77 2 $67 $77
566 $90 6/07/2012 1
I'm pretty new with Python, and I really struggle to find anything to do it in a fancy way. I've sorted the data (as above) by BasketID and Date, so I can get the previous sale in bulk by shifting forward by one for each single basket. No clue how to get the MeanToDate and MaxToDate in an efficient way apart from looping... any ideas?
This should do the trick:
from pandas import concat
from pandas.stats.moments import expanding_mean, expanding_count
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
# se is the (ordered) time series of sales restricted to a single basket
# we can now create a dataframe by combining different metrics
# pandas has a function for each of the ones you are interested in!
return concat(
{
'MeanToDate': expanding_mean(se), # cumulative mean
'MaxToDate': se.cummax(), # cumulative max
'SaleCount': expanding_count(se), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
# we then apply this handler to all the groups and pandas combines them
# back into a single dataframe indexed by (Basket, Date)
# we simply need to reset the index to get the shape you mention in your question
new_df = df.groupby('Basket').apply(handler).reset_index()
You can read more about grouping/aggregating here.
import pandas as pd
pd.__version__ # u'0.24.2'
from pandas import concat
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
return concat(
{
'MeanToDate': se.expanding().mean(), # cumulative mean
'MaxToDate': se.expanding().max(), # cumulative max
'SaleCount': se.expanding().count(), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
###########################
from datetime import datetime
df = pd.DataFrame({'Basket':[88,88,88,123,477,477,566],
'Sale':[15,30,16,90,77,57,90],
'Date':[datetime.strptime(ds,'%d/%m/%Y')
for ds in ['3/01/2012','11/02/2012','16/08/2012','18/06/2012',
'19/08/2012','11/12/2012','6/07/2012']]})
#########
new_df = df.groupby('Basket').apply(handler).reset_index()

Categories

Resources