Perform a cross column calculation in Python

Perform a cross column calculation in Python - python

Context
I am trying to build a portfolio dashboard following this example, only instead of Excel, I am using Python. I am currently not sure how to conduct from 3:47 onwards, cross calculating to arrive at the next period balance.
Problem
Is there a way to conduct this in Python? I tried a for loop but it returned the same number iterated over the number of forward periods. Below is the example:
date_range = pd.date_range(start=today, periods=period_of_investments, freq=contribution_periods)
returns_port = 12
rs = []
balance_total = []
for one in range(len(date_range))):
return_loss = (returns_port/period_of_investments)*capital_insert
rs.append(return_loss)
period_one_balance = capital_insert+return_loss
period_two_return_loss = (returns_port/period_of_investments)*(period_one_balance + capital_insert)
period_two_balance = period_one_balance + capital_insert + period_two_return_loss
balance_total.append(period_two_balance)

I did not watch the video but I will explain how to write a Python code for the following problem, which is similar to the one in the video.
Suppose you want to calculate the return of investment of a fixed monthly deposit for the next 20 years with a fixed interest rate.
The first step is understanding how pd.date_range() works. If you started at the beginning of this month the whole period would be pd.date_rage(start='4-1-2021', periods='240', freq='1m') (240 comes from 20 years, 12 month each). Basically, we are calculating the return at the end of each month.
import pandas as pd
portfolio = pd.DataFrame(columns=['Date', 'Investment', 'Return/Loss', 'Balance'])
interest_rate = 0.121
monthly_deposit = 500
dates = pd.date_range(start="3-31-2021", periods=240, freq='1m')
investment = [monthly_deposit]*len(dates)
return_losses = []
balances = []
current_balance = 500
for date in dates:
current_return_loss = (interest_rate/12)*current_balance
return_losses.append(round(current_return_loss,2))
balances.append(round(current_balance + current_return_loss))
current_balance += (current_return_loss + monthly_deposit)
portfolio['Date'] = pd.to_datetime(dates)
portfolio['Investment'] = investment
portfolio['Return/Loss'] = return_losses
portfolio['Balance'] = balances
balance_at_end = balances[-1]
print(portfolio.head(10))
print(balance_at_end)
You will get the following result, which is identical to the video:
Date Investment Return/Loss Balance
0 2021-03-31 500 5.04 505
1 2021-04-30 500 10.13 1015
2 2021-05-31 500 15.28 1530
3 2021-06-30 500 20.47 2051
4 2021-07-31 500 25.72 2577
5 2021-08-31 500 31.02 3108
6 2021-09-30 500 36.38 3644
7 2021-10-31 500 41.79 4186
8 2021-11-30 500 47.25 4733
9 2021-12-31 500 52.77 5286
506397

Related

For loop to exlcude percentiles in pandas dataframe based on year and region

I have a dataset consisting of days in which the average temperature is above 10 degrees Celsius for some states in the US.
I want to define the different percentiles from (1st to 99th percentile, in 5 percentile increments) for each year and state and subtract the rows that are larger than the percentile from each data frame.
x = pd.read_csv('C:/data/data_1.csv')
percentile1 = x.groupby(['STATE', 'YEAR']).quantile(0.01)
percentile5 = x.groupby(['STATE', 'YEAR']).quantile(0.05)
percentile10 = x.groupby(['STATE', 'YEAR']).quantile(0.1)
percentile15 = x.groupby(['STATE', 'YEAR']).quantile(0.15)
percentile20 = x.groupby(['STATE', 'YEAR']).quantile(0.2)
...
percentile85 = x.groupby(['STATE', 'YEAR']).quantile(0.85)
percentile90 = x.groupby(['STATE', 'YEAR']).quantile(0.90)
percentile99 = x.groupby(['STATE', 'YEAR']).quantile(0.99)
print(percentile1)
ID doy
STATE YEAR
AK 2001 1193.40 190.56
2002 1903.48 138.24
2003 2104.40 143.66
2004 1946.40 132.00
2005 2221.08 121.24
... ... ...
WY 2015 156.79 78.70
2016 114.60 83.68
2017 102.60 111.10
2018 115.04 114.51
2019 115.01 114.02
####Calculate the annual ignition timing quantiles per ecoregion
AK01 = x[(x["STATE"] == 'AK') & (x["YEAR"] == 2001)]
AK01 = AK01[AK01["doy"] >= percentile1.doy[0]]
So far I have done it like this, but it would take forever to do it like this per state, per year.
I would love to loop over this in a way so that it subsets per STATE and per YEAR.
Something like:
if
x.STATE == percentile1.index[0] and x.YEAR == percentile1.index[1]
then
x[x["doy"] >= percentile1.doy[0]]
I would eventually end up with something like a data frame with
print(df_percentile1)
STATE YEAR ID DOY
AK 2001 1 191
AK 2001 2 200
AK 2001 3 200
... ... ... ...
AK 2019 17 185
WY 2019 209 99
WY 2019 210 100
How should I incorporate all this in a for-loop?
Edit:
I think that I basically want to do this after I have reset_index() for all the percentiles:
percentile1 = percentile1.reset_index()
x = np.where((x['STATE'] == percentile1['STATE']) & (x['YEAR'] == percentile['YEAR']) & (x['doy'] <= percentile1['doy']), 'TRUE', 'FALSE')`
but I get the following error message
ValueError: Can only compare identically-labeled Series objects
However, everything is labelled identically. How should I deal with this?

Calculating Time Weighted Rate of Return in Python

I'm trying to calculate daily returns using the time weighted rate of return formula:
(Ending Value-(Beginning Value + Net Additions)) / (Beginning value + Net Additions)
My DF looks like:
Account # Date Balance Net Additions
1 9/1/2022 100 0
1 9/2/2022 115 10
1 9/3/2022 117 0
2 9/1/2022 50 0
2 9/2/2022 52 0
2 9/3/2022 40 -15
It should look like:
Account # Date Balance Net Additions Daily TWRR
1 9/1/2022 100 0
1 9/2/2022 115 10 0.04545
1 9/3/2022 117 0 0.01739
2 9/1/2022 50 0
2 9/2/2022 52 0 0.04
2 9/3/2022 40 -15 0.08108
After calculating the daily returns for each account, I want to link all the returns throughout the month to get the monthly return:
((1 + return) * (1 + return)) - 1
The final result should look like:
Account # Monthly Return
1 0.063636
2 0.12432
Through research (and trial and error), I was able to get the output I am looking for but as a new python user, I'm sure there is an easier/better way to accomplish this.
DF["Numerator"] = DF.groupby("Account #")[Balance].diff() - DF["Net Additions"]
DF["Denominator"] = ((DF["Numerator"] + DF["Net Additions"] - DF["Balance"]) * -1) + DF["Net Additions"]
DF["Daily Returns"] = (DF["Numerator"] / DF["Denominator"]) + 1
DF = DF.groupby("Account #")["Daily Returns"].prod() - 1
Any help is appreciated!

Average between points based on time

I'm trying to use Python to get time taken, as well as average speed between an object traveling between points.
The data looks somewhat like this,
location initialtime id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9km
2 2020-09-18T12:10:14.485952Z car_uno 83 8km
3 2020-09-18T11:59:14.484781Z car_duo 70 9km
7 2020-09-18T12:00:14.484653Z car_trio 85 8km
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5km
The function I'm using currently is essentially like this,
Speeds.index = pd.to_datetime(Speeds.index)
..etc
Now if I were doing this usually, I would just take the unique values of the id's,
for x in speeds.id.unique():
Speeds[speeds.id=="x"]...
But this method really isn't working.
What is the best approach for simply seeing if there are multiple id points over time, then taking the average of the speeds by that time given? Otherwise just returning the speed itself if there are not multiple values.
Is there a simpler pandas filter I could use?
Expected output is simply,
area - id - initial time - journey time - average speed.
the point is to get the average time and journey time for a vehicle going past two points

To get the average speed and journey times you can use groupby() and pass in the columns that determine one complete journey, like id or area.
import pandas as pd
from io import StringIO
data = StringIO("""
area initialtime id speed
1 2020-09-18T12:03:14.485952Z car_uno 72
2 2020-09-18T12:10:14.485952Z car_uno 83
3 2020-09-18T11:59:14.484781Z car_duo 70
7 2020-09-18T12:00:14.484653Z car_trio 85
8 2020-09-18T12:12:14.484653Z car_trio 70
""")
df = pd.read_csv(data, delim_whitespace=True)
df["initialtime"] = pd.to_datetime(df["initialtime"])
# change to ["id", "area"] if need more granular aggregation
group_cols = ["id"]
time = df.groupby(group_cols)["initialtime"].agg([max, min]).eval('max-min').reset_index(name="journey_time")
speed = df.groupby(group_cols)["speed"].mean().reset_index(name="average_speed")
pd.merge(time, speed, on=group_cols)
id journey_time average_speed
0 car_duo 00:00:00 70.0
1 car_trio 00:12:00 77.5
2 car_uno 00:07:00 77.5

I tryed to use a very intuitive solution. I'm assuming the data has already been loaded to df.
df['initialtime'] = pd.to_datetime(df['initialtime'])
result = []
for car in df['id'].unique():
_df = df[df['id'] == car].sort_values('initialtime', ascending=True)
# Where the car is leaving "from" and where it's heading "to"
_df['From'] = _df['location']
_df['To'] = _df['location'].shift(-1, fill_value=_df['location'].iloc[0])
# Auxiliary columns
_df['end_time'] = _df['initialtime'].shift(-1, fill_value=_df['initialtime'].iloc[0])
_df['end_speed'] = _df['speed'].shift(-1, fill_value=_df['speed'].iloc[0])
# Desired columns
_df['journey_time'] = _df['end_time'] - _df['initialtime']
_df['avg_speed'] = (_df['speed'] + _df['end_speed']) / 2
_df = _df[_df['journey_time'] >= pd.Timedelta(0)]
_df.drop(['location', 'distance', 'speed', 'end_time', 'end_speed'],
axis=1, inplace=True)
result.append(_df)
final_df = pd.concat(result).reset_index(drop=True)
The final DataFrame is as follows:
initialtime id From To journey_time avg_speed
0 2020-09-18 12:03:14.485952+00:00 car_uno 1 2 0 days 00:07:00 77.5
1 2020-09-18 11:59:14.484781+00:00 car_duo 3 3 0 days 00:00:00 70.0
2 2020-09-18 12:00:14.484653+00:00 car_trio 7 8 0 days 00:12:00 77.5

Here is another approach. My results are different that other posts, so I may have misunderstood the requirements. In brief, I calculated each average speed as total distance divided by total time (for each car).
from io import StringIO
import pandas as pd
# speed in km / hour; distance in km
data = '''location initial-time id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9
2 2020-09-18T12:10:14.485952Z car_uno 83 8
3 2020-09-18T11:59:14.484781Z car_duo 70 9
7 2020-09-18T12:00:14.484653Z car_trio 85 8
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5
'''
Now create data frame and perform calculations
# create data frame
df = pd.read_csv(StringIO(data), delim_whitespace=True)
df['elapsed-time'] = df['distance'] / df['speed'] # in hours
# utility function
def hours_to_hms(elapsed):
''' Convert `elapsed` (in hours) to hh:mm:ss (round to nearest sec)'''
h, m = divmod(elapsed, 1)
m *= 60
_, s = divmod(m, 1)
s *= 60
hms = '{:02d}:{:02d}:{:02d}'.format(int(h), int(m), int(round(s, 0)))
return hms
# perform calculations
start_time = df.groupby('id')['initial-time'].min()
journey_hrs = df.groupby('id')['elapsed-time'].sum().rename('elapsed-hrs')
hms = journey_hrs.apply(lambda x: hours_to_hms(x)).rename('hh:mm:ss')
ave_speed = ((df.groupby('id')['distance'].sum()
/ df.groupby('id')['elapsed-time'].sum())
.rename('ave speed (km/hr)')
.round(2))
# assemble results
result = pd.concat([start_time, journey_hrs, hms, ave_speed], axis=1)
print(result)
initial-time elapsed-hrs hh:mm:ss \
id
car_duo 2020-09-18T11:59:14.484781Z 0.128571 00:07:43
car_trio 2020-09-18T12:00:14.484653Z 0.201261 00:12:05
car_uno 2020-09-18T12:03:14.485952Z 0.221386 00:13:17
ave speed (km/hr)
id
car_duo 70.00
car_trio 77.01
car_uno 76.79

You should provide a better dataset (ie with identical time points) so that we understand better the inputs, and an exemple of expected output so that we understand the computation of the average speed.
Thus I'm just guessing that you may be looking for df.groupby('initialtime')['speed'].mean() if df is a dataframe containing your input data.

Create features based on cutoff times in featuretools

Im using featuretools and I need to create a feature that uses the cutoff time for its calculation.
My entityset consist in a client table and a subscription table (it has more but for the question only these are necessary):
import featuretools as ft
import pandas as pd
client_table = pd.DataFrame({'client_id': (1,2,3),
'start_date': (dt.date(2015,1,1),dt.date(2017,10,15),dt.date(2011,1,10))})
subscription_table = pd.DataFrame({'client_id': (1,3,1,2),
'start_plan_date': (dt.date(2015,1,1),dt.date(2011,1,10), dt.date(2018,2,1),dt.date(2017,10,15)),
'end_plan_date':(dt.date(2018,2,1),dt.date(2019,1,10), dt.date(2021,2,1),dt.date(2019,10,15))})
client table
client_id start_date
0 1 2015-01-01
1 2 2017-10-15
2 3 2011-01-10
substription table
subscription_id client_id start_plan_date end_plan_date
0 1 1 2015-01-01 2018-02-01
1 2 3 2011-01-10 2019-01-10
2 3 1 2018-02-01 2021-02-01
3 4 2 2017-10-15 2019-10-15
I created the entity set using client_id as key and setting start_date as time_index
es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id="client",
dataframe=client_table,
index="client_id",
time_index="start_date")
es = es.entity_from_dataframe(entity_id="subscription",
dataframe=subscription_table,
index="subscription_id",
time_index="start_plan_date",
variable_types={"client_id": ft.variable_types.Index,
"end_plan_date": ft.variable_types.Datetime})
relation= ft.Relationship(es["client"]["client_id"],es["subscription"]["client_id"])
es = es.add_relationship(relation)
print(es)
Out:
Entityset: None
Entities:
subscription [Rows: 4, Columns: 4]
client [Rows: 3, Columns: 2]
Relationships:
subscription.client_id -> client.client_id
Now, I need to create a feature that estimates the time between the cutoff time (i.e. 01/01/2018) and the closest end_plan_date for each client. In algebraic form the calculation should be
time_remaining_in_plan = max(subscription.end_plan_date - cutoff_time)
Also I need to calculate the amount of time since the client started:
time_since_start = cutoff_time - client.start_date
In my example the expected output for those features should look like this (im assuming the time differences in days, but it could be months also, also im using a time range for the cutoff times):
client_id cutoff_time time_remaining_in_plan time_since_start
0 3 2018-10-31 71 2851
1 3 2018-11-30 41 2881
2 1 2018-10-31 824 1399
3 1 2018-11-30 794 1429
4 2 2018-10-31 349 381
5 2 2018-11-30 319 411
Is there a way to use featuretools to create custom primitives (aggregation or transformation) or seed features that can generate this result??
Thanks!!

This can be done with custom primitives that use the use_calc_time parameter. This parameter will set up the primitive such that the cutoff time gets passed to it during calculation.
In your case, we need to define two primitives
from featuretools.primitives import make_trans_primitive
from featuretools.variable_types import Datetime, Numeric
def time_until(array, time):
diff = pd.DatetimeIndex(array) - time
return diff.days
TimeUntil = make_trans_primitive(function=time_until,
input_types=[Datetime],
return_type=Numeric,
uses_calc_time=True,
description="Calculates time until the cutoff time in days",
name="time_until")
def time_since(array, time):
diff = time - pd.DatetimeIndex(array)
return diff.days
TimeSince = make_trans_primitive(function=time_since,
input_types=[Datetime],
return_type=Numeric,
uses_calc_time=True,
description="Calculates time since the cutoff time in days",
name="time_since")
Then we can use the primitives in a call to ft.dfs
cutoff_times = pd.DataFrame({
"client_id": [1, 1, 2, 2, 3, 3],
"cutoff_time": pd.to_datetime([dt.date(2018,10,31), dt.date(2018,11,30)]*3)
})
fm, fl = ft.dfs(entityset=es,
target_entity="client",
cutoff_time=cutoff_times,
agg_primitives=["max"],
trans_primitives=[TimeUntil, TimeSince],
cutoff_time_in_index=True)
# these columns correspond to time_remaining_in_plan and time_since_start
fm = fm[["MAX(subscription.TIME_UNTIL(end_plan_date))", "TIME_SINCE(start_date)"]]
this returns
MAX(subscription.TIME_UNTIL(end_plan_date)) TIME_SINCE(start_date)
client_id time
1 2018-10-31 -272 1399
2 2018-10-31 349 381
3 2018-10-31 71 2851
1 2018-11-30 -302 1429
2 2018-11-30 319 411
3 2018-11-30 41 2881
This matches the result you're looking for in your answer with the exception of time_remaining_in_plan for client id 1. I double checked the numbers Feauturetools came up and I believe they are right for this dataset.

How to map a function in pandas which compares each record in a column to previous and next records

I have a time series of water levels for which I need to calculate monthly and annual statistics in relation to several arbitrary flood stages. Specifically, I need to determine the duration per month that the water exceeded flood stage, as well as the number of times these excursions occurred. Additionally, because of the noise associated with the dataloggers, I need to exclude floods that lasted less than 1 hour as well as floods with less than 1 hour between events.
Mock up data:
start = datetime.datetime(2014,9,5,12,00)
daterange = pd.date_range(start, periods = 10000, freq = '30min', name = "Datetime")
data = np.random.random_sample((len(daterange), 3)) * 10
columns = ["Pond_A", "Pond_B", "Pond_C"]
df = pd.DataFrame(data = data, index = daterange, columns = columns)
flood_stages = [('Stage_1', 4.0), ('Stage_2', 6.0)]
My desired output is:
Pond_A_Stage_1_duration Pond_A_Stage_1_events \
2014-09-30 12:00:00 35.5 2
2014-10-31 12:00:00 40.5 31
2014-11-30 12:00:00 100 16
2014-12-31 12:00:00 36 12
etc. for the duration and events at each flood stage, at each reservoir.
I've tried grouping by month, iterating through the ponds and then iterating through each row like:
grouper = pd.TimeGrouper(freq = "1MS")
month_groups = df.groupby(grouper)
for name, group in month_groups:
flood_stage_a = group.sum()[1]
flood_stage_b = group.sum()[2]
inundation_a = False
inundation_30_a = False
inundation_hour_a = False
change_inundation_a = 0
for level in group.values:
if level[1]:
inundation_a = True
else:
inundation_a = False
if inundation_hour_a == False and inundation_a == True and inundation_30_a == True:
change_inundation_a += 1
inundation_hour_a = inundation_30_a
inundation_30_a = inundation_a
But this is a caveman solution and the heuristics are getting messy since I don't want to count a new event if a flood started in one month and continued into the next. This also doesn't combine events with less than one hour between their start and end. Is there a better way to compare a record to it previous and next?
My other thought is to create new columns with the series shifted t+1, t+2, t-1, t-2, so I can evaluate each row once, but this still seems inefficient. Is there a smarter way to do this by mapping a function?

Let me give a quick, partial answer since no one has answered yet, and maybe someone else can do something better later on if this does not suffice for you.
You can do the time spent above flood stage pretty easily. I divided by 48 so the units are in days.
df[ df > 4 ].groupby(pd.TimeGrouper( freq = "1MS" )).count() / 48
Pond_A Pond_B Pond_C
Datetime
2014-09-01 15.375000 15.437500 14.895833
2014-10-01 18.895833 18.187500 18.645833
2014-11-01 17.937500 17.979167 18.666667
2014-12-01 18.104167 18.354167 18.958333
2015-01-01 18.791667 18.645833 18.708333
2015-02-01 16.583333 17.208333 16.895833
2015-03-01 18.458333 18.458333 18.458333
2015-04-01 0.458333 0.520833 0.500000
Counting distinct events is a little harder, but something like this will get you most of the way. (Note that this produces an unrealistically high number of flooding events, but that's just because of how the sample data is set up and not reflective of a typical pond, though I'm not an expert on pond flooding!)
for c in df.columns:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift() <= 4))
df.iloc[:,-3:].groupby(pd.TimeGrouper( freq = "1MS" )).sum()
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 306 291 298
2014-10-01 381 343 373
2014-11-01 350 346 357
2014-12-01 359 352 361
2015-01-01 355 335 352
2015-02-01 292 337 316
2015-03-01 344 360 386
2015-04-01 9 10 9
A couple things to note. First, an event can span months and this method will group it with the month where the event began. Second, I'm ignoring the duration of the event here, but you can adjust that however you want. For example, if you want to say the event doesn't start unless there are 2 consecutive periods below flood level followed by 2 consecutive periods above flood level, just change the relevant line above to:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift(1) <= 4) &
(df[c].shift(-1) > 4) & (df[c].shift(2) <= 4))
That produces a pretty dramatic reduction in the count of distinct events:
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 70 71 72
2014-10-01 91 85 81
2014-11-01 87 75 91
2014-12-01 88 87 77
2015-01-01 91 95 94
2015-02-01 79 90 83
2015-03-01 83 78 85
2015-04-01 0 2 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Perform a cross column calculation in Python - python

Related

For loop to exlcude percentiles in pandas dataframe based on year and region

Calculating Time Weighted Rate of Return in Python

Average between points based on time

Create features based on cutoff times in featuretools

How to map a function in pandas which compares each record in a column to previous and next records

Categories

Resources