Date subraction not working with date range function - python

I am trying subtract two datetimes when there valid there is valid value for T1,T2 obtain the difference. The difference is caluclated by considering only weekdays between the dates not considering saturday and sunday.
Code works for only some rows. How can this be fixed.
T1 T2 Diff
0 2017-12-04 05:48:15 2018-01-05 12:15:22 NaN
1 2017-07-10 08:23:11 2018-01-05 15:28:22 NaN
2 2017-09-11 05:10:37 2018-01-29 15:02:07 NaN
3 2017-12-21 04:51:12 2018-01-29 16:06:43 NaN
4 2017-10-13 10:11:00 2018-02-22 16:19:04 NaN
5 2017-09-28 21:44:31 2018-01-29 12:42:02 NaN
6 2018-01-23 20:00:58 2018-01-29 14:40:33 NaN
7 2017-11-28 15:39:38 2018-01-31 11:57:04 NaN
8 2017-12-21 12:44:00 2018-01-31 13:12:37 30.0
9 2017-11-09 05:52:29 2018-01-22 11:42:01 53.0
10 2018-02-12 04:21:08 NaT NaN
df[['T1','T2','diff']].dtypes
T1 datetime64[ns]
T2 datetime64[ns]
diff float64
df['T1'] = pd.to_datetime(df['T1'])
df['T2'] = pd.to_datetime(df['t2'])
def fun(row):
if row.isnull().any():
return np.nan
ts = pd.DataFrame(pd.date_range(row["T1"],row["T2"]), columns=["date"])
ts["dow"] = ts["date"].dt.weekday
return (ts["dow"]<5).sum()
df["diff"] = df.apply(lambda x: fun(x), axis=1)

Instead of trying to check for a null value in the row, use a try/except to capture the error when it does the calculation with a null value.
This worked for me in, I think, the manner you want.
import pandas as pd
import numpy as np
df = pd.read_csv("/home/rightmire/Downloads/test.csv", sep=",")
# df = df[["m1","m2"]]
print(df)
# print(df[['m1','m2']].dtypes)
df['m1'] = pd.to_datetime(df['m1'])
df['m2'] = pd.to_datetime(df['m2'])
print(df[['m1','m2']].dtypes)
#for index, row in df.iterrows():
def fun(row):
try:
ts = pd.DataFrame(pd.date_range(row["m1"],row["m2"]), columns=["date"])
# print(ts)
ts["dow"] = ts["date"].dt.weekday
result = (ts["dow"]<5).sum()
# print("Result = ", result)
return result
except Exception as e:
# print("ERROR:{}".format(str(e)))
result = np.nan
# print("Result = ", result)
return result
df["diff"] = df.apply(lambda x: fun(x), axis=1)
print(df["diff"])
OUTPUT OF INTEREST:
dtype: object
0 275.0
1 147.0
2 58.0
3 28.0
4 95.0
5 87.0
6 4.0
7 46.0
8 30.0
9 96.0
10 NaN
11 27.0
12 170.0
13 158.0
14 79.0
Name: diff, dtype: float64

Related

Calculate specific function from the Date column and input parameter specified by user in Pandas

I have a df as shown below.
Date t_factor
2020-02-01 5
2020-02-03 23
2020-02-06 14
2020-02-09 23
2020-02-10 23
2020-02-11 23
2020-02-13 30
2020-02-20 29
2020-02-29 100
2020-03-01 38
2020-03-10 38
2020-03-11 38
2020-03-26 70
2020-03-29 70
From that I would like to create a function that will calculate the column called t_function based on the calculated values t1, t2 and t3 .
where user will enter following parameters.
Step1:
Enter start_date1 = 2020-02-01
Enter end_date1 = 2020-02-06
Enter a0 = 3
Enter a1 = 1
Enter a2 = 0
calculate t1 as number of days from start_date1 (2020-02-01) to the values in date column till end_date1.
t_function = a0 + a1*t1 + a2*(t1)**2
Step2:
Enter start_date2 = 2020-02-13
Enter end_date2 = 2020-02-29
Enter a0 = 2
Enter a1 = 0
Enter a2 = 1
calculate time_in_days as t2, which is 1 on start_date2 = 2020-02-13 and so on till end_date2
t_function = a0 + a1*t2 + a2*(t2)**2
Step3:
Enter start_date3 = 2020-03-11
Enter end_date3 = 2020-03-29
Enter a0 = 4
Enter a1 = 0
Enter a2 = 0
calculate time_in_days as t3, which is 1 on start_date2 = 2020-02-13 and so on till end_date2
t_function = t_function = a0 + a1*t3 + a2*(t3)**2
Expected output:
Date t_factor t1 t2 t3 t_function
2020-02-01 5 1 NaN NaN 4
2020-02-03 23 3 NaN NaN 6
2020-02-06 14 6 NaN NaN 9
2020-02-09 23 NaN NaN NaN NaN
2020-02-10 23 NaN NaN NaN NaN
2020-02-11 23 NaN NaN NaN NaN
2020-02-13 30 NaN 1 NaN 3
2020-02-20 29 NaN 8 NaN 66
2020-02-29 100 NaN 17 NaN 291
2020-03-01 38 NaN NaN NaN NaN
2020-03-10 38 NaN NaN NaN NaN
2020-03-11 38 NaN NaN 1 4
2020-03-26 70 NaN NaN 15 4
2020-03-29 70 NaN NaN 18 4
Note:
Initial start_date ie start_date1 should first date of Date column.
Final end_date is end_date3 should be final date of Date column.
The column t_factor is not used.
I tried below code to calculate t1 after that I am confused. Since I am new in python and pandas
df['t1'] = (df['Date'] - df.at[0, 'Date']).dt.days + 1
Here is how I will go about it:
import pandas as pd
from io import StringIO
from datetime import datetime, timedelta
import numpy as np
df = pd.read_csv(StringIO("""Date t_factor
2020-02-01 5
2020-02-03 23
2020-02-06 14
2020-02-09 23
2020-02-13 30
2020-02-20 29
2020-02-29 100
2020-03-11 38
2020-03-26 70
2020-03-29 70 """), sep="\s+", parse_dates=[0])
df
def fun(x, start="2020-02-01", end="2020-02-06", a0=3, a1=1, a2=0):
start = datetime.strptime(start, "%Y-%m-%d")
end = datetime.strptime(end, "%Y-%m-%d")
if start <= x.Date <= end:
t2 = (x.Date - start)/np.timedelta64(1, 'D') + 1
diff = a0 + a1*t2 + a2*(t2)**2
else:
diff = np.NaN
return diff
df["t1"] = df.apply(lambda x: fun(x), axis=1)
df["t2"] = df.apply(lambda x: fun(x, "2020-02-13", "2020-02-29", 2, 0, 1), axis=1)
df["t3"] = df.apply(lambda x: fun(x, "2020-03-11", "2020-03-29", 4, 0, 0), axis=1)
df["t_function"] = df["t1"].fillna(0) + df["t2"].fillna(0) + df["t3"].fillna(0)
df
Here is the output:
Date t_factor t1 t2 t3 t_function
0 2020-02-01 5 4.0 NaN NaN 4.0
1 2020-02-03 23 6.0 NaN NaN 6.0
2 2020-02-06 14 9.0 NaN NaN 9.0
3 2020-02-09 23 NaN NaN NaN 0.0
4 2020-02-13 30 NaN 3.0 NaN 3.0
5 2020-02-20 29 NaN 66.0 NaN 66.0
6 2020-02-29 100 NaN 291.0 NaN 291.0
7 2020-03-11 38 NaN NaN 4.0 4.0
8 2020-03-26 70 NaN NaN 4.0 4.0
9 2020-03-29 70 NaN NaN 4.0 4.0

Taking first and last value in a rolling window

Initial problem statement
Using pandas, I would like to apply function available for resample() but not for rolling().
This works:
df1 = df.resample(to_freq,
closed='left',
kind='period',
).agg(OrderedDict([('Open', 'first'),
('Close', 'last'),
]))
This doesn't:
df2 = df.rolling(my_indexer).agg(
OrderedDict([('Open', 'first'),
('Close', 'last') ]))
>>> AttributeError: 'first' is not a valid function for 'Rolling' object
df3 = df.rolling(my_indexer).agg(
OrderedDict([
('Close', 'last') ]))
>>> AttributeError: 'last' is not a valid function for 'Rolling' object
What would be your advice to keep first and last value of a rolling windows to be put into two different columns?
EDIT 1 - with usable input data
import pandas as pd
from random import seed
from random import randint
from collections import OrderedDict
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0,10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
# First & last work with resample
resampled_first = df.resample('3H',
closed='left',
kind='period',
).agg(OrderedDict([('Values', 'first')]))
resampled_last = df.resample('3H',
closed='left',
kind='period',
).agg(OrderedDict([('Values', 'last')]))
# They don't with rolling
rolling_first = df.rolling(3).agg(OrderedDict([('Values', 'first')]))
rolling_first = df.rolling(3).agg(OrderedDict([('Values', 'last')]))
Thanks for your help!
Bests,
You can use own function to get first or last element in rolling window
rolling_first = df.rolling(3).agg(lambda rows: rows[0])
rolling_last = df.rolling(3).agg(lambda rows: rows[-1])
Example
import pandas as pd
from random import seed, randint
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
df['last'] = df['Values'].rolling(3).agg(lambda rows: rows[-1])
print(df)
Result
Values first last
2020-01-01 00:00:00+00:00 2 NaN NaN
2020-01-01 01:00:00+00:00 9 NaN NaN
2020-01-01 02:00:00+00:00 1 2.0 1.0
2020-01-01 03:00:00+00:00 4 9.0 4.0
2020-01-01 04:00:00+00:00 1 1.0 1.0
2020-01-01 05:00:00+00:00 7 4.0 7.0
2020-01-01 06:00:00+00:00 7 1.0 7.0
2020-01-01 07:00:00+00:00 7 7.0 7.0
2020-01-01 08:00:00+00:00 10 7.0 10.0
2020-01-01 09:00:00+00:00 6 7.0 6.0
2020-01-01 10:00:00+00:00 3 10.0 3.0
2020-01-01 11:00:00+00:00 1 6.0 1.0
2020-01-01 12:00:00+00:00 7 3.0 7.0
2020-01-01 13:00:00+00:00 0 1.0 0.0
2020-01-01 14:00:00+00:00 6 7.0 6.0
2020-01-01 15:00:00+00:00 6 0.0 6.0
2020-01-01 16:00:00+00:00 9 6.0 9.0
2020-01-01 17:00:00+00:00 0 6.0 0.0
2020-01-01 18:00:00+00:00 7 9.0 7.0
2020-01-01 19:00:00+00:00 4 0.0 4.0
2020-01-01 20:00:00+00:00 3 7.0 3.0
2020-01-01 21:00:00+00:00 9 4.0 9.0
2020-01-01 22:00:00+00:00 1 3.0 1.0
2020-01-01 23:00:00+00:00 5 9.0 5.0
2020-01-02 00:00:00+00:00 0 1.0 0.0
EDIT:
Using dictionary you have to put directly lambda, not string
result = df['Values'].rolling(3).agg({'first': lambda rows: rows[0], 'last': lambda rows: rows[-1]})
print(result)
The same with own function - you have to put its name, not string with name
def first(rows):
return rows[0]
def last(rows):
return rows[-1]
result = df['Values'].rolling(3).agg({'first': first, 'last': last})
print(result)
Example
import pandas as pd
from random import seed, randint
# DataFrame
ts_1h = pd.date_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in ts_1h]
df = pd.DataFrame({'Values' : values}, index=ts_1h)
result = df['Values'].rolling(3).agg({'first': lambda rows: rows[0], 'last': lambda rows: rows[-1]})
print(result)
def first(rows):
return rows[0]
def mylast(rows):
return rows[-1]
result = df['Values'].rolling(3).agg({'first': first, 'last': last})
print(result)
In case anyone else needs to find the difference between the first and last value in a 'rolling-window'. I used this on stock market data and wanted to know the price difference from the beginning to the end of the 'window' so I created a new column which used the current row 'close' value and the 'open' value using .shift() so it is taking the "open" value from 60 rows above.
df[windowColumn] = df["close"] - (df["open"].shift(60))
I think it's a very quick method for large datasets.

pandas group by date, assign value to a column

I have a DataFrame with columns = ['date','id','value'], where id represents different products. Assume that we have n products. I am looking to create a new dataframe with columns = ['date', 'valueid1' ..,'valueidn'], where the values are assigned to the corresponding date-row if they exist, a NaN is assigned as value if they don't. Many thanks
assuming you have the following DF:
In [120]: df
Out[120]:
date id value
0 2001-01-01 1 10
1 2001-01-01 2 11
2 2001-01-01 3 12
3 2001-01-02 3 20
4 2001-01-03 1 20
5 2001-01-04 2 30
you can use pivot_table() method:
In [121]: df.pivot_table(index='date', columns='id', values='value')
Out[121]:
id 1 2 3
date
2001-01-01 10.0 11.0 12.0
2001-01-02 NaN NaN 20.0
2001-01-03 20.0 NaN NaN
2001-01-04 NaN 30.0 NaN
or
In [122]: df.pivot_table(index='date', columns='id', values='value', fill_value=0)
Out[122]:
id 1 2 3
date
2001-01-01 10 11 12
2001-01-02 0 0 20
2001-01-03 20 0 0
2001-01-04 0 30 0
I think you need pivot:
df = df.pivot(index='date', columns='id', values='value')
Sample:
df = pd.DataFrame({'date':pd.date_range('2017-01-01', periods=5),
'id':[4,5,6,4,5],
'value':[7,8,9,1,2]})
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-04 4 1
4 2017-01-05 5 2
df = df.pivot(index='date', columns='id', values='value')
#alternative solution
#df = df.set_index(['date','id'])['value'].unstack()
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-04 1.0 NaN NaN
2017-01-05 NaN 2.0 NaN
but if get:
ValueError: Index contains duplicate entries, cannot reshape
is necessary use aggregating function like mean, sum, ... with groupby or pivot_table:
df = pd.DataFrame({'date':['2017-01-01', '2017-01-02',
'2017-01-03','2017-01-05','2017-01-05'],
'id':[4,5,6,4,4],
'value':[7,8,9,1,2]})
df.date = pd.to_datetime(df.date)
print (df)
date id value
0 2017-01-01 4 7
1 2017-01-02 5 8
2 2017-01-03 6 9
3 2017-01-05 4 1 <- duplicity 2017-01-05 4
4 2017-01-05 4 2 <- duplicity 2017-01-05 4
df = df.groupby(['date', 'id'])['value'].mean().unstack()
#alternative solution (another answer same as groupby only slowier in big df)
#df = df.pivot_table(index='date', columns='id', values='value', aggfunc='mean')
print (df)
id 4 5 6
date
2017-01-01 7.0 NaN NaN
2017-01-02 NaN 8.0 NaN
2017-01-03 NaN NaN 9.0
2017-01-05 1.5 NaN NaN <- 1.5 is mean (1 + 2)/2

Assitance needed in python pandas to reduce lines of code and cycle time

I have a DF where I am calculating the filling the emi value in fields
account Total Start Date End Date EMI
211829 107000 05/19/17 01/22/19 5350
320563 175000 08/04/17 10/30/18 12500
648336 246000 02/26/17 08/25/19 8482.7586206897
109996 175000 11/23/17 11/27/19 7291.6666666667
121213 317000 09/07/17 04/12/18 45285.7142857143
Then based on dates range I create new fields like Jan 17 , Feb 17 , Mar 17 etc. and fill them up with the code below.
jant17 = pd.to_datetime('2017-01-01')
febt17 = pd.to_datetime('2017-02-01')
mart17 = pd.to_datetime('2017-03-01')
jan17 = pd.to_datetime('2017-01-31')
feb17 = pd.to_datetime('2017-02-28')
mar17 = pd.to_datetime('2017-03-31')
df.ix[(df['Start Date'] <= jan17) & (df['End Date'] >= jant17) , 'Jan17'] = df['EMI']
But the drawback is when I have to do a forecast till 2019 or 2020 They become too many lines of code to write and when there is any update I need to modify too many lines of code. To reduce the lines of code I tried an alternate method with using for loop but the code started taking very long to execute.
monthend = { 'Jan17' : pd.to_datetime('2017-01-31'),
'Feb17' : pd.to_datetime('2017-02-28'),
'Mar17' : pd.to_datetime('2017-03-31')}
monthbeg = { 'Jant17' : pd.to_datetime('2017-01-01'),
'Febt17' : pd.to_datetime('2017-02-01'),
'Mart17' : pd.to_datetime('2017-03-01')}
for mend in monthend.values():
for mbeg in monthbeg.values():
for coln in colnames:
df.ix[(df['Start Date'] <= mend) & (df['End Date'] >= mbeg) , coln] = df['EMI']
This greatly reduced the no of lines of code but increased to execution time from 3-4 mins to 1 hour plus. Is there a better way to code this with less lines and lesser processing time
I think you can create helper df with start, end dates and names of columns, loop rows and create new columns of original df:
dates = pd.DataFrame({'start':pd.date_range('2017-01-01', freq='MS', periods=10),
'end':pd.date_range('2017-01-01', freq='M', periods=10)})
dates['names'] = dates.start.dt.strftime('%b%y')
print (dates)
end start names
0 2017-01-31 2017-01-01 Jan17
1 2017-02-28 2017-02-01 Feb17
2 2017-03-31 2017-03-01 Mar17
3 2017-04-30 2017-04-01 Apr17
4 2017-05-31 2017-05-01 May17
5 2017-06-30 2017-06-01 Jun17
6 2017-07-31 2017-07-01 Jul17
7 2017-08-31 2017-08-01 Aug17
8 2017-09-30 2017-09-01 Sep17
9 2017-10-31 2017-10-01 Oct17
#if necessary convert to datetimes
df['Start Date'] = pd.to_datetime(df['Start Date'])
df['End Date'] = pd.to_datetime(df['End Date'])
def f(x):
df.loc[(df['Start Date'] <= x.start) & (df['End Date'] >= x.end) , x.names] = df['EMI']
dates.apply(f, axis=1)
print (df)
account Total Start Date End Date EMI Jan17 Feb17 \
0 211829 107000 2017-05-19 2019-01-22 5350.000000 NaN NaN
1 320563 175000 2017-08-04 2018-10-30 12500.000000 NaN NaN
2 648336 246000 2017-02-26 2019-08-25 8482.758621 NaN NaN
3 109996 175000 2017-11-23 2019-11-27 7291.666667 NaN NaN
4 121213 317000 2017-09-07 2018-04-12 45285.714286 NaN NaN
Mar17 Apr17 May17 Jun17 Jul17 \
0 NaN NaN NaN 5350.000000 5350.000000
1 NaN NaN NaN NaN NaN
2 8482.758621 8482.758621 8482.758621 8482.758621 8482.758621
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Aug17 Sep17 Oct17
0 5350.000000 5350.000000 5350.000000
1 NaN 12500.000000 12500.000000
2 8482.758621 8482.758621 8482.758621
3 NaN NaN NaN
4 NaN NaN 45285.714286

Removing incomplete seasons from multi-index dataframe (pandas)

Trying to apply the method from here to a multi-index dataframe, doesn't seem to work.
Take a data-frame:
import pandas as pd
import numpy as np
dates = pd.date_range('20070101',periods=3200)
df = pd.DataFrame(data=np.random.randint(0,100,(3200,1)), columns =list('A'))
df['A'][5,6,7, 8, 9, 10, 11, 12, 13] = np.nan #add missing data points
df['date'] = dates
df = df[['date','A']]
Apply season function to the datetime index
def get_season(row):
if row['date'].month >= 3 and row['date'].month <= 5:
return '2'
elif row['date'].month >= 6 and row['date'].month <= 8:
return '3'
elif row['date'].month >= 9 and row['date'].month <= 11:
return '4'
else:
return '1'
Apply the function
df['Season'] = df.apply(get_season, axis=1)
Create a 'Year' column for indexing
df['Year'] = df['date'].dt.year
Multi-index by Year and Season
df = df.set_index(['Year', 'Season'], inplace=False)
Count datapoints in each season
count = df.groupby(level=[0, 1]).count()
Drop the seasons with less than 75 days in them
count = count.drop(count[count.A < 75].index)
Create a variable for seasons with more than 75 days
complete = count[count['A'] >= 75].index
Using isin function turns up false for everything, while I want it to select all the seasons who have more than 75 days of valid data in 'A'
df = df.isin(complete)
df
Every value comes up false, and I can't see why.
I hope this is concise enough, I need this to work on a multi-index using seasons so I included it!
EDIT
Another method based on multi-index reindexing not working (which also produces a blank dataframe) from here
df3 = df.reset_index().groupby('Year').apply(lambda x: x.set_index('Season').reindex(count,method='pad'))
EDIT 2
Also tried this
seasons = count[count['A'] >= 75].index
df = df[df['A'].isin(seasons)]
Again, blank output
I think you can use Index.isin:
complete = count[count['A'] >= 75].index
idx = df.index.isin(complete)
print idx
[ True True True ..., False False False]
print df[idx]
date A
Year Season
2007 1 2007-01-01 24.0
1 2007-01-02 92.0
1 2007-01-03 54.0
1 2007-01-04 91.0
1 2007-01-05 91.0
1 2007-01-06 NaN
1 2007-01-07 NaN
1 2007-01-08 NaN
1 2007-01-09 NaN
1 2007-01-10 NaN
1 2007-01-11 NaN
1 2007-01-12 NaN
1 2007-01-13 NaN
1 2007-01-14 NaN
1 2007-01-15 18.0
1 2007-01-16 82.0
1 2007-01-17 55.0
1 2007-01-18 64.0
1 2007-01-19 89.0
1 2007-01-20 37.0
1 2007-01-21 45.0
1 2007-01-22 4.0
1 2007-01-23 34.0
1 2007-01-24 35.0
1 2007-01-25 90.0
1 2007-01-26 17.0
1 2007-01-27 29.0
1 2007-01-28 58.0
1 2007-01-29 7.0
1 2007-01-30 57.0
... ... ...
2015 3 2015-08-02 42.0
3 2015-08-03 0.0
3 2015-08-04 31.0
3 2015-08-05 39.0
3 2015-08-06 25.0
3 2015-08-07 1.0
3 2015-08-08 7.0
3 2015-08-09 97.0
3 2015-08-10 38.0
3 2015-08-11 59.0
3 2015-08-12 28.0
3 2015-08-13 84.0
3 2015-08-14 43.0
3 2015-08-15 63.0
3 2015-08-16 68.0
3 2015-08-17 0.0
3 2015-08-18 19.0
3 2015-08-19 61.0
3 2015-08-20 11.0
3 2015-08-21 84.0
3 2015-08-22 75.0
3 2015-08-23 37.0
3 2015-08-24 40.0
3 2015-08-25 66.0
3 2015-08-26 50.0
3 2015-08-27 74.0
3 2015-08-28 37.0
3 2015-08-29 19.0
3 2015-08-30 25.0
3 2015-08-31 15.0
[3106 rows x 2 columns]

Categories

Resources