Error when writing a function to create new pandas column - python

I have a dataframe
id |start|stop|join_date
233| 0 | 12 |2015-01-01
234| 0 | 12 |2013-03-04
235| 10 | 23 |2014-01-10
GOAL:
I want to create another column stop_date that offsets the join_date based on whether or not the start date is 0.
If the start is 0 then stop_date is the join_date is offset by the months in stop
If the start is not 0 then stop_date is the join_date is offset by the months in stop and the months in start
I wrote the following function:
def stop_date(x):
if x['start'] == 0:
return x['join_date'] + x['stop'].astype('timedelta64[M]')
elif x['start'] != 0 :
return x['join_date'] + x['start'].astype('timedelta64[M]') + x['stop'].astype('timedelta64[M]')
else:
return x
I tried to apply to the dataframe by:
df['stop_date'] = df.apply(stop_date, axis = 1)
I keep getting an error : AttributeError: ("'int' object has no attribute 'astype'", 'occurred at index 0')
I cannot figure out how to achieve this.

Because when start is 0, doing the sum of start and stop won't change the number of month to add, you can sum both, convert with astype and add the 'join_date':
df['stop_date'] = (pd.to_datetime(df['join_date'])
+ df[['start', 'stop']].sum(axis=1).astype('timedelta64[M]')
).dt.date
print (df)
id start stop join_date stop_date
0 233 0 12 2015-01-01 2016-01-01
1 234 0 12 2013-03-04 2014-03-04
2 235 10 23 2014-01-10 2016-10-10

Convert the columns to the desired dtype before you apply the function. x['stop'] is a scalar value of the datatype of the column (e.g., 12), so it has no dataframe or series methods, such as astype.

Related

Pandas - compare day and month only against a datetime?

I want to compare a timestamp datatype datetime64[ns] with a datetime.date I only want a comparison based on day and month
df
timestamp last_price
0 2023-01-22 14:15:06.033314 100.0
1 2023-01-25 14:15:06.213591 101.0
2 2023-01-30 14:15:06.313554 102.0
3 2023-03-31 14:15:07.018540 103.0
cu_date = datetime.datetime.now().date()
cu_year = cu_date.year
check_end_date = datetime.datetime.strptime(f'{cu_year}-11-05', '%Y-%m-%d').date()
check_start_date = datetime.datetime.strptime(f'{cu_year}-03-12', '%Y-%m-%d').date()
# this is incorrect as the day can be greater than check_start_date while the month might be less.
daylight_off_df = df.loc[((df.timestamp.dt.month >= check_end_date.month) & (df.timestamp.dt.day >= check_end_date.day)) |
((df.timestamp.dt.month <= check_start_date.month) & (df.timestamp.dt.day <= check_start_date.day))]
daylight_on_df = df.loc[((df.timestamp.dt.month <= check_end_date.month) & (df.timestamp.dt.day <= check_end_date.day)) &
((df.timestamp.dt.month >= check_start_date.month) & (df.timestamp.dt.day >= check_start_date.day))]
I am trying to think up of the logic to do this, but failing.
Expected output:
daylight_off_df
timestamp last_price
0 2023-01-22 14:15:06.033314 100.0
1 2023-01-25 14:15:06.213591 101.0
2 2023-01-30 14:15:06.313554 102.0
daylight_on_df
timestamp last_price
3 2023-03-31 14:15:07.018540 103.0
In summation separate the dataframe as per day and month comparison while ignoring the year.
I would break out these values then just query
df['day'] = df['timestamp'].dt.day_name()
df['month'] = df['timestamp'].dt.month_name()
then whatever you're looking for:
df.groupby('month').mean()
The following parameters could be helpfull if you dont want an additional column in your table:
check_end_date.timetuple().tm_yday # returns day of the year
#output 309
check_start_date.timetuple().tm_yday
#output 71
df['timestamp'].dt.is_leap_year.astype(int) #returns 1 if year is a leapyear
#output 0 | 1
df['timestamp'].dt.dayofyear #returns day of the year
#output
#0 22
#1 25
#2 30
#3 90
df['timestamp'].dt.dayofyear.between(a,b) #returns true if day is between a,b
there are some possible solutions now. i think using between can be the nicest looking one.
daylight_on_df4 = df.loc[df['timestamp'].dt.dayofyear.between(
check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int),
check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int))]
daylight_off_df4 = df.loc[~df['timestamp'].dt.dayofyear.between(
check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int),
check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int))]
or the code could look like this:
daylight_on_df3 = df.loc[((check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) - df['timestamp'].dt.dayofyear > 0)
& (df['timestamp'].dt.dayofyear - (df['timestamp'].dt.is_leap_year.astype(int) + check_start_date.timetuple().tm_yday) > 0)]
daylight_off_df3 = df.loc[((check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) - df['timestamp'].dt.dayofyear < 0)
| (df['timestamp'].dt.dayofyear - (check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) < 0)]
All daylight_on/off are doing now is checking if the day of the year is inbetween your ranges or not (inclusive leap year).
This formula has probably to be rewritten if your startdate / enddate would cross a year (ex 2022-11-19 , 2023-02-22) but i think it provides a general idea.

Converting Object to Time in python Pandas

I have a dataset that containes a column with "Time" values, but it's showing as object, and I want to converte them to time so I can do a for loop to see if the time is between two times.
for i in df['Time']:
if i >= dt.time(21,0,0) and i <= dt.time(7, 30,0) or i >= dt.time(3,0,0) and i <= dt.time(10,0,0) or i >= dt.time(10,30,0) and i <= dt.time(14,0,0):
df['In/Out'] = 'In'
else:
df['In/Out'] = 'Out'
I want the code to set the value in a new column to "In" if the time is between two times.
The first times are (21:00) & (07:30) the second are (03:00) & (10:00) and the third are (10:30) & (14:00)
If the time is not in those ranges, it should set the value in the new column to "Out".
You can simplify:
(21:00) & (07:30) the second are (03:00) & (10:00)
to:
(21:00) & (10:00)
so solution is use Series.between with numpy.where:
df=pd.DataFrame({'Time':['0:01:00','8:01:00','2021-08-13 10:19:10','12:01:00',
'14:01:00','18:01:01','23:01:00']})
df['Time'] = pd.to_datetime(df['Time']).dt.time
m = (df['Time'].between(dt.time(21,0,0), dt.time(23,23,23)) |
df['Time'].between(dt.time(0,0,0), dt.time(10,0,0)) |
df['Time'].between(dt.time(10,30,0), dt.time(14, 0,0)))
df['In/Out'] = np.where(m, 'In','Out')
print (df)
Time In/Out
0 00:01:00 In
1 08:01:00 In
2 10:19:10 Out
3 12:01:00 In
4 14:01:00 Out
5 18:01:01 Out
6 23:01:00 In

Dataframe column locating min and max value depeding of an ID

I'm wondering how to optimize a part of code to remove a loop which takes forever since I have around 350 000 IDs.
Here is the current code, which is not optimal and takes quite a while.
I'm trying to get it working better and if possible removing a loop.
The dataset is made of 4 columns with IDs, start_dates, end_dates and amount. We can have multi rows with same IDs but not the same amount. The main thing is in some rows the dates are not saved in the dataset. In that case we have to find the earlier start_date of the ID and the later end_date and add them to the row where it's not put in the dataframe
ID start_date end_date value
ABC 12/10/2010 12/12/2020 8
ABC 01/01/2020 01/04/2021 9
ABC 43
BCD 14/02/2020 14/03/2020 8
So we should have on the third row the start_date as 12/10/2010 and end date 01/04/2021. In the picture you cant see it but don't forget that BCD start_date could be earlier than ABC but you still use the 12/10/2010 because it is linked to the ID
for x in df['ID'].unique():
tmp = df.loc[df['ID'] == x].reset_index()
df.loc[(df['ID'] == x) & (df['start_date'].isna()), 'start_date'] = tmp['start_date'].min()
df.loc[(df['ID'] == x) & (df['end_date'].isna()), 'end_date'] = tmp['end_date'].max()
I suppose the code is quite clear about what I am trying to do.
But if you have any questions don't hesitate do post them I'll do my best to answer.
set up the job
import pandas as pd
data = { 'ID': ['ABC','ABC','ABC','BCD'], 'start_date' : ['12/10/2010', '01/01/2020',None ,'14/02/2020'], 'end_date': ['12/12/2020', '01/01/2021',None ,'14/03/2020'], 'value': [8,9,43,8]}
df = pd.DataFrame(data)
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
we get this result
ID start_date end_date value
0 ABC 2010-12-10 2020-12-12 8
1 ABC 2020-01-01 2021-01-01 9
2 ABC NaT NaT 43
3 BCD 2020-02-14 2020-03-14 8
do the work
df.start_date = df.groupby('ID')['start_date'].apply(lambda x: x.fillna(x.min()))
df.end_date = df.groupby('ID')['end_date'].apply(lambda x: x.fillna(x.max()))
we get this result
ID start_date end_date value
0 ABC 2010-12-10 2020-12-12 8
1 ABC 2020-01-01 2021-01-01 9
2 ABC 2010-12-10 2021-01-01 43
3 BCD 2020-02-14 2020-03-14 8

How to count business days per month for the whole year with different weekmask every week?

I'm trying to make a program that will equally distribute employees' day off. There are 4 groups and each group has it's own weekmask for each week of the month. So far I've made a code that will change weekmask when it locates 0 in Dataframe(Sunday). I'm stuck on structuring this command np.busday_count(start, end, weekmask=) to automatically change the start and the end date.
My Dataframe looks like this:
And here's my code:
a: int = 0
week_mask: str = '1100111'
def _change_week_mask():
global a, week_mask
a += 1
if a == 1:
week_mask = '1111000'
elif a == 2:
week_mask = '1111111'
elif a == 3:
week_mask = '0011111'
else:
a = 0
for line in rows['Workday']:
if line is '0':
_change_week_mask()
Edit: changed the value of start week from 6 to 0.
Ok, so to answer your problem I have created the sample data frame with below code.
Then I have added below columns to the data frame.
dayofweek - to reach to similar data which you created by setting every Sunday as zero. In this case Monday is set as zero and Sunday is six.
weeknum - week of year
week - instead of counting and than changing the week mask, I have assigned the value to week from 0 to 3 and based on it, we can calculate the mask.
weekmask - using value of the week, I have calculate the mask, you might need to align this as per your logic.
weekenddate- end date I have calculate by adding 7 to start date, if month is changing mid week then this will have month end date.
b
after this we can create a new data frame to have only end of week entry, in this case Monday is 0 so I have taken 0.
then you can apply function and store the result to data frame.
import datetime
import pandas as pd
import numpy as np
df_ = pd.DataFrame({'startdate':pd.date_range(pd.to_datetime('2018-10-01'), pd.to_datetime('2018-11-30'))})
df_['dayofweek'] = df_.startdate.dt.dayofweek
df_['remaining_days_in_month'] = df_.startdate.dt.days_in_month - df_.startdate.dt.day
df_['week'] = df_.startdate.dt.week%4
df_['day'] = df_.startdate.dt.day
df_['weekmask'] = df_.week.map({0 : '1100111', 1 : '1111000' , 2 : '1111111', 3: '0011111'})
df_['weekenddate'] = [x[0] + datetime.timedelta(days=(7-x[1])) if x[2] > 7-x[1] else x[0] + datetime.timedelta(days=(x[2])) for x in df_[['startdate','dayofweek','remaining_days_in_month']].values]
final_df = df_[(df_['dayofweek']==0) | ( df_['day']==1)][['startdate','weekenddate','weekmask']]
final_df['numberofdays'] = [ np.busday_count((x[0]).astype('<M8[D]'), x[1].astype('<M8[D]'), weekmask=x[2]) for x in final_df.values.astype(str)]
Output:
startdate weekenddate weekmask numberofdays
0 2018-10-01 2018-10-08 1100111 5
7 2018-10-08 2018-10-15 1111000 4
14 2018-10-15 2018-10-22 1111111 7
21 2018-10-22 2018-10-29 0011111 5
28 2018-10-29 2018-10-31 1100111 2
31 2018-11-01 2018-11-05 1100111 3
35 2018-11-05 2018-11-12 1111000 4
42 2018-11-12 2018-11-19 1111111 7
49 2018-11-19 2018-11-26 0011111 5
56 2018-11-26 2018-11-30 1100111 2
let me know if this needs some changes as per your requirement.

Python performance improvements and coding style

Question
Let's assume the following sparse table is given indicating the listing of a security on an index.
identifier from thru
AAPL 1964-03-31 --
ABT 1999-01-03 2003-12-31
ABT 2005-12-31 --
AEP 1992-01-15 2017-08-31
KO 2014-12-31 --
ABT for example is on index from 1999-01-03 to 2003-12-31 and again from 2005-12-31 until today (-- indicates today). During times in between it is not listed on index.
How can I efficiently transform this sparse table to a dense table of the following form
date AAPL ABT AEP KO
1964-03-31 1 0 0 0
1964-04-01 1 0 0 0
... ... ... ... ...
1999-01-03 1 1 1 0
1999-01-04 1 1 1 0
... ... ... ... ...
2003-12-31 1 1 1 0
2004-01-01 1 0 1 0
... ... ... ... ...
2017-09-04 1 1 0 1
In the section My Solution you will find my solution to the problem. Unfortunately, the code seems to perform very bad. It took about 22 seconds to process 1648 entries.
As I am new to python I wondered how to efficiently program problems like these.
I do not intend that anyone is providing me with a solution to my problem (unless you wish to do so). My primary goal would be to understand how to efficiently solve problems like these in python. I used the functionalities of pandas to match the respective entries. Should I have used numpy and indexing instead? Should I have used other toolboxes? How can I gain performance improvements?
Please find my approach to this problem in the section below (if it is of interest to you).
Thank you very much for your help
My Solution
I have tried to resolve the problem by looping through every row entry in the first table. During every single loop, I specify a Boolean matrix for the specific from-thru-interval with all elements set to True. This matrix is appended to a list. At the end, I pd.concat the list and unstack and reindex the resulting DataFrame.
import pandas as pd
import numpy as np
def get_ts_data(data, start_date, end_date, attribute=None, identifier=None, frequency=None):
"""
Transform sparse table to dense table.
Parameters
----------
data: pd.DataFrame
sparse table with minimal column specification ['identifier', 'from', 'thru'
start_date: pd.Timestamp, str
start date of the dense matrix
end_date: pd.Timestamp, str
end date of the dense matrix
attribute: str
column name of the value of the dense matrix.
identifier: str
column name of the identifier
frequency: str
frequency of the dense matrix
kwargs:
Allows to overwrite naming of 'from' and 'thru' variables.
e.g.
{'from': 'start', 'thru': 'end'}
Returns
-------
"""
if attribute is None:
attribute = ['on_index']
elif not isinstance(attribute, list):
attribute = [attribute]
if identifier is None:
identifier = ['identifier']
elif not isinstance(identifier, list):
identifier = [identifier]
if frequency is None:
frequency = 'B'
# copy data for security reasons
data_mod = data.copy()
data_mod['on_index'] = True
# specify start date and check type
if not isinstance(start_date, pd.Timestamp):
start_date = pd.Timestamp(start_date)
# specify end date and check type
if not isinstance(end_date, pd.Timestamp):
end_date = pd.Timestamp(end_date)
# specify output date range
date_range = pd.date_range(start_date, end_date, freq=frequency)
#overwrite null indicating that it is valid until today
missing = data_mod['thru'].isnull()
data_mod.loc[missing, 'thru'] = data_mod.loc[missing, 'from'].apply(lambda d: max(d, end_date))
# preallocate frms
frms = []
# add dataframe to frms with time specific entries
for index, row in data_mod.iterrows():
# date range index
d_range = pd.date_range(row['from'], row['thru'], freq=frequency)
# Multi index with date and identifier
d_index = pd.MultiIndex.from_product([d_range] + [[x] for x in row[identifier]], names=['date'] + identifier)
# add DataFrame with repeated values to list
frms.append(pd.DataFrame(data=np.repeat(row[attribute].values, d_index.size), index=d_index, columns=attribute))
out_frame = pd.concat(frms)
out_frame = out_frame.unstack(identifier)
out_frame = out_frame.reindex(date_range)
return out_frame
if __name__ == "__main__":
data = pd.DataFrame({'identifier': ['AAPL', 'ABT', 'ABT', 'AEP', 'KO'],
'from': [pd.Timestamp('1964-03-31'),
pd.Timestamp('1999-01-03'),
pd.Timestamp('2005-12-31'),
pd.Timestamp('1992-01-15'),
pd.Timestamp('2014-12-31')],
'thru': [np.nan,
pd.Timestamp('2003-12-31'),
np.nan,
pd.Timestamp('2017-08-31'),
np.nan]
})
transformed_data = get_ts_data(data, start_date='1964-03-31', end_date='2017-09-04', attribute='on_index', identifier='identifier', frequency='B')
print(transformed_data)
# Ensure dates are Pandas timestamps.
df['from'] = pd.DatetimeIndex(df['from'])
df['thru'] = pd.DatetimeIndex(df['thru'].replace('--', np.nan))
# Get sorted list of all unique dates and create index for full range.
dates = sorted(set(df['from'].tolist() + df['thru'].dropna().tolist()))
dti = pd.DatetimeIndex(start=dates[0], end=dates[-1], freq='B')
# Create new target dataframe based on symbols and full date range. Initialize to zero.
df2 = pd.DataFrame(0, columns=df['identifier'].unique(), index=dti)
# Find all active symbols and set their symbols' values to one from their respective `from` dates.
for _, row in df[df['thru'].isnull()].iterrows():
df2.loc[df2.index >= row['from'], row['identifier']] = 1
# Find all other symbols and set their symbols' values to one between their respective `from` and `thru` dates.
for _, row in df[df['thru'].notnull()].iterrows():
df2.loc[(df2.index >= row['from']) & (df2.index <= row['thru']), row['identifier']] = 1
>>> df2.head(3)
AAPL ABT AEP KO
1964-03-31 1 0 0 0
1964-04-01 1 0 0 0
1964-04-02 1 0 0 0
>>> df2.tail(3)
AAPL ABT AEP KO
2017-08-29 1 1 1 1
2017-08-30 1 1 1 1
2017-08-31 1 1 1 1
>>> df2.loc[:'2004-01-02', 'ABT'].tail()
2003-12-29 1
2003-12-30 1
2003-12-31 1
2004-01-01 0
2004-01-02 0
Freq: B, Name: ABT, dtype: int64
>>> df2.loc['2005-12-30':, 'ABT'].head(3)
2005-12-30 0
2006-01-02 1
2006-01-03 1
Freq: B, Name: ABT, dtype: int64

Categories

Resources