New pandas DataFrame column from datetime calculation - python

I am trying to calculate the number of days that have elapsed since the launch of a marketing campaign. I have one row per date for each marketing campaign in my DataFrame (df) and all dates start from the same day (though there is not a data point for each day for each campaign). In column 'b' I have the date relating to the data points of interest (dateime64[ns]) and in column 'c' I have the launch date of the marketing campaign (dateime64[ns]). I would like the resulting calculation to return n/a (or np.NaN or a suitable alternative) when column 'b' is earlier than column 'c', else I would like the calculation to return the difference the two dates.
Campaign
Date
Launch Date
Desired Column
A
2019-09-01
2022-12-01
n/a
A
2019-09-02
2022-12-01
n/a
B
2019-09-01
2019-09-01
0
B
2019-09-25
2019-09-01
24
When I try:
df['Days Since Launch'] = df['Date'] - df['Launch Date']
What I would hope returns a negative value actually returns a positive one, thus leading to duplicate values when I have dates that are 10 days prior and 10 days after the launch date.
When I try:
df['Days Since Launch'] = np.where(df['Date'] < df['Launch Date'], XXX, df['Date'] - df['Launch Date'])
Where XXX has to be the same data type as the two input columns, so I can't enter np.NaN because the calculation will fail, nor can I enter a date as this will still leave the same issue that i want to solve. IF statements do not work as the "truth value of a Series is ambiguous". Any ideas?

You can use a direct subtraction and conversion to days with dt.days, then mask the negative values with where:
s = pd.to_datetime(df['Date']).sub(pd.to_datetime(df['Launch Date'])).dt.days
# or, if already datetime:
#s = df['Date'].sub(df['Launch Date']).dt.days
df['Desired Column'] = s.where(s.ge(0))
Alternative closer to your initial attempt, using mask:
df['Desired Column'] = (df['Date'].sub(df['Launch Date'])
.mask(df['Date'] < df['Launch Date'])
)
Output:
Campaign Date Launch Date Desired Column
0 A 2019-09-01 2022-12-01 NaN
1 A 2019-09-02 2022-12-01 NaN
2 B 2019-09-01 2019-09-01 0.0
3 B 2019-09-25 2019-09-01 24.0

Add Series.dt.days for convert timedeltas to days:
df['Days Since Launch'] = np.where(df['Date'] < df['Launch Date'],
np.nan,
(df['Date'] - df['Launch Date']).dt.days)
print (df)
Campaign Date Launch Date Desired Column Days Since Launch
0 A 2019-09-01 2022-12-01 NaN NaN
1 A 2019-09-02 2022-12-01 NaN NaN
2 B 2019-09-01 2019-09-01 0.0 0.0
3 B 2019-09-25 2019-09-01 24.0 24.0

Another alternative:
df["Date"] = pd.to_datetime(df["Date"])
df["Launch Date"] = pd.to_datetime(df["Launch Date"])
df["Desired Column"] = df.apply(lambda x: x["Date"] - x["Launch Date"] if x["Date"] >= x["Launch Date"] else None, axis=1)

Related

Insert new row in a DataFrame if condition is met

I have a dataframe with two date columns:
Brand Start Date Finish Date Check
0 1 2013-03-16 2014-03-02 Consecutive
1 2 2014-03-03 2015-09-05 Consecutive
2 3 2015-12-12 2016-12-12 Non Consecutive
3 4 2017-01-01 2017-06-01 Non consecutive
4 5 2017-06-02 2019-02-20 Consecutive
I created a new column (check column) checking if the start date is consecutive of the finish date in the previous row, which I populated with 'Consecutive' and 'Non consecutive'.
I want to insert a new row where the value of the check column is 'Non consecutive' that contains, as Start date, the date of the column 'finish date' + 1 day (consecutive with previous row) and as 'finish date' the date of the column Finish Date - 1 day (consecutive with next row). So indexes 2 and 4 will be the new rows
Brand Start Date Finish Date
0 1 2013-03-16 2014-03-02
1 2 2014-03-03 2015-09-05
2 3 2015-09-06 2015-12-11
3 3 2015-12-12 2016-12-12
4 4 2016-12-13 2016-12-31
5 4 2017-01-01 2017-06-01
6 5 2017-06-02 2019-02-20
How can I achieve this?
date_format = '%Y-%m-%d'
rows = df.index[~df['Consecutive']]
df2 = pd.DataFrame(columns=df.columns, index=rows)
res = []
for row in rows:
df2.loc[row, :] = df.loc[row, :].copy()
df2.loc[row, 'Start Date'] = datetime.strftime(datetime.strptime(df.loc[row, 'Start Date'], date_format) + timedelta(days=1), date_format)
df2.loc[row, 'Finish Date'] = datetime.strftime(datetime.strptime(df.loc[row+1, 'Finish Date'], date_format) - timedelta(days=1), date_format)
df3 = pd.concat([df, df2]).sort_values(['Brand', 'Start Date']).reset_index(drop=True)
This uses sorting to put the rows in the correct place. If your df is big the sorting could be the slowest part and you could consider adding the rows one at a time into the correct place see here.

Number of rows between two dates

Let's say I have a pandas df with a Date column (datetime64[ns]):
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0
I want to get a column (rows_num in the above example) with the number of rows I need to claw back to find the current row date minus 365 days (1 year before).
So, in the above example, for index 7 (date 2021-01-26) I want to know how many rows before I can find the date 2020-01-26.
If a perfect match is not available (like in the example df), I should reference the closest available date (or the closest smaller/larger date: it doesn't really matter in my case).
Any idea? Thanks
Edited to reflect OP's original question. Created a demo dataframe. Created a column to hold that row_count value to reflect number of business days. Then, for each row, create a filter to grab all rows between the start date and 365 days later. the shape[0] of that filtered dataframe represents the number of business days, and we add it into the appropriate field of the df.
# Import Pandas package
import pandas as pd
from datetime import datetime, timedelta
# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2],
'date' : ['2020-08-09', '2020-08-25', '2020-09-05',
'2020-09-12', '2020-09-29', '2020-10-15',
'2020-11-21', '2020-12-02', '2020-12-10',
'2020-12-18']})
#create the column for the row count:
df.insert(2, "row_count", '')
# Convert the date to datetime64
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
for row in range(len(df['date'])):
start_date = str(df['date'].iloc[row])
end_date = str(df['date'].iloc[row] + timedelta(days=365)) #set the end date for the filter
# Filter data between two dates
filtered_df = df.loc[(df['date'] >= start_date) & (df['date'] < end_date)]
df['row_count'][row] = filtered_df.shape[0] # fill in the row_count column with the number of rows returned by your filter
source
You can use pd.merge_asof, which performs the exact nearest-match lookup you describe. You can even choose to use backward (smaller), forward (larger), or nearest search types.
# setup
text = StringIO(
"""
Date
2020-01-01
2020-02-25
2020-04-23
2020-06-28
2020-08-17
2020-10-11
2020-12-06
2021-01-26
2021-03-17
"""
)
data = pd.read_csv(text, delim_whitespace=True, parse_dates=["Date"])
# calculate the reference date from 1 year (365 days) ago
one_year_ago = data["Date"] - pd.Timedelta("365D")
# we only care about the index values for the original and matched dates
merged = pd.merge_asof(
one_year_ago.reset_index(),
data.reset_index(),
on="Date",
suffixes=("_original", "_matched"),
direction="backward",
)
data["rows_num"] = merged["index_original"] - merged["index_matched"]
Result:
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0

how to convert monthly data to weekly data keeping the other columns constant

I have a data frame as follows.
pd.DataFrame({'Date':['2020-08-01','2020-08-01','2020-09-01'],'value':[10,12,9],'item':['a','d','b']})
I want to convert this to weekly data keeping all the columns apart from the Date column constant.
Expected output
pd.DataFrame({'Date':['2020-08-01','2020-08-08','2020-08-15','2020-08-22','2020-08-29','2020-08-01','2020-08-08','2020-08-15','2020-08-22','2020-08-29','2020-09-01','2020-09-08','2020-09-15','2020-09-22','2020-09-29'],
'value':[10,10,10,10,10,12,12,12,12,12,9,9,9,9,9],'item':['a','a','a','a','a','d','d','d','d','d','b','b','b','b','b']})
It should be able to convert any month data to weekly data. Date in the input data frame is always the first day of that month.
How do I make this happen?
Thanks in advance.
Since the desired new datetime index is irregular (re-starts at the 1st of each month), an iterative creation of the index is an option:
df = pd.DataFrame({'Date':['2020-08-01','2020-09-01'],'value':[10,9],'item':['a','b']})
df = df.set_index(pd.to_datetime(df['Date'])).drop(columns='Date')
dti = pd.to_datetime([]) # start with an empty datetime index
for month in df.index: # for each month, add a 7-day step datetime index to the previous
dti = dti.union(pd.date_range(month, month+pd.DateOffset(months=1), freq='7d'))
# just reindex and forward-fill, no resampling needed
df = df.reindex(dti).ffill()
df
value item
2020-08-01 10.0 a
2020-08-08 10.0 a
2020-08-15 10.0 a
2020-08-22 10.0 a
2020-08-29 10.0 a
2020-09-01 9.0 b
2020-09-08 9.0 b
2020-09-15 9.0 b
2020-09-22 9.0 b
2020-09-29 9.0 b
I added one more date to your data and then used resample:
df = pd.DataFrame({'Date':['2020-08-01', '2020-09-01'],'value':[10, 9],'item':['a', 'b']})
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df = df.resample('W').ffill().reset_index()
print(df)
Date value item
0 2020-08-02 10 a
1 2020-08-09 10 a
2 2020-08-16 10 a
3 2020-08-23 10 a
4 2020-08-30 10 a
5 2020-09-06 9 b

Date Offset Pandas Field Based Off Another Field

I have a data frame with a field time of timestamps with dates, and another column period. How can I add a number of days to time based on period?
Current Output:
time period
------------------------------
2020-04-28 10:00:00 1
2020-04-27 12:34:56 3
Expected Output
time
---------------
2020-04-29 10:00:00
2020-04-30 12:34:56
If I try df['time'] = df['time'] + pd.DateOffset(df['period']) I get an error TypeError:nargument must be an integer, got <class 'pandas.core.series.Series'> because it is trying to pass the whole column into the function which expects an integer. How can this be accomplished?
Because days can be converted to timedeltas by to_timedelta is possible use:
df['time'] = df['time'] + pd.to_timedelta(df['period'], unit='d')
print (df)
time period
0 2020-04-29 10:00:00 1
1 2020-04-30 12:34:56 3
But if want add months is necessary use:
df['time'] = df['time'] + df['period'].apply(lambda x: pd.DateOffset(months=x))
print (df)
time period
0 2020-05-28 10:00:00 1
1 2020-07-27 12:34:56 3
If use month timedelatas is working with 'default month', so precision is different:
df['time'] = df['time'] + pd.to_timedelta(df['period'], unit='M')
print (df)
time period
0 2020-05-28 20:29:06 1
1 2020-07-27 20:02:14 3

How to check if any holiday is included in a time period in python

I have two columns in pandas dataframe, start date and end date.
I want to know if any holiday is included in the time period of each row.
I want to create a new column to show yes or no.
id Start Date End Date
0 2019-09-27 2019-10-06
1 2019-10-09 2019-10-22
2 2019-05-04 2019-05-15
3 2019-09-18 2019-09-29
I know how to check if a specific date is holiday or not
But how can I check the duration of each row?
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
df = pd.DataFrame({'Start Date':['2019-09-27', '2019-10-09', '2019-05-04', '2019-09-18'],
'End Date':['2019-10-06', '2019-10-22', '2019-05-15', '2019-09-29']})
# To check if a specific date is a holiday or not
holidays = calendar().holidays(start = df['Start Date'].min(), end = df['Start Date'].max())
df['Holiday'] = df['Start Date'].isin(holidays)
# This can only check if the start date is a holiday
id Start Date Holiday
0 2019-09-27 False
1 2019-10-09 False
2 2019-05-04 False
3 2019-09-18 False
# But how can I check the duration between df['Start Date'] and df['End Date'] of each row?
I expect that there is another Boolean column to check
if each row (id) is including any holiday between the duration of start date and end date
id Start Date End Date Holiday
0 2019-09-27 2019-10-06 True
1 2019-10-09 2019-10-22 False
2 2019-05-04 2019-05-15 True
3 2019-09-18 2019-09-29 False
What I will do
#holidays = calendar().holidays(start = df['Start Date'].min(), end = df['End Date'].max())
l=[any(x<=z and y>=z for z in holidays.tolist()) for x , y in zip(df['Start Date'],df['End Date'])]
[False, True, False, False]
df['Holiday']=l
Also check When should I ever want to use pandas apply() in my code?
Apply the checking function to each row of the dataframe:
df['Holiday'] = df.apply(lambda x:\
calendar().holidays(start=x['Start Date'],
end=x['End Date']).size, axis=1)\
.astype(bool) # Convert the answer to a boolean
#0 False
#1 True
#2 False
#3 False

Categories

Resources