Insert new row in a DataFrame if condition is met - python

I have a dataframe with two date columns:
Brand Start Date Finish Date Check
0 1 2013-03-16 2014-03-02 Consecutive
1 2 2014-03-03 2015-09-05 Consecutive
2 3 2015-12-12 2016-12-12 Non Consecutive
3 4 2017-01-01 2017-06-01 Non consecutive
4 5 2017-06-02 2019-02-20 Consecutive
I created a new column (check column) checking if the start date is consecutive of the finish date in the previous row, which I populated with 'Consecutive' and 'Non consecutive'.
I want to insert a new row where the value of the check column is 'Non consecutive' that contains, as Start date, the date of the column 'finish date' + 1 day (consecutive with previous row) and as 'finish date' the date of the column Finish Date - 1 day (consecutive with next row). So indexes 2 and 4 will be the new rows
Brand Start Date Finish Date
0 1 2013-03-16 2014-03-02
1 2 2014-03-03 2015-09-05
2 3 2015-09-06 2015-12-11
3 3 2015-12-12 2016-12-12
4 4 2016-12-13 2016-12-31
5 4 2017-01-01 2017-06-01
6 5 2017-06-02 2019-02-20
How can I achieve this?

date_format = '%Y-%m-%d'
rows = df.index[~df['Consecutive']]
df2 = pd.DataFrame(columns=df.columns, index=rows)
res = []
for row in rows:
df2.loc[row, :] = df.loc[row, :].copy()
df2.loc[row, 'Start Date'] = datetime.strftime(datetime.strptime(df.loc[row, 'Start Date'], date_format) + timedelta(days=1), date_format)
df2.loc[row, 'Finish Date'] = datetime.strftime(datetime.strptime(df.loc[row+1, 'Finish Date'], date_format) - timedelta(days=1), date_format)
df3 = pd.concat([df, df2]).sort_values(['Brand', 'Start Date']).reset_index(drop=True)
This uses sorting to put the rows in the correct place. If your df is big the sorting could be the slowest part and you could consider adding the rows one at a time into the correct place see here.

Related

New pandas DataFrame column from datetime calculation

I am trying to calculate the number of days that have elapsed since the launch of a marketing campaign. I have one row per date for each marketing campaign in my DataFrame (df) and all dates start from the same day (though there is not a data point for each day for each campaign). In column 'b' I have the date relating to the data points of interest (dateime64[ns]) and in column 'c' I have the launch date of the marketing campaign (dateime64[ns]). I would like the resulting calculation to return n/a (or np.NaN or a suitable alternative) when column 'b' is earlier than column 'c', else I would like the calculation to return the difference the two dates.
Campaign
Date
Launch Date
Desired Column
A
2019-09-01
2022-12-01
n/a
A
2019-09-02
2022-12-01
n/a
B
2019-09-01
2019-09-01
0
B
2019-09-25
2019-09-01
24
When I try:
df['Days Since Launch'] = df['Date'] - df['Launch Date']
What I would hope returns a negative value actually returns a positive one, thus leading to duplicate values when I have dates that are 10 days prior and 10 days after the launch date.
When I try:
df['Days Since Launch'] = np.where(df['Date'] < df['Launch Date'], XXX, df['Date'] - df['Launch Date'])
Where XXX has to be the same data type as the two input columns, so I can't enter np.NaN because the calculation will fail, nor can I enter a date as this will still leave the same issue that i want to solve. IF statements do not work as the "truth value of a Series is ambiguous". Any ideas?
You can use a direct subtraction and conversion to days with dt.days, then mask the negative values with where:
s = pd.to_datetime(df['Date']).sub(pd.to_datetime(df['Launch Date'])).dt.days
# or, if already datetime:
#s = df['Date'].sub(df['Launch Date']).dt.days
df['Desired Column'] = s.where(s.ge(0))
Alternative closer to your initial attempt, using mask:
df['Desired Column'] = (df['Date'].sub(df['Launch Date'])
.mask(df['Date'] < df['Launch Date'])
)
Output:
Campaign Date Launch Date Desired Column
0 A 2019-09-01 2022-12-01 NaN
1 A 2019-09-02 2022-12-01 NaN
2 B 2019-09-01 2019-09-01 0.0
3 B 2019-09-25 2019-09-01 24.0
Add Series.dt.days for convert timedeltas to days:
df['Days Since Launch'] = np.where(df['Date'] < df['Launch Date'],
np.nan,
(df['Date'] - df['Launch Date']).dt.days)
print (df)
Campaign Date Launch Date Desired Column Days Since Launch
0 A 2019-09-01 2022-12-01 NaN NaN
1 A 2019-09-02 2022-12-01 NaN NaN
2 B 2019-09-01 2019-09-01 0.0 0.0
3 B 2019-09-25 2019-09-01 24.0 24.0
Another alternative:
df["Date"] = pd.to_datetime(df["Date"])
df["Launch Date"] = pd.to_datetime(df["Launch Date"])
df["Desired Column"] = df.apply(lambda x: x["Date"] - x["Launch Date"] if x["Date"] >= x["Launch Date"] else None, axis=1)

How to calculate number of days between 2 months in Python

I have a requirement where I have to find number of days between 2 months where 1st month value is constant and 2nd month value is present in a data frame.
I have to subtract 24th Feb with values present in the Data Frame.
past_2_month = date.today()
def to_integer(dt_time):
return 1*dt_time.month
past_2_month = to_integer(past_2_month)
past_2_month_num = past_2_month-2
day = 24
date_2 = dt.date(year, past_2_month_num, day)
date_2
Output of above code: datetime.date(2022, 2, 24)
Other values present in the Data frame is below:
dict_1 = {'Col1' : ['2017-05-01', np.NaN, '2017-11-01', np.NaN, '2016-10-01']}
a = pd.DataFrame(dict_1)
How to subtract this 2 values so that I can get difference in days between these 2 values?
If need number of days between datetime column and 2 months shifted values use offsets.DateOffset and convert timedeltas to days by Series.dt.days:
a['Col1'] = pd.to_datetime(a['Col1'])
a['new'] = (a['Col1'] - (a['Col1'] - pd.DateOffset(months=2))).dt.days
print (a)
Col1 new
0 2017-05-01 61.0
1 NaT NaN
2 2017-11-01 61.0
3 NaT NaN
4 2016-10-01 61.0
If need difference by another datetime solution is simplier - subtract and convert values to days:
a['Col1'] = pd.to_datetime(a['Col1'])
a['new'] = (pd.to_datetime('2022-02-24') - a['Col1']).dt.days
print (a)
Col1 new
0 2017-05-01 1760.0
1 NaT NaN
2 2017-11-01 1576.0
3 NaT NaN
4 2016-10-01 1972.0

Python, assign sesion column if date is within range of dates of other table

I have a sessions table with three columns
And I have a table of user's actions with datestamps
To the user's action table, I would like to add a column that states in which session # that action took place, based on the dates
I was thinking of making a for loop where I take each row of user_id, and then for that row and inner loop to check if the date of that row is inside row 1 of Sessions (session 1), if not, next iteration will be to check with row 2 of Sessions table, if not, the third row, and so on untill it matches and we can move back to the outer loop to take the next row of User's actions and repeat the process of iterating over sessions table.
But is there a faster computational way to do it? My tables are more than 10 million rows
You can do it this way:
(Assuming df1 is the sessions table, df2 is the table of user's actions)
1. Convert datetime format
df1['start datetime'] = pd.to_datetime(df1['start datetime'], dayfirst=True)
df1['end datetime'] = pd.to_datetime(df1['end datetime'], dayfirst=True)
df2['datestamp'] = pd.to_datetime(df2['datestamp'], dayfirst=True)
2. Cross join the 2 dataframes:
df3 = df2.merge(df1, how='cross')
Or, if your Pandas version is older than 1.2.0 (December 2020 version), you can use:
df3 = df2.assign(key=1).merge(df1.assign(key=1), on='key').drop('key', axis=1)
3. Filter rows with datestamp within range of start datetime end datetime
df_out = df3.loc[df3['datestamp'].between(df3['start datetime'], df3['end datetime'])]
Result:
(modified the time of last row to make it within range)
print(df_out)
user_id action datestamp session # start datetime end datetime
0 1 A 2021-01-15 08:21:00 1 2021-01-15 05:21:00 2021-01-15 20:22:00
4 1 A 2021-01-23 11:50:00 2 2021-01-23 11:21:00 2021-01-23 12:21:00
8 1 B 2021-03-02 14:44:00 3 2021-03-02 14:43:00 2021-03-02 14:45:00
Optionally, you can remove the unwanted columns, as follows:
df_out = df_out.drop(['start datetime', 'end datetime'], axis=1)
Result:
print(df_out)
user_id action datestamp session #
0 1 A 2021-01-15 08:21:00 1
4 1 A 2021-01-23 11:50:00 2
8 1 B 2021-03-02 14:44:00 3

how to replace missing date from NAT to some date in increasing order in python

I have a date column
the missing values(NAT in python) needs to be incremented in loop with one day
that is 1/1/2015 , 1/2/2016, 1/3/2016
Can any one help me out ?
This will add an incremental date to your dataframe.
import pandas as pd
import datetime as dt
ddict = {
'Date': ['2014-12-29','2014-12-30','2014-12-31','','','','',]
}
data = pd.DataFrame(ddict)
data['Date'] = pd.to_datetime(data['Date'])
def fill_dates(data_frame, date_col='Date'):
### Seconds in a day (3600 seconds per hour x 24 hours per day)
day_s = 3600 * 24
### Create datetime variable for adding 1 day
_day = dt.timedelta(seconds=day_s)
### Get the max non-null date
max_dt = data_frame[date_col].max()
### Get index of missing date values
NaT_index = data_frame[data_frame[date_col].isnull()].index
### Loop through index; Set incremental date value; Increment variable by 1 day
for i in NaT_index:
data_frame[date_col][i] = max_dt + _day
_day += dt.timedelta(seconds=day_s)
### Execute function
fill_dates(data, 'Date')
Initial data frame:
Date
0 2014-12-29
1 2014-12-30
2 2014-12-31
3 NaT
4 NaT
5 NaT
6 NaT
After running the function:
Date
0 2014-12-29
1 2014-12-30
2 2014-12-31
3 2015-01-01
4 2015-01-02
5 2015-01-03
6 2015-01-04

Pandas get days in a between two two dates from a particular month

I have a pandas dataframe with three columns. A start and end date and a month.
I would like to add a column for how many days within the month are between the two dates. I started doing something with apply, the calendar library and some math, but it started to get really complex. I bet pandas has a simple solution, but am struggling to find it.
Input:
import pandas as pd
df1 = pd.DataFrame(data=[['2017-01-01', '2017-06-01', '2016-01-01'],
['2015-03-02', '2016-02-10', '2016-02-01'],
['2011-01-02', '2018-02-10', '2016-03-01']],
columns=['start date', 'end date date', 'Month'])
Desired Output:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31
There is a solution:
get a date list by pd.date_range between start and end dates, and then check how many date has the same year and month with the target month.
def overlap(x):
md = pd.to_datetime(x[2])
cand = [(ad.year, ad.month) for ad in pd.date_range(x[0], x[1])]
return len([x for x in cand if x ==(md.year, md.month)])
df1["Days in Month"]= df1.apply(overlap, axis=1)
You'll get:
start date end date date Month Days in Month
0 2017-01-01 2017-06-01 2016-01-01 0
1 2015-03-02 2016-02-10 2016-02-01 10
2 2011-01-02 2018-02-10 2016-03-01 31
You can convert your cell to datetime by
df = df.applymap(lambda x: pd.to_datetime(x))
Then find intersection days with function
def intersectionDaysInMonth(start, end, month):
end_month = month.replace(month=month.month + 1)
if month <= start <= end_month:
return end_month - start
if month <= end <= end_month:
return end - month
if start <= month < end_month <= end:
return end_month - month
return pd.to_timedelta(0)
Then apply
df['Days in Month'] = df.apply(lambda row: intersectionDaysInMonth(*row).days, axis=1)

Categories

Resources