Number of rows between two dates - python

Let's say I have a pandas df with a Date column (datetime64[ns]):
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0
I want to get a column (rows_num in the above example) with the number of rows I need to claw back to find the current row date minus 365 days (1 year before).
So, in the above example, for index 7 (date 2021-01-26) I want to know how many rows before I can find the date 2020-01-26.
If a perfect match is not available (like in the example df), I should reference the closest available date (or the closest smaller/larger date: it doesn't really matter in my case).
Any idea? Thanks

Edited to reflect OP's original question. Created a demo dataframe. Created a column to hold that row_count value to reflect number of business days. Then, for each row, create a filter to grab all rows between the start date and 365 days later. the shape[0] of that filtered dataframe represents the number of business days, and we add it into the appropriate field of the df.
# Import Pandas package
import pandas as pd
from datetime import datetime, timedelta
# Create a sample dataframe
df = pd.DataFrame({'num_posts': [4, 6, 3, 9, 1, 14, 2, 5, 7, 2],
'date' : ['2020-08-09', '2020-08-25', '2020-09-05',
'2020-09-12', '2020-09-29', '2020-10-15',
'2020-11-21', '2020-12-02', '2020-12-10',
'2020-12-18']})
#create the column for the row count:
df.insert(2, "row_count", '')
# Convert the date to datetime64
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
for row in range(len(df['date'])):
start_date = str(df['date'].iloc[row])
end_date = str(df['date'].iloc[row] + timedelta(days=365)) #set the end date for the filter
# Filter data between two dates
filtered_df = df.loc[(df['date'] >= start_date) & (df['date'] < end_date)]
df['row_count'][row] = filtered_df.shape[0] # fill in the row_count column with the number of rows returned by your filter
source

You can use pd.merge_asof, which performs the exact nearest-match lookup you describe. You can even choose to use backward (smaller), forward (larger), or nearest search types.
# setup
text = StringIO(
"""
Date
2020-01-01
2020-02-25
2020-04-23
2020-06-28
2020-08-17
2020-10-11
2020-12-06
2021-01-26
2021-03-17
"""
)
data = pd.read_csv(text, delim_whitespace=True, parse_dates=["Date"])
# calculate the reference date from 1 year (365 days) ago
one_year_ago = data["Date"] - pd.Timedelta("365D")
# we only care about the index values for the original and matched dates
merged = pd.merge_asof(
one_year_ago.reset_index(),
data.reset_index(),
on="Date",
suffixes=("_original", "_matched"),
direction="backward",
)
data["rows_num"] = merged["index_original"] - merged["index_matched"]
Result:
Date rows_num
0 2020-01-01 NaN
1 2020-02-25 NaN
2 2020-04-23 NaN
3 2020-06-28 NaN
4 2020-08-17 NaN
5 2020-10-11 NaN
6 2020-12-06 NaN
7 2021-01-26 7.0
8 2021-03-17 7.0

Related

calc churn rate in pandas

i have a sales dataset (simplified) with sales from existing customers (first_order = 0)
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00'],
'email':['1#abc.de','2#abc.de','3#abc.de','1#abc.de'],
'first_order':[1,1,1,1],
'Last_Order_Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00']
})
I would like to analyze how many existing customers we lose per month.
my idea is to
group(count) by month and
then count how many have made their last purchase in the following months which gives me a churn cross table where I can see that e.g. we had 300 purchases in January, and 10 of them bought the last time in February.
like this:
Col B is the total number of repeating customers and column C and further is the last month they bought something.
E.g. we had 2400 customers in January, 677 of them made their last purchase in this month, 203 more followed in February etc.
I guess I could first group the total number of sales per month and then group a second dataset by Last_Order_Date and filter by month.
but I guess there is a handy python way ?! :)
any ideas?
thanks!
The below code helps you to identify how many purchases are done in each month.
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.groupby(pd.Grouper(freq="M")).size()
O/P:
Date
2020-02-29 1
2020-03-31 0
2020-04-30 1
2020-05-31 1
2020-06-30 1
Freq: M, dtype: int64
I couldn't find the required data and explanation clearly. This could be your starting point. Please let me know, if it is helped you in anyway.
Update-1:
df.pivot_table(index='Date', columns='email', values='Last_Order_Date', aggfunc='count')
Output:
email 1#abc.de 2#abc.de 3#abc.de
Date
2020-02-26 00:00:00 1.0 NaN NaN
2020-04-10 00:00:00 NaN NaN 1.0
2020-05-05 00:00:00 NaN 1.0 NaN
2020-06-30 00:00:00 1.0 NaN NaN

How to calculate number of days between 2 months in Python

I have a requirement where I have to find number of days between 2 months where 1st month value is constant and 2nd month value is present in a data frame.
I have to subtract 24th Feb with values present in the Data Frame.
past_2_month = date.today()
def to_integer(dt_time):
return 1*dt_time.month
past_2_month = to_integer(past_2_month)
past_2_month_num = past_2_month-2
day = 24
date_2 = dt.date(year, past_2_month_num, day)
date_2
Output of above code: datetime.date(2022, 2, 24)
Other values present in the Data frame is below:
dict_1 = {'Col1' : ['2017-05-01', np.NaN, '2017-11-01', np.NaN, '2016-10-01']}
a = pd.DataFrame(dict_1)
How to subtract this 2 values so that I can get difference in days between these 2 values?
If need number of days between datetime column and 2 months shifted values use offsets.DateOffset and convert timedeltas to days by Series.dt.days:
a['Col1'] = pd.to_datetime(a['Col1'])
a['new'] = (a['Col1'] - (a['Col1'] - pd.DateOffset(months=2))).dt.days
print (a)
Col1 new
0 2017-05-01 61.0
1 NaT NaN
2 2017-11-01 61.0
3 NaT NaN
4 2016-10-01 61.0
If need difference by another datetime solution is simplier - subtract and convert values to days:
a['Col1'] = pd.to_datetime(a['Col1'])
a['new'] = (pd.to_datetime('2022-02-24') - a['Col1']).dt.days
print (a)
Col1 new
0 2017-05-01 1760.0
1 NaT NaN
2 2017-11-01 1576.0
3 NaT NaN
4 2016-10-01 1972.0

Keep only rows from first X hours with starting point from another dataframe

I have a DataFrame (df1) with patients, where each patient (with unique id) has an admission timestamp:
admission_timestamp id
0 2020-03-31 12:00:00 1
1 2021-01-13 20:52:00 2
2 2020-04-02 07:36:00 3
3 2020-04-05 16:27:00 4
4 2020-03-21 18:51:00 5
I also have a DataFrame (df2) with for each patient (with unique id), data for a specific feature. For example:
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
2 1 temperature 2020-04-03 13:04:33 36.51
3 2 temperature 2020-04-02 07:44:12 36.45
4 2 temperature 2020-04-08 08:36:00 36.50
Where effective_timestamp is of type: datetime64[ns], for both columns. The ids for both dataframes link to the same patients.
In reality there is a lot more data with +- 1 value per minute. What I want is for each patient, only the data for the first X (say 24) hours after the admission timestamp from df1. So the above would result in:
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
3 2 temperature 2020-04-02 07:44:12 36.45
This would thus include first searching for the admission timestamp, and with this timestamp, drop all rows for that patient where the effective_timestamp is not within X hours from the admission timestamp. Here, X should be variable (could be 7, 24, 72, etc). I could not find a similar question on SO. I tried this using panda's date_range but I don't know how to perform that for each patient, with a variable value for X. Any help is appreciated.
Edit: I could also merge the dataframes together so each row in df2 has the admission_timestamp, and then subtract the two columns to get the difference in time. And then drop all rows where difference > X. But this sounds very cumbersome.
Let's use pd.DateOffset
First get the value of admission_timestamp for a given patient id, and convert it to pandas datetime.
Let's say id = 1
>>admissionTime = pd.to_datetime(df1[df1['id'] == 1]['admission_timestamp'].values[0])
>>admissionTime
Timestamp('2020-03-31 12:00:00')
Now, you just need to use pd.DateOffset to add 24 hours to it.
>>admissionTime += pd.DateOffset(hours=24)
Now, just look for the rows where id=1 and effective_timestamp < admissionTime
>>df2[(df2['id'] == 1) & (df2['effective_timestamp']<admissionTime)]
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61

how to convert monthly data to weekly data keeping the other columns constant

I have a data frame as follows.
pd.DataFrame({'Date':['2020-08-01','2020-08-01','2020-09-01'],'value':[10,12,9],'item':['a','d','b']})
I want to convert this to weekly data keeping all the columns apart from the Date column constant.
Expected output
pd.DataFrame({'Date':['2020-08-01','2020-08-08','2020-08-15','2020-08-22','2020-08-29','2020-08-01','2020-08-08','2020-08-15','2020-08-22','2020-08-29','2020-09-01','2020-09-08','2020-09-15','2020-09-22','2020-09-29'],
'value':[10,10,10,10,10,12,12,12,12,12,9,9,9,9,9],'item':['a','a','a','a','a','d','d','d','d','d','b','b','b','b','b']})
It should be able to convert any month data to weekly data. Date in the input data frame is always the first day of that month.
How do I make this happen?
Thanks in advance.
Since the desired new datetime index is irregular (re-starts at the 1st of each month), an iterative creation of the index is an option:
df = pd.DataFrame({'Date':['2020-08-01','2020-09-01'],'value':[10,9],'item':['a','b']})
df = df.set_index(pd.to_datetime(df['Date'])).drop(columns='Date')
dti = pd.to_datetime([]) # start with an empty datetime index
for month in df.index: # for each month, add a 7-day step datetime index to the previous
dti = dti.union(pd.date_range(month, month+pd.DateOffset(months=1), freq='7d'))
# just reindex and forward-fill, no resampling needed
df = df.reindex(dti).ffill()
df
value item
2020-08-01 10.0 a
2020-08-08 10.0 a
2020-08-15 10.0 a
2020-08-22 10.0 a
2020-08-29 10.0 a
2020-09-01 9.0 b
2020-09-08 9.0 b
2020-09-15 9.0 b
2020-09-22 9.0 b
2020-09-29 9.0 b
I added one more date to your data and then used resample:
df = pd.DataFrame({'Date':['2020-08-01', '2020-09-01'],'value':[10, 9],'item':['a', 'b']})
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df = df.resample('W').ffill().reset_index()
print(df)
Date value item
0 2020-08-02 10 a
1 2020-08-09 10 a
2 2020-08-16 10 a
3 2020-08-23 10 a
4 2020-08-30 10 a
5 2020-09-06 9 b

Best approach to group the differences between each row by month, year, etc?

I have a dataframe like the following:
Index Diff
2019-03-14 11:32:21.583000+00:00 0
2019-03-14 11:32:21.583000+00:00 2
2019-04-14 11:32:21.600000+00:00 13
2019-04-14 11:32:21.600000+00:00 14
2019-05-14 11:32:21.600000+00:00 19
2019-05-14 11:32:21.600000+00:00 27
What would be the best approach to group by the month and take the difference inside of those months?
Using the .diff() option I am able to find the difference between each row, but I am trying to use the df.groupby(pd.Grouper(freq='M')) with no success.
Expected Output:
Index Diff
0 2019-03-31 00:00:00+00:00 2.0
1 2019-04-30 00:00:00+00:00 1.0
2 2019-05-31 00:00:00+00:00 8.0
Any help would be much appreciated!!
Depending on whether or not your date is on the index, you can comment out df1 = df.reset_index(). Also, check that your index is in DateTimeIndex format if it is on the index. If not in the correct format, then you can change the data type with df.index = pd.to_datetime(df.index). Then, you should be set to change the Diff column with df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].diff() and then later groupby with the full dataframe:
input:
import pandas as pd
df = pd.DataFrame({'Diff': {'2019-03-14 11:32:21.583000+00:00': 2,
'2019-04-14 11:32:21.600000+00:00': 14,
'2019-05-14 11:32:21.600000+00:00': 27}})
df.index.name = 'Index'
df.index = pd.to_datetime(df.index)
code:
df1 = df.reset_index()
df1['Diff'] = df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].diff()
df1 = df1.groupby(pd.Grouper(key='Index', freq='M'))['Diff'].max().reset_index()
df1
output:
Index Diff
0 2019-03-31 00:00:00+00:00 2.0
1 2019-04-30 00:00:00+00:00 1.0
2 2019-05-31 00:00:00+00:00 8.0

Categories

Resources