Pandas groupby date range - python

I have a table where one of the columns is the date of occurrence (the dataframe is not indexed by date)
I want to group the table by date wherein all items which occurred prior to a certain date are grouped into one bucket. This would need to be cumulative, so later buckets will include all datapoints from earlier ones.
Here's the daterange object I need to group by:
date_rng = date_range('28/02/2010','31/08/2014',freq='3M')
Here's an example of a few datapoints in the table:
df_raw.head()
Ticker FY Periodicity Measure Val Date
0 BP9DL90 2009 ANN CPX 1000.00 2008-03-31 00:00:00
1 BP9DL90 2010 ANN CPX 600.00 2009-03-25 00:00:00
2 BP9DL90 2010 ANN CPX 600.00 2009-09-16 00:00:00
3 BP9DL90 2011 ANN CPX 570.00 2010-03-17 00:00:00
4 BP9DL90 2011 ANN GRM 57.09 2010-09-06 00:00:00
[5 rows x 6 columns]
Any input would be much appreciated.
Thanks

you could create a function that returns 1 if the date is in the date range you want, and then use this to group by:
# convert date column do datetime type
df['Date']=pd.to_datetime(df['DATE']), format='%d-%m-%Y %H:%M:%S'
def is_in_range(x):
if x['Date'] > '28-02-2010 00:00:00' and x['Date'] < '31-08-2014 00:00:00':
return 1
else:
return 0
data.groupby(df['date'].map(is_in_range))

Related

How do I convert the month and day in dataset to datetime so that I can index it?

End goal: I want to create a graph so that the x-axis is the date and there are two y-axis, one Ontario and one CMB. Ultimately there would be 5 graphs (Ontario,2 vs CMB,2 & Ontario,3 vs CMB,3 & etc.)
Ideal Graph
However, my datasets have date as "mm-dd" in the Datestamp column which are object type. Also, both tables do not have all the same dates.
dfplot_ont.tail()
Datestamp Ontario,2 Ontario,3 Ontario,4 Ontario,5 Ontario,7
18 12-29 -0.664715 0.245738 0.668187 0.016819 -0.493384
19 12-30 0.491311 0.302230 1.140404 1.421685 1.552911
20 01-02 1.213827 0.471704 1.400124 1.599767 1.621120
21 01-03 1.502834 0.048018 0.927907 0.956694 1.052705
22 01-04 -1.965244 -2.917788 0.597355 0.234474 -0.857170
dfplot_cmb.tail()
Datestamp CMB,2 CMB,3 CMB,4 CMB,5 CMB,7
15 12-28 0.907092 0.937362 0.991568 1.030808 1.139708
16 12-29 0.900410 0.919994 0.992267 0.991359 1.034978
17 12-30 1.181259 1.193806 1.272700 1.283576 1.265860
18 01-03 0.751646 0.752037 0.681900 0.686982 0.600167
19 01-04 0.606714 0.532544 0.339825 0.282894 0.127186
I need to change this to datetime but it seems like I need to include a year to change it. How do I code "if the month is 12, then year is 2022 and if the month is 1, then year is 2023"? I will also need to swap it out for the year will always be 2023 once there is data at the end of the year.
I have tried this, but it does not change Datestamp to datetime type:
dfplot_ont['Datestamp'] = pd.to_datetime(dfplot_ont['Datestamp'], format='%m-%d').dt.strftime('%m-%d')
I have also tried this, but then the index ends up not being mm-dd:
dfplot_ont = dfplot_ont.set_index(pd.to_datetime(dfplot_ont['Datestamp'], format='%MM-%dd'))
Datestamp Ontario,2 Ontario,3 Ontario,4 Ontario,5 Ontario,7
Datestamp
1900-01-22 00:12:00 12-22 0.708066 -0.149703 -0.724853 -1.200072 -0.356965
1900-01-23 00:12:00 12-23 -0.520212 0.415213 -1.362347 -1.140712 -0.970853
1900-01-26 00:12:00 12-26 -0.014450 0.612933 -1.149849 -0.952737 -0.925380
1900-01-27 00:12:00 12-27 0.202305 0.669425 -1.102627 -0.893376 -0.925380
1900-01-28 00:12:00 12-28 -0.953721 0.302230 -0.394301 -0.042542 0.302397
I tried this as well, but similar to above, datestamp is not correct:
dfplot_cmb\['Datestamp'\] = pd.to_datetime(dfplot_cmb\['Datestamp'\], format='%M-%d')
dfplot_cmb.set_index('Datestamp', inplace=True)
dfplot_cmb.head()
CMB,2 CMB,3 CMB,4 CMB,5 CMB,7
Datestamp
1900-01-19 00:12:00 -1.559724 -1.663136 -1.719869 -1.771499 -1.778253
1900-01-20 00:12:00 -1.311374 -1.250774 -1.156484 -1.076946 -1.038540
1900-01-21 00:12:00 -1.220269 -1.156733 -1.106780 -1.077736 -1.057990
1900-01-22 00:12:00 -0.554371 -0.517907 -0.513658 -0.517146 -0.498735
1900-01-23 00:12:00 0.298617 0.252807 0.218531 0.167709 0.205619
How do I code "if the month is 12, then year is 2022 and if the month is 1, then year is 2023"?
First, create a function to implement the above logic:
import datetime
def create_dt(ds):
mm = ds[:2]
dd = ds[-2:]
# Modify logic accordingly if you have other months
if month == '01':
dt = f'2023-{mm}-{dd}'
else:
dt = f'2022-{mm}-{dd}'
return datetime.datetime.strptime(dt, '%Y-%m-%d').date()
Then apply this function to column Datestamp to crate a new column:
dfplot_ont['datetime']= dfplot_ont['Datestamp'].apply(create_dt)
Use the new column datetime as your x-axis.
Here's a method that will add an offset for each year.
data = {'date': ['12-29', '12-30', '12-31', '01-02', '01-03', '01-04']}
df = pd.DataFrame(data)
Set a start year. In this case we'll start at 2022.
start_year = 2022
df['date'] = pd.to_datetime( df['date'].str.slice(0,2) + '-'
+ df['date'].str.slice(3,5) + '-'
+ str(start_year)
)
Next we need to create a date offset to increment each year accordingly. Whenever the current month is less than the previous month, we know we've started a new year. We can turn this into a Boolean and cumulatively sum it to get each year's offset.
df['offset'] = df['date'].dt.month.lt(df['date'].dt.month.shift()).cumsum()
Without cumsum(), the data looks like this. You can see how we cumulatively sum each time the year changes.
date offset
0 2022-12-29 False
1 2022-12-30 False
2 2022-12-31 False
3 2022-01-02 True
4 2022-01-03 False
5 2022-01-04 False
Finally, we add the offset to the date by year.
df['date'] = df['date'] + df['offset'].astype('timedelta64[Y]')
date offset
0 2022-12-29 0
1 2022-12-30 0
2 2022-12-31 0
3 2023-01-02 1
4 2023-01-03 1
5 2023-01-04 1

calc churn rate in pandas

i have a sales dataset (simplified) with sales from existing customers (first_order = 0)
import pandas as pd
import datetime as dt
df = pd.DataFrame({'Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00'],
'email':['1#abc.de','2#abc.de','3#abc.de','1#abc.de'],
'first_order':[1,1,1,1],
'Last_Order_Date':['2020-06-30 00:00:00','2020-05-05 00:00:00','2020-04-10 00:00:00','2020-02-26 00:00:00']
})
I would like to analyze how many existing customers we lose per month.
my idea is to
group(count) by month and
then count how many have made their last purchase in the following months which gives me a churn cross table where I can see that e.g. we had 300 purchases in January, and 10 of them bought the last time in February.
like this:
Col B is the total number of repeating customers and column C and further is the last month they bought something.
E.g. we had 2400 customers in January, 677 of them made their last purchase in this month, 203 more followed in February etc.
I guess I could first group the total number of sales per month and then group a second dataset by Last_Order_Date and filter by month.
but I guess there is a handy python way ?! :)
any ideas?
thanks!
The below code helps you to identify how many purchases are done in each month.
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df.groupby(pd.Grouper(freq="M")).size()
O/P:
Date
2020-02-29 1
2020-03-31 0
2020-04-30 1
2020-05-31 1
2020-06-30 1
Freq: M, dtype: int64
I couldn't find the required data and explanation clearly. This could be your starting point. Please let me know, if it is helped you in anyway.
Update-1:
df.pivot_table(index='Date', columns='email', values='Last_Order_Date', aggfunc='count')
Output:
email 1#abc.de 2#abc.de 3#abc.de
Date
2020-02-26 00:00:00 1.0 NaN NaN
2020-04-10 00:00:00 NaN NaN 1.0
2020-05-05 00:00:00 NaN 1.0 NaN
2020-06-30 00:00:00 1.0 NaN NaN

Pandas extract week of year and year from date

I caught up with this scenario and don't know how can I solve this.
I have the data frame where I am trying to add "week_of_year" and "year" column based in the "date" column of the pandas' data frame which is working fine.
import pandas as pd
df = pd.DataFrame({'date': ['2018-12-31', '2019-01-01', '2019-12-31', '2020-01-01']})
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].apply(lambda x: x.weekofyear)
df['year'] = df['date'].apply(lambda x: x.year)
print(df)
Current Output
date week_of_year year
0 2018-12-31 1 2018
1 2019-01-01 1 2019
2 2019-12-31 1 2019
3 2020-01-01 1 2020
Expected Output
So here what I am expecting is for 2018 and 2019 the last date was the first week of the new year which is 2019 and 2020 respectively so I want to add logic in the year, where the week is 1 but the date belongs for the previous year so the year column would track that as in the expected output.
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020
Try:
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].dt.weekofyear
df['year']=(df['date']+pd.to_timedelta(6-df['date'].dt.weekday, unit='d')).dt.year
Outputs:
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020
Few things - generally avoid .apply(..).
For datetime columns you can just interact with the date through df[col].dt variable.
Then to get the last day of the week just add to date 6-weekday where weekday is between 0 (Monday) and 6 to the date
TLDR CODE
To get the week number as a series
df['DATE'].dt.isocalendar().week
To set a new column to the week use same function and set series returned to a column:
df['WEEK'] = df['DATE'].dt.isocalendar().week
TLDR EXPLANATION
Use the pd.series.dt.isocalendar().week to get the the week for a given series object.
Note:
column "DATE" must be stored as a datetime column

How to convert this forloop to pandas lambda function, to increase speed

This forloop will take 3 days to complete. How can I increase the speed?
for i in range(df.shape[0]):
df.loc[df['Creation date'] >= pd.to_datetime(str(df['Original conf GI dte'].iloc[i])),'delivered'] += df['Sale order item'].iloc[i]
I think the forloop is enough to understand?
If Creation date is bigger than Original conf GI date, then add Sale order item value to delivered column.
Each row's date is "Date Accepted" (Date Delivered is future date). Input is Order Ouantity, Date Accepted & Date Delivered....Output is Delivered column
Order Quantity Date Accepted Date Delivered Delivered
20 01-05-2010 01-02-2011 0
10 01-11-2010 01-03-2011 0
300 01-12-2010 01-09-2011 0
5 01-03-2011 01-03-2012 30
20 01-04-2012 01-11-2013 335
10 01-07-2013 01-12-2014 335
Convert values to numpy arrays by Series.to_numpy, compare them with broadcasting, match order values by numpy.where and last sum:
date1 = df['Date Accepted'].to_numpy()
date2 = df['Date Delivered'].to_numpy()
order = df['Order Quantity'].to_numpy()
#oldier pandas versions
#date1 = df['Date Accepted'].values
#date2 = df['Date Delivered'].values
#order = df['Order Quantity'].values
df['Delivered1'] = np.where(date1[:, None] >= date2, order, 0).sum(axis=1)
print (df)
Order Quantity Date Accepted Date Delivered Delivered Delivered1
0 20 2010-01-05 2011-01-02 0 0
1 10 2010-01-11 2011-01-03 0 0
2 300 2010-01-12 2011-01-09 0 0
3 5 2011-01-03 2012-01-03 30 30
4 20 2012-01-04 2013-01-11 335 335
5 10 2013-01-07 2014-01-12 335 335
If I understand correctly, you can use np.where() for speed. Currently you are looping on the dataframe rows whereas numpy operations are designed to operate on the entire column:
cond= df['Creation date'].ge(pd.to_datetime(str(df['Original conf GI dte'])))
df['delivered']=np.where(cond,df['delivered']+df['Sale order item'],df['delivered'])

Count unique dates in pandas dataframe

I have a dataframe of surface weather observations (fzraHrObs) organized by a station identifier code and date. fzraHrObs has several columns of weather data. The station code and date (datetime objects) look like:
usaf dat
716270 2014-11-23 12:00:00
2015-12-20 08:00:00
2015-12-20 09:00:00
2015-12-21 04:00:00
2015-12-28 03:00:00
716280 2015-12-19 08:00:00
2015-12-19 08:00:00
I would like to get a count of the number of unique dates (days) per year for each station - i.e. the number of days of obs per year at each station. In my example above this would give me:
usaf Year Count
716270 2014 1
2015 3
716280 2014 0
2015 1
I've tried using groupby and grouping by station, year, and date:
grouped = fzraHrObs['dat'].groupby(fzraHrObs['usaf'], fzraHrObs.dat.dt.year, fzraHrObs.dat.dt.date])
Count, size, nunique, etc. on this just gives me the number of obs on each date, not the number of dates themselves per year. Any suggestions on getting what I want here?
Could be something like this, group the date by usaf and year and then count the number of unique values:
import pandas as pd
df.dat.apply(lambda dt: dt.date()).groupby([df.usaf, df.dat.apply(lambda dt: dt.year)]).nunique()
# usaf dat
# 716270 2014 1
# 2015 3
# 716280 2015 1
# Name: dat, dtype: int64
The following should work:
df.groupby(['usaf', df.dat.dt.year])['dat'].apply(lambda s: s.dt.date.nunique())
What I did differently is group by two levels only, then use the nunique method of pandas series to count the number of unique dates in each group.

Categories

Resources