I would like to calculate how many customers there were at each time of month in the past year. My dataframe contains customer ID, start-date (where customer started being customer) and end-date (where customer ended being customer):
Customer_ID StartDate EndDate
1 01/01/2019 NAT
2 25/10/2017 01/06/2020
2 13/06/2012 15/07/2015
2 20/12/2015 03/01/2016
2 25/03/2016 14/06/2017
3 05/06/2018 05/06/2019
3 12/12/2019 NAT
The result I would like; is counting the number of customers that were "active" per month-year combination:
MONTH YEAR NUMB_CUSTOMERS
01 2013 1
02 2013 1
03 2013 1
04 2013 1
...
01 2019 2
...
09 2020 2
I would like to avoid for-loops as that takes too much long (I have a table of over 100 000 rows).
Has anyone an idea to do this neat and quickly?
Thanks!
First, read data and make it digestible for program
import pandas as pd
import datetime
df = pd.read_csv("table.csv")
func = lambda x: x.split('/', maxsplit=1)[1]
df["StartDate"] = df["StartDate"].apply(func)
mask = df["EndDate"] != "NAT"
df.loc[mask, "EndDate"] = df.loc[mask, "EndDate"].apply(func)
Then, count changes in amount of clients (you basically get a derivative of your data)
customers_gained = df[["Customer_ID", "StartDate"]].groupby("StartDate").agg("count")
customers_lost = df[["Customer_ID", "EndDate"]].groupby("EndDate").agg("count")
customers_lost.drop("NAT",inplace=True)
make a grouper for all changes in amount of clients
def make_time_table(start, end):
start_date = datetime.datetime.strptime(start, "%d/%m/%Y")
end_date = datetime.datetime.strptime(end, "%d/%m/%Y")
data_range = pd.date_range(start_date, end_date, freq="M")
string_range = [el.strftime("%m/%Y") for el in data_range]
ser = pd.Series([0]*data_range.size, index=string_range)
return ser
Next introduce change into time_table and "integrate" by accumulation
time_table = make_time_table("01/01/2012", "01/12/2020")
time_table[customers_gained.index] = customers_gained["Customer_ID"]
time_table[customers_lost.index] -= customers_lost["Customer_ID"]
result = time_table.cumsum()
print(result)
Outputs:
01/2012 0
02/2012 0
03/2012 0
04/2012 0
05/2012 0
06/2012 1
07/2012 1
...
10/2019 2
11/2019 2
12/2019 3
01/2020 3
02/2020 3
03/2020 3
04/2020 3
05/2020 3
06/2020 2
07/2020 2
08/2020 2
09/2020 2
10/2020 2
11/2020 2
dtype: int64
table.csv
Customer_ID,StartDate,EndDate
1,01/01/2019,NAT
2,25/10/2017,01/06/2020
2,13/06/2012,15/07/2015
2,20/12/2015,03/01/2016
2,25/03/2016,14/06/2017
3,25/03/2016,05/06/2019
3,12/12/2019,NAT
Related
I'm wondering how to optimize a part of code to remove a loop which takes forever since I have around 350 000 IDs.
Here is the current code, which is not optimal and takes quite a while.
I'm trying to get it working better and if possible removing a loop.
The dataset is made of 4 columns with IDs, start_dates, end_dates and amount. We can have multi rows with same IDs but not the same amount. The main thing is in some rows the dates are not saved in the dataset. In that case we have to find the earlier start_date of the ID and the later end_date and add them to the row where it's not put in the dataframe
ID start_date end_date value
ABC 12/10/2010 12/12/2020 8
ABC 01/01/2020 01/04/2021 9
ABC 43
BCD 14/02/2020 14/03/2020 8
So we should have on the third row the start_date as 12/10/2010 and end date 01/04/2021. In the picture you cant see it but don't forget that BCD start_date could be earlier than ABC but you still use the 12/10/2010 because it is linked to the ID
for x in df['ID'].unique():
tmp = df.loc[df['ID'] == x].reset_index()
df.loc[(df['ID'] == x) & (df['start_date'].isna()), 'start_date'] = tmp['start_date'].min()
df.loc[(df['ID'] == x) & (df['end_date'].isna()), 'end_date'] = tmp['end_date'].max()
I suppose the code is quite clear about what I am trying to do.
But if you have any questions don't hesitate do post them I'll do my best to answer.
set up the job
import pandas as pd
data = { 'ID': ['ABC','ABC','ABC','BCD'], 'start_date' : ['12/10/2010', '01/01/2020',None ,'14/02/2020'], 'end_date': ['12/12/2020', '01/01/2021',None ,'14/03/2020'], 'value': [8,9,43,8]}
df = pd.DataFrame(data)
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
we get this result
ID start_date end_date value
0 ABC 2010-12-10 2020-12-12 8
1 ABC 2020-01-01 2021-01-01 9
2 ABC NaT NaT 43
3 BCD 2020-02-14 2020-03-14 8
do the work
df.start_date = df.groupby('ID')['start_date'].apply(lambda x: x.fillna(x.min()))
df.end_date = df.groupby('ID')['end_date'].apply(lambda x: x.fillna(x.max()))
we get this result
ID start_date end_date value
0 ABC 2010-12-10 2020-12-12 8
1 ABC 2020-01-01 2021-01-01 9
2 ABC 2010-12-10 2021-01-01 43
3 BCD 2020-02-14 2020-03-14 8
I currently have a df in pandas called astrology that contains two columns, one column called birthdate has dates that I would like to create two new DateTime variables from )one variable to record the month and day and another variable to record the year).
My current df looks like this:
birthdate howMuch
1/1/95 8
3/15/80 7
5/28/86 1
11/16/61 5
12/15/88 2
Desired df:
month-day year howMuch
1-1 1995 8
3-15 1980 7
5-28 1986 1
11-16 1961 5
12-15 1988 2
The current code I tried is:
astrology['year'] = pd.to_datetime(astrology['.birthdate'])
And I get the error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 7545-07-14 00:00:00
First is possible test if cleaning is correct, check if some datetimes which cannot be parsed with to_datetime and parameter errors='coerce' so not parsed datetimes are NaT, filtered are by Series.isna and boolean indexing:
print (astrology[pd.to_datetime(astrology['birthdate'], errors='coerce').isna()])
Then converting to datetimes, for format of month and days is used Series.dt.strftime and for years Series.dt.year, but is necessary subtract 100 for avoid parsing years above today year:
dates = pd.to_datetime(astrology['birthdate'])
y = dates.dt.year
now = pd.to_datetime('now').year
astrology = astrology.assign(monthday = dates.dt.strftime('%m/%d'),
year = y.mask(y > now, y - 100))
print (astrology)
birthdate howMuch monthday year
0 1/1/95 8 01/01 1995
1 3/15/80 7 03/15 1980
2 5/28/86 1 05/28 1986
3 11/16/61 5 11/16 1961
4 12/15/88 2 12/15 1988
If want month day column without padding zeros is used here Series.str.rsplit with selecting first lists by indexing str[0]:
md = astrology['birthdate'].str.rsplit('/', n=1).str[0]
dates = pd.to_datetime(astrology['birthdate'])
y = dates.dt.year
now = pd.to_datetime('now').year
astrology = astrology.assign(monthday = md,
year = y.mask(y > now, y - 100))
print (astrology)
birthdate howMuch monthday year
0 1/1/95 8 1/1 1995
1 3/15/80 7 3/15 1980
2 5/28/86 1 5/28 1986
3 11/16/61 5 11/16 1961
4 12/15/88 2 12/15 1988
I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08
I'm trying to make a program that will equally distribute employees' day off. There are 4 groups and each group has it's own weekmask for each week of the month. So far I've made a code that will change weekmask when it locates 0 in Dataframe(Sunday). I'm stuck on structuring this command np.busday_count(start, end, weekmask=) to automatically change the start and the end date.
My Dataframe looks like this:
And here's my code:
a: int = 0
week_mask: str = '1100111'
def _change_week_mask():
global a, week_mask
a += 1
if a == 1:
week_mask = '1111000'
elif a == 2:
week_mask = '1111111'
elif a == 3:
week_mask = '0011111'
else:
a = 0
for line in rows['Workday']:
if line is '0':
_change_week_mask()
Edit: changed the value of start week from 6 to 0.
Ok, so to answer your problem I have created the sample data frame with below code.
Then I have added below columns to the data frame.
dayofweek - to reach to similar data which you created by setting every Sunday as zero. In this case Monday is set as zero and Sunday is six.
weeknum - week of year
week - instead of counting and than changing the week mask, I have assigned the value to week from 0 to 3 and based on it, we can calculate the mask.
weekmask - using value of the week, I have calculate the mask, you might need to align this as per your logic.
weekenddate- end date I have calculate by adding 7 to start date, if month is changing mid week then this will have month end date.
b
after this we can create a new data frame to have only end of week entry, in this case Monday is 0 so I have taken 0.
then you can apply function and store the result to data frame.
import datetime
import pandas as pd
import numpy as np
df_ = pd.DataFrame({'startdate':pd.date_range(pd.to_datetime('2018-10-01'), pd.to_datetime('2018-11-30'))})
df_['dayofweek'] = df_.startdate.dt.dayofweek
df_['remaining_days_in_month'] = df_.startdate.dt.days_in_month - df_.startdate.dt.day
df_['week'] = df_.startdate.dt.week%4
df_['day'] = df_.startdate.dt.day
df_['weekmask'] = df_.week.map({0 : '1100111', 1 : '1111000' , 2 : '1111111', 3: '0011111'})
df_['weekenddate'] = [x[0] + datetime.timedelta(days=(7-x[1])) if x[2] > 7-x[1] else x[0] + datetime.timedelta(days=(x[2])) for x in df_[['startdate','dayofweek','remaining_days_in_month']].values]
final_df = df_[(df_['dayofweek']==0) | ( df_['day']==1)][['startdate','weekenddate','weekmask']]
final_df['numberofdays'] = [ np.busday_count((x[0]).astype('<M8[D]'), x[1].astype('<M8[D]'), weekmask=x[2]) for x in final_df.values.astype(str)]
Output:
startdate weekenddate weekmask numberofdays
0 2018-10-01 2018-10-08 1100111 5
7 2018-10-08 2018-10-15 1111000 4
14 2018-10-15 2018-10-22 1111111 7
21 2018-10-22 2018-10-29 0011111 5
28 2018-10-29 2018-10-31 1100111 2
31 2018-11-01 2018-11-05 1100111 3
35 2018-11-05 2018-11-12 1111000 4
42 2018-11-12 2018-11-19 1111111 7
49 2018-11-19 2018-11-26 0011111 5
56 2018-11-26 2018-11-30 1100111 2
let me know if this needs some changes as per your requirement.
I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.
How would I go about grouping the games by season, like -
season(2016-2017), season(2017-2018), etc..
This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.
This is what the date column looks like:
DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
...
29/09/18
07/10/18
28/10/18
03/11/18
I want to group by seasons so that I can perform visualization operations over the aggregated data.
UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:
split_df = np.array_split(df, np.arange(0, len(df),32))
But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.
The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').
Note that freq='AS-AUG' states that your groups should start from the start of
August each year.
Look at the following script:
import pandas as pd
# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09',
'25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
'25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))
If you print the results:
for name, group in gr:
print()
print(name)
print(group)
you will get:
2008-08-01 00:00:00
DATA SCORE_X SCORE_Y
0 2009-04-01 16 11
1 2009-07-31 10 7
2009-08-01 00:00:00
DATA SCORE_X SCORE_Y
2 2009-08-01 19 6
3 2009-09-26 14 5
4 2009-10-04 8 11
5 2009-12-17 12 19
6 2010-01-25 0 0
7 2010-04-20 17 6
8 2010-07-31 18 2
2010-08-01 00:00:00
DATA SCORE_X SCORE_Y
9 2010-08-01 15 18
10 2010-10-28 2 4
11 2010-11-03 8 16
12 2010-12-25 13 1
13 2011-04-20 19 7
14 2011-07-31 8 3
As you can see, each group starts just on 1-st of August and ends on
31-st of July.
They you can do with your groups whatever you want.
Use -
df.groupby(df['DATE'].dt.year).count()
Output
DATE
DATE
2009 5
2018 4
Custom Season Grouping
min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()
Output
DATE
DATE
(2009-07-31, 2010-07-31] 3
(2010-07-31, 2011-07-31] 0
(2011-07-31, 2012-07-31] 0
(2012-07-31, 2013-07-31] 0
(2013-07-31, 2014-07-31] 0
(2014-07-31, 2015-07-31] 0
(2015-07-31, 2016-07-31] 0
(2016-07-31, 2017-07-31] 0
(2017-07-31, 2018-07-31] 1
Resampling using 'A-JUL' as an anchored offset alias should do the trick:
>>> df
SAMPLE
DATE
2009-01-30 1
2009-07-10 4
2009-11-20 3
2010-01-01 5
2010-05-13 1
2010-08-01 1
>>> df.resample('A-JUL').sum()
SAMPLE
DATE
2009-07-31 5
2010-07-31 9
2011-07-31 1
A indicates it is a yearly interval, -JUL indicates it ends in July.
You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.
import pandas as pd
from datetime import date
dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])
# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)
# copy season column to the main dataframe
df['season'] = df_tmp['season']
df.groupby('season').count()