Pandas: Group by bi-monthly date field - python

I am trying to group by hospital staff working hours bi monthly. I have raw data on daily basis which look like below.
date hourse_spent emp_id
9/11/2016 8 1
15/11/2016 8 1
22/11/2016 8 2
23/11/2016 8 1
How I want to group by is.
cycle hourse_spent emp_id
1/11/2016-15/11/2016 16 1
16/11/2016-31/11/2016 8 2
16/11/2016-31/11/2016 8 1
I am trying to do the same with grouper and frequency in pandas something as below.
data.set_index('date',inplace=True)
print data.head()
dt = data.groupby(['emp_id', pd.Grouper(key='date', freq='MS')])['hours_spent'].sum().reset_index().sort_values('date')
#df.resample('10d').mean().interpolate(method='linear',axis=0)
print dt.resample('SMS').sum()
I also tried resampling
df1 = dt.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
data.set_index('date',inplace=True)
df1 = data.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
But this is giving data of 15 days interval not like 1 to 15 and 15 to 31.
Please let me know what I am doing wrong here.

You were almost there. This will do it -
dt = df.groupby(['emp_id', pd.Grouper(key='date', freq='SM')])['hours_spent'].sum().reset_index().sort_values('date')
emp_id date hours_spent
1 2016-10-31 8
1 2016-11-15 16
2 2016-11-15 8
The freq='SM' is the concept of semi-months which will use the 15th and the last day of every month

Put DateTime-Values into Bins
If I got you right, you basically want to put your values in the date column into bins. For this, pandas has the pd.cut() function included, which does exactly what you want.
Here's an approach which might help you:
import pandas as pd
df = pd.DataFrame({
'hours' : 8,
'emp_id' : [1,1,2,1],
'date' : [pd.datetime(2016,11,9),
pd.datetime(2016,11,15),
pd.datetime(2016,11,22),
pd.datetime(2016,11,23)]
})
bins_dt = pd.date_range('2016-10-16', freq='SM', periods=3)
cycle = pd.cut(df.date, bins_dt)
df.groupby([cycle, 'emp_id']).sum()
Which gets you:
cycle emp_id hours
------------------------ ------ ------
(2016-10-31, 2016-11-15] 1 16
2 NaN
(2016-11-15, 2016-11-30] 1 8
2 8

Had a similar question, here was my solution:
df1['BiMonth'] = df1['Date'] + pd.DateOffset(days=-1) + pd.offsets.SemiMonthEnd()
df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')
The construction "df1['Date'] + pd.DateOffset(days=-1)" will take whatever is in the date column and -1 day.
The construction "+ pd.offsets.SemiMonthEnd()" converts it to a bimonthly basket, but its off by a day unless you reduce the reference date by 1.
The construction "df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')" cleans out the time so you just have days.

Related

count values of groups by consecutive days

i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.

Function to change Days to Years

I would like to create a function to convert days to years.
In my data set, the Age of my customers is given in days instead of years.
I do not have any other information about their Age. (No DOB)
This is one of the solutions I have tried:
def convert (age_in_days):
Age = int[(age_in_days/365)]
return Age
convert (['age_in_days'])
and then calling the function with
convert(data['age_in_days'])
my output is:
[id
1 31.041096
2 83.038356
3 44.024658
4 65.021918
5 53.041096
...
79849 70.013699
79850 46.019178
79851 68.041096
79852 30.024658
79853 54.027397
Name: age_in_days, Length: 79853, dtype: float64]
as you can see I have a total of 79853 observations
when I try to convert to a data frame so I can append to my main data set,
pd.DataFrame(Age, columns = ['age'])
this is what I get:
age
age_in_days NaN
And nothing else, I need to add the calculated values of age to my data set.
I'm not sure if there is any requirement to write a separate function to convert days to years.
We can do the same directly in pandas as
Sample Input DF
df = pd.DataFrame({
'id':[11, 12],
'age_in_day':[22643, 10262]
})
id age_in_day
0 11 22643
1 12 10262
Two Options
Option 1 - Code - Pandas way
df['age_in_years'] = pd.to_timedelta(df.age_in_day, unit='D') / np.timedelta64(1,'Y')
Output
id age_in_day age_in_years
0 11 22643 61.994428
1 12 10262 28.096402
Option 2 - Code - float div
df['age_in_years'] = df['age_in_day'] / 365
df
Output
id age_in_day age_in_years
0 11 22643 62.035616
1 12 10262 28.115068
Other options
By any means if we have a separate list/Series(generated from an external function) of age in years which we want to add to existing df. Then
Series
0 62.035616
1 28.115068
Name: age_in_day, dtype: float64
df['age_in_years'] = age_Series
List
age_list=[62.035616438356165, 28.115068493150684]
df['age_in_years'] = age_list

grouping time-series data based on starting and ending date

I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.
How would I go about grouping the games by season, like -
season(2016-2017), season(2017-2018), etc..
This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.
This is what the date column looks like:
DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
...
29/09/18
07/10/18
28/10/18
03/11/18
I want to group by seasons so that I can perform visualization operations over the aggregated data.
UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:
split_df = np.array_split(df, np.arange(0, len(df),32))
But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.
The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').
Note that freq='AS-AUG' states that your groups should start from the start of
August each year.
Look at the following script:
import pandas as pd
# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09',
'25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
'25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))
If you print the results:
for name, group in gr:
print()
print(name)
print(group)
you will get:
2008-08-01 00:00:00
DATA SCORE_X SCORE_Y
0 2009-04-01 16 11
1 2009-07-31 10 7
2009-08-01 00:00:00
DATA SCORE_X SCORE_Y
2 2009-08-01 19 6
3 2009-09-26 14 5
4 2009-10-04 8 11
5 2009-12-17 12 19
6 2010-01-25 0 0
7 2010-04-20 17 6
8 2010-07-31 18 2
2010-08-01 00:00:00
DATA SCORE_X SCORE_Y
9 2010-08-01 15 18
10 2010-10-28 2 4
11 2010-11-03 8 16
12 2010-12-25 13 1
13 2011-04-20 19 7
14 2011-07-31 8 3
As you can see, each group starts just on 1-st of August and ends on
31-st of July.
They you can do with your groups whatever you want.
Use -
df.groupby(df['DATE'].dt.year).count()
Output
DATE
DATE
2009 5
2018 4
Custom Season Grouping
min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()
Output
DATE
DATE
(2009-07-31, 2010-07-31] 3
(2010-07-31, 2011-07-31] 0
(2011-07-31, 2012-07-31] 0
(2012-07-31, 2013-07-31] 0
(2013-07-31, 2014-07-31] 0
(2014-07-31, 2015-07-31] 0
(2015-07-31, 2016-07-31] 0
(2016-07-31, 2017-07-31] 0
(2017-07-31, 2018-07-31] 1
Resampling using 'A-JUL' as an anchored offset alias should do the trick:
>>> df
SAMPLE
DATE
2009-01-30 1
2009-07-10 4
2009-11-20 3
2010-01-01 5
2010-05-13 1
2010-08-01 1
>>> df.resample('A-JUL').sum()
SAMPLE
DATE
2009-07-31 5
2010-07-31 9
2011-07-31 1
A indicates it is a yearly interval, -JUL indicates it ends in July.
You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.
import pandas as pd
from datetime import date
dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])
# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)
# copy season column to the main dataframe
df['season'] = df_tmp['season']
df.groupby('season').count()

Slice, combine, and map fiscal year dates to calendar year dates to new column

I have the following pandas data frame:
Shortcut_Dimension_4_Code Stage_Code
10225003 2
8225003 1
8225004 3
8225005 4
It is part of a much larger dataset that I need to be able to filter by month and year. I need to pull the fiscal year from the first two digits for values larger than 9999999 in the Shortcut_Dimension_4_Code column, and the first digit for values less than or equal to 9999999. That value needs to be added to "20" to produce a year i.e. "20" + "8" = 2008 | "20" + "10" = 2010.
That year "2008, 2010" needs to be combined with the stage code value (1-12) to produce a month/year, i.e. 02/2010.
The date 02/2010 then needs to converted from fiscal year date to calendar year date, i.e. Fiscal Year Date : 02/2010 = Calendar Year date: 08/2009. The resulting date needs to be presented in a new column. The resulting df would end up looking like this:
Shortcut_Dimension_4_Code Stage_Code Date
10225003 2 08/2009
8225003 1 07/2007
8225004 3 09/2007
8225005 4 10/2007
I am new to pandas and python and could use some help. I am beginning with this:
Shortcut_Dimension_4_Code Stage_Code CY_Month Fiscal_Year
0 10225003 2 8.0 10
1 8225003 1 7.0 82
2 8225003 1 7.0 82
3 8225003 1 7.0 82
4 8225003 1 7.0 82
I used .map and .str methods to produce this df, but have not been able to figure out how to get the FY's right, for fy 2008-2009.
In below code, I'll assume Shortcut_Dimension_4_Code is an integer. If it's a string you can convert it or slice it like this: df['Shortcut_Dimension_4_Code'].str[:-6]. More explanations in comments alongside the code.
That should work as long as you don't have to deal with empty values.
import pandas as pd
import numpy as np
from datetime import date
from dateutil.relativedelta import relativedelta
fiscal_month_offset = 6
input_df = pd.DataFrame(
[[10225003, 2],
[8225003, 1],
[8225004, 3],
[8225005, 4]],
columns=['Shortcut_Dimension_4_Code', 'Stage_Code'])
# make a copy of input dataframe to avoid modifying it
df = input_df.copy()
# numpy will help us with numeric operations on large collections
df['fiscal_year'] = 2000 + np.floor_divide(df['Shortcut_Dimension_4_Code'], 1000000)
# loop with `apply` to create `date` objects from available columns
# day is a required field in date, so we'll just use 1
df['fiscal_date'] = df.apply(lambda row: date(row['fiscal_year'], row['Stage_Code'], 1), axis=1)
df['calendar_date'] = df['fiscal_date'] - relativedelta(months=fiscal_month_offset)
# by default python dates will be saved as Object type in pandas. You can verify with `df.info()`
# to use clever things pandas can do with dates we need co convert it
df['calendar_date'] = pd.to_datetime(df['calendar_date'])
# I would just keep date as datetime type so I could access year and month
# but to create same representation as in question, let's format it as string
df['Date'] = df['calendar_date'].dt.strftime('%m/%Y')
# copy important columns into output dataframe
output_df = df[['Shortcut_Dimension_4_Code', 'Stage_Code', 'Date']].copy()
print(output_df)

How to convert this to a for-loop with an output to CSV

I'm trying to put together a generic piece of code that would:
Take a time series for some price data and divide it into deciles, e.g. take the past 18m of gold prices and divide it into deciles [DONE, see below]
date 4. close decile
2017-01-03 1158.2 0
2017-01-04 1166.5 1
2017-01-05 1181.4 2
2017-01-06 1175.7 1
... ...
2018-04-23 1326.0 7
2018-04-24 1333.2 8
2018-04-25 1327.2 7
[374 rows x 2 columns]
Pull out the dates for a particular decile, then create a secondary datelist with an added 30 days
#So far only for a single decile at a time
firstdecile = gold.loc[gold['decile'] == 1]
datelist = list(pd.to_datetime(firstdecile.index))
datelist2 = list(pd.to_datetime(firstdecile.index) + pd.DateOffset(months=1))
Take an average of those 30-day price returns for each decile
level1 = gold.ix[datelist]
level2 = gold.ix[datelist2]
level2.index = level2.index - pd.DateOffset(months=1)
result = pd.merge(level1,level2, how='inner', left_index=True, right_index=True)
def ret(one, two):
return (two - one)/one
pricereturns = result.apply(lambda x :ret(x['4. close_x'], x['4. close_y']), axis=1)
mean = pricereturns.mean()
Return the list of all 10 averages in a single CSV file
So far I've been able to put together something functional that does steps 1-3 but only for a single decile, but I'm struggling to expand this to a looped-code for all 10 deciles at once with a clean CSV output
First append the close price at t + 1 month as a new column on the whole dataframe.
gold2_close = gold.loc[gold.index + pd.DateOffset(months=1), 'close']
gold2_close.index = gold.index
gold['close+1m'] = gold2_close
However practically relevant should be the number of trading days, i.e. you won't have prices for the weekend or holidays. So I'd suggest you shift by number of rows, not by daterange, i.e. the next 20 trading days
gold['close+20'] = gold['close'].shift(periods=-20)
Now calculate the expected return for each row
gold['ret'] = (gold['close+20'] - gold['close']) / gold['close']
You can also combine steps 1. and 2. directly so you don't need the additional column (only if you shift by number of rows, not by fixed daterange due to reindexing)
gold['ret'] = (gold['close'].shift(periods=-20) - gold['close']) / gold['close']
Since you already have your deciles, you just need to groupby the deciles and aggregate the returns with mean()
gold_grouped = gold.groupby(by="decile").mean()
Putting in some random data you get something like the dataframe below. close and ret are the averages for each decile. You can create a csv from a dataframe via pandas.DataFrame.to_csv
close ret
decile
0 1238.343597 -0.018290
1 1245.663315 0.023657
2 1254.073343 -0.025934
3 1195.941312 0.009938
4 1212.394511 0.002616
5 1245.961831 -0.047414
6 1200.676333 0.049512
7 1181.179956 0.059099
8 1214.438133 0.039242
9 1203.060985 0.029938

Categories

Resources