Trouble adding season column based on a datetime object - python

I'm trying to finish my workproject but I'm getting stuck at a certain point.
Part of the dataframe I have is this:
year_month
year
month
2007-01
2007
1
2009-07
2009
7
2010-03
2010
3
However, I want to add the column "season". I'm illustrating soccer seasons and the season column needs to illustrate what season the players plays. So if month is equal or smaller than 3, the "season" column needs to correspond with ((year-1), "/", year) and if larger with (year, "/", (year + 1)).
The table should look like this:
year_month
year
month
season
2007-01
2007
1
2006/2007
2009-07
2009
7
2009/2010
2010-03
2010
3
2009/2010
Hopefully someone else can help me with this problem.
Here is the code to create the first Table:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'year_month':["2007-01", "2009-07", "2010-03"],
'year':[2007, 2009, 2010],
'month':[1, 7, 3]})
# convert the 'Date' columns to datetime format
df['year_month']= pd.to_datetime(df['year_month'])
Thanks in advance!

You can use np.where() to specify the condition and get corresponding strings according to True / False of the condition, as follows:
df['season'] = np.where(df['month'] <= 3,
(df['year'] - 1).astype(str) + '/' + df['year'].astype(str),
df['year'].astype(str) + '/' + (df['year'] + 1).astype(str))
Result:
year_month year month season
0 2007-01-01 2007 1 2006/2007
1 2009-07-01 2009 7 2009/2010
2 2010-03-01 2010 3 2009/2010

You can use a lambda function with conditionals and axis=1 to apply it to each row. Using f-Strings reduces the code needed to transform values from the year column into strings as needed for your new season column.
df['season'] = df.apply(lambda x: f"{x['year']-1}/{x['year']}" if x['month'] <= 3 else f"{x['year']}/{x['year']+1}", axis=1)
Output:
year_month year month season
0 2007-01 2007 1 2006/2007
1 2009-07 2009 7 2009/2010
2 2010-03 2010 3 2009/2010

Related

Return fiscal quarter from dataframe date column with custom string in Python

quarter = pd.Timestamp(dt.date(2020, 1, 1)).quarter
assert quarter == 1
df['quarter'] = df['date'].dt.quarter
This returns a 1,2,3 or 4 in df['quarter'] depending on the date in column df['date'].
What I would like to have is this format in column df['quarter']:
Qx-2019 or Qx-2020 depending on the year, where x is the quarter found with the script above.
How can I get the specific year from the quarter and add the formt Qx-year?
Thank you.
Try with to_period
s.dt.to_period('Q')
Out[159]:
0 2020Q4
1 2019Q1
dtype: period[Q-DEC]
Update
'Q' + df['date'].dt.quarter.astype(str) + '-' + df['date'].dt.year.astype(str)

Column in pandas dataframe in the form D/M/YY to two datetime variables

I currently have a df in pandas called astrology that contains two columns, one column called birthdate has dates that I would like to create two new DateTime variables from )one variable to record the month and day and another variable to record the year).
My current df looks like this:
birthdate howMuch
1/1/95 8
3/15/80 7
5/28/86 1
11/16/61 5
12/15/88 2
Desired df:
month-day year howMuch
1-1 1995 8
3-15 1980 7
5-28 1986 1
11-16 1961 5
12-15 1988 2
The current code I tried is:
astrology['year'] = pd.to_datetime(astrology['.birthdate'])
And I get the error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 7545-07-14 00:00:00
First is possible test if cleaning is correct, check if some datetimes which cannot be parsed with to_datetime and parameter errors='coerce' so not parsed datetimes are NaT, filtered are by Series.isna and boolean indexing:
print (astrology[pd.to_datetime(astrology['birthdate'], errors='coerce').isna()])
Then converting to datetimes, for format of month and days is used Series.dt.strftime and for years Series.dt.year, but is necessary subtract 100 for avoid parsing years above today year:
dates = pd.to_datetime(astrology['birthdate'])
y = dates.dt.year
now = pd.to_datetime('now').year
astrology = astrology.assign(monthday = dates.dt.strftime('%m/%d'),
year = y.mask(y > now, y - 100))
print (astrology)
birthdate howMuch monthday year
0 1/1/95 8 01/01 1995
1 3/15/80 7 03/15 1980
2 5/28/86 1 05/28 1986
3 11/16/61 5 11/16 1961
4 12/15/88 2 12/15 1988
If want month day column without padding zeros is used here Series.str.rsplit with selecting first lists by indexing str[0]:
md = astrology['birthdate'].str.rsplit('/', n=1).str[0]
dates = pd.to_datetime(astrology['birthdate'])
y = dates.dt.year
now = pd.to_datetime('now').year
astrology = astrology.assign(monthday = md,
year = y.mask(y > now, y - 100))
print (astrology)
birthdate howMuch monthday year
0 1/1/95 8 1/1 1995
1 3/15/80 7 3/15 1980
2 5/28/86 1 5/28 1986
3 11/16/61 5 11/16 1961
4 12/15/88 2 12/15 1988

grouping time-series data based on starting and ending date

I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.
How would I go about grouping the games by season, like -
season(2016-2017), season(2017-2018), etc..
This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.
This is what the date column looks like:
DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
...
29/09/18
07/10/18
28/10/18
03/11/18
I want to group by seasons so that I can perform visualization operations over the aggregated data.
UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:
split_df = np.array_split(df, np.arange(0, len(df),32))
But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.
The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').
Note that freq='AS-AUG' states that your groups should start from the start of
August each year.
Look at the following script:
import pandas as pd
# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09',
'25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
'25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))
If you print the results:
for name, group in gr:
print()
print(name)
print(group)
you will get:
2008-08-01 00:00:00
DATA SCORE_X SCORE_Y
0 2009-04-01 16 11
1 2009-07-31 10 7
2009-08-01 00:00:00
DATA SCORE_X SCORE_Y
2 2009-08-01 19 6
3 2009-09-26 14 5
4 2009-10-04 8 11
5 2009-12-17 12 19
6 2010-01-25 0 0
7 2010-04-20 17 6
8 2010-07-31 18 2
2010-08-01 00:00:00
DATA SCORE_X SCORE_Y
9 2010-08-01 15 18
10 2010-10-28 2 4
11 2010-11-03 8 16
12 2010-12-25 13 1
13 2011-04-20 19 7
14 2011-07-31 8 3
As you can see, each group starts just on 1-st of August and ends on
31-st of July.
They you can do with your groups whatever you want.
Use -
df.groupby(df['DATE'].dt.year).count()
Output
DATE
DATE
2009 5
2018 4
Custom Season Grouping
min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()
Output
DATE
DATE
(2009-07-31, 2010-07-31] 3
(2010-07-31, 2011-07-31] 0
(2011-07-31, 2012-07-31] 0
(2012-07-31, 2013-07-31] 0
(2013-07-31, 2014-07-31] 0
(2014-07-31, 2015-07-31] 0
(2015-07-31, 2016-07-31] 0
(2016-07-31, 2017-07-31] 0
(2017-07-31, 2018-07-31] 1
Resampling using 'A-JUL' as an anchored offset alias should do the trick:
>>> df
SAMPLE
DATE
2009-01-30 1
2009-07-10 4
2009-11-20 3
2010-01-01 5
2010-05-13 1
2010-08-01 1
>>> df.resample('A-JUL').sum()
SAMPLE
DATE
2009-07-31 5
2010-07-31 9
2011-07-31 1
A indicates it is a yearly interval, -JUL indicates it ends in July.
You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.
import pandas as pd
from datetime import date
dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])
# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)
# copy season column to the main dataframe
df['season'] = df_tmp['season']
df.groupby('season').count()

Pandas: Group by bi-monthly date field

I am trying to group by hospital staff working hours bi monthly. I have raw data on daily basis which look like below.
date hourse_spent emp_id
9/11/2016 8 1
15/11/2016 8 1
22/11/2016 8 2
23/11/2016 8 1
How I want to group by is.
cycle hourse_spent emp_id
1/11/2016-15/11/2016 16 1
16/11/2016-31/11/2016 8 2
16/11/2016-31/11/2016 8 1
I am trying to do the same with grouper and frequency in pandas something as below.
data.set_index('date',inplace=True)
print data.head()
dt = data.groupby(['emp_id', pd.Grouper(key='date', freq='MS')])['hours_spent'].sum().reset_index().sort_values('date')
#df.resample('10d').mean().interpolate(method='linear',axis=0)
print dt.resample('SMS').sum()
I also tried resampling
df1 = dt.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
data.set_index('date',inplace=True)
df1 = data.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
But this is giving data of 15 days interval not like 1 to 15 and 15 to 31.
Please let me know what I am doing wrong here.
You were almost there. This will do it -
dt = df.groupby(['emp_id', pd.Grouper(key='date', freq='SM')])['hours_spent'].sum().reset_index().sort_values('date')
emp_id date hours_spent
1 2016-10-31 8
1 2016-11-15 16
2 2016-11-15 8
The freq='SM' is the concept of semi-months which will use the 15th and the last day of every month
Put DateTime-Values into Bins
If I got you right, you basically want to put your values in the date column into bins. For this, pandas has the pd.cut() function included, which does exactly what you want.
Here's an approach which might help you:
import pandas as pd
df = pd.DataFrame({
'hours' : 8,
'emp_id' : [1,1,2,1],
'date' : [pd.datetime(2016,11,9),
pd.datetime(2016,11,15),
pd.datetime(2016,11,22),
pd.datetime(2016,11,23)]
})
bins_dt = pd.date_range('2016-10-16', freq='SM', periods=3)
cycle = pd.cut(df.date, bins_dt)
df.groupby([cycle, 'emp_id']).sum()
Which gets you:
cycle emp_id hours
------------------------ ------ ------
(2016-10-31, 2016-11-15] 1 16
2 NaN
(2016-11-15, 2016-11-30] 1 8
2 8
Had a similar question, here was my solution:
df1['BiMonth'] = df1['Date'] + pd.DateOffset(days=-1) + pd.offsets.SemiMonthEnd()
df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')
The construction "df1['Date'] + pd.DateOffset(days=-1)" will take whatever is in the date column and -1 day.
The construction "+ pd.offsets.SemiMonthEnd()" converts it to a bimonthly basket, but its off by a day unless you reduce the reference date by 1.
The construction "df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')" cleans out the time so you just have days.

Groupby and plot bar graph

I want to plot a bar graph for sales over period of year. x-axis as 'year' and y-axis as sum of weekly sales per year. While plotting I am getting 'KeyError: 'year'. I guess it's because 'year' became index during group by.
Below is the sample content from csv file:
Store year Weekly_Sales
1 2014 24924.5
1 2010 46039.49
1 2015 41595.55
1 2010 19403.54
1 2015 21827.9
1 2010 21043.39
1 2014 22136.64
1 2010 26229.21
1 2014 57258.43
1 2010 42960.91
Below is the code I used to group by
storeDetail_df = pd.read_csv('Details.csv')
result_group_year= storeDetail_df.groupby(['year'])
total_by_year = result_group_year['Weekly_Sales'].agg([np.sum])
total_by_year.plot(kind='bar' ,x='year',y='sum',rot=0)
Updated the Code and below is the output:
DataFrame output:
year sum
0 2010 42843534.38
1 2011 45349314.40
2 2012 35445927.76
3 2013 0.00
below is the Graph i am getting:
While reading your csv file, you needed to use white space as the delimiter as delim_whitespace=True and then reset the index after summing up the Weekly_Sales. Below is the working code:
storeDetail_df = pd.read_csv('Details.csv', delim_whitespace=True)
result_group_year= storeDetail_df.groupby(['year'])
total_by_year = result_group_year['Weekly_Sales'].agg([np.sum]).reset_index()
total_by_year.plot(kind='bar' ,x='year',y='sum',rot=0, legend=False)
Output
In case it is making year your index due to group by command. you need to remove it as a index before plotting.
Try
total_by_year = total_by_year.reset_index(drop=False, inplace=True)
You might want to try this
storeDetail_df = pd.read_csv('Details.csv')
result_group_year= storeDetail_df.groupby(['year'])['Weekly_Sales'].sum()
result_group_year = result_group_year.reset_index(drop=False)
result_group_year.plot.bar(x='year', y='Weekly_Sales')

Categories

Resources