Problem with tuple indices in loop in Python Pandas?

Problem with tuple indices in loop in Python Pandas? - python

I try to calculate number of days until and since last and next holiday. My method of calculation it is like below:
holidays = pd.Series(pd.to_datetime(["01.01.2013", "06.01.2013", "14.02.2013","29.03.2013",
"31.03.2013", "01.04.2013", "01.05.2013", "03.05.2013",
"19.05.2013", "26.05.2013", "30.05.2013", "23.06.2013",
"15.07.2013", "27.10.2013", "01.11.2013", "11.11.2013",
"24.12.2013", "25.12.2013", "26.12.2013", "31.12.2013",
"01.01.2014", "06.01.2014", "14.02.2014", "30.03.2014",
"18.04.2014", "20.04.2014", "21.04.2014", "01.05.2014",
"03.05.2014", "03.05.2014", "26.05.2014", "08.06.2014",
"19.06.2014", "23.06.2014", "15.08.2014", "26.10.2014",
"01.11.2014", "11.11.2014", "24.12.2014", "25.12.2014",
"26.12.2014", "31.12.2014",
"01.01.2015", "06.01.2015", "14.02.2015", "29.03.2015",
"03.04.2015", "05.04.2015", "06.04.2015", "01.05.2015",
"03.05.2015", "24.05.2015", "26.05.2015", "04.06.2015",
"23.06.2015", "15.08.2015", "25.10.2015", "01.11.2015",
"11.11.2015", "24.12.2015", "25.12.2015", "26.12.2015",
"31.12.2015"], dayfirst=True))
#Number of days until next holiday
d_until_next_holiday = []
#Number of days since last holiday
d_since_last_holiday = []
for row in data.itertuples():
next_special_date = holidays[holidays >= row["Date"]].iloc[0]
d_until_next_holiday.append((next_special_date - row["Date"])/pd.Timedelta('1D'))
previous_special_date = holidays[holidays <= row.index].iloc[-1]
d_since_last_holiday.append((row["Date"] - previous_special_date)/pd.Timedelta('1D'))
#Add new cols to DF
sto2STG14["d_until_next_holiday"] = d_until_next_holiday
sto2STG14["d_since_last_holiday"] = d_since_last_holiday
Nevertheless, I have en error like below:
TypeError: tuple indices must be integers or slices, not str
Why I have this erro ? I know that row is tuple, but i use in my code .iloc[0] and .iloc[-1] ? WHat can I do ?

With pandas, you rarely need to loop. In this case, the .shift method allows you to compute everything in one go:
import pandas
holidays = pandas.Series(pandas.to_datetime([
"01.01.2013", "06.01.2013", "14.02.2013","29.03.2013",
"31.03.2013", "01.04.2013", "01.05.2013", "03.05.2013",
"19.05.2013", "26.05.2013", "30.05.2013", "23.06.2013",
"15.07.2013", "27.10.2013", "01.11.2013", "11.11.2013",
"24.12.2013", "25.12.2013", "26.12.2013", "31.12.2013",
"01.01.2014", "06.01.2014", "14.02.2014", "30.03.2014",
"18.04.2014", "20.04.2014", "21.04.2014", "01.05.2014",
"03.05.2014", "03.05.2014", "26.05.2014", "08.06.2014",
"19.06.2014", "23.06.2014", "15.08.2014", "26.10.2014",
"01.11.2014", "11.11.2014", "24.12.2014", "25.12.2014",
"26.12.2014", "31.12.2014",
"01.01.2015", "06.01.2015", "14.02.2015", "29.03.2015",
"03.04.2015", "05.04.2015", "06.04.2015", "01.05.2015",
"03.05.2015", "24.05.2015", "26.05.2015", "04.06.2015",
"23.06.2015", "15.08.2015", "25.10.2015", "01.11.2015",
"11.11.2015", "24.12.2015", "25.12.2015", "26.12.2015",
"31.12.2015"
], dayfirst=True)
)
results = (
holidays
.sort_values()
.to_frame('holiday')
.assign(
days_since_prev=lambda df: df['holiday'] - df['holiday'].shift(1),
days_until_next=lambda df: df['holiday'].shift(-1) - df['holiday'],
)
)
results.head(10)
And I get:
holiday days_since_prev days_until_next
0 2013-01-01 NaT 5 days
1 2013-01-06 5 days 39 days
2 2013-02-14 39 days 43 days
3 2013-03-29 43 days 2 days
4 2013-03-31 2 days 1 days
5 2013-04-01 1 days 30 days
6 2013-05-01 30 days 2 days
7 2013-05-03 2 days 16 days
8 2013-05-19 16 days 7 days
9 2013-05-26 7 days 4 days

Related

Add column to dataframe based on date range

I want to add a column to my data frame prod_data based on a range of dates. This is an example of the data in the column ['Mount Time'] I want to modify the new column from:
0 2022-08-17 06:07:00
1 2022-08-17 06:12:00
2 2022-08-17 06:40:00
3 2022-08-17 06:45:00
4 2022-08-17 06:47:00
The new column is named ['Week'] and I want it to run from M-S, with week 1 starting on 9/5/22, running through 9/11/22 and then week 2 the next M-S, and so on until the last week which would be 53. I would also like weeks previous to 9/5 to have negative week numbers, so 8/29/22 would be the start of week -1 and so on.
The only thing I could think of was to create 2 massive lists and use np.select to define the parameters of the column, but there has to be a cleaner way of doing this, right?

You can use pandas datetime objects to figure out how many days away a date is from your start date, 9/5/2022, and then use floor division to convert that to week numbers. I made the "mount_time" column just to emphasize that the original column should be a datetime object.
prod_data["mount_time"] = pd.to_datetime( prod_data[ "Mount Time" ] )
start_date = pd.to_datetime( "9/5/2022" )
days_away = prod_data.mount_time - start_date
prod_data["Week"] = ( days_away.dt.days // 7 ) + 1
As intended, 9/5/2022 through 9/11/2022 will have a value of 1. 8/29/2022 would start week 0 (not -1 as you wrote) unless you want 9/5/2022 to start as week 0 (in which case just delete the + 1 from the code). Some more examples:
>>> test[ ["date", "Week" ] ]
date Week
0 2022-08-05 -4
1 2022-08-14 -3
2 2022-08-28 -1
3 2022-08-29 0
4 2022-08-30 0
5 2022-09-05 1
6 2022-09-11 1
7 2022-09-12 2

How to count business days per month for the whole year with different weekmask every week?

I'm trying to make a program that will equally distribute employees' day off. There are 4 groups and each group has it's own weekmask for each week of the month. So far I've made a code that will change weekmask when it locates 0 in Dataframe(Sunday). I'm stuck on structuring this command np.busday_count(start, end, weekmask=) to automatically change the start and the end date.
My Dataframe looks like this:
And here's my code:
a: int = 0
week_mask: str = '1100111'
def _change_week_mask():
global a, week_mask
a += 1
if a == 1:
week_mask = '1111000'
elif a == 2:
week_mask = '1111111'
elif a == 3:
week_mask = '0011111'
else:
a = 0
for line in rows['Workday']:
if line is '0':
_change_week_mask()

Edit: changed the value of start week from 6 to 0.
Ok, so to answer your problem I have created the sample data frame with below code.
Then I have added below columns to the data frame.
dayofweek - to reach to similar data which you created by setting every Sunday as zero. In this case Monday is set as zero and Sunday is six.
weeknum - week of year
week - instead of counting and than changing the week mask, I have assigned the value to week from 0 to 3 and based on it, we can calculate the mask.
weekmask - using value of the week, I have calculate the mask, you might need to align this as per your logic.
weekenddate- end date I have calculate by adding 7 to start date, if month is changing mid week then this will have month end date.
b
after this we can create a new data frame to have only end of week entry, in this case Monday is 0 so I have taken 0.
then you can apply function and store the result to data frame.
import datetime
import pandas as pd
import numpy as np
df_ = pd.DataFrame({'startdate':pd.date_range(pd.to_datetime('2018-10-01'), pd.to_datetime('2018-11-30'))})
df_['dayofweek'] = df_.startdate.dt.dayofweek
df_['remaining_days_in_month'] = df_.startdate.dt.days_in_month - df_.startdate.dt.day
df_['week'] = df_.startdate.dt.week%4
df_['day'] = df_.startdate.dt.day
df_['weekmask'] = df_.week.map({0 : '1100111', 1 : '1111000' , 2 : '1111111', 3: '0011111'})
df_['weekenddate'] = [x[0] + datetime.timedelta(days=(7-x[1])) if x[2] > 7-x[1] else x[0] + datetime.timedelta(days=(x[2])) for x in df_[['startdate','dayofweek','remaining_days_in_month']].values]
final_df = df_[(df_['dayofweek']==0) | ( df_['day']==1)][['startdate','weekenddate','weekmask']]
final_df['numberofdays'] = [ np.busday_count((x[0]).astype('<M8[D]'), x[1].astype('<M8[D]'), weekmask=x[2]) for x in final_df.values.astype(str)]
Output:
startdate weekenddate weekmask numberofdays
0 2018-10-01 2018-10-08 1100111 5
7 2018-10-08 2018-10-15 1111000 4
14 2018-10-15 2018-10-22 1111111 7
21 2018-10-22 2018-10-29 0011111 5
28 2018-10-29 2018-10-31 1100111 2
31 2018-11-01 2018-11-05 1100111 3
35 2018-11-05 2018-11-12 1111000 4
42 2018-11-12 2018-11-19 1111111 7
49 2018-11-19 2018-11-26 0011111 5
56 2018-11-26 2018-11-30 1100111 2
let me know if this needs some changes as per your requirement.

grouping time-series data based on starting and ending date

I have time-series data of a yearly sports tournament, with the date when each game was played. I want to group the games by the season(year) they were played in. Each season starts in August and ends the NEXT year in july.
How would I go about grouping the games by season, like -
season(2016-2017), season(2017-2018), etc..
This Answer involving df.resample() may be related, but I'm not sure how I'd go about doing it.
This is what the date column looks like:
DATE
26/09/09
04/10/09
17/10/09
25/10/09
31/10/09
...
29/09/18
07/10/18
28/10/18
03/11/18
I want to group by seasons so that I can perform visualization operations over the aggregated data.
UPDATE: For the time being my solution is to split up the dataframe into groups of 32 as I know each season has 32 games. This is the code I've used:
split_df = np.array_split(df, np.arange(0, len(df),32))
But I'd rather prefer something more elegant and more inclusive of time-series data so I'll keep the question open.

The key to success is proper grouping, in your case pd.Grouper(key='DATA', freq='AS-AUG').
Note that freq='AS-AUG' states that your groups should start from the start of
August each year.
Look at the following script:
import pandas as pd
# Source columns
dates = [ '01/04/09', '31/07/09', '01/08/09', '26/09/09', '04/10/09', '17/12/09',
'25/01/10', '20/04/10', '31/07/10', '01/08/10', '28/10/10', '03/11/10',
'25/12/10', '20/04/11', '31/07/11' ]
scores_x = np.random.randint(0, 20, len(dates))
scores_y = np.random.randint(0, 20, len(dates))
# Source DataFrame
df = pd.DataFrame({'DATA': dates, 'SCORE_X': scores_x, 'SCORE_Y': scores_y})
# Convert string date to datetime
df.DATA = pd.to_datetime(df.DATA, format='%d/%m/%y')
# Grouping
gr = df.groupby(pd.Grouper(key='DATA', freq='AS-AUG'))
If you print the results:
for name, group in gr:
print()
print(name)
print(group)
you will get:
2008-08-01 00:00:00
DATA SCORE_X SCORE_Y
0 2009-04-01 16 11
1 2009-07-31 10 7
2009-08-01 00:00:00
DATA SCORE_X SCORE_Y
2 2009-08-01 19 6
3 2009-09-26 14 5
4 2009-10-04 8 11
5 2009-12-17 12 19
6 2010-01-25 0 0
7 2010-04-20 17 6
8 2010-07-31 18 2
2010-08-01 00:00:00
DATA SCORE_X SCORE_Y
9 2010-08-01 15 18
10 2010-10-28 2 4
11 2010-11-03 8 16
12 2010-12-25 13 1
13 2011-04-20 19 7
14 2011-07-31 8 3
As you can see, each group starts just on 1-st of August and ends on
31-st of July.
They you can do with your groups whatever you want.

Use -
df.groupby(df['DATE'].dt.year).count()
Output
DATE
DATE
2009 5
2018 4
Custom Season Grouping
min_year = df['DATE'].dt.year.min()
max_year = df['DATE'].dt.year.max()
rng = pd.date_range(start='{}-07'.format(min_year), end='{}-08'.format(max_year), freq='12M').to_series()
df.groupby(pd.cut(df['DATE'], rng)).count()
Output
DATE
DATE
(2009-07-31, 2010-07-31] 3
(2010-07-31, 2011-07-31] 0
(2011-07-31, 2012-07-31] 0
(2012-07-31, 2013-07-31] 0
(2013-07-31, 2014-07-31] 0
(2014-07-31, 2015-07-31] 0
(2015-07-31, 2016-07-31] 0
(2016-07-31, 2017-07-31] 0
(2017-07-31, 2018-07-31] 1

Resampling using 'A-JUL' as an anchored offset alias should do the trick:
>>> df
SAMPLE
DATE
2009-01-30 1
2009-07-10 4
2009-11-20 3
2010-01-01 5
2010-05-13 1
2010-08-01 1
>>> df.resample('A-JUL').sum()
SAMPLE
DATE
2009-07-31 5
2010-07-31 9
2011-07-31 1
A indicates it is a yearly interval, -JUL indicates it ends in July.

You could build a season column and group by that. In below code, I used pandas.DateOffset() to move all dates 7 months back so a game that happened in August would look like it happened in January to align the season year with the calendar year. Building season string is fairly straightforward after that.
import pandas as pd
from datetime import date
dates = pd.date_range(date(2009, 8, 1), date(2018, 7, 30), freq='17d')
df = pd.DataFrame(dates, columns=['date'])
# copy the date column to a separate dataframe to do the work
df_tmp = df[['date']]
df_tmp['season_start_year'] = (df_tmp['date'] - pd.DateOffset(months=7)).dt.year
df_tmp['season_end_year'] = df_tmp['season_start_year'] + 1
df_tmp['season'] = df_tmp['season_start_year'].map(str) + '-' + df_tmp['season_end_year'].map(str)
# copy season column to the main dataframe
df['season'] = df_tmp['season']
df.groupby('season').count()

Time arithmetic on pandas series

I have a pandas DataFrame with a column "StartTime" that could be any datetime value. I would like to create a second column that gives the StartTime relative to the beginning of the week (i.e., 12am on the previous Sunday). For example, this post is 5 days, 14 hours since the beginning of this week.
StartTime
1 2007-01-19 15:59:24
2 2007-03-01 04:16:08
3 2006-11-08 20:47:14
4 2008-09-06 23:57:35
5 2007-02-17 18:57:32
6 2006-12-09 12:30:49
7 2006-11-11 11:21:34
I can do this, but it's pretty dang slow:
def time_since_week_beg(x):
y = x.to_datetime()
return pd.Timedelta(days=y.weekday(),
hours=y.hour,
minutes=y.minute,
seconds=y.second
)
df['dt'] = df.StartTime.apply(time_since_week_beg)
What I want is something like this, that doesn't result in an error:
df['dt'] = pd.Timedelta(days=df.StartTime.dt.dayofweek,
hours=df.StartTime.dt.hour,
minute=df.StartTime.dt.minute,
second=df.StartTime.dt.second
)
TypeError: Invalid type <class 'pandas.core.series.Series'>. Must be int or float.
Any thoughts?

You can use a list comprehension:
df['dt'] = [pd.Timedelta(days=ts.dayofweek,
hours=ts.hour,
minutes=ts.minute,
seconds=ts.second)
for ts in df.StartTime]
>>> df
StartTime dt
0 2007-01-19 15:59:24 4 days 15:59:24
1 2007-03-01 04:16:08 3 days 04:16:08
2 2006-11-08 20:47:14 2 days 20:47:14
3 2008-09-06 23:57:35 5 days 23:57:35
4 2007-02-17 18:57:32 5 days 18:57:32
5 2006-12-09 12:30:49 5 days 12:30:49
6 2006-11-11 11:21:34 5 days 11:21:34
Depending on the format of StartTime, you may need:
...for ts in pd.to_datetime(df.StartTime)

Average hourly week profile for a year excluding weekend days and holidays

With Pandas I have created a DataFrame from an imported .csv file (this file is generated through simulation). The DataFrame consists of half-hourly energy consumption data for a single year. I have already created a DateTimeindex for the dates.
I would like to be able to reformat this data into average hourly week and weekend profile results. With the week profile excluding holidays.
DataFrame:
Date_Time Equipment:Electricity:LGF Equipment:Electricity:GF
01/01/2000 00:30 0.583979872 0.490327348
01/01/2000 01:00 0.583979872 0.490327348
01/01/2000 01:30 0.583979872 0.490327348
01/01/2000 02:00 0.583979872 0.490327348
I found an example (Getting the average of a certain hour on weekdays over several years in a pandas dataframe) that explains doing this for several years, but not explicitly for a week (without holidays) and weekend.
I realised that there is no resampling techniques in Pandas that do this directly, I used several aliases (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) for creating Monthly and Daily profiles.
I was thinking of using the business day frequency and create a new dateindex with working days and compare that to my DataFrame datetimeindex for every half hour. Then return values for working days and weekend days when true or false respectively to create a new dataset, but am not sure how to do this.
PS; I am just getting into Python and Pandas.

Dummy data (for future reference, more likely to get an answer if you post some in a copy-paste-able form)
df = pd.DataFrame(data={'a':np.random.randn(1000)},
index=pd.date_range(start='2000-01-01', periods=1000, freq='30T'))
Here's an approach. First define a US (or modify as appropriate) business day offset with holidays, and generate and range covering your dates.
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
bday_us = CustomBusinessDay(calendar=USFederalHolidayCalendar())
bday_over_df = pd.date_range(start=df.index.min().date(),
end=df.index.max().date(), freq=bday_us)
Then, develop your two grouping columns. An hour column is easy.
df['hour'] = df.index.hour
For weekday/weekend/holiday, define a function to group the data.
def group_day(date):
if date.weekday() in [5,6]:
return 'weekend'
elif date.date() in bday_over_df:
return 'weekday'
else:
return 'holiday'
df['day_group'] = df.index.map(group_day)
Then, just group by the two columns as you wish.
In [140]: df.groupby(['day_group', 'hour']).sum()
Out[140]:
a
day_group hour
holiday 0 1.890621
1 -0.029606
2 0.255001
3 2.837000
4 -1.787479
5 0.644113
6 0.407966
7 -1.798526
8 -0.620614
9 -0.567195
10 -0.822207
11 -2.675911
12 0.940091
13 -1.601885
14 1.575595
15 1.500558
16 -2.512962
17 -1.677603
18 0.072809
19 -1.406939
20 2.474293
21 -1.142061
22 -0.059231
23 -0.040455
weekday 0 9.192131
1 2.759302
2 8.379552
3 -1.189508
4 3.796635
5 3.471802
... ...
18 -5.217554
19 3.294072
20 -7.461023
21 8.793223
22 4.096128
23 -0.198943
weekend 0 -2.774550
1 0.461285
2 1.522363
3 4.312562
4 0.793290
5 2.078327
6 -4.523184
7 -0.051341
8 0.887956
9 2.112092
10 -2.727364
11 2.006966
12 7.401570
13 -1.958666
14 1.139436
15 -1.418326
16 -2.353082
17 -1.381131
18 -0.568536
19 -5.198472
20 -3.405137
21 -0.596813
22 1.747980
23 -6.341053
[72 rows x 1 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problem with tuple indices in loop in Python Pandas? - python

Related

Add column to dataframe based on date range

How to count business days per month for the whole year with different weekmask every week?

grouping time-series data based on starting and ending date

Time arithmetic on pandas series

Average hourly week profile for a year excluding weekend days and holidays

Categories

Resources