Change the the year based on month in a datafram in python - python

I have a dataset called weather and it contains one column 'Date' that looks like this.
Date
2020-01-01
2020-01-02
2020-02-01
2020-02-04
2020-03-01
2020-04-01
2020-04-02
2020-04-03
2020-04-04
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-01-01
2020-02-01
2020-04-01
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2020-12-01
2020-01-01
The problem is the year is always 2020 when it should be 2020, 2021, and 2022.
The desired column looks like this
Date
2020-01-01
2020-01-02
2020-02-01
2020-02-04
2020-03-01
2020-04-01
2020-04-02
2020-04-03
2020-04-04
2020-05-01
2020-06-01
2020-07-01
2020-08-01
2020-09-01
2020-10-01
2020-11-01
2021-01-01
2021-02-01
2021-04-01
2021-05-01
2021-06-01
2021-07-01
2021-08-01
2021-09-01
2021-10-01
2021-11-01
2021-12-01
2022-01-01
Each year's last month is not necessarily 12, but the new year starts with month 01.
Here is my code:
month = ['01','02','03','04','05','06','07','08','09','10','11','12']
for i in range(len(weather['Date'])):
year = 2022
for j in range(len(month)):
if weather['Date'][i][5:7] == '01':
weather['Date'][i] = weather['Date'][i].apply(lambda x: 'year' + x[5:])
Is there any suggestion for fixing my code and getting the desired column?

Here's one approach:
Turn the date strings in column Date into datetime, using pd.to_datetime and apply Series.diff and chain Series.dt.day.
Since each negative value (i.e. "day") in our Series will represent the start of a new year, let's apply Series.lt(0) to turn all values below 0 into True and the rest into False.
At this stage, we chain Series.cumsum to end up with a Series containing 0, ..., 1, ..., 2. These will be the values that need to be added to the year 2020 to achieve the correct years.
Now, finally, we can create the correct dates by passing (new_year = year + addition), month, day again to pd.to_datetime (cf. this SO answer).
df['Date'] = pd.to_datetime(df['Date'])
df['Date'] = pd.to_datetime(dict(year=(df['Date'].dt.year
+ df['Date'].diff().dt.days.lt(0).cumsum()),
month=df['Date'].dt.month,
day=df['Date'].dt.day))
df['Date']
0 2020-01-01
1 2020-01-02
2 2020-02-01
3 2020-02-04
4 2020-03-01
5 2020-04-01
6 2020-04-02
7 2020-04-03
8 2020-04-04
9 2020-05-01
10 2020-06-01
11 2020-07-01
12 2020-08-01
13 2020-09-01
14 2020-10-01
15 2020-11-01
16 2021-01-01
17 2021-02-01
18 2021-04-01
19 2021-05-01
20 2021-06-01
21 2021-07-01
22 2021-08-01
23 2021-09-01
24 2021-10-01
25 2021-11-01
26 2021-12-01
27 2022-01-01
Name: Date, dtype: datetime64[ns]
You don't need to convert to datetime, of course. You can also recreate the date strings, leaving off from the following line:
df['Date'].str[5:7].astype(int).diff().lt(0).cumsum()

Similar to #ouroboros1, but using numpy to get the number of years to add to each date, and then pd.offsets.DateOffset(years=...) for the addition.
import numpy as np
import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
s = df['Date'].values
y = np.r_[0, (s[:-1] > s[1:]).cumsum()]
At this point, it would be tempting to do:
df['Date'] += y * pd.offsets.DateOffset(years=1)
But then we would get a warning: PerformanceWarning: Adding/subtracting object-dtype array to DatetimeArray not vectorized.
So instead, we group by number of years to add, and add the relevant offset to all the dates in the group.
def add_years(g):
return g['Date'] + pd.offsets.DateOffset(years=g['y'].iloc[0])
df['Date'] = df.assign(y=y).groupby('y', sort=False, group_keys=False).apply(add_years)
This is reasonably fast (4.25 ms for 1000 rows and 10 distinct y values), and, for other situations than yours, is a bit more general than #ouroboros1's answer:
It handles date changes due to leap year (not present in your example where all dates are on the first of a month, but if one of the dates was '2020-02-29' and we try to add 1 year to it using the construct dt = df['Date'].dt; pd.to_datetime(dict(year=dt.year + y, month=dt.month, ...), then we'd get a ValueError: cannot assemble the datetimes: day is out of range for month).
It preserves any time of day and timezone information (again, not in your case, but in the general case one would retain those).

Related

Pandas: How to set hour of a datetime from another column?

I have a dataframe including a datetime column for date and a column for hour.
like this:
min hour date
0 0 2020-12-01
1 5 2020-12-02
2 6 2020-12-01
I need a datetime column including both date and hour.
like this :
min hour date datetime
0 0 2020-12-01 2020-12-01 00:00:00
0 5 2020-12-02 2020-12-02 05:00:00
0 6 2020-12-01 2020-12-01 06:00:00
How can I do it?
Use pd.to_datetime and pd.to_timedelta:
In [393]: df['date'] = pd.to_datetime(df['date'])
In [396]: df['datetime'] = df['date'] + pd.to_timedelta(df['hour'], unit='h')
In [405]: df
Out[405]:
min hour date datetime
0 0 0 2020-12-01 2020-12-01 00:00:00
1 1 5 2020-12-02 2020-12-02 05:00:00
2 2 6 2020-12-01 2020-12-01 06:00:00
You could also try using apply and np.timedelta64:
df['datetime'] = df['date'] + df['hour'].apply(lambda x: np.timedelta64(x, 'h'))
print(df)
Output:
min hour date datetime
0 0 0 2020-12-01 2020-12-01 00:00:00
1 1 5 2020-12-02 2020-12-02 05:00:00
2 2 6 2020-12-01 2020-12-01 06:00:00
In the first question it is not clear the data type of columns, so i thought they are
in date (not pandas) and he want the datetime version.
If this is the case so, solution is similar to the previous, but using a different constructor.
from datetime import datetime
df['datetime'] = df.apply(lambda x: datetime(x.date.year, x.date.month, x.date.day, int(x['hour']), int(x['min'])), axis=1)

How do I get difference between two dates but also have a way to sum up hours by day?

I'm a little new to Python and I'm not sure where to start.
I have a python dataframe that contains shift information like the below:
EmployeeID ShiftType BeginDate EndDate
10 Holiday 2020-01-01 21:00:00 2020-01-02 07:00:00
10 Regular 2020-01-02 21:00:00 2020-01-03 07:00:00
10 Regular 2020-01-03 21:00:00 2020-01-04 07:00:00
10 Regular 2020-01-04 21:00:00 2020-01-05 07:00:00
20 Regular 2020-02-01 09:00:00 2020-02-01 17:00:00
20 Regular 2020-02-02 09:00:00 2020-02-02 17:00:00
20 Regular 2020-02-03 09:00:00 2020-02-03 17:00:00
20 Regular 2020-02-04 09:00:00 2020-02-04 17:00:00
I'd like to be able to break each shift down and summarize hours worked each day.
The desired ouput is below:
EmployeeID ShiftType Date HoursWorked
10 Holiday 2020-01-01 3
10 Regular 2020-01-02 10
10 Regular 2020-01-03 10
10 Regular 2020-01-04 10
10 Regular 2020-01-05 7
20 Regular 2020-02-01 10
20 Regular 2020-02-02 10
20 Regular 2020-02-03 10
20 Regular 2020-02-04 10
I know how to get the work hours like below. This does get me hours for each shift, but I would like to be able to break out each calendar day and hours for that day. Retaining 'ShiftType' is not that important here.
df['HoursWorked'] = ((pd.to_datetime(schedule['EndDate']) - pd.to_datetime(schedule['BeginDate'])).dt.total_seconds() / 3600)
Any suggestion would be appreciated.
You could calculate the working hours for each date (BeginDate and EndDate) separately, which would give you three pd.Series, two for the parts of the night shifts that are on different dates and one for the day shift that is on one date. Use the according date as index for those Series.
# make sure dtype is correct
df["BeginDate"] = pd.to_datetime(df["BeginDate"])
df["EndDate"] = pd.to_datetime(df["EndDate"])
# get a mask where there is a night shift; end date = start date +1
m = df["BeginDate"].dt.date != df["EndDate"].dt.date
# extract the night shifts
s0 = pd.Series((df["BeginDate"][m].dt.ceil('d')-df["BeginDate"][m]).values,
index=df["BeginDate"][m].dt.floor('d'))
s1 = pd.Series((df["EndDate"][m]-df["EndDate"][m].dt.floor('d')).values,
index=df["EndDate"][m].dt.floor('d'))
# ...and the day shifts
s2 = pd.Series((df["EndDate"][~m]-df["BeginDate"][~m]).values, index=df["BeginDate"][~m].dt.floor('d'))
Now concat and sum them, giving you a pd.Series with the working hours for each date:
working_hours = pd.concat([s0, s1, s2], axis=1).sum(axis=1)
# working_hours
# 2020-01-01 0 days 03:00:00
# 2020-01-02 0 days 10:00:00
# 2020-01-03 0 days 10:00:00
# ...
# Freq: D, dtype: timedelta64[ns]
To join with your original df, you can reindex that and add the working_hours:
new_index = pd.concat([df["BeginDate"].dt.floor('d'),
df["EndDate"].dt.floor('d')]).drop_duplicates()
df_out = df.set_index(df["BeginDate"].dt.floor('d')).reindex(new_index, method='nearest')
df_out.index = df_out.index.set_names('date')
df_out = df_out.drop(['BeginDate', 'EndDate'], axis=1).sort_index()
df_out['HoursWorked'] = working_hours.dt.total_seconds()/3600
# df_out
# EmployeeID ShiftType HoursWorked
# date
# 2020-01-01 10 Holiday 3.0
# 2020-01-02 10 Regular 10.0
# 2020-01-03 10 Regular 10.0
# 2020-01-04 10 Regular 10.0
# 2020-01-05 10 Regular 7.0
# 2020-02-01 20 Regular 8.0
# 2020-02-02 20 Regular 8.0
# 2020-02-03 20 Regular 8.0
# 2020-02-04 20 Regular 8.0

How to calculate the difference between 2 consecutive dataframes using pandas

I am fairly new in using pandas I have the following dataframe:
Date
2019-06-01 195.585770
2019-07-01 210.527466
2019-08-01 206.278168
2019-09-01 222.169479
2019-10-01 246.760193
2019-11-01 265.101562
2019-12-01 292.163818
2020-01-01 307.943604
2020-02-01 271.976532
2020-03-01 253.603500
2020-04-01 293.006836
2020-05-01 317.081665
2020-06-01 331.500000
2020-06-05 331.500000
Name: AAPL, dtype: float64
How can I quickly calculate the difference between 2 dates in days? In the end I want to calculate the average monthly increase percentage-wise.
The result should be that the difference is alternately 30 and 31 days. There must be a quick command to calculate the difference between two consecutive dates but I can't seem to find it.
We can do pct_change and mean:
df['AAPL'].pct_change().mean()
Or in case your series:
s.pct_change().mean()
If you want to find out the daily percentage change:
s.pct_change()/s.index.to_series().diff().dt.days
Output:
Date
2019-06-01 NaN
2019-07-01 0.002546
2019-08-01 -0.000651
2019-09-01 0.002485
2019-10-01 0.003689
2019-11-01 0.002398
2019-12-01 0.003403
2020-01-01 0.001742
2020-02-01 -0.003768
2020-03-01 -0.002329
2020-04-01 0.005012
2020-05-01 0.002739
2020-06-01 0.001467
2020-06-05 0.000000
dtype: float64
I see that you already have your answer, but this may be useful in some other instances.
You can use the df.itterows() to easily grab the difference in dates. First, make sure you convert your Date row to datetime. Then iterate over the rows and grab the difference.
df = pd.DataFrame(data={'Date':['2020-04-01','2020-05-01','2020-06-01','2020-06-05 '], 'Col2':['195.585770','210.527466','331.500000','331.500000']})
df['Date'] = pd.to_datetime(df['Date']) #convert to datetime
for index, row in df.iterrows():
if not index == 0:
print(row['Date'] - df['Date'][index-1])
The output will look like this:
30 days 00:00:00
31 days 00:00:00
4 days 00:00:00

How can I get different statistics for a rolling datetime range up top a current value in a pandas dataframe?

I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
Target date = 2020-01-01 11:00:00
Target value in column_a = A
Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
Target date = 2019-11-01 10:00:00
Target value in column_a = A
Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
Average would be the mean of rows 11
and so on...
I cannot use the grouper since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT
The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.

Adding months to a pandas object in Python

I have this dataframe object:
Date
2018-12-14
2019-01-11
2019-01-25
2019-02-08
2019-02-22
2019-07-26
What I want, if it's possible, is to add for example: 3 months to the dates, and then 3 months to the new date (original date + 3 months) and repeat this x times. I am using pd.offsets.MonthOffset but this just adds the months one time and I need to do it more times.
I don't know if it is possible but any help would be perfect.
Thank you so much for taking your time.
The expected output is (for 1 month adding 2 times):
[[2019-01-14, 2019-02-11, 2019-02-25, 2019-03-08, 2019-03-22, 2019-08-26],[2019-02-14, 2019-03-11, 2019-03-25, 2019-04-08, 2019-04-22, 2019-09-26]]
I believe you need loop with f-strings for new columns names:
for i in range(1,4):
df[f'Date_added_{i}_months'] = df['Date'] + pd.offsets.MonthBegin(i)
print (df)
Date Date_added_1_months Date_added_2_months Date_added_3_months
0 2018-12-14 2019-01-01 2019-02-01 2019-03-01
1 2019-01-11 2019-02-01 2019-03-01 2019-04-01
2 2019-01-25 2019-02-01 2019-03-01 2019-04-01
3 2019-02-08 2019-03-01 2019-04-01 2019-05-01
4 2019-02-22 2019-03-01 2019-04-01 2019-05-01
5 2019-07-26 2019-08-01 2019-09-01 2019-10-01
Or:
for i in range(1,4):
df[f'Date_added_{i}_months'] = df['Date'] + pd.offsets.MonthOffset(i)
print (df)
Date Date_added_1_months Date_added_2_months Date_added_3_months
0 2018-12-14 2019-01-14 2019-02-14 2019-03-14
1 2019-01-11 2019-02-11 2019-03-11 2019-04-11
2 2019-01-25 2019-02-25 2019-03-25 2019-04-25
3 2019-02-08 2019-03-08 2019-04-08 2019-05-08
4 2019-02-22 2019-03-22 2019-04-22 2019-05-22
5 2019-07-26 2019-08-26 2019-09-26 2019-10-26
I hope this helps
from dateutil.relativedelta import relativedelta
month_offset = [3,6,9]
for i in month_offset:
df['Date_plus_'+i+'_months'] = df['Date'].map(lambda x: x+relativedelta(months=i))
If your dates are date objects, it should be pretty easy. You can just create a timedelta of 3 months and add it to each date.
Alternatively, you can convert them to date objects with .strptime() and then do what you are suggesting. You can convert them back to a string with .strftime().

Categories

Resources