I have a df of id's and dates. What I'd like to do is set the same date for a 2 day time period. Having trouble writing a function for this. Its like using the equivalent to a SQL OVER PARTITION BY
Input:
d1 = {'id': ['a','a','a','a','b','a','b'], 'datetime': ['10/25/2021 0:00','10/26/2021 0:00','11/28/2021 0:00','11/29/2021 0:00','11/29/2021 0:00', '11/30/2021 0:00', '11/30/2021 0:00']}
df1 = pd.DataFrame(d1)
df1['datetime'] = pd.to_datetime(df1['datetime'])
Desired Output:
d3 = {'id': ['a','a','a','a','a','b','b'], 'datetime': ['10/25/2021 0:00','10/25/2021 0:00','11/28/2021 0:00','11/28/2021 0:00', '11/30/2021 0:00','11/29/2021 0:00','11/29/2021 0:00']}
df1 = pd.DataFrame(d3)
The solution I'm looking for should group by id sorted by datetime. With the first datetime value in that group, create a group of all rows within a 2 day time period and assign those rows with that first datetime value, then move on to the next date and repeat. Then move on to the next id.
Try this:
from datetime import datetime as dt
df1.sort_values(by=['id'])
oldest = {df1.iloc[0,0]: dt.strptime(df1['datetime'][0], "%m/%d/%Y %H:%M")}
for t in range(df1['datetime'].shape[0]):
if df1.iloc[t,0] in oldest:
if ((dt.strptime(df1['datetime'][t],"%m/%d/%Y %H:%M") - oldest[df1.iloc[t,0]]).days) >1:
oldest[df1.iloc[t,0]] = dt.strptime(df1['datetime'][t], "%m/%d/%Y %H:%M")
else:
oldest[df1.iloc[t, 0]] = dt.strptime(df1['datetime'][t], "%m/%d/%Y %H:%M")
df1.iloc[t, 1] = oldest[df1.iloc[t, 0]]
The output would be:
id datetime
0 a 2021-10-25 00:00:00
1 a 2021-10-25 00:00:00
2 a 2021-11-28 00:00:00
3 a 2021-11-28 00:00:00
4 b 2021-11-29 00:00:00
5 a 2021-11-30 00:00:00
6 b 2021-11-29 00:00:00
Try with groupby:
df["datetime"] = pd.to_datetime(df["datetime"])
output = df.groupby("id").apply(lambda x: x.iloc[::2].reindex(x.index).ffill()).sort_values(["id", "datetime"])
>>> output
id datetime
0 a 2021-10-25
1 a 2021-10-25
2 a 2021-11-28
3 a 2021-11-28
5 a 2021-11-30
4 b 2021-11-29
6 b 2021-11-29
Related
Python Q. How to parse an object index in a data frame into its date, time, and time zone when it has multiple time zones?
The format is "YYY-MM-DD HH:MM:SS-HH:MM" where the right "HH:MM" is the timezone.
Example: Midnight Jan 1st, 2020 in Mountain Time, counting up:
2020-01-01 00:00:00-07:00
2020-01-01 01:00:00-07:00
2020-01-01 02:00:00-07:00
2020-01-01 04:00:00-06:00
I've got code that works for one time zone, but it breaks when a second timezone is introduced.
df['Date'] = pd.to_datetime(df.index)
df['year']= df['Date'].dt.year
df['month']= df['Date'].dt.month
df['month_n']= df['Date'].dt.month_name()
df['day']= df['Date'].dt.day
df['day_n']= df['Date'].dt.day_name()
df['h']= df['Date'].dt.hour
df['mn']= df['Date'].dt.minute
df['s']= df['Date'].dt.second
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc="True"
Use pandas.DataFrame.apply instead :
df['Date'] = pd.to_datetime(df.index)
df_info = df['Date'].apply(lambda t: pd.Series({
'date': t.date(),
'year': t.year,
'month': t.month,
'month_n': t.strftime("%B"),
'day': t.day,
'day_n': t.strftime("%A"),
'h': t.hour,
'mn': t.minute,
's': t.second,
}))
df = pd.concat([df, df_info], axis=1)
# Output :
print(df)
Date date year month month_n day day_n h mn s
col
2020-01-01 00:00:00-07:00 2020-01-01 00:00:00-07:00 2020-01-01 2020 1 January 1 Wednesday 0 0 0
2020-01-01 01:00:00-07:00 2020-01-01 01:00:00-07:00 2020-01-01 2020 1 January 1 Wednesday 1 0 0
2020-01-01 02:00:00-07:00 2020-01-01 02:00:00-07:00 2020-01-01 2020 1 January 1 Wednesday 2 0 0
2020-01-01 04:00:00-06:00 2020-01-01 04:00:00-06:00 2020-01-01 2020 1 January 1 Wednesday 4 0 0
#abokey 's answer is great if you aren't sure of the actual time zone or cannot work with UTC. However, you don't have the dt accessor and lose the performance of a "vectorized" approach.
So if you can use UTC or set a time zone (you just have UTC offset at the moment !), e.g. "America/Denver", all will work as expected:
import pandas as pd
df = pd.DataFrame({'v': [999,999,999,999]},
index = ["2020-01-01 00:00:00-07:00",
"2020-01-01 01:00:00-07:00",
"2020-01-01 02:00:00-07:00",
"2020-01-01 04:00:00-06:00"])
df['Date'] = pd.to_datetime(df.index, utc=True)
print(df.Date.dt.hour)
# 2020-01-01 00:00:00-07:00 7
# 2020-01-01 01:00:00-07:00 8
# 2020-01-01 02:00:00-07:00 9
# 2020-01-01 04:00:00-06:00 10
# Name: Date, dtype: int64
# Note: hour changed since we converted to UTC !
or
df['Date'] = pd.to_datetime(df.index, utc=True).tz_convert("America/Denver")
print(df.Date.dt.hour)
# 2020-01-01 00:00:00-07:00 0
# 2020-01-01 01:00:00-07:00 1
# 2020-01-01 02:00:00-07:00 2
# 2020-01-01 04:00:00-06:00 3
# Name: Date, dtype: int64
I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12
This is a follow up question of the accepted solution in here.
I have a pandas dataframe:
In one column 'time' is the time stored in the following format: 'HHMMSS' (e.g. 203412 means 20:34:12).
In another column 'date' the date is stored in the following format: 'YYmmdd' (e.g 200712 means 2020-07-12). YY represents the addon to the year 2000.
Example:
import pandas as pd
data = {'time': ['123455', '000010', '100000'],
'date': ['200712', '210601', '190610']}
df = pd.DataFrame(data)
print(df)
# time date
#0 123455 200712
#1 000010 210601
#2 100000 190610
I need a third column which contains the combined datetime format (e.g. 2020-07-12 12:34:55) of the two other columns. So far, I can only modify the time but I do not know how to add the date.
df['datetime'] = pd.to_datetime(df['time'], format='%H%M%S')
print(df)
# time date datetime
#0 123455 200712 1900-01-01 12:34:55
#1 000010 210601 1900-01-01 00:00:10
#2 100000 190610 1900-01-01 10:00:00
How can I add in column df['datetime'] the date from column df['date'], so that the dataframe is:
time date datetime
0 123455 200712 2020-07-12 12:34:55
1 000010 210601 2021-06-01 00:00:10
2 100000 190610 2019-06-10 10:00:00
I found this question, but I am not exactly sure how to use it for my purpose.
You can join columns first and then specify formar:
df['datetime'] = pd.to_datetime(df['date'] + df['time'], format='%y%m%d%H%M%S')
print(df)
time date datetime
0 123455 200712 2020-07-12 12:34:55
1 000010 210601 2021-06-01 00:00:10
2 100000 190610 2019-06-10 10:00:00
If possible integer columns:
df['datetime'] = pd.to_datetime(df['date'].astype(str) + df['time'].astype(str), format='%y%m%d%H%M%S')
I am dealing with a very large size dataframe. A small sample is in bellow:
import pandas as pd
df = pd.DataFrame({'nodes': ['A', 'B', 'C'],
'dept': ['20:00', '02:00', '21:00'],
'arrv': ['20:00', '17:00', '21:00'],
'dept_offset_day': [0, 1, 0],
'arrv_offset_day': [0, 1, 0],
'stop_num':[0,1,2]})
print(df)
nodes dept arrv dept_offset_day arrv_offset_day
0 A 20:00 20:00 0 0
1 B 02:00 17:00 1 1
2 C 21:00 21:00 0 0
I am trying to 1) add a date into the start and end time by considering the day offsets. 2) break nodes column to two nodes_start and nodes_end columns i.e points to points. Something like:
nodes_start nodes_end start_datetime end_datetime
A B 2019-5-9 20:00 2019-5-10 02:00
B C 2019-5-10 17:00 2019-5-10 21:00
I tried using pd.offsets.Day() and loop through each line, but it makes the exec time very slow and I get wrong dates. Thanks for your help.
Try constructing a new data-frame, with new columns (copied columns really :D):
df2 = pd.DataFrame()
df2['nodes_start'] = df['nodes'][:2]
df2['nodes_end'] = df['nodes'][-2:].reset_index(drop=True)
df2['start_datetime'] = pd.to_datetime(df['arrv'][:2])
df2['end_datetime'] = pd.to_datetime(df['dept'][-2:].reset_index(drop=True))
df2['start_datetime'] = [df2['start_datetime'][0] - pd.Timedelta(days=1)] + [df2['start_datetime'][1]]
print(df2)
Output:
nodes_start nodes_end start_datetime end_datetime
0 A B 2019-05-09 20:00:00 2019-05-10 02:00:00
1 B C 2019-05-10 17:00:00 2019-05-10 21:00:00
I'm trying to figure out how to add 3 months to a date in a Pandas dataframe, while keeping it in the date format, so I can use it to lookup a range.
This is what I've tried:
#create dataframe
df = pd.DataFrame([pd.Timestamp('20161011'),
pd.Timestamp('20161101') ], columns=['date'])
#create a future month period
plus_month_period = 3
#calculate date + future period
df['future_date'] = plus_month_period.astype("timedelta64[M]")
However, I get the following error:
AttributeError: 'int' object has no attribute 'astype'
You could use pd.DateOffset
In [1756]: df.date + pd.DateOffset(months=plus_month_period)
Out[1756]:
0 2017-01-11
1 2017-02-01
Name: date, dtype: datetime64[ns]
Details
In [1757]: df
Out[1757]:
date
0 2016-10-11
1 2016-11-01
In [1758]: plus_month_period
Out[1758]: 3
Suppose you have a dataframe of the following format, where you have to add integer months to a date column.
Start_Date
Months_to_add
2014-06-01
23
2014-06-01
4
2000-10-01
10
2016-07-01
3
2017-12-01
90
2019-01-01
2
In such a scenario, using Zero's code or mattblack's code won't be useful. You have to use lambda function over the rows where the function takes 2 arguments -
A date to which months need to be added to
A month value in integer format
You can use the following function:
# Importing required modules
from dateutil.relativedelta import relativedelta
# Defining the function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
After this you can use the following code snippet to add months to the Start_Date column. Use progress_apply functionality of Pandas. Refer to this Stackoverflow answer on progress_apply : Progress indicator during pandas operations.
from tqdm import tqdm
tqdm.pandas()
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
Here's the full code form dataset creation, for your reference:
import pandas as pd
from dateutil.relativedelta import relativedelta
from tqdm import tqdm
tqdm.pandas()
# Initilize a new dataframe
df = pd.DataFrame()
# Add Start Date column
df["Start_Date"] = ['2014-06-01T00:00:00.000000000',
'2014-06-01T00:00:00.000000000',
'2000-10-01T00:00:00.000000000',
'2016-07-01T00:00:00.000000000',
'2017-12-01T00:00:00.000000000',
'2019-01-01T00:00:00.000000000']
# To convert the date column to a datetime format
df["Start_Date"] = pd.to_datetime(df["Start_Date"])
# Add months column
df["Months_to_add"] = [23, 4, 10, 3, 90, 2]
# Defining the Add Months function
def add_months(start_date, delta_period):
end_date = start_date + relativedelta(months=delta_period)
return end_date
# Apply function on the dataframe using lambda operation.
df["End_Date"] = df.progress_apply(lambda row: add_months(row["Start_Date"], row["Months_to_add"]), axis = 1)
You will have the final output dataframe as follows.
Start_Date
Months_to_add
End_Date
2014-06-01
23
2016-05-01
2014-06-01
4
2014-10-01
2000-10-01
10
2001-08-01
2016-07-01
3
2016-10-01
2017-12-01
90
2025-06-01
2019-01-01
2
2019-03-01
Please add to comments if there are any issues with the above code.
All the best!
I believe that the simplest and most efficient (faster) way to solve this is to transform the date to monthly periods with to_period(M), add the result with the values of the Months_to_add column and then retrieve the data as datetime with the .dt.to_timestamp() command.
Using the sample data created by #Aruparna Maity
Start_Date
Months_to_add
2014-06-01
23
2014-06-20
4
2000-10-01
10
2016-07-05
3
2017-12-15
90
2019-01-01
2
df['End_Date'] = ((df['Start_Date'].dt.to_period('M')) + df['Months_to_add']).dt.to_timestamp()
df.head(6)
#output
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-01
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-01
4 2017-12-15 90 2025-06-01
5 2019-01-01 2 2019-03-01
If the exact day is needed, just repeat the process, but changing the periods to days
df['End_Date'] = ((df['End_Date'].dt.to_period('D')) + df['Start_Date'].dt.day -1).dt.to_timestamp()
#output:
Start_Date Months_to_add End_Date
0 2014-06-01 23 2016-05-01
1 2014-06-20 4 2014-10-20
2 2000-10-01 10 2001-08-01
3 2016-07-05 3 2016-10-05
4 2017-12-15 90 2025-06-15
5 2019-01-01 2 2019-03-01
Another way using numpy timedelta64
df['date'] + np.timedelta64(plus_month_period, 'M')
0 2017-01-10 07:27:18
1 2017-01-31 07:27:18
Name: date, dtype: datetime64[ns]