Pandas - Datetime Manipulation - python

I have a dataframe like so:
CREATED_AT COUNT
'1990-01-01' '2022-01-01 07:30:00' 5
'1990-01-02' '2022-01-01 07:30:00' 10
...
Where the index is a date and the CREATED_AT column is a datetime that is the same value for all rows.
How can I update the CREATED_AT_COLUMN such that it inherits its date portion from the index?
The result should look like:
CREATED_AT COUNT
'1990-01-01' '1990-01-01 07:30:00' 5
'1990-01-02' '1990-01-02 07:30:00' 10
...
Attempts at this result in errors like:
cannot add DatetimeArray and DatetimeArray

You can use df.reset_index() to use the index as a column and then do a simple maniuplation to get the output you want like this:
# Creating a test df
import pandas as pd
from datetime import datetime, timedelta, date
df = pd.DataFrame.from_dict({
"CREATED_AT": [datetime.now(), datetime.now() + timedelta(hours=1)],
"COUNT": [5, 10]
})
df_with_index = df.set_index(pd.Index([date.today() - timedelta(days=10), date.today() - timedelta(days=9)]))
# Creating the column with the result
df_result = df_with_index.reset_index()
df_result["NEW_CREATED_AT"] = pd.to_datetime(df_result["index"].astype(str) + ' ' + df_result["CREATED_AT"].dt.time.astype(str))
Result:
index CREATED_AT COUNT NEW_CREATED_AT
0 2022-11-11 2022-11-21 16:15:31.520960 5 2022-11-11 16:15:31.520960
1 2022-11-12 2022-11-21 17:15:31.520965 10 2022-11-12 17:15:31.520965

You can use:
# ensure CREATED_AT is a datetime
s = pd.to_datetime(df['CREATED_AT'])
# subtract the date to only get the time, add to the index
# ensuring the index is of datetime type
df['CREATED_AT'] = s.sub(s.dt.normalize()).add(pd.to_datetime(df.index))
If everything is already of datetime type, this simplifies to:
df['CREATED_AT'] = (df['CREATED_AT']
.sub(df['CREATED_AT'].dt.normalize())
.add(df.index)
)
Output:
CREATED_AT COUNT
1990-01-01 1990-01-01 07:30:00 5
1990-01-02 1990-01-02 07:30:00 10

Related

How to tackle a dataset that has multiple same date values

I have a large data set that I'm trying to produce a time series using ARIMA. However
some of the data in the date column has multiple rows with the same date.
The data for the dates was entered this way in the data set as it was not known the exact date of the event, hence unknown dates where entered for the first of that month(biased). Known dates have been entered correctly in the data set.
2016-01-01 10035
2015-01-01 5397
2013-01-01 4567
2014-01-01 4343
2017-01-01 3981
2011-01-01 2049
Ideally I want to randomise the dates within the month so they are not the same. I have the code to randomise the date but I cannot find a way to replace the data with the date ranges.
import random
import time
def str_time_prop(start, end, time_format, prop):
stime = time.mktime(time.strptime(start, time_format))
etime = time.mktime(time.strptime(end, time_format))
ptime = stime + prop * (etime - stime)
return time.strftime(time_format, time.localtime(ptime))
def random_date(start, end, prop):
return str_time_prop(start, end, '%Y-%m-%d', prop)
# check if the random function works
print(random_date("2021-01-02", "2021-01-11", random.random()))
The code above I use to generate a random date within a date range but I'm stuggling to find a way to replace the dates.
Any help/guidance would be great.
Thanks
With the following toy dataframe:
import random
import time
import pandas as pd
df = pd.DataFrame(
{
"date": [
"2016-01-01",
"2015-01-01",
"2013-01-01",
"2014-01-01",
"2017-01-01",
"2011-01-01",
],
"value": [10035, 5397, 4567, 4343, 3981, 2049],
}
)
print(df)
# Output
date value
0 2016-01-01 10035
1 2015-01-01 5397
2 2013-01-01 4567
3 2014-01-01 4343
4 2017-01-01 3981
5 2011-01-01 2049
Here is one way to do it:
df["date"] = [
random_date("2011-01-01", "2022-04-17", random.random()) for _ in range(df.shape[0])
]
print(df)
# Ouput
date value
0 2013-12-30 10035
1 2016-06-17 5397
2 2018-01-26 4567
3 2012-02-14 4343
4 2014-06-26 3981
5 2019-07-03 2049
Since the data in the date column has multiple rows with the same date, and you want to randomize the dates within the month, you could group by the year and month and select only those who have the day equal 1. Then, use calendar.monthrange to find the last day of the month for that particular year, and use that information when replacing the timestamp's day. Change the FIRST_DAY and last_day values to match your desired range.
import pandas as pd
import calendar
import numpy as np
np.random.seed(42)
df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'])
# group multiple rows with the same year, month and day equal 1
grouped = df.groupby([df['date'].dt.year, df['date'].dt.month, df['date'].dt.day==1])
FIRST_DAY = 2 # set for the desired range
df_list = []
for n,g in grouped:
last_day = calendar.monthrange(n[0], n[1])[1] # get last day for this month and year
g['New_Date'] = g['date'].apply(lambda d:
d.replace(day=np.random.randint(FIRST_DAY,last_day+1))
)
df_list.append(g)
new_df = pd.concat(df_list)
print(new_df)
Output from new_df
date num New_Date
2 2013-01-01 4567 2013-01-08
3 2014-01-01 4343 2014-01-21
1 2015-01-01 5397 2015-01-30
0 2016-01-01 10035 2016-01-16
4 2017-01-01 3981 2017-01-12

Python: Add Weeks to Date from df

How would I add two df columns together (date + weeks):
This works for me:
df['Date'] = pd.to_datetime(startDate, format='%Y-%m-%d') + datetime.timedelta(weeks = 3)
But when I try to add weeks from a column, I get a type error: unsupported type for timedelta weeks component: Series
df['Date'] = pd.to_datetime(startDate, format='%Y-%m-%d') + datetime.timedelta(weeks = df['Duration (weeks)'])
Would appreciate any help thank you!
You can use the pandas to_timelta function to transform the number of weeks column to a timedelta, like this:
import pandas as pd
import numpy as np
# create a DataFrame with a `date` column
df = pd.DataFrame(
pd.date_range(start='1/1/2018', end='1/08/2018'),
columns=["date"]
)
# add a column `weeks` with a random number of weeks
df['weeks'] = np.random.randint(1, 6, df.shape[0])
# use `pd.to_timedelta` to transform the number of weeks column to a timedelta
# and add it to the `date` column
df["new_date"] = df["date"] + pd.to_timedelta(df["weeks"], unit="W")
df.head()
date weeks new_date
0 2018-01-01 5 2018-02-05
1 2018-01-02 2 2018-01-16
2 2018-01-03 2 2018-01-17
3 2018-01-04 4 2018-02-01
4 2018-01-05 3 2018-01-26

Convert a Date Object excel column to Datetime string by adding a given hour column

Can anyone solve this problem! I am trying to convert a Date object column to Datetime string format with the help of python. From 'YY-mm-dd' to 'YY/mm/dd 00:00' format. Dataset is given below. I have tried every options like energy_df['Date']= pd.to_datetime(energy_df['Date']),
energy_df['Date'] = pd.to_datetime(energy_df['Date'])
energy_df['month'] = energy_df['Date'].dt.month.astype(int)
energy_df['day_of_month'] = energy_df['Date'].dt.day.astype(int)
energy_df['day_of_week'] = energy_df['Date'].dt.dayofweek.astype(int)
energy_df['hour_of_day'] = energy_df['Hours']
selected_columns = ['Date', 'day_of_week', 'hour_of_day', 'Avg Specific Humidity[g/Kg]']
energy_df = energy_df[selected_columns]
Dataset image:
Convert the 'date' column to dtype datetime, the 'hour' column to dtype timedelta, add them together, and format to string.
Ex:
import pandas as pd
# some dummy input...
df = pd.DataFrame({'date': ['2015-01-01', '2015-01-01', '2015-01-01'],
'hour': [1, 2, 3]})
# to datetime / timedelta...
df['datetime'] = pd.to_datetime(df['date']) + pd.to_timedelta(df['hour'], unit='h')
# and format to string...
df['timestamp'] = df['datetime'].dt.strftime('%Y/%m/%d %H:%M')
# will give you:
df
date hour datetime timestamp
0 2015-01-01 1 2015-01-01 01:00:00 2015/01/01 01:00
1 2015-01-01 2 2015-01-01 02:00:00 2015/01/01 02:00
2 2015-01-01 3 2015-01-01 03:00:00 2015/01/01 03:00

python - Convert timezone and retrieve hour

I am looking to add three columns to my current dataframe (utc_date, apac_date, and hour).
I successfully obtain two of the three columns, however hour should be corresponding to apac_date (17) but it is returning the hour for utc_date (9).
Any help would be greatly appreciated!
This is the starting dataframe:
import pandas as pd
from tzlocal import get_localzone
from pytz import timezone
raw_data = {
'id': ['123456'],
'start_date': [pd.datetime(2017, 9, 21, 5, 30, 0)]}
df = pd.DataFrame(raw_data, columns = ['id', 'start_date'])
df
Result:
id start_date
123456 2017-09-21 05:30:00
Next, I convert the timezones for utc and apac based on the users current region.
local_tz = get_localzone()
df['utc_date'] = df['start_date'].apply(lambda x: x.tz_localize(local_tz).astimezone(timezone('utc')))
df['apac_date'] = df['utc_date'].apply(lambda x: x.tz_localize('utc').astimezone(timezone('Asia/Hong_Kong')))
df
Result:
id start_date utc_date apac_date
123456 2017-09-21 05:30:00 2017-09-21 09:30:00+00:00 2017-09-21 17:30:00+08:00
Next, I retrieve the hour for the apac_date (it is giving me utc hour instead):
df['hour'] = df['apac_date'].apply(lambda x: int(x.strftime('%H')))
df
Result:
id start_date utc_date apac_date hour
123456 2017-09-21 05:30:00 2017-09-21 09:30:00+00:00 2017-09-21 17:30:00+08:00 9
can you try using:
df['apac_date'] = df['utc_date'].apply(lambda x: x.tz_convert('Asia/Hong_Kong'))
I got errors with your above code with using tz_localize() on a timezone that has already been localized.

Changing date value in Pandas to another

Have a dateframe like that:
Trying to change '2001-01-01' value in column to date (function of today's date). But this one approach does not work:
date = dt.date.today()
df.loc[df['dat_csz_opzione_tech'] == '2001-01-01', 'dat_csz_opzione_tech'] = date
How can I do this?
Try this
import pandas as pd
import time
df = pd.DataFrame({ 'dat_csz_opzione_tech' :['2001-02-01','2001-01-01','2001-03-01','2001-04-01']})
todaysdate = time.strftime("%Y-%m-%d")
df.loc[df['dat_csz_opzione_tech'] == '2001-01-01', 'dat_csz_opzione_tech'] = todaysdate
print df
Output
dat_csz_opzione_tech
0 2001-02-01
1 2017-02-14
2 2001-03-01
3 2001-04-01

Categories

Resources