Python - Parse object index with multiple time zones - python

Python Q. How to parse an object index in a data frame into its date, time, and time zone when it has multiple time zones?
The format is "YYY-MM-DD HH:MM:SS-HH:MM" where the right "HH:MM" is the timezone.
Example: Midnight Jan 1st, 2020 in Mountain Time, counting up:
2020-01-01 00:00:00-07:00
2020-01-01 01:00:00-07:00
2020-01-01 02:00:00-07:00
2020-01-01 04:00:00-06:00
I've got code that works for one time zone, but it breaks when a second timezone is introduced.
df['Date'] = pd.to_datetime(df.index)
df['year']= df['Date'].dt.year
df['month']= df['Date'].dt.month
df['month_n']= df['Date'].dt.month_name()
df['day']= df['Date'].dt.day
df['day_n']= df['Date'].dt.day_name()
df['h']= df['Date'].dt.hour
df['mn']= df['Date'].dt.minute
df['s']= df['Date'].dt.second
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc="True"

Use pandas.DataFrame.apply instead :
df['Date'] = pd.to_datetime(df.index)
df_info = df['Date'].apply(lambda t: pd.Series({
'date': t.date(),
'year': t.year,
'month': t.month,
'month_n': t.strftime("%B"),
'day': t.day,
'day_n': t.strftime("%A"),
'h': t.hour,
'mn': t.minute,
's': t.second,
}))
df = pd.concat([df, df_info], axis=1)
# Output :
print(df)
Date date year month month_n day day_n h mn s
col
2020-01-01 00:00:00-07:00 2020-01-01 00:00:00-07:00 2020-01-01 2020 1 January 1 Wednesday 0 0 0
2020-01-01 01:00:00-07:00 2020-01-01 01:00:00-07:00 2020-01-01 2020 1 January 1 Wednesday 1 0 0
2020-01-01 02:00:00-07:00 2020-01-01 02:00:00-07:00 2020-01-01 2020 1 January 1 Wednesday 2 0 0
2020-01-01 04:00:00-06:00 2020-01-01 04:00:00-06:00 2020-01-01 2020 1 January 1 Wednesday 4 0 0

#abokey 's answer is great if you aren't sure of the actual time zone or cannot work with UTC. However, you don't have the dt accessor and lose the performance of a "vectorized" approach.
So if you can use UTC or set a time zone (you just have UTC offset at the moment !), e.g. "America/Denver", all will work as expected:
import pandas as pd
df = pd.DataFrame({'v': [999,999,999,999]},
index = ["2020-01-01 00:00:00-07:00",
"2020-01-01 01:00:00-07:00",
"2020-01-01 02:00:00-07:00",
"2020-01-01 04:00:00-06:00"])
df['Date'] = pd.to_datetime(df.index, utc=True)
print(df.Date.dt.hour)
# 2020-01-01 00:00:00-07:00 7
# 2020-01-01 01:00:00-07:00 8
# 2020-01-01 02:00:00-07:00 9
# 2020-01-01 04:00:00-06:00 10
# Name: Date, dtype: int64
# Note: hour changed since we converted to UTC !
or
df['Date'] = pd.to_datetime(df.index, utc=True).tz_convert("America/Denver")
print(df.Date.dt.hour)
# 2020-01-01 00:00:00-07:00 0
# 2020-01-01 01:00:00-07:00 1
# 2020-01-01 02:00:00-07:00 2
# 2020-01-01 04:00:00-06:00 3
# Name: Date, dtype: int64

Related

Reshape dataframe into several columns based on date column

I want to rearrange my example dataframe (df.csv) below based on the date column. Each row represents an hour's data for instance for both dates 2002-01-01 and 2002-01-02, there is 5 rows respectively, each representing 1 hour.
date,symbol
2002-01-01,A
2002-01-01,A
2002-01-01,A
2002-01-01,B
2002-01-01,A
2002-01-02,B
2002-01-02,B
2002-01-02,A
2002-01-02,A
2002-01-02,A
My expected output is as below .
date,hour1, hour2, hour3, hour4, hour5
2002-01-01,A,A,A,B,A
2002-01-02,B,B,A,A,A
I have tried the below as explained here: https://pandas.pydata.org/docs/user_guide/reshaping.html, but it doesnt work in my case because the symbol column contains duplicates.
import pandas as pd
import numpy as np
df = pd.read_csv('df.csv')
pivoted = df.pivot(index="date", columns="symbol")
print(pivoted)
The data does not have the timestamps but only the date. However, each row for the same date represents an hourly interval, for instance the output could also be represented as below:
date,01:00, 02:00, 03:00, 04:00, 05:00
2002-01-01,A,A,A,B,A
2002-01-02,B,B,A,A,A
where the hour1 represent 01:00, hour2 represent 02:00...etc
You had the correct pivot approach, but you were missing a column 'time', so let's split the datetime into date and time:
s = pd.to_datetime(df['date'])
df['date'] = s.dt.date
df['time'] = s.dt.time
df2 = df.pivot(index='date', columns='time', values='symbol')
output:
time 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00
date
2002-01-01 A A A B A
2002-01-02 B B A A A
Alternatively for having a HH:MM time, use df['time'] = s.dt.strftime('%H:%M')
used input:
date,symbol
2002-01-01 01:00,A
2002-01-01 02:00,A
2002-01-01 03:00,A
2002-01-01 04:00,B
2002-01-01 05:00,A
2002-01-02 01:00,B
2002-01-02 02:00,B
2002-01-02 03:00,A
2002-01-02 04:00,A
2002-01-02 05:00,A
not time as input!
If really you have no time in the input dates and need to 'invent' increasing ones, you could use groupby.cumcount:
df['time'] = pd.to_datetime(df.groupby('date').cumcount(), format='%H').dt.strftime('%H:%M')
df2 = df.pivot(index='date', columns='time', values='symbol')
output:
time 01:00 02:00 03:00 04:00 05:00
date
2002-01-01 A A A B A
2002-01-02 B B A A A
For each entry as an hour:
k = df.groupby("date").cumcount().add(1).astype(str).radd("hour")
out = df.pivot_table('symbol','date',k,aggfunc=min)
print(out)
hour1 hour2 hour3 hour4 hour5
date
2002-01-01 A A A B A
2002-01-02 B B A A A
I'd have an approach for you, I guess it not the most elegant way since I have to rename both index and columns but it does the job.
new_cols = ['01:00', '02:00', '03:00', '04:00', '05:00']
df1 = df.loc[df['date']=='2002-01-01', :].T.drop('date').set_axis(new_cols, axis=1).set_axis(['2002-01-01'])
df2 = df.loc[df['date']=='2002-01-02', :].T.drop('date').set_axis(new_cols, axis=1).set_axis(['2002-01-02'])
result = pd.concat([df1,df2])
print(result)
Output:
01:00 02:00 03:00 04:00 05:00
2002-01-01 A A A B A
2002-01-02 B B A A A

Function to change dates based on time period?

I have a df of id's and dates. What I'd like to do is set the same date for a 2 day time period. Having trouble writing a function for this. Its like using the equivalent to a SQL OVER PARTITION BY
Input:
d1 = {'id': ['a','a','a','a','b','a','b'], 'datetime': ['10/25/2021 0:00','10/26/2021 0:00','11/28/2021 0:00','11/29/2021 0:00','11/29/2021 0:00', '11/30/2021 0:00', '11/30/2021 0:00']}
df1 = pd.DataFrame(d1)
df1['datetime'] = pd.to_datetime(df1['datetime'])
Desired Output:
d3 = {'id': ['a','a','a','a','a','b','b'], 'datetime': ['10/25/2021 0:00','10/25/2021 0:00','11/28/2021 0:00','11/28/2021 0:00', '11/30/2021 0:00','11/29/2021 0:00','11/29/2021 0:00']}
df1 = pd.DataFrame(d3)
The solution I'm looking for should group by id sorted by datetime. With the first datetime value in that group, create a group of all rows within a 2 day time period and assign those rows with that first datetime value, then move on to the next date and repeat. Then move on to the next id.
Try this:
from datetime import datetime as dt
df1.sort_values(by=['id'])
oldest = {df1.iloc[0,0]: dt.strptime(df1['datetime'][0], "%m/%d/%Y %H:%M")}
for t in range(df1['datetime'].shape[0]):
if df1.iloc[t,0] in oldest:
if ((dt.strptime(df1['datetime'][t],"%m/%d/%Y %H:%M") - oldest[df1.iloc[t,0]]).days) >1:
oldest[df1.iloc[t,0]] = dt.strptime(df1['datetime'][t], "%m/%d/%Y %H:%M")
else:
oldest[df1.iloc[t, 0]] = dt.strptime(df1['datetime'][t], "%m/%d/%Y %H:%M")
df1.iloc[t, 1] = oldest[df1.iloc[t, 0]]
The output would be:
id datetime
0 a 2021-10-25 00:00:00
1 a 2021-10-25 00:00:00
2 a 2021-11-28 00:00:00
3 a 2021-11-28 00:00:00
4 b 2021-11-29 00:00:00
5 a 2021-11-30 00:00:00
6 b 2021-11-29 00:00:00
Try with groupby:
df["datetime"] = pd.to_datetime(df["datetime"])
output = df.groupby("id").apply(lambda x: x.iloc[::2].reindex(x.index).ffill()).sort_values(["id", "datetime"])
>>> output
id datetime
0 a 2021-10-25
1 a 2021-10-25
2 a 2021-11-28
3 a 2021-11-28
5 a 2021-11-30
4 b 2021-11-29
6 b 2021-11-29

Pandas datetime column increment day when reach midnight timestamp

I have pandas column with only timestamps in incremental order.
I use to_datetime() to work with that column but it automatically adds same day throughout column without incrementing when encounters midnight.
So how can I logically tell it to increment day when it crosses midnight.
rail[8].iloc[121]
rail[8].iloc[100]
printing these values outputs:
TIME 2020-11-19 00:18:00
Name: DSG, dtype: datetime64[ns]
TIME 2020-11-19 21:12:27
Name: KG, dtype: datetime64[ns]
whereas iloc[121] should be 2020-11-20
Sample data is like:
df1.columns = df1.iloc[0]
ids = df1.loc['TRAIN NO'].unique()
df1.drop('TRAIN NO',axis=0,inplace=True)
rail = {}
for i in range(len(ids)):
rail[i] = df1.filter(like=ids[i])
rail[i] = rail[i].reset_index()
rail[i].rename(columns={0:'TRAIN NO'},inplace=True)
rail[i] = pd.melt(rail[i],id_vars='TRAIN NO',value_name='TIME',var_name='trainId')
rail[i].drop(columns='trainId',inplace=True)
rail[i].rename(columns={'TRAIN NO': 'CheckPoints'},inplace=True)
rail[i].set_index('CheckPoints',inplace=True)
rail[i].dropna(inplace=True)
rail[i]['TIME'] = pd.to_datetime(rail[i]['TIME'],infer_datetime_format=True)
CheckPoints TIME
DEPOT 2020-11-19 05:10:00
KG 2020-11-19 05:25:00
RI 2020-11-19 05:51:11
RI 2020-11-19 06:00:00
KG 2020-11-19 06:25:44
... ...
DSG 2020-11-19 23:41:50
ATHA 2020-11-19 23:53:56
NBAA 2020-11-19 23:58:00
NBAA 2020-11-19 00:01:00
DSG 2020-11-19 00:18:00
Could someone help me out..!
You can check where the timedelta of subsequent timestamps is less than 0 (= date changes). Use the cumsum of that and add it as a timedelta (days) to your datetime column:
import pandas as pd
df = pd.DataFrame({'time': ["23:00", "00:00", "12:00", "23:00", "01:00"]})
# cast time string to datetime, will automatically add today's date by default
df['datetime'] = pd.to_datetime(df['time'])
# get timedelta between subsequent timestamps in the column; df['datetime'].diff()
# compare to get a boolean mask where the change in time is negative (= new date)
m = df['datetime'].diff() < pd.Timedelta(0)
# m
# 0 False
# 1 True
# 2 False
# 3 False
# 4 True
# Name: datetime, dtype: bool
# the cumulated sum of that mask accumulates the booleans as 0/1:
# m.cumsum()
# 0 0
# 1 1
# 2 1
# 3 1
# 4 2
# Name: datetime, dtype: int32
# ...so we can use that as the date offset, which we add as timedelta to the datetime column:
df['datetime'] += pd.to_timedelta(m.cumsum(), unit='d')
df
time datetime
0 23:00 2020-11-19 23:00:00
1 00:00 2020-11-20 00:00:00
2 12:00 2020-11-20 12:00:00
3 23:00 2020-11-20 23:00:00
4 01:00 2020-11-21 01:00:00

Incorrect order of M/D in datetimes

I have a date column in my csv file
This is my Date column data
14/3/18
28/3/18
9/4/2018
How to make the year all become 2018 ?
I have tried this
df['DateTime'] = pd.to_datetime(df['Date'])
print (df['DateTime'])
but it return
1 2018-03-14
2 2018-03-28
3 2018-09-04
The Last column 09 become month but it supposed 04 is month.
Add parameter dayfirst=True:
df['DateTime'] = pd.to_datetime(df['Date'], dayfirst=True)
print (df)
Date DateTime
0 14/3/18 2018-03-14
1 28/3/18 2018-03-28
2 9/4/2018 2018-04-09
You can use .dt.strftime:
df['DateTime'] = pd.to_datetime(df['DateTime']).dt.strftime("%Y-%d-%m")
Output:
0 2018-14-03
1 2018-28-03
2 2018-04-09
Name: A, dtype: object

How to split a datetime column efficiently to have timezone in a new column?

So, I am creating a yearly time series data taking DST into consideration as follows:
import pandas as pd
sd = '2020-01-01'
ed = '2021-01-01'
df = pd.date_range(sd, ed, freq='0.25H', tz='Europe/Berlin')
df = df.to_frame().reset_index(drop=True)
df.rename(columns={0:'dates'}, inplace=True)
The dates column also contains the timezone (+1(CET) and +2 (CEST)). Now, I want to split the dates column in such a way that in the dates column, there is only the date of format (YYYY-MM-DD HH:MM) and a new column named tz be created and it must have the timezone in the form of a string as either +01 or +02
I did:
df['dates'] = df['dates'].apply(lambda t: str(t))
df['tz'] = df['dates'].str.split('+').str[1]
df['tz'] = df['tz'].str.split(':').str[0]
df['dates'] = pd.to_datetime(df['dates'])
df['dates'] = df['dates'].apply(lambda t: t.strftime('%Y-%m-%d %H:%M'))
and this gives me the output as follows:
dates tz
2020-01-01 00:00 01
2020-01-01 00:15 01
2020-01-01 00:30 01
2020-01-01 00:45 01
2020-01-01 01:00 01
2020-01-01 01:15 01
2020-01-01 01:30 01
Now, I need help with a couple of things:
In the tz column as you can see the values are only 01, I want to know how can I include the '+' sign in the tz column while splitting it?
I know I can do it by doing:
df['tz'] = '+' + df['tz'].str.split(':').str[0]
But it seems very messy.
Is there a more efficient way of splitting the column after creating the original time-series (pd.date_range(sd, ed, freq='0.25H', tz='Europe/Berlin')) into the desired output?
Desired output
dates tz
2020-01-01 00:00 +01
2020-01-01 00:15 +01
2020-01-01 00:30 +01
2020-01-01 00:45 +01
2020-01-01 01:00 +01
2020-01-01 01:15 +01
2020-01-01 01:30 +01
In general, I'd advise against storing datetime type as string, especially those of non-standard format. However, if you insist, you can do:
# from the original dataframe
df['tz'] = df['dates'].astype(str).str.extract(r'(\+\d{2})')[0]
df['dates'] = df['dates'].dt.strftime('%Y-%m-%d %H:%M')
Or only one extract with more complex regex:
df['tz'] = ''
df[['dates', 'tz']] = df['dates'].astype(str).str.extract(r'([\d\- \:]+):\d{2}(.+):')
Output (head):
dates tz
0 2020-01-01 00:00 +01
1 2020-01-01 00:15 +01
2 2020-01-01 00:30 +01
3 2020-01-01 00:45 +01
4 2020-01-01 01:00 +01

Categories

Resources