I have pandas column with only timestamps in incremental order.
I use to_datetime() to work with that column but it automatically adds same day throughout column without incrementing when encounters midnight.
So how can I logically tell it to increment day when it crosses midnight.
rail[8].iloc[121]
rail[8].iloc[100]
printing these values outputs:
TIME 2020-11-19 00:18:00
Name: DSG, dtype: datetime64[ns]
TIME 2020-11-19 21:12:27
Name: KG, dtype: datetime64[ns]
whereas iloc[121] should be 2020-11-20
Sample data is like:
df1.columns = df1.iloc[0]
ids = df1.loc['TRAIN NO'].unique()
df1.drop('TRAIN NO',axis=0,inplace=True)
rail = {}
for i in range(len(ids)):
rail[i] = df1.filter(like=ids[i])
rail[i] = rail[i].reset_index()
rail[i].rename(columns={0:'TRAIN NO'},inplace=True)
rail[i] = pd.melt(rail[i],id_vars='TRAIN NO',value_name='TIME',var_name='trainId')
rail[i].drop(columns='trainId',inplace=True)
rail[i].rename(columns={'TRAIN NO': 'CheckPoints'},inplace=True)
rail[i].set_index('CheckPoints',inplace=True)
rail[i].dropna(inplace=True)
rail[i]['TIME'] = pd.to_datetime(rail[i]['TIME'],infer_datetime_format=True)
CheckPoints TIME
DEPOT 2020-11-19 05:10:00
KG 2020-11-19 05:25:00
RI 2020-11-19 05:51:11
RI 2020-11-19 06:00:00
KG 2020-11-19 06:25:44
... ...
DSG 2020-11-19 23:41:50
ATHA 2020-11-19 23:53:56
NBAA 2020-11-19 23:58:00
NBAA 2020-11-19 00:01:00
DSG 2020-11-19 00:18:00
Could someone help me out..!
You can check where the timedelta of subsequent timestamps is less than 0 (= date changes). Use the cumsum of that and add it as a timedelta (days) to your datetime column:
import pandas as pd
df = pd.DataFrame({'time': ["23:00", "00:00", "12:00", "23:00", "01:00"]})
# cast time string to datetime, will automatically add today's date by default
df['datetime'] = pd.to_datetime(df['time'])
# get timedelta between subsequent timestamps in the column; df['datetime'].diff()
# compare to get a boolean mask where the change in time is negative (= new date)
m = df['datetime'].diff() < pd.Timedelta(0)
# m
# 0 False
# 1 True
# 2 False
# 3 False
# 4 True
# Name: datetime, dtype: bool
# the cumulated sum of that mask accumulates the booleans as 0/1:
# m.cumsum()
# 0 0
# 1 1
# 2 1
# 3 1
# 4 2
# Name: datetime, dtype: int32
# ...so we can use that as the date offset, which we add as timedelta to the datetime column:
df['datetime'] += pd.to_timedelta(m.cumsum(), unit='d')
df
time datetime
0 23:00 2020-11-19 23:00:00
1 00:00 2020-11-20 00:00:00
2 12:00 2020-11-20 12:00:00
3 23:00 2020-11-20 23:00:00
4 01:00 2020-11-21 01:00:00
Related
Python Q. How to parse an object index in a data frame into its date, time, and time zone when it has multiple time zones?
The format is "YYY-MM-DD HH:MM:SS-HH:MM" where the right "HH:MM" is the timezone.
Example: Midnight Jan 1st, 2020 in Mountain Time, counting up:
2020-01-01 00:00:00-07:00
2020-01-01 01:00:00-07:00
2020-01-01 02:00:00-07:00
2020-01-01 04:00:00-06:00
I've got code that works for one time zone, but it breaks when a second timezone is introduced.
df['Date'] = pd.to_datetime(df.index)
df['year']= df['Date'].dt.year
df['month']= df['Date'].dt.month
df['month_n']= df['Date'].dt.month_name()
df['day']= df['Date'].dt.day
df['day_n']= df['Date'].dt.day_name()
df['h']= df['Date'].dt.hour
df['mn']= df['Date'].dt.minute
df['s']= df['Date'].dt.second
ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc="True"
Use pandas.DataFrame.apply instead :
df['Date'] = pd.to_datetime(df.index)
df_info = df['Date'].apply(lambda t: pd.Series({
'date': t.date(),
'year': t.year,
'month': t.month,
'month_n': t.strftime("%B"),
'day': t.day,
'day_n': t.strftime("%A"),
'h': t.hour,
'mn': t.minute,
's': t.second,
}))
df = pd.concat([df, df_info], axis=1)
# Output :
print(df)
Date date year month month_n day day_n h mn s
col
2020-01-01 00:00:00-07:00 2020-01-01 00:00:00-07:00 2020-01-01 2020 1 January 1 Wednesday 0 0 0
2020-01-01 01:00:00-07:00 2020-01-01 01:00:00-07:00 2020-01-01 2020 1 January 1 Wednesday 1 0 0
2020-01-01 02:00:00-07:00 2020-01-01 02:00:00-07:00 2020-01-01 2020 1 January 1 Wednesday 2 0 0
2020-01-01 04:00:00-06:00 2020-01-01 04:00:00-06:00 2020-01-01 2020 1 January 1 Wednesday 4 0 0
#abokey 's answer is great if you aren't sure of the actual time zone or cannot work with UTC. However, you don't have the dt accessor and lose the performance of a "vectorized" approach.
So if you can use UTC or set a time zone (you just have UTC offset at the moment !), e.g. "America/Denver", all will work as expected:
import pandas as pd
df = pd.DataFrame({'v': [999,999,999,999]},
index = ["2020-01-01 00:00:00-07:00",
"2020-01-01 01:00:00-07:00",
"2020-01-01 02:00:00-07:00",
"2020-01-01 04:00:00-06:00"])
df['Date'] = pd.to_datetime(df.index, utc=True)
print(df.Date.dt.hour)
# 2020-01-01 00:00:00-07:00 7
# 2020-01-01 01:00:00-07:00 8
# 2020-01-01 02:00:00-07:00 9
# 2020-01-01 04:00:00-06:00 10
# Name: Date, dtype: int64
# Note: hour changed since we converted to UTC !
or
df['Date'] = pd.to_datetime(df.index, utc=True).tz_convert("America/Denver")
print(df.Date.dt.hour)
# 2020-01-01 00:00:00-07:00 0
# 2020-01-01 01:00:00-07:00 1
# 2020-01-01 02:00:00-07:00 2
# 2020-01-01 04:00:00-06:00 3
# Name: Date, dtype: int64
Trying to convert object type variable to datetime type
pd.to_datetime(df['Time'])
0 13:08:00
1 10:29:00
2 13:23:00
3 20:33:00
4 10:37:00
Error :<class 'datetime.time'> is not convertible to datetime
Please help how can I convert object to datetime and merge with date variable.
What you have are datetime.time objects, as the error tells you. You can use their string representation and parse to pandas datetime or timedelta, depending on your needs. Here's three options for example,
import datetime
import pandas as pd
df = pd.DataFrame({'Time': [datetime.time(13,8), datetime.time(10,29), datetime.time(13,23)]})
# 1)
# use string representation and parse to datetime:
pd.to_datetime(df['Time'].astype(str))
# 0 2022-01-19 13:08:00
# 1 2022-01-19 10:29:00
# 2 2022-01-19 13:23:00
# Name: Time, dtype: datetime64[ns]
# 2)
# add as timedelta to a certain date:
pd.Timestamp('2020-1-1') + pd.to_timedelta(df['Time'].astype(str))
# 0 2020-01-01 13:08:00
# 1 2020-01-01 10:29:00
# 2 2020-01-01 13:23:00
# Name: Time, dtype: datetime64[ns]
# 3)
# add the cumulated sum of the timedelta to a starting date:
pd.Timestamp('2020-1-1') + pd.to_timedelta(df['Time'].astype(str)).cumsum()
# 0 2020-01-01 13:08:00
# 1 2020-01-01 23:37:00
# 2 2020-01-02 13:00:00
# Name: Time, dtype: datetime64[ns]
df['col'] = df['col'].astype('datetime64')
This worked for me.
(
Name
Gun_time
Net_time
Pace
John
28:48:00
28:47:00
4:38:00
George
29:11:00
29:10:00
4:42:00
Mike
29:38:00
29:37:00
4:46:00
Sarah
29:46:00
29:46:00
4:48:00
Roy
30:31:00
30:30:00
4:55:00
Q1. How can I add another column stating difference between Gun_time and Net_time?
Q2. How will I calculate the mean for Gun_time and Net_time. Please help!
I have tried doing the following but it doesn't work
df['Difference'] = df['Gun_time'] - df['Net_time']
for mean value I tried df['Gun_time'].mean
but it doesn't work either, please help!
Q.3 What if we have times in 28:48 (minutes and seconds) format and not 28:48:00 the function gives out a value error.
ValueError: expected hh:mm:ss format
Convert your columns to dtype timedelta, e.g. like
for col in ("Gun_time", "Net_time", "Pace"):
df[col] = pd.to_timedelta(df[col])
Now you can do calculations like
df['Gun_time'].mean()
# Timedelta('1 days 05:34:48')
or
df['Difference'] = df['Gun_time'] - df['Net_time']
#df['Difference']
# 0 0 days 00:01:00
# 1 0 days 00:01:00
# 2 0 days 00:01:00
# 3 0 days 00:00:00
# 4 0 days 00:01:00
# Name: Difference, dtype: timedelta64[ns]
If you need nicer output to string, you can use
def timedeltaToString(td):
hours, remainder = divmod(td.total_seconds(), 3600)
minutes, seconds = divmod(remainder, 60)
return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}"
df['diffString'] = df['Difference'].apply(timedeltaToString)
# df['diffString']
# 0 00:01:00
# 1 00:01:00
# 2 00:01:00
# 3 00:00:00
# 4 00:01:00
#Name: diffString, dtype: object
See also Format timedelta to string.
I am trying to create a proper bin for a timestamp interval column,
using code such as
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00']))
The Resulting df looks like:
time_interval | bin
00:17:00 (0 days 00:10:00, 0 days 00:20:00]
01:42:00 NaN
00:15:00 (0 days 00:10:00, 0 days 00:20:00]
00:00:00 NaN
00:06:00 (0 days 00:00:00, 0 days 00:10:00]
Which is a little off as the result I want is jjust the time value and not the days and also I want the upper limit or last bin to be 60 mins or inf ( or more)
Desired Output:
time_interval | bin
00:17:00 (00:10:00,00:20:00]
01:42:00 (00:60:00,inf]
00:15:00 (00:10:00,00:20:00]
00:00:00 (00:00:00,00:10:00]
00:06:00 (00:00:00,00:10:00]
Thanks for looking!
In pandas inf for timedeltas not exist, so used maximal value. Also for include lowest values is used parameter include_lowest=True if want bins filled by timedeltas:
b = pd.to_timedelta(['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00',
'00:50:00','00:60:00'])
b = b.append(pd.Index([pd.Timedelta.max]))
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b)
print (df)
time_interval Bin
0 00:17:00 (0 days 00:10:00, 0 days 00:20:00]
1 01:42:00 (0 days 01:00:00, 106751 days 23:47:16.854775]
2 00:15:00 (0 days 00:10:00, 0 days 00:20:00]
3 00:00:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
4 00:06:00 (-1 days +23:59:59.999999, 0 days 00:10:00]
If want strings instead timedeltas use zip for create labels with append 'inf':
vals = ['00:00:00','00:10:00','00:20:00',
'00:30:00','00:40:00', '00:50:00','00:60:00']
b = pd.to_timedelta(vals).append(pd.Index([pd.Timedelta.max]))
vals.append('inf')
labels = ['{}-{}'.format(i, j) for i, j in zip(vals[:-1], vals[1:])]
df['Bin'] = pd.cut(df['time_interval'], include_lowest=True, bins=b, labels=labels)
print (df)
time_interval Bin
0 00:17:00 00:10:00-00:20:00
1 01:42:00 00:60:00-inf
2 00:15:00 00:10:00-00:20:00
3 00:00:00 00:00:00-00:10:00
4 00:06:00 00:00:00-00:10:00
You could just use labels to solve it -
df['Bin'] = pd.cut(df['interval_length'], bins=pd.to_timedelta(['00:00:00','00:10:00','00:20:00','00:30:00','00:40:00','00:50:00','00:60:00', '24:00:00']), labels=['(00:00:00,00:10:00]', '(00:10:00,00:20:00]', '(00:20:00,00:30:00]', '(00:30:00,00:40:00]', '(00:40:00,00:50:00]', '(00:50:00,00:60:00]', '(00:60:00,inf]'])
I have a date column
the missing values(NAT in python) needs to be incremented in loop with one day
that is 1/1/2015 , 1/2/2016, 1/3/2016
Can any one help me out ?
This will add an incremental date to your dataframe.
import pandas as pd
import datetime as dt
ddict = {
'Date': ['2014-12-29','2014-12-30','2014-12-31','','','','',]
}
data = pd.DataFrame(ddict)
data['Date'] = pd.to_datetime(data['Date'])
def fill_dates(data_frame, date_col='Date'):
### Seconds in a day (3600 seconds per hour x 24 hours per day)
day_s = 3600 * 24
### Create datetime variable for adding 1 day
_day = dt.timedelta(seconds=day_s)
### Get the max non-null date
max_dt = data_frame[date_col].max()
### Get index of missing date values
NaT_index = data_frame[data_frame[date_col].isnull()].index
### Loop through index; Set incremental date value; Increment variable by 1 day
for i in NaT_index:
data_frame[date_col][i] = max_dt + _day
_day += dt.timedelta(seconds=day_s)
### Execute function
fill_dates(data, 'Date')
Initial data frame:
Date
0 2014-12-29
1 2014-12-30
2 2014-12-31
3 NaT
4 NaT
5 NaT
6 NaT
After running the function:
Date
0 2014-12-29
1 2014-12-30
2 2014-12-31
3 2015-01-01
4 2015-01-02
5 2015-01-03
6 2015-01-04