Converting Pandas column from days to days, hours, minutes - python

I am trying to convert a column df["time_ro_reply"]
which contains only days in decimal to a timedelta format where it contains days, hours, minutes. This makes it more human readable.
I am reading about pd.to_timedelta, but I am struggling implementing it:
pd.to_timedelta(df["time_to_reply"]) This returns me only 0.
Sample input:
df["time_ro_reply"]
1.881551
0.903264
2.931560
2.931560
Expected output:
df["time_ro_reply"]
1 days 19 hours 4 minutes
0 days 23 hours 2 minutes
2 days 2 hours 23 minutes
2 days 2 hours 23 minutes

I suggest using using a custom function as follows:
import numpy as np
import pandas as pd
# creating the provided dataframe
df = pd.DataFrame([1.881551, 0.903264, 2.931560, 2.931560],
columns = ["time_ro_reply"])
# this function converts a time as a decimal of days into the desired format
def convert_time(time):
# calculate the days and remaining time
days, remaining = divmod(time, 1)
# calculate the hours and remaining time
hours, remaining = divmod(remaining * 24, 1)
# calculate the minutes
minutes = divmod(remaining * 60, 1)[0]
# a list of the strings, rounding the time values
strings = [str(round(days)), 'days',
str(round(hours)), 'hours',
str(round(minutes)), 'minutes']
# return the strings concatenated to a single string
return ' '.join(strings)
# add a new column to the dataframe by applying the function
# to all values of the column 'time_ro_reply' using .apply()
df["desired_output"] = df["time_ro_reply"].apply(lambda t: convert_time(t))
This yields the following dataframe:
time_ro_reply desired_output
0 1.881551 1 days 21 hours 9 minutes
1 0.903264 0 days 21 hours 40 minutes
2 2.931560 2 days 22 hours 21 minutes
3 2.931560 2 days 22 hours 21 minutes
However, this yields different outputs than the ones you described. If the 'time_ro_reply' values are indeed to be interpreted as pure decimals, I don't see how you got your expected results. Do you mind sharing how you got them?
I hope the comments explain the code well enough. If not and you are unfamiliar with syntax such as e.g. divmod(), apply(), I suggest looking them up in the Python / Pandas documentations.
Let me know if this helps.

Related

Fix pandas wrongly interpretting timestamps hh:mm:ss into yyyy-dd-mmThh:mm:ss

I need to import a .xlsx sheet into pandas which has a column for the processing time of an associated activity. All entries in this column look somewhat like this:
01:20:34
12:22:30
25:01:02
155:20:56
Which says how much hours, minutes and seconds were needed. When I use pd.read_excel pandas correctly interprets each of the timestamps with less than 24 hours, and reads them as above in the first two cases. The timestamps with more than 24h (last two) on the other hand are converted into a datetime object, which in turn looks like this: 1900-01-02T14:58:03 instead of 62:58:03.
Is there a simple solution?
I think that part of the problem is not in Python/Pandas, but in Excel. Date '1900-01-01' is the base date used by Excel represented by number '1'. You can check that if you write '0' in a cell and then formate that cell to date, you get '1900-01-00' and '1' you get '1900-01-01'.
So, try to export your Excel file to a CSV file before importing to pandas and then import this way:
import pandas as pd
df1 = pd.read_csv('sample_data.csv')
In this case, you can get this DataFrame with the column Duration as a string (I added a column id for reference).
duration id
0 01:20:34 1
1 12:22:30 2
2 25:01:02 3
3 155:20:56 4
Then for your purpose, I suggest you Do not try to convert those values to datetime type, but a timedelta. A strategy will be to split the strings by colons and then build an instance of timedelta using those three fields: hours, minutes, and seconds.
import datetime as dt
def converter1(x):
vals = x.split(':')
vals = [int(val) for val in vals ]
out = dt.timedelta(hours=vals[0], minutes=vals[1], seconds=vals[2])
return out
df1['deltat'] = df1['duration'].apply(converter1)
duration id deltat
0 01:20:34 1 0 days 01:20:34
1 12:22:30 2 0 days 12:22:30
2 25:01:02 3 1 days 01:01:02
3 155:20:56 4 6 days 11:20:56
If you need to convert those values to a number of decimals hours or other new fields use the total_seconds() method from timedelta:
df1['deltat_hr'] = df1['deltat'].apply(lambda x: x.total_seconds()/3600)
duration id deltat deltat_hr
0 01:20:34 1 0 days 01:20:34 1.342778
1 12:22:30 2 0 days 12:22:30 12.375000
2 25:01:02 3 1 days 01:01:02 25.017222
3 155:20:56 4 6 days 11:20:56 155.348889

Counting consecutive days of temperature data

So I have some sea surface temperature anomaly data. These data have been filtered down so that these are the values that are below a certain threshold. However, I am trying to identify cold spells - that is, to isolate events that last longer than 5 consecutive days. A sample of my data is below (I've been working between xarray datasets/dataarrays and pandas dataframes). Note, the 'day' is the day number of the month I am looking at (eventually will be expanded to the whole year). I have been scouring SO/the internet for ways to extract these 5-day-or-longer events based on the 'day' column, but I haven't gotten anything to work. I'm still relatively new to coding so my first thought was looping over the rows of the 'day' column but I'm not sure. Any insight is appreciated.
Here's what some of my data look like as a pandas df:
lat lon time day ssta
5940 24.125 262.375 1984-06-03 3 -1.233751
21072 24.125 262.375 1984-06-04 4 -1.394495
19752 24.125 262.375 1984-06-05 5 -1.379742
10223 24.125 262.375 1984-06-27 27 -1.276407
47355 24.125 262.375 1984-06-28 28 -1.840763
... ... ... ... ... ...
16738 30.875 278.875 2015-06-30 30 -1.345640
3739 30.875 278.875 2020-06-16 16 -1.212824
25335 30.875 278.875 2020-06-17 17 -1.446407
41891 30.875 278.875 2021-06-01 1 -1.714249
27740 30.875 278.875 2021-06-03 3 -1.477497
64228 rows × 5 columns
As a filtered xarray:
xarray.Dataset
Dimensions: lat: 28, lon: 68, time: 1174
Coordinates:
time (time) datetime64[ns] 1982-06-01 ... 2021-06-04
lon (lon) float32 262.1 262.4 262.6 ... 278.6 278.9
lat (lat) float32 24.12 24.38 24.62 ... 30.62 30.88
day (time) int64 1 2 3 4 5 6 7 ... 28 29 30 1 2 3 4
Data variables:
ssta (time, lat, lon) float32 nan nan nan nan ... nan nan nan nan
Attributes: (0)
TLDR; I want to identify (and retain the information of) events that are 5+ consecutive days, ie if there were a day 3 through day 8, or day 21 through day 30, etc.
I think rather than filtering your original data you should try to do it the pandas way which in this case means obtain a series with true false values depending on your condition.
Your data seems not to include temperatures so here is my example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'temp':np.random.randint(10,high=40,size=64228,dtype='int64')})
Will generate a DataFrame with a single column containing random temperatures between 10 and 40 degrees. Notice that I can just work with the auto generated index but you might have to switch it to a column like time or date or something like that using .set_index. Say we are interested in the consecutive days with more than 30 degrees.
is_over_30 = df['temp'] > 30
will give us a True/False array with that information. Notice that this format is very useful since we can index with it. E.g. df[is_over_30] will give us the rows of the dataframe for days where the temperature is over 30 deg. Now we wanna shift the True/False values in is_over_30 one spot forward and generate a new series that is true if both are true like so
is_over_30 & np.roll(is_over_30, -1)
Basically we are done here and could write 3 more of those & rolls. But there is a way to write it more concise.
from functools import reduce
is_consecutively_over_30 = reduce(lambda a,b: a&b, [np.roll(is_over_30, -i) for i in range(5)])
Keep in mind that that even though the last 4 days can't be consecutively over 30 deg this might still happen here since roll shifts the first values into the position relevant for that. But you can just set the last 4 values to False to resolve this.
is_consecutively_over_30[-4:] = False
You can pull the day ranges of the spells using this approach:
min_spell_days = 6
days = {'day': [1,2,5,6,7,8,9,10,17,19,21,22,23,24,25,26,27,31]}
df = pd.DataFrame(days)
Find number of days between consecutive entries:
diff = df['day'].diff()
Mark the last day of a spell:
df['last'] = (diff == 1) & (diff.shift(-1) > 1)
Accumulate the number of days in each spell:
df['diff0'] = np.where(diff > 1, 0, diff)
df['cs'] = df['diff0'].eq(0).cumsum()
df['spell_days'] = df.groupby('cs')['diff0'].transform('cumsum')
Mark the last entry as the last day of a spell if applicable:
if diff.iat[-1] == 1:
df['last'].iat[-1] = True
Select the last day of all qualifying spells:
df_spells = (df[df['last'] & (df['spell_days'] >= (min_spell_days-1))]).copy()
Identify the start, end and duration of each spell:
df_spells['end_day'] = df_spells['day']
df_spells['start_day'] = (df_spells['day'] - df['spell_days'])
df_spells['spell_days'] = df['spell_days'] + 1
Resulting df:
df_spells[['start_day','end_day','spell_days']].astype('int')
start_day end_day spell_days
7 5 10 6
16 21 27 7
Also, using date arithmetic 'day' you could represent a serial day number relative to some base date - like 1/1/1900. That way spells that span month and year boundaries could be handled. It would then be trivial to convert back to a date using date arithmetic and that serial number.

pd.PeriodIndex is only giving me 12 hours result when I use freq='H'

I´m using this code to group observations by hour into a dataframe, I first format the date/time column the way I want.
timedf.columns = pd.to_datetime(timedf.columns, dayfirst=True, errors='ignore').strftime('%Y/%m/%d %H:%M:%S')
timedf = timedf.groupby(pd.PeriodIndex(timedf.columns, freq='H'), axis=1).sum()
Even though I have observations for 24 hours, I'm expecting a (24,1) dataframe, I'm actually getting a (13,1) dataframe, from 0 to 12 hours instead of 0 to 23 hours.
Which could be the cause?
Thank you.

Python Pandas Column to Minutes

I've subtracted two datetimes from each other, like so:
df['Time Difference'] = df['Time 1'] - df['Time 2']
resulting in a timedelta object. I need the total number of minutes from this object, but I can't for the life of me figure it out. Currently, the "Time Difference" column looks like this:
1 0 days 00:01:00.000000000
2 0 days 00:04:00.000000000
3 0 days 00:03:00.000000000
4 0 days 00:01:00.000000000
5 0 days 00:03:00.000000000
I've tried dividing by a numpy timedelta (which seems to be the most common suggestion) as well as by pandas timedelta, as well as a few other things. Operations such as df['Time Difference'].seconds, or .seconds(), or .total_seconds, (all suggestions I've seen for this), all give errors. I'm really at a loss for what to do here. I need this in minutes in order to make graphs in matplotlib, and I'm kind of stuck until I figure this out, so any suggestions are very much appreciated. Thanks!
use dt.total_seconds() and divide by 60 to get the minutes:
import pandas as pd
df = pd.DataFrame({'td': pd.to_timedelta(['0 days 00:01:00.000000000',
'0 days 00:04:00.000000000',
'0 days 00:03:00.000000000',
'1 days 00:01:00.000000000',
'0 days 00:03:00.000000000'])})
df['delta_min'] = df['td'].dt.total_seconds() / 60
# df['delta_min']
# 0 1.0
# 1 4.0
# 2 3.0
# 3 1441.0
# 4 3.0

How to call previous two week same day value in python

I am trying to fetch previous week same day data and then take an average of the value ("current_demand") for today's forecast (predict).
for example:
Today is Monday, so then I want to fetch data from the last two weeks Monday's data same time or block and then take an average of the value ["current_demand"] to predict today's value.
Input Data:
current_demand Date Blockno weekday
18839 01-06-2018 1 4
18836 01-06-2018 2 4
12256 02-06-2018 1 5
12266 02-06-2018 2 5
17957 08-06-2018 1 4
17986 08-06-2018 2 4
18491 09-06-2018 1 5
18272 09-06-2018 2 5
Expecting result:
18398 15-06-2018 1 4
something like that. I want to take same value, same block and same day of the previous two-week value then calculate for next value average.
I have tried some thing:
def forecast(DATA):
df = DATA
day = {0:'Monday',1:'Tuesday',2:'Wednesday',3:'Thursday',4:'Friday',5:'Saturday',6:'Sunday'}
df.friday = day - timedelta(days=day.weekday() + 3)
print df
forecast(DATA)
Please suggest me something. Thank you in advance
I like relativedelta for this kind of job
from dateutil.relativedelta import relativedelta
(datetime.datetime.today() + relativedelta(weeks=-2)).date()
Output:
datetime.date(2018, 7, 23)
without the actual structure of your df it's hard to provide a solution tailored to your needs

Categories

Resources