pd.Timedelta adds an extra day when calculating a difference between dates - python

I have the following pandas data frame df:
Actual Scheduled
2017-01-01 04:03:00.000 2017-01-01 04:25:00.000
2017-01-01 04:56:00.000 2017-01-01 04:55:00.000
2017-01-01 04:36:00.000 2017-01-01 05:05:00.000
2017-01-01 06:46:00.000 2017-01-01 06:55:00.000
2017-01-01 06:46:00.000 2017-01-01 07:00:00.000
I need to create an additional column DIFF_MINUTES that contains the difference (in minutes) between Actual and Scheduled (Actual - Scheduled).
This is how I tried to solve this task:
import pandas as pd
import datetime
df["Actual"] = df.apply(lambda row: datetime.datetime.strptime(str(row["Actual"]),"%Y-%m-%d %H:%M:%S.%f"), axis=1)
df["Scheduled"] = df.apply(lambda row: datetime.datetime.strptime(str(row["Scheduled"]),"%Y-%m-%d %H:%M:%S.%f"), axis=1)
df["DIFF_MINUTES"] = df.apply(lambda row: (pd.Timedelta(row["Actual"]-row["Scheduled"]).seconds)/60, axis=1)
However, I got wrong results for a negative difference cases (e.g. 04:03:00-04:25:00 should give 22 minutes instead of 1418 minutes):
Actual Scheduled DIFF_MINUTES
2017-01-01 04:03:00 2017-01-01 04:25:00 1418.0
2017-01-01 04:56:00 2017-01-01 04:55:00 1.0
2017-01-01 04:36:00 2017-01-01 05:05:00 1411.0
2017-01-01 06:46:00 2017-01-01 06:55:00 1431.0
2017-01-01 06:46:00 2017-01-01 07:00:00 1426.0
How to fix it?
Expected result:
Actual Scheduled DIFF_MINUTES
2017-01-01 04:03:00 2017-01-01 04:25:00 -22.0
2017-01-01 04:56:00 2017-01-01 04:55:00 1.0
2017-01-01 04:36:00 2017-01-01 05:05:00 -29
2017-01-01 06:46:00 2017-01-01 06:55:00 -9.0
2017-01-01 06:46:00 2017-01-01 07:00:00 -14.0

Use dt.total_seconds() as (also check whether date is coming first or month in your columns):
df['Actual'] = pd.to_datetime(df['Actual'], dayfirst=True)
df['Scheduled'] = pd.to_datetime(df['Scheduled'], dayfirst=True)
df['DIFF_MINUTES'] = (df['Actual']-df['Scheduled']).dt.total_seconds()/60
print(df)
Actual Scheduled DIFF_MINUTES
0 2017-01-01 04:03:00 2017-01-01 04:25:00 -22.0
1 2017-01-01 04:56:00 2017-01-01 04:55:00 1.0
2 2017-01-01 04:36:00 2017-01-01 05:05:00 -29.0
3 2017-01-01 06:46:00 2017-01-01 06:55:00 -9.0
4 2017-01-01 06:46:00 2017-01-01 07:00:00 -14.0

Assuming that both column are DateTime, run just:
df['DIFF_MINUTES'] = (df.Actual - df.Scheduled).dt.total_seconds() / 60
(a one-liner).
If you read this DataFrame e.g. from Excel or CSV file, add
parse_dates=[0, 1] parameter to have these columns converted into dates,
so that there will be no need to cast them by your code.
And if for some reason you have these column as text, then to
convert them run:
df.Actual = pd.to_datetime(df.Actual)
df.Scheduled = pd.to_datetime(df.Scheduled)
(another quicker solution than "plain Python" functions).

Related

Replace nan with zero or linear interpolation

I have a dataset with a lot of NaNs and numeric values with the following form:
PV_Power
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 NaN
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 NaN
... ...
2017-12-31 20:00:00 NaN
2017-12-31 21:00:00 NaN
2017-12-31 22:00:00 NaN
2017-12-31 23:00:00 NaN
2018-01-01 00:00:00 NaN
What I need to do is to replace a NaN value with either 0 if it is between other NaN values or with the result of interpolation if it is between numeric values. Any idea of how can I achieve that?
Use DataFrame.interpolate with limit_area='inside' if need interpolate between numeric values and then replace missing values:
print (df)
PV_Power
date
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 4.0
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 5.0
2017-01-01 05:00:00 NaN
2017-01-01 06:00:00 NaN
df = df.interpolate(limit_area='inside').fillna(0)
print (df)
PV_Power
date
2017-01-01 00:00:00 0.000000
2017-01-01 01:00:00 4.000000
2017-01-01 02:00:00 4.333333
2017-01-01 03:00:00 4.666667
2017-01-01 04:00:00 5.000000
2017-01-01 05:00:00 0.000000
2017-01-01 06:00:00 0.000000
You could reindex your dataframe
idx = df.index
df = df.dropna().reindex(idx, fill_value=0)
or just set values where PV_Power is NaN:
df.loc[pd.isna(df.PV_Power), ["PV_Power"]] = 0
You Can use fillna(0) :-
df['PV_Power'].fillna(0, inplace=True)
or You Can Replace it:-
df['PV_Power'] = df['PV_Power'].replace(np.nan, 0)

Add hours to year-month-day data in pandas data frame

I have the following data frame with hourly resolution
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 20.96
1 2017-01-01 20.90
2 2017-01-01 18.13
3 2017-01-01 16.03
4 2017-01-01 16.43
... ...
8756 2017-12-31 25.56
8757 2017-12-31 11.02
8758 2017-12-31 7.32
8759 2017-12-31 1.86
type(day_ahead_DK1)
Out[28]: pandas.core.frame.DataFrame
But the current column DateStamp is missing hours. How can I add hours 00:00:00, to 2017-01-01 for Index 0 so it will be 2017-01-01 00:00:00, and then 01:00:00, to 2017-01-01 for Index 1 so it will be 2017-01-01 01:00:00, and so on, so that all my days will have hours from 0 to 23. Thank you!
The expected output:
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
... ...
8756 2017-12-31 20:00:00 25.56
8757 2017-12-31 21:00:00 11.02
8758 2017-12-31 22:00:00 7.32
8759 2017-12-31 23:00:00 1.86
Use GroupBy.cumcount for counter with to_timedelta for hours and add to DateStamp column:
df['DateStamp'] = pd.to_datetime(df['DateStamp'])
df['DateStamp'] += pd.to_timedelta(df.groupby('DateStamp').cumcount(), unit='H')
print (df)
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
8756 2017-12-31 00:00:00 25.56
8757 2017-12-31 01:00:00 11.02
8758 2017-12-31 02:00:00 7.32
8759 2017-12-31 03:00:00 1.86

RMSE between all value with same hours with pandas

I have two dataframe: the first represents the output of a model simulation and the second the real value. I would like to compute the RMSE between all the value with the same hours. Basically I should compute 24 RMSE value, one for each hour.
These are the first columns of my dataframes:
date;model
2017-01-01 00:00:00;53
2017-01-01 01:00:00;52
2017-01-01 02:00:00;51
2017-01-01 03:00:00;47.27
2017-01-01 04:00:00;45.49
2017-01-01 05:00:00;45.69
2017-01-01 06:00:00;48.07
2017-01-01 07:00:00;45.67
2017-01-01 08:00:00;45.48
2017-01-01 09:00:00;42.06
2017-01-01 10:00:00;46.86
2017-01-01 11:00:00;48.02
2017-01-01 12:00:00;49.57
2017-01-01 13:00:00;48.69
2017-01-01 14:00:00;46.91
2017-01-01 15:00:00;49.43
2017-01-01 16:00:00;50.45
2017-01-01 17:00:00;53.3
2017-01-01 18:00:00;59.07
2017-01-01 19:00:00;61.71
2017-01-01 20:00:00;56.26
2017-01-01 21:00:00;55
2017-01-01 22:00:00;54
2017-01-01 23:00:00;52
2017-01-02 00:00:00;53
and
date;real
2017-01-01 00:00:00;55
2017-01-01 01:00:00;55
2017-01-01 02:00:00;55
2017-01-01 03:00:00;48.27
2017-01-01 04:00:00;48.49
2017-01-01 05:00:00;48.69
2017-01-01 06:00:00;49.07
2017-01-01 07:00:00;49.67
2017-01-01 08:00:00;49.48
2017-01-01 09:00:00;50.06
2017-01-01 10:00:00;50.86
2017-01-01 11:00:00;50.02
2017-01-01 12:00:00;33.57
2017-01-01 13:00:00;33.69
2017-01-01 14:00:00;33.91
2017-01-01 15:00:00;33.43
2017-01-01 16:00:00;33.45
2017-01-01 17:00:00;33.3
2017-01-01 18:00:00;33.07
2017-01-01 19:00:00;33.71
2017-01-01 20:00:00;33.26
2017-01-01 21:00:00;33
2017-01-01 22:00:00;33
2017-01-01 23:00:00;33
2017-01-02 00:00:00;33
due to the fact that I am considering one year, I have to consider 365 value for each RMSE computation.
Up to now, I able only to read the dataframes. One option could be to set-up a cycle between 1-24 and to try do create 24 new dataframes by means of dfr[dfr.index.hour == i-th hours].
Do you have some more elegant and efficient solution?
Thanks
RMSE depends on the pairing order so you should join the model to the real data first, then group by hour and calculate your RMSE:
def rmse(group):
if len(group) == 0:
return np.nan
s = (group['model'] - group['real']).pow(2).sum()
return np.sqrt(s / len(group))
result = (
df1.merge(df2, on='date')
.assign(hour=lambda x: x['date'].dt.hour)
.groupby('hour')
.apply(rmse)
)
Result:
hour
0 14.21267
1 3.00000
2 4.00000
3 1.00000
4 3.00000
5 3.00000
6 1.00000
7 4.00000
8 4.00000
9 8.00000
10 4.00000
11 2.00000
12 16.00000
13 15.00000
14 13.00000
15 16.00000
16 17.00000
17 20.00000
18 26.00000
19 28.00000
20 23.00000
21 22.00000
22 21.00000
23 19.00000
dtype: float64
Explanation
Here what the code does:
merge: combine the two data frames together based on the date index
assign: create a new column hour, extracted from the date index
groupby: group rows based on their hour values
apply allows you to write a custom aggregator. All the rows with hour = 0 will be sent into the rmse function (our custom function), all the rows with hour = 1 will be sent next. As an illustration:
date hour model real
2017-01-01 00:00:00 0 ... ...
2017-01-02 00:00:00 0 ... ...
2017-01-03 00:00:00 0 ... ...
2017-01-04 00:00:00 0 ... ...
--------------------------------------
2017-01-01 01:00:00 1 ... ...
2017-01-02 01:00:00 1 ... ...
2017-01-03 01:00:00 1 ... ...
2017-01-04 01:00:00 1 ... ...
--------------------------------------
2017-01-01 02:00:00 2 ... ...
2017-01-02 02:00:00 2 ... ...
2017-01-03 02:00:00 2 ... ...
2017-01-04 02:00:00 2 ... ...
--------------------------------------
2017-01-01 03:00:00 3 ... ...
2017-01-02 03:00:00 3 ... ...
2017-01-03 03:00:00 3 ... ...
2017-01-04 03:00:00 3 ... ...
Each chunk is then sent to our custom function: rmse(group=<a chunk>). Within the function, we reduce that chunk down into a single number: its RMSE. That's how you get the 24 RMSE numbers back as a result.
You need to provide to by= a function that takes the date and extract the hour.
import pandas as pd
from time import strptime
df = pd.DataFrame([
['2017-01-01 00:00:00', 53],
['2017-01-01 01:00:00', 52],
['2017-01-02 00:00:00', 53],
['2017-01-03 01:00:00', 50],
['2017-01-04 00:00:00', 53]
], columns=['date', 'model'])
def group_fun(ix):
return strptime(df['date'][ix], '%Y-%m-%d %H:%M:%S').tm_hour
print(df.groupby(by=group_fun).std())
model
0 0.000000
1 1.414214

How do I group a time series by hour of day?

I have a time series and I want to group the rows by hour of day (regardless of date) and visualize these as boxplots. So I'd want 24 boxplots starting from hour 1, then hour 2, then hour 3 and so on.
The way I see this working is splitting the dataset up into 24 series (1 for each hour of the day), creating a boxplot for each series and then plotting this on the same axes.
The only way I can think of to do this is to manually select all the values between each hour, is there a faster way?
some sample data:
Date Actual Consumption
2018-01-01 00:00:00 47.05
2018-01-01 00:15:00 46
2018-01-01 00:30:00 44
2018-01-01 00:45:00 45
2018-01-01 01:00:00 43.5
2018-01-01 01:15:00 43.5
2018-01-01 01:30:00 43
2018-01-01 01:45:00 42.5
2018-01-01 02:00:00 43
2018-01-01 02:15:00 42.5
2018-01-01 02:30:00 41
2018-01-01 02:45:00 42.5
2018-01-01 03:00:00 42.04
2018-01-01 03:15:00 41.96
2018-01-01 03:30:00 44
2018-01-01 03:45:00 44
2018-01-01 04:00:00 43.54
2018-01-01 04:15:00 43.46
2018-01-01 04:30:00 43.5
2018-01-01 04:45:00 43
2018-01-01 05:00:00 42.04
This is what i've tried so far:
zero = df.between_time('00:00', '00:59')
one = df.between_time('01:00', '01:59')
two = df.between_time('02:00', '02:59')
and then I would plot a boxplot for each of these on the same axes. However it's very tedious to do this for all 24 hours in a day.
This is the kind of output I want:
https://www.researchgate.net/figure/Boxplot-of-the-NOx-data-by-hour-of-the-day_fig1_24054015
there are 2 steps to achieve this:
convert Actual to date time:
df.Actual = pd.to_datetime(df.Actual)
Group by the hour:
df.groupby([df.Date, df.Actual.dt.hour+1]).Consumption.sum().reset_index()
I assumed you wanted to sum the Consumption (unless you wish to have mean or whatever just change it). One note: hour+1 so it will start from 1 and not 0 (remove it if you wish 0 to be midnight).
desired result:
Date Actual Consumption
0 2018-01-01 1 182.05
1 2018-01-01 2 172.50
2 2018-01-01 3 169.00
3 2018-01-01 4 172.00
4 2018-01-01 5 173.50
5 2018-01-01 6 42.04

How can I efficiently convert hourly data into dates and times for every day of the year using Python pandas?

I have a pandas DataFrame that represents a value for every hour of a day and I want to report each value of each day for a year. I have written the 'naive' way to do it. Is there a more efficient way?
Naive way (that works correctly, but takes a lot of time):
dfConsoFrigo = pd.read_csv("../assets/datas/refregirateur.csv", sep=';')
dataframe = pd.DataFrame(columns=['Puissance'])
iterator = 0
for day in pd.date_range("01 Jan 2017 00:00", "31 Dec 2017 23:00", freq='1H'):
iterator = iterator % 24
dataframe.loc[day] = dfConsoFrigo.iloc[iterator]['Puissance']
iterator += 1
Input (time;value) 24 rows:
Heure;Puissance
00:00;48.0
01:00;47.0
02:00;46.0
03:00;46.0
04:00;45.0
05:00;46.0
...
19:00;55.0
20:00;53.0
21:00;51.0
22:00;50.0
23:00;49.0
Expected Output (8760 rows):
Puissance
2017-01-01 00:00:00 48
2017-01-01 01:00:00 47
2017-01-01 02:00:00 46
2017-01-01 03:00:00 46
2017-01-01 04:00:00 45
...
2017-12-31 20:00:00 53
2017-12-31 21:00:00 51
2017-12-31 22:00:00 50
2017-12-31 23:00:00 49
I think you need numpy.tile:
np.random.seed(10)
df = pd.DataFrame({'Puissance':np.random.randint(100, size=24)})
rng = pd.date_range("01 Jan 2017 00:00", "31 Dec 2017 23:00", freq='1H')
df = pd.DataFrame({'a':np.tile(df['Puissance'].values, 365)}, index=rng)
print (df.head(30))
a
2017-01-01 00:00:00 9
2017-01-01 01:00:00 15
2017-01-01 02:00:00 64
2017-01-01 03:00:00 28
2017-01-01 04:00:00 89
2017-01-01 05:00:00 93
2017-01-01 06:00:00 29
2017-01-01 07:00:00 8
2017-01-01 08:00:00 73
2017-01-01 09:00:00 0
2017-01-01 10:00:00 40
2017-01-01 11:00:00 36
2017-01-01 12:00:00 16
2017-01-01 13:00:00 11
2017-01-01 14:00:00 54
2017-01-01 15:00:00 88
2017-01-01 16:00:00 62
2017-01-01 17:00:00 33
2017-01-01 18:00:00 72
2017-01-01 19:00:00 78
2017-01-01 20:00:00 49
2017-01-01 21:00:00 51
2017-01-01 22:00:00 54
2017-01-01 23:00:00 77
2017-01-02 00:00:00 9
2017-01-02 01:00:00 15
2017-01-02 02:00:00 64
2017-01-02 03:00:00 28
2017-01-02 04:00:00 89
2017-01-02 05:00:00 93

Categories

Resources