I have the following data frame with hourly resolution
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 20.96
1 2017-01-01 20.90
2 2017-01-01 18.13
3 2017-01-01 16.03
4 2017-01-01 16.43
... ...
8756 2017-12-31 25.56
8757 2017-12-31 11.02
8758 2017-12-31 7.32
8759 2017-12-31 1.86
type(day_ahead_DK1)
Out[28]: pandas.core.frame.DataFrame
But the current column DateStamp is missing hours. How can I add hours 00:00:00, to 2017-01-01 for Index 0 so it will be 2017-01-01 00:00:00, and then 01:00:00, to 2017-01-01 for Index 1 so it will be 2017-01-01 01:00:00, and so on, so that all my days will have hours from 0 to 23. Thank you!
The expected output:
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
... ...
8756 2017-12-31 20:00:00 25.56
8757 2017-12-31 21:00:00 11.02
8758 2017-12-31 22:00:00 7.32
8759 2017-12-31 23:00:00 1.86
Use GroupBy.cumcount for counter with to_timedelta for hours and add to DateStamp column:
df['DateStamp'] = pd.to_datetime(df['DateStamp'])
df['DateStamp'] += pd.to_timedelta(df.groupby('DateStamp').cumcount(), unit='H')
print (df)
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
8756 2017-12-31 00:00:00 25.56
8757 2017-12-31 01:00:00 11.02
8758 2017-12-31 02:00:00 7.32
8759 2017-12-31 03:00:00 1.86
Related
Say I have a Dataframe called Data with shape (71067, 4):
StartTime EndDateTime TradeDate Values
0 2018-12-31 23:00:00 2018-12-31 23:30:00 2019-01-01 -44.676
1 2018-12-31 23:30:00 2019-01-01 00:00:00 2019-01-01 -36.113
2 2019-01-01 00:00:00 2019-01-01 00:30:00 2019-01-01 -19.229
3 2019-01-01 00:30:00 2019-01-01 01:00:00 2019-01-01 -23.606
4 2019-01-01 01:00:00 2019-01-01 01:30:00 2019-01-01 -25.899
... ... ... ... ...
2023-01-30 20:30:00 2023-01-30 21:00:00 2023-01-30 -27.198
2023-01-30 21:00:00 2023-01-30 21:30:00 2023-01-30 -13.221
2023-01-30 21:30:00 2023-01-30 22:00:00 2023-01-30 -12.034
2023-01-30 22:00:00 2023-01-30 22:30:00 2023-01-30 -16.464
2023-01-30 22:30:00 2023-01-30 23:00:00 2023-01-30 -25.441
71067 rows × 4 columns
When running Data.isna().sum().sum() I realise I have some NaN values in the dataset:
Data.isna().sum().sum()
> 1391
Shown here:
Data[Data['Values'].isna()].reset_index(drop = True).sort_values(by = 'StartTime')
0 2019-01-01 03:30:00 2019-01-01 04:00:00 2019-01-01 NaN
1 2019-01-04 02:30:00 2019-01-04 03:00:00 2019-01-04 NaN
2 2019-01-04 03:00:00 2019-01-04 03:30:00 2019-01-04 NaN
3 2019-01-04 03:30:00 2019-01-04 04:00:00 2019-01-04 NaN
4 2019-01-04 04:00:00 2019-01-04 04:30:00 2019-01-04 NaN
... ... ... ... ...
1386 2022-12-06 13:00:00 2022-12-06 13:30:00 2022-12-06 NaN
1387 2022-12-06 13:30:00 2022-12-06 14:00:00 2022-12-06 NaN
1388 2022-12-22 11:00:00 2022-12-22 11:30:00 2022-12-22 NaN
1389 2023-01-25 11:00:00 2023-01-25 11:30:00 2023-01-25 NaN
1390 2023-01-25 11:30:00 2023-01-25 12:00:00 2023-01-25 NaN
Is there anyway of replacing each of the NaN values in the dataset with the mean value of the corresponding half hour across the 70,000 plus rows, see below:
Data['HH'] = pd.to_datetime(Data['StartTime']).dt.time
Data.groupby(['HH'], as_index=False)[['Data']].mean().head(10)
# Only showing first 10 means
HH Values
0 00:00:00 5.236811
1 00:30:00 2.056571
2 01:00:00 4.157455
3 01:30:00 2.339253
4 02:00:00 2.658238
5 02:30:00 0.230557
6 03:00:00 0.217599
7 03:30:00 -0.630243
8 04:00:00 -0.989919
9 04:30:00 -0.494372
For example, if a value is missing against 04:00, can it be replaced with the 04:00 mean value (0.989919) as per the above table of means?
Any help greatly appreciated.
Let's group the dataframe by HH then transform the Values with mean to broadcast the mean values back to the original column shape then use fillna to fill the null values
avg = Data.groupby('HH')['Values'].transform('mean')
Data['Values'] = Data['Values'].fillna(avg)
I am working with a dataset where I have dates in datetime format in the first column and hours as float as separate columns like this:
date 1.0 2.0 3.0 ... 21.0 22.0 23.0 24.0
0 2021-01-01 24.95 24.35 23.98 ... 27.32 26.98 26.44 25.64
1 2021-01-02 25.59 24.91 24.74 ... 27.38 26.96 26.85 25.94
and what I want to achieve is this:
Date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-01 02:00:00 24.35
2 2021-01-01 03:00:00 23.98
3 2013-01-01 04:00:00 ...
So I have been figuring that the first step should be to change the hours into datetime format,
been trying this code for example: df[1.0] = pd.to_datetime(df[1.0], format='%h')
Where I get this: "ValueError: 'h' is a bad directive in format '%h'"
And then rearrange the columns and rows. Been thinking about doing this with pandas pivot_table and transform. Any help would be appreciated. Thank you.
Use DataFrame.set_index first, convert all columns to timedeltas, reshape by DataFrame.unstack and last join dates and timedeltas:
df['date'] = pd.to_datetime(df['date'])
f = lambda x: pd.to_timedelta(float(x), unit='h')
df1 = (df.set_index('date')
.rename(columns=f)
.unstack()
.reset_index(name='Price')
.assign(date=lambda x: x['date'] + x.pop('level_0')))
print (df1)
date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-02 01:00:00 25.59
2 2021-01-01 02:00:00 24.35
3 2021-01-02 02:00:00 24.91
4 2021-01-01 03:00:00 23.98
5 2021-01-02 03:00:00 24.74
6 2021-01-01 21:00:00 27.32
7 2021-01-02 21:00:00 27.38
8 2021-01-01 22:00:00 26.98
9 2021-01-02 22:00:00 26.96
10 2021-01-01 23:00:00 26.44
11 2021-01-02 23:00:00 26.85
12 2021-01-02 00:00:00 25.64
13 2021-01-03 00:00:00 25.94
Or use DataFrame.melt and then join column converted to timedeltas:
df['date'] = pd.to_datetime(df['date'])
df1 = (df.melt('date', value_name='Price')
.assign(date = lambda x: x['date'] +
pd.to_timedelta(x.pop('variable').astype(float), unit='h'))
.sort_values('date', ignore_index=True))
print (df1)
date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-01 02:00:00 24.35
2 2021-01-01 03:00:00 23.98
3 2021-01-01 21:00:00 27.32
4 2021-01-01 22:00:00 26.98
5 2021-01-01 23:00:00 26.44
6 2021-01-02 00:00:00 25.64
7 2021-01-02 01:00:00 25.59
8 2021-01-02 02:00:00 24.91
9 2021-01-02 03:00:00 24.74
10 2021-01-02 21:00:00 27.38
11 2021-01-02 22:00:00 26.96
12 2021-01-02 23:00:00 26.85
13 2021-01-03 00:00:00 25.94
So I have a dataset that has electricity load over 24 hours:
Time_of_Day = loadData.groupby(loadData.index.hour).mean()
Time_of_Day
Time Load
2019-01-01 01:00:00 38.045
2019-01-01 02:00:00 30.675
2019-01-01 03:00:00 22.570
2019-01-01 04:00:00 22.153
2019-01-01 05:00:00 21.085
... ...
2019-12-31 20:00:00 65.565
2019-12-31 21:00:00 53.513
2019-12-31 22:00:00 49.096
2019-12-31 23:00:00 44.409
2020-01-01 00:00:00 45.744
how do I plot a random day(24hrs) from the 8760 hours please
With the following toy dataframe:
import pandas as pd
import random
df = pd.DataFrame({"Time": pd.date_range(start="1/1/2019", end="12/31/2019", freq="H")})
df["Load"] = [round(random.random() * 100, 2) for _ in range(df.shape[0])]
Time Load
0 2019-01-01 00:00:00 53.36
1 2019-01-01 01:00:00 34.20
2 2019-01-01 02:00:00 64.19
3 2019-01-01 03:00:00 89.18
4 2019-01-01 04:00:00 27.82
... ... ...
8732 2019-12-30 20:00:00 38.26
8733 2019-12-30 21:00:00 49.66
8734 2019-12-30 22:00:00 64.15
8735 2019-12-30 23:00:00 23.97
8736 2019-12-31 00:00:00 3.72
[8737 rows x 2 columns]
Here is one way to do it using choice function from Python standard library random module:
# In Jupyter cell
df[
(df["Time"].dt.month == random.choice(df["Time"].dt.month))
& (df["Time"].dt.day == random.choice(df["Time"].dt.day))
].plot(x="Time")
Output:
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
missing data point 1
6 17111 2018-01-01 07:00:00 1.835
7 17112 2018-01-01 00:00:00 1.095
8 17112 2018-01-01 01:00:00 1.129
missing data point 1
9 17112 2018-01-01 03:00:00 1.833
10 17112 2018-01-01 04:00:00 1.697
11 17112 2018-01-01 05:00:00 1.835
For every customer, I have hourly data. However, some data points are missing in between. I want to check the Min and Max of Usage Date and fill in the missing Usage Date in that time interval (all values are per hour) and EnergyConsumed as zero. I can later use ffill or backfill to take care of this.
Not every customer's max UsageDate is 2018-01-31 23:00:00. So we only want to extend the series till the max date of every customer.
missing point 1 is replaced by
17111 2018-01-01 06:00:00 0
missing point 2 is replaced by
17112 2018-01-01 02:00:00 0
My main point of trouble is how to find the min and max date of every customer and then generate the gaps of dates.
I have tried indexing by date and resampling but havent helped me reach the solution.
Also, I was wondering if there is a way to directly find customerID's which have missing values in the pattern described above. My data is very large and the solution provided by #Vaishali is computing heavy. Any inputs would be helpful!
You can group the Dataframe by custid and create index with desired date range. Now use this index to reindex the data
df['UsageDate'] = pd.to_datetime(df['UsageDate'])
idx = df.groupby('CustID')['UsageDate'].apply(lambda x: pd.Series(index = pd.date_range(x.min(), x.max(), freq = 'H'))).index
df.set_index(['CustID', 'UsageDate']).reindex(idx).fillna(0).reset_index().rename(columns = {'level_1':'UsageDate'})
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
6 17111 2018-01-01 06:00:00 0.000
7 17111 2018-01-01 07:00:00 1.835
8 17112 2018-01-01 00:00:00 1.095
9 17112 2018-01-01 01:00:00 1.129
10 17112 2018-01-01 02:00:00 0.000
11 17112 2018-01-01 03:00:00 1.833
12 17112 2018-01-01 04:00:00 1.697
13 17112 2018-01-01 05:00:00 1.835
Explanation: Since the Usagedates have to be all the dates in the range of minimum and maximum date for that CustID, we group the data by CustID and create a series of min and max dates using date_range. Set the dates as index of the series rather than value. The result of the groupby will be a multiindex with CUSTID as level 0 and usage date as level 1. We now use this multiindex to reindex the original dataframe. It will use the values where the index matches, assign NaN at the rest. Finally convert the NaN to 0 using fillna.
First create DatetimeIndex and then use asfreq in apply:
df['UsageDate'] = pd.to_datetime(df['UsageDate'])
df = (df.set_index('UsageDate')
.groupby('CustID')['EnergyConsumed']
.apply(lambda x: x.asfreq('H'))
.fillna(0)
.reset_index()
)
print (df)
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
6 17111 2018-01-01 06:00:00 0.000
7 17111 2018-01-01 07:00:00 1.835
8 17112 2018-01-01 00:00:00 1.095
9 17112 2018-01-01 01:00:00 1.129
10 17112 2018-01-01 02:00:00 0.000
11 17112 2018-01-01 03:00:00 1.833
12 17112 2018-01-01 04:00:00 1.697
13 17112 2018-01-01 05:00:00 1.835
Also is possible use parameter ffill or bfill:
df = (df.set_index('UsageDate')
.groupby('CustID')['EnergyConsumed']
.apply(lambda x: x.asfreq('H', method='ffill'))
.reset_index()
)
print (df)
CustID UsageDate EnergyConsumed
0 17111 2018-01-01 00:00:00 1.095
1 17111 2018-01-01 01:00:00 1.129
2 17111 2018-01-01 02:00:00 1.165
3 17111 2018-01-01 03:00:00 1.833
4 17111 2018-01-01 04:00:00 1.697
5 17111 2018-01-01 05:00:00 1.835
6 17111 2018-01-01 06:00:00 1.835
7 17111 2018-01-01 07:00:00 1.835
8 17112 2018-01-01 00:00:00 1.095
9 17112 2018-01-01 01:00:00 1.129
10 17112 2018-01-01 02:00:00 1.129
11 17112 2018-01-01 03:00:00 1.833
12 17112 2018-01-01 04:00:00 1.697
13 17112 2018-01-01 05:00:00 1.835
i have below dataframe. and i wanna make a hourly mean dataframe
condition that every hour just calculate mean value 00:15:00~00:45:00.
date/time are multi index.
aaa
date time
2017-01-01 00:00:00 146.88
00:15:00 143.28
00:30:00 143.28
00:45:00 141.12
01:00:00 134.64
01:15:00 132.48
01:30:00 136.80
01:45:00 138.24
02:00:00 131.76
02:15:00 131.04
02:30:00 134.64
02:45:00 139.68
03:00:00 136.08
03:15:00 132.48
03:30:00 132.48
03:45:00 139.68
04:00:00 134.64
04:15:00 131.04
04:30:00 160.56
04:45:00 177.12
...
results should be belows.. how can i do it?
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
...
It seems need only select rows with 00:00 in the end of times:
df2 = df1[df1.index.get_level_values(1).astype(str).str.endswith('00:00')]
print (df2)
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
But if need mean only values 00:15-00:45 it is more complicated:
lvl1 = pd.Series(df1.index.get_level_values(1))
m = ~lvl1.astype(str).str.endswith('00:00')
lvl1new = lvl1.mask(m).ffill()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
lvl1new.where(m)], names=df1.index.names)
print (df1)
aaa
date time
2017-01-01 NaN 146.88
00:00:00 143.28
00:00:00 143.28
00:00:00 141.12
NaN 134.64
01:00:00 132.48
01:00:00 136.80
01:00:00 138.24
NaN 131.76
02:00:00 131.04
02:00:00 134.64
02:00:00 139.68
NaN 136.08
03:00:00 132.48
03:00:00 132.48
03:00:00 139.68
NaN 134.64
04:00:00 131.04
04:00:00 160.56
04:00:00 177.12
df = df1['aaa'].groupby(level=[0,1]).mean()
print (df)
date time
2017-01-01 00:00:00 142.56
01:00:00 135.84
02:00:00 135.12
03:00:00 134.88
04:00:00 156.24
Name: aaa, dtype: float64