df1 looks like this-
week_date Values
21-04-2019 00:00:00 10
28-04-2019 00:00:00 20
df2 looks like this-
hourly_date hour_val
21-04-2019 00:00:00 a
21-04-2019 01:00:00 b
21-04-2019 02:00:00 c
21-04-2019 03:00:00 d
28-04-2019 00:00:00 e
resultant dataset should look like this
week_date Values hourly_date hour_val
21-04-2019 00:00:00 10 21-04-2019 00:00:00 a
21-04-2019 00:00:00 10 21-04-2019 01:00:00 b
21-04-2019 00:00:00 10 21-04-2019 02:00:00 c
21-04-2019 00:00:00 10 21-04-2019 03:00:00 d
28-04-2019 00:00:00 20 28-04-2019 00:00:00 e
I have hundreds of weekly rows data and thousands of hourly rows data.
trying merging but not getting the desired output.
merge=pd.merge(df1,df2, how='outer', left_index=True, right_index=True)
resultant dataset should look like this
week_date Values hourly_date hour_val
21-04-2019 00:00:00 10 21-04-2019 00:00:00 a
21-04-2019 00:00:00 10 21-04-2019 01:00:00 b
21-04-2019 00:00:00 10 21-04-2019 02:00:00 c
21-04-2019 00:00:00 10 21-04-2019 03:00:00 d
28-04-2019 00:00:00 20 28-04-2019 00:00:00 e
You can merge on year and week in this case, try something like:
import pandas as pd
df1 = pd.DataFrame(
{
"week_date": ["21-04-2019 00:00:00", "28-04-2019 00:00:00"],
"Values": [10,20]
}
)
df2 = pd.DataFrame(
{
"hourly_date": [
"21-04-2019 00:00:00",
"21-04-2019 01:00:00",
"21-04-2019 02:00:00",
"21-04-2019 03:00:00",
"28-04-2019 00:00:00"
],
"hour_val": ["a","b","c","d","e"]
}
)
df1.week_date = pd.to_datetime(df1.week_date)
df1 = df1.set_index("week_date", drop=False)
df2.hourly_date = pd.to_datetime(df2.hourly_date)
df2 = df2.set_index("hourly_date", drop=False)
pd.merge(df1, df2,
left_on=[df1.week_date.dt.week, df1.week_date.dt.year],
right_on=[df2.hourly_date.dt.week, df2.hourly_date.dt.year]
)[["week_date", "Values","hourly_date","hour_val"]].set_index("week_date")
this outputs
Values hourly_date hour_val
week_date
2019-04-21 10 2019-04-21 00:00:00 a
2019-04-21 10 2019-04-21 01:00:00 b
2019-04-21 10 2019-04-21 02:00:00 c
2019-04-21 10 2019-04-21 03:00:00 d
2019-04-28 20 2019-04-28 00:00:00 e
not getting the desired result
my original data sets look this this
data-1:
week_date value
2019-04-19 20:00:00 10
2019-04-26 20:00:00 20
data-2:
hourly_date hour_val
2019-04-26 01:00:00 a
2019-04-26 02:00:00 b
2019-04-26 03:00:00 c
2019-04-26 20:00:00 d
2019-04-26 21:00:00 e
and the desired output should be-
Values hourly_date hour_val
week_date
2019-04-19 20:00:00 10 2019-04-26 01:00:00 a
2019-04-19 20:00:00 10 2019-04-26 02:00:00 b
2019-04-19 20:00:00 10 2019-04-26 03:00:00 c
2019-04-26 20:00:00 20 2019-04-26 20:00:00 d
2019-04-26 20:00:00 20 2019-04-26 21:00:00 e
means weekly date-time changes only when it's equal to hourly date-time...else week_date carries the previous date-time value
Related
I am working with a dataset where I have dates in datetime format in the first column and hours as float as separate columns like this:
date 1.0 2.0 3.0 ... 21.0 22.0 23.0 24.0
0 2021-01-01 24.95 24.35 23.98 ... 27.32 26.98 26.44 25.64
1 2021-01-02 25.59 24.91 24.74 ... 27.38 26.96 26.85 25.94
and what I want to achieve is this:
Date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-01 02:00:00 24.35
2 2021-01-01 03:00:00 23.98
3 2013-01-01 04:00:00 ...
So I have been figuring that the first step should be to change the hours into datetime format,
been trying this code for example: df[1.0] = pd.to_datetime(df[1.0], format='%h')
Where I get this: "ValueError: 'h' is a bad directive in format '%h'"
And then rearrange the columns and rows. Been thinking about doing this with pandas pivot_table and transform. Any help would be appreciated. Thank you.
Use DataFrame.set_index first, convert all columns to timedeltas, reshape by DataFrame.unstack and last join dates and timedeltas:
df['date'] = pd.to_datetime(df['date'])
f = lambda x: pd.to_timedelta(float(x), unit='h')
df1 = (df.set_index('date')
.rename(columns=f)
.unstack()
.reset_index(name='Price')
.assign(date=lambda x: x['date'] + x.pop('level_0')))
print (df1)
date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-02 01:00:00 25.59
2 2021-01-01 02:00:00 24.35
3 2021-01-02 02:00:00 24.91
4 2021-01-01 03:00:00 23.98
5 2021-01-02 03:00:00 24.74
6 2021-01-01 21:00:00 27.32
7 2021-01-02 21:00:00 27.38
8 2021-01-01 22:00:00 26.98
9 2021-01-02 22:00:00 26.96
10 2021-01-01 23:00:00 26.44
11 2021-01-02 23:00:00 26.85
12 2021-01-02 00:00:00 25.64
13 2021-01-03 00:00:00 25.94
Or use DataFrame.melt and then join column converted to timedeltas:
df['date'] = pd.to_datetime(df['date'])
df1 = (df.melt('date', value_name='Price')
.assign(date = lambda x: x['date'] +
pd.to_timedelta(x.pop('variable').astype(float), unit='h'))
.sort_values('date', ignore_index=True))
print (df1)
date Price
0 2021-01-01 01:00:00 24.95
1 2021-01-01 02:00:00 24.35
2 2021-01-01 03:00:00 23.98
3 2021-01-01 21:00:00 27.32
4 2021-01-01 22:00:00 26.98
5 2021-01-01 23:00:00 26.44
6 2021-01-02 00:00:00 25.64
7 2021-01-02 01:00:00 25.59
8 2021-01-02 02:00:00 24.91
9 2021-01-02 03:00:00 24.74
10 2021-01-02 21:00:00 27.38
11 2021-01-02 22:00:00 26.96
12 2021-01-02 23:00:00 26.85
13 2021-01-03 00:00:00 25.94
I have the following data frame with hourly resolution
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 20.96
1 2017-01-01 20.90
2 2017-01-01 18.13
3 2017-01-01 16.03
4 2017-01-01 16.43
... ...
8756 2017-12-31 25.56
8757 2017-12-31 11.02
8758 2017-12-31 7.32
8759 2017-12-31 1.86
type(day_ahead_DK1)
Out[28]: pandas.core.frame.DataFrame
But the current column DateStamp is missing hours. How can I add hours 00:00:00, to 2017-01-01 for Index 0 so it will be 2017-01-01 00:00:00, and then 01:00:00, to 2017-01-01 for Index 1 so it will be 2017-01-01 01:00:00, and so on, so that all my days will have hours from 0 to 23. Thank you!
The expected output:
day_ahead_DK1
Out[27]:
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
... ...
8756 2017-12-31 20:00:00 25.56
8757 2017-12-31 21:00:00 11.02
8758 2017-12-31 22:00:00 7.32
8759 2017-12-31 23:00:00 1.86
Use GroupBy.cumcount for counter with to_timedelta for hours and add to DateStamp column:
df['DateStamp'] = pd.to_datetime(df['DateStamp'])
df['DateStamp'] += pd.to_timedelta(df.groupby('DateStamp').cumcount(), unit='H')
print (df)
DateStamp DK1
0 2017-01-01 00:00:00 20.96
1 2017-01-01 01:00:00 20.90
2 2017-01-01 02:00:00 18.13
3 2017-01-01 03:00:00 16.03
4 2017-01-01 04:00:00 16.43
8756 2017-12-31 00:00:00 25.56
8757 2017-12-31 01:00:00 11.02
8758 2017-12-31 02:00:00 7.32
8759 2017-12-31 03:00:00 1.86
I'm trying to execute conditional statements on time series data. Is there a way to set the "t_value" to zero if the time is between 1) 00:00:00 and 02:00:00 2) 04:00:00 & 06:00:00
t_value
2019-11-24 00:00:00 4.0
2019-11-24 01:00:00 7.8
2019-11-24 02:00:00 95.1
2019-11-24 03:00:00 78.4
2019-11-24 04:00:00 8.0
2019-11-24 05:00:00 17.50
2019-11-24 06:00:00 55.00
2019-11-24 07:00:00 66.00
2019-11-25 00:00:00 21.00
2019-11-25 01:00:00 12.40
if-else & np.where are probable options but I'm unsure on how to implement the conditions on hours.
use between_time to get the datetimes between the specified times, then use loc to assign the new values :
I'll use #Ben.T's sample data :
df = pd.DataFrame({'t_value':range(1,11)},
index=pd.date_range('2020-05-17 00:00:00', periods=10, freq='1H'))
#get the time indices for the different ranges
m1 = df.between_time('00:00:00','02:00:00').index
m2 = df.between_time('04:00:00','06:00:00').index
#assign 0 to the t_value column matches :
df.loc[m1|m2] = 0
print(df)
t_value
2020-05-17 00:00:00 0
2020-05-17 01:00:00 0
2020-05-17 02:00:00 0
2020-05-17 03:00:00 4
2020-05-17 04:00:00 0
2020-05-17 05:00:00 0
2020-05-17 06:00:00 0
2020-05-17 07:00:00 8
2020-05-17 08:00:00 9
2020-05-17 09:00:00 10
you can acces the time from your datetime index with time and create mask depending on your condition. Then use loc and | to concatenate your mask as or.
#sample data
df = pd.DataFrame({'t_value':range(1,11)},
index=pd.date_range('2020-05-17 00:00:00', periods=10, freq='1H'))
# masks
m1 = ((df.index.time>=pd.to_datetime('00:00:00').time())
& (df.index.time<=pd.to_datetime('02:00:00').time()))
m2 = ((df.index.time>=pd.to_datetime('04:00:00').time())
& (df.index.time<=pd.to_datetime('06:00:00').time()))
#set the value to 0
df.loc[m1|m2, 't_value'] = 0
print (df)
t_value
2020-05-17 00:00:00 0
2020-05-17 01:00:00 0
2020-05-17 02:00:00 0
2020-05-17 03:00:00 4
2020-05-17 04:00:00 0
2020-05-17 05:00:00 0
2020-05-17 06:00:00 0
2020-05-17 07:00:00 8
2020-05-17 08:00:00 9
2020-05-17 09:00:00 10
I have two data sets.One with weekly date time and the other hourly date time.
my data sets looks like this:-
df1
Week_date w_values
21-04-2019 20:00:00 10
28-04-2019 20:00:00 20
05-05-2019 20:00:00 30
df2
hour_date h_values
19-04-2019 08:00:00 a
21-04-2019 07:00:00 b
21-04-2019 20:00:00 c
22-04-2019 06:00:00 d
23-04-2019 05:00:00 e
28-04-2019 19:00:00 f
28-04-2019 20:00:00 g
28-04-2019 21:00:00 h
29-04-2019 20:00:00 i
05-05-2019 20:00:00 j
06-05-2019 23:00:00 k
tried merging but failed to get the desired output
output data set should look like this
week_date w_values hour_date h_values
21-04-2019 20:00:00 10 21-04-2019 20:00:00 c
21-04-2019 20:00:00 10 22-04-2019 06:00:00 d
21-04-2019 20:00:00 10 23-04-2019 05:00:00 e
21-04-2019 20:00:00 10 28-04-2019 19:00:00 f
28-04-2019 20:00:00 20 28-04-2019 20:00:00 g
28-04-2019 20:00:00 20 28-04-2019 21:00:00 h
28-04-2019 20:00:00 20 29-04-2019 20:00:00 i
05-05-2019 20:00:00 30 05-05-2019 20:00:00 j
05-05-2019 20:00:00 30 06-05-2019 23:00:00 k
the weekly date will change only when week date is equal to hour date....else it will take previous week date....
Use the 'merge_asof' function. From pandas documentation "This merge is similar to a left-join except that we match on nearest key rather than equal keys."
df_week['Week_date']=pd.to_datetime(df_week['Week_date'])
df_hour['hour_date']=pd.to_datetime(df_hour['hour_date'])
df_week_sort=df_week.sort_values(by='Week_date')
df_hour_sort=df_hour.sort_values(by='hour_date')
df_week_sort.rename(columns={'Week_date':'Merge_date'},inplace=True)
df_hour_sort.rename(columns={'hour_date':'Merge_date'},inplace=True)
df_merged=pd.merge_asof(df_hour_sort,df_week_sort,on='Merge_date')
Make sure that the two frames are sorted by the date stamp
The following should do (provided Week_date and hour_date are datetimes):
(df2.merge(df1, how='left', right_on='Week_date', left_on='hour_date')
.ffill()
.dropna())
The way it works
Make sure both dfs are sorted
>>> df1 = df1.sort_values('Week_date')
>>> df2 = df2.sort_values('hour_date')
Do the merge
>>> df3 = df2.merge(df1, how='left', right_on='Week_date', left_on='hour_date')
>>> df3
hour_date h_values Week_date w_values
0 2019-04-19 08:00:00 a NaT NaN
1 2019-04-21 07:00:00 b NaT NaN
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d NaT NaN
4 2019-04-23 05:00:00 e NaT NaN
5 2019-04-28 19:00:00 f NaT NaN
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h NaT NaN
8 2019-04-29 20:00:00 i NaT NaN
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k NaT NaN
Forward fill the gaps
>>> df3 = df3.ffill()
>>> df3
hour_date h_values Week_date w_values
0 2019-04-19 08:00:00 a NaT NaN
1 2019-04-21 07:00:00 b NaT NaN
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d 2019-04-21 20:00:00 10.0
4 2019-04-23 05:00:00 e 2019-04-21 20:00:00 10.0
5 2019-04-28 19:00:00 f 2019-04-21 20:00:00 10.0
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h 2019-04-28 20:00:00 20.0
8 2019-04-29 20:00:00 i 2019-04-28 20:00:00 20.0
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k 2019-05-05 20:00:00 30.0
Remove the remaining NaNs
>>> df3 = df3.dropna()
>>> df3
hour_date h_values Week_date w_values
2 2019-04-21 20:00:00 c 2019-04-21 20:00:00 10.0
3 2019-04-22 06:00:00 d 2019-04-21 20:00:00 10.0
4 2019-04-23 05:00:00 e 2019-04-21 20:00:00 10.0
5 2019-04-28 19:00:00 f 2019-04-21 20:00:00 10.0
6 2019-04-28 20:00:00 g 2019-04-28 20:00:00 20.0
7 2019-04-28 21:00:00 h 2019-04-28 20:00:00 20.0
8 2019-04-29 20:00:00 i 2019-04-28 20:00:00 20.0
9 2019-05-05 20:00:00 j 2019-05-05 20:00:00 30.0
10 2019-06-05 23:00:00 k 2019-05-05 20:00:00 30.0
i have below dataframe. and i wanna make a hourly mean dataframe
condition that every hour just calculate mean value 00:15:00~00:45:00.
date/time are multi index.
aaa
date time
2017-01-01 00:00:00 146.88
00:15:00 143.28
00:30:00 143.28
00:45:00 141.12
01:00:00 134.64
01:15:00 132.48
01:30:00 136.80
01:45:00 138.24
02:00:00 131.76
02:15:00 131.04
02:30:00 134.64
02:45:00 139.68
03:00:00 136.08
03:15:00 132.48
03:30:00 132.48
03:45:00 139.68
04:00:00 134.64
04:15:00 131.04
04:30:00 160.56
04:45:00 177.12
...
results should be belows.. how can i do it?
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
...
It seems need only select rows with 00:00 in the end of times:
df2 = df1[df1.index.get_level_values(1).astype(str).str.endswith('00:00')]
print (df2)
aaa
date time
2017-01-01 00:00:00 146.88
01:00:00 134.64
02:00:00 131.76
03:00:00 136.08
04:00:00 134.64
But if need mean only values 00:15-00:45 it is more complicated:
lvl1 = pd.Series(df1.index.get_level_values(1))
m = ~lvl1.astype(str).str.endswith('00:00')
lvl1new = lvl1.mask(m).ffill()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
lvl1new.where(m)], names=df1.index.names)
print (df1)
aaa
date time
2017-01-01 NaN 146.88
00:00:00 143.28
00:00:00 143.28
00:00:00 141.12
NaN 134.64
01:00:00 132.48
01:00:00 136.80
01:00:00 138.24
NaN 131.76
02:00:00 131.04
02:00:00 134.64
02:00:00 139.68
NaN 136.08
03:00:00 132.48
03:00:00 132.48
03:00:00 139.68
NaN 134.64
04:00:00 131.04
04:00:00 160.56
04:00:00 177.12
df = df1['aaa'].groupby(level=[0,1]).mean()
print (df)
date time
2017-01-01 00:00:00 142.56
01:00:00 135.84
02:00:00 135.12
03:00:00 134.88
04:00:00 156.24
Name: aaa, dtype: float64