Suppose I have a multi-index Pandas data frame with two index levels: month_begin and month_end
import pandas as pd
multi_index = pd.MultiIndex.from_tuples([("2022-03-01", "2022-03-31"),
("2022-04-01", "2022-04-30"),
("2022-05-01", "2022-05-31"),
("2022-06-01", "2022-06-30")])
multi_index.names = ['month_begin', 'month_end']
df = pd.DataFrame(np.random.rand(4,100), index=multi_index)
df
0 1 ... 98 99
month_begin month_end ...
2022-03-01 2022-03-31 0.322032 0.205307 ... 0.975128 0.673460
2022-04-01 2022-04-30 0.113813 0.278981 ... 0.951049 0.090765
2022-05-01 2022-05-31 0.777918 0.842734 ... 0.667831 0.274189
2022-06-01 2022-06-30 0.221407 0.555711 ... 0.745158 0.648246
I would like to resample the data to have the value in a month at every hour in the respective month:
0 1 ... 98 99
...
2022-03-01 00:00 0.322032 0.205307 ... 0.975128 0.673460
2022-03-01 01:00 0.322032 0.205307 ... 0.975128 0.673460
2022-03-01 02:00 0.322032 0.205307 ... 0.975128 0.673460
...
2022-06-30 22:00 0.221407 0.555711 ... 0.745158 0.648246
2022-06-30 23:00 0.221407 0.555711 ... 0.745158 0.648246
I know I can use resample(), but I am struggeling with how to do this. Does anybody have a clue?
IIUC, try this using list_comprehension and explode with pd.date_range:
df['Date'] = [pd.date_range(s, e, freq='H') for s, e in df.index]
df_out = df.explode('Date').set_index('Date')
Output:
0 1 ... 98 99
Date ...
2022-03-01 00:00:00 0.396311 0.138263 ... 0.637640 0.106366
2022-03-01 01:00:00 0.396311 0.138263 ... 0.637640 0.106366
2022-03-01 02:00:00 0.396311 0.138263 ... 0.637640 0.106366
2022-03-01 03:00:00 0.396311 0.138263 ... 0.637640 0.106366
2022-03-01 04:00:00 0.396311 0.138263 ... 0.637640 0.106366
... ... ... ... ... ...
2022-06-29 20:00:00 0.129921 0.654878 ... 0.619212 0.142297
2022-06-29 21:00:00 0.129921 0.654878 ... 0.619212 0.142297
2022-06-29 22:00:00 0.129921 0.654878 ... 0.619212 0.142297
2022-06-29 23:00:00 0.129921 0.654878 ... 0.619212 0.142297
2022-06-30 00:00:00 0.129921 0.654878 ... 0.619212 0.142297
[2836 rows x 100 columns]
Related
I have two years worth of data in a Dataframe called df, with an additional column called dayNo which labels what day it is in the year. See below:
Code which handles dayNo:
df['dayNo'] = pd.to_datetime(df['TradeDate'], dayfirst=True).dt.day_of_year
I would like to amened dayNo so that when 2023 begins, dayNo doesn't reset to 1, but changes to 366, 367 and so on. Expected output below:
Maybe a completely different approach will have to be taken to what I've done above. Any help greatly appreciated, Thanks!
You could define a start day to start counting days from, and use the number of days from that point forward as your column. An example using self generated data to illustrate the point:
df = pd.DataFrame({"dates": pd.date_range("2022-12-29", "2023-01-03", freq="8H")})
start = pd.Timestamp("2021-12-31")
df["dayNo"] = df["dates"].sub(start).dt.days
dates dayNo
0 2022-12-29 00:00:00 363
1 2022-12-29 08:00:00 363
2 2022-12-29 16:00:00 363
3 2022-12-30 00:00:00 364
4 2022-12-30 08:00:00 364
5 2022-12-30 16:00:00 364
6 2022-12-31 00:00:00 365
7 2022-12-31 08:00:00 365
8 2022-12-31 16:00:00 365
9 2023-01-01 00:00:00 366
10 2023-01-01 08:00:00 366
11 2023-01-01 16:00:00 366
12 2023-01-02 00:00:00 367
13 2023-01-02 08:00:00 367
14 2023-01-02 16:00:00 367
15 2023-01-03 00:00:00 368
You are nearly there with your solution just do Apply for final result as
df['dayNo'] = df['dayNo'].apply(lambda x : x if x>= df.loc[0].dayNo else x+df.loc[0].dayNo)
df
Out[108]:
dates TradeDate dayNo
0 2022-12-31 00:00:00 2022-12-31 365
1 2022-12-31 01:00:00 2022-12-31 365
2 2022-12-31 02:00:00 2022-12-31 365
3 2022-12-31 03:00:00 2022-12-31 365
4 2022-12-31 04:00:00 2022-12-31 365
.. ... ... ...
68 2023-01-02 20:00:00 2023-01-02 367
69 2023-01-02 21:00:00 2023-01-02 367
70 2023-01-02 22:00:00 2023-01-02 367
71 2023-01-02 23:00:00 2023-01-02 367
72 2023-01-03 00:00:00 2023-01-03 368
Let's suppose we have a pandas dataframe as follows with this script (inspired by Chrysophylaxs dataframe) :
import pandas as pd
df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")})
The dataframe has then dates from 2022 to 2030 :
TradeDate
0 2022-12-29 00:00:00
1 2022-12-29 08:00:00
2 2022-12-29 16:00:00
3 2022-12-30 00:00:00
4 2022-12-30 08:00:00
... ...
7682 2030-01-01 16:00:00
7683 2030-01-02 00:00:00
7684 2030-01-02 08:00:00
7685 2030-01-02 16:00:00
7686 2030-01-03 00:00:00
[7687 rows x 1 columns]
I propose you the following commented-inside code to aim our target :
import pandas as pd
df = pd.DataFrame({'TradeDate': pd.date_range("2022-12-29", "2030-01-03", freq="8H")})
# Initialize Days counter
dyc = df['TradeDate'].iloc[0].dayofyear
# Initialize Previous day of Year
prv_dof = dyc
def func(row):
global dyc, prv_dof
# Get the day of the year
dof = row.iloc[0].dayofyear
# If New day then increment days counter
if dof != prv_dof:
dyc+=1
prv_dof = dof
return dyc
df['dayNo'] = df.apply(func, axis=1)
Resulting dataframe :
TradeDate dayNo
0 2022-12-29 00:00:00 363
1 2022-12-29 08:00:00 363
2 2022-12-29 16:00:00 363
3 2022-12-30 00:00:00 364
4 2022-12-30 08:00:00 364
... ... ...
7682 2030-01-01 16:00:00 2923
7683 2030-01-02 00:00:00 2924
7684 2030-01-02 08:00:00 2924
7685 2030-01-02 16:00:00 2924
7686 2030-01-03 00:00:00 2925
So I have a dataset that has electricity load over 24 hours:
Time_of_Day = loadData.groupby(loadData.index.hour).mean()
Time_of_Day
Time Load
2019-01-01 01:00:00 38.045
2019-01-01 02:00:00 30.675
2019-01-01 03:00:00 22.570
2019-01-01 04:00:00 22.153
2019-01-01 05:00:00 21.085
... ...
2019-12-31 20:00:00 65.565
2019-12-31 21:00:00 53.513
2019-12-31 22:00:00 49.096
2019-12-31 23:00:00 44.409
2020-01-01 00:00:00 45.744
how do I plot a random day(24hrs) from the 8760 hours please
With the following toy dataframe:
import pandas as pd
import random
df = pd.DataFrame({"Time": pd.date_range(start="1/1/2019", end="12/31/2019", freq="H")})
df["Load"] = [round(random.random() * 100, 2) for _ in range(df.shape[0])]
Time Load
0 2019-01-01 00:00:00 53.36
1 2019-01-01 01:00:00 34.20
2 2019-01-01 02:00:00 64.19
3 2019-01-01 03:00:00 89.18
4 2019-01-01 04:00:00 27.82
... ... ...
8732 2019-12-30 20:00:00 38.26
8733 2019-12-30 21:00:00 49.66
8734 2019-12-30 22:00:00 64.15
8735 2019-12-30 23:00:00 23.97
8736 2019-12-31 00:00:00 3.72
[8737 rows x 2 columns]
Here is one way to do it using choice function from Python standard library random module:
# In Jupyter cell
df[
(df["Time"].dt.month == random.choice(df["Time"].dt.month))
& (df["Time"].dt.day == random.choice(df["Time"].dt.day))
].plot(x="Time")
Output:
This is basically my code :
for index in range(scen_1['DateTime'].size):
DateandTime = scen_1['DateTime'][index][1:3] + '-' + scen_1['DateTime'][index][4:6] + ' ' + scen_1['DateTime'][index][8:]
if '24:00:00' in DateandTime:
DateandTime = DateandTime.replace('24:00:00','00:00:00')
scen_1['DateTime'][index] = DateandTime
scen_1['Date'] = pd.to_datetime(scen_1['DateTime'], format='%m-%d %H:%M:%S')
So I get this result:
DateTime OutdoorTemp ... HeadOfficeOcc Date
0 01-01 00:15:00 NaN ... 0 1900-01-01 00:15:00
1 01-01 00:30:00 NaN ... 0 1900-01-01 00:30:00
2 01-01 00:45:00 NaN ... 0 1900-01-01 00:45:00
3 01-01 01:00:00 5.2875 ... 0 1900-01-01 01:00:00
4 01-01 01:15:00 NaN ... 0 1900-01-01 01:15:00
... ... ... ... ... ...
17371 06-30 23:00:00 19.9875 ... 0 1900-06-30 23:00:00
17372 06-30 23:15:00 NaN ... 0 1900-06-30 23:15:00
17373 06-30 23:30:00 NaN ... 0 1900-06-30 23:30:00
17374 06-30 23:45:00 NaN ... 0 1900-06-30 23:45:00
17375 06-30 00:00:00 17.8250 ... 0 1900-06-30 00:00:00
any help is very much appreciated, I tried dt.date dt.time and I don't what else to try thanks !
For downloading German bank holidays via a web api and converting the json data into a pandas dataframe I use the following code (python 3):
import datetime
import requests
import pandas as pd
now = datetime.datetime.now()
year = now.year
URL ='https://feiertage-api.de/api/?jahr='+ str(year)
r = requests.get(URL)
df = pd.DataFrame(r.json())
The goal is a pandas dataframe looking like (picture = section of the dataframe):
The Problem: "columns" are pandas.core.series.Series and I cannot figure out how to extract the date using various versions of
df['BW'].str.split(", ", n = 0, expand = True)
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
Please, can anyone help me to turn df into a "proper" dataframe with columns that only contain dates?
One approach would be to do df.applymap(lambda x: '' if pd.isna(x) else x['datum']):
In [21]: df.applymap(lambda x: '' if pd.isna(x) else x['datum'])
Out[21]:
BW BY BE BB HB ... SN ST SH TH NATIONAL
1. Weihnachtstag 2019-12-25 2019-12-25 2019-12-25 2019-12-25 2019-12-25 ... 2019-12-25 2019-12-25 2019-12-25 2019-12-25 2019-12-25
2. Weihnachtstag 2019-12-26 2019-12-26 2019-12-26 2019-12-26 2019-12-26 ... 2019-12-26 2019-12-26 2019-12-26 2019-12-26 2019-12-26
Allerheiligen 2019-11-01 2019-11-01 ...
Augsburger Friedensfest 2019-08-08 ...
Buß- und Bettag 2019-11-20 ... 2019-11-20
Christi Himmelfahrt 2019-05-30 2019-05-30 2019-05-30 2019-05-30 2019-05-30 ... 2019-05-30 2019-05-30 2019-05-30 2019-05-30 2019-05-30
Frauentag 2019-03-08 ...
Fronleichnam 2019-06-20 2019-06-20 ... 2019-06-20 2019-06-20
Gründonnerstag 2019-04-18 ...
Heilige Drei Könige 2019-01-06 2019-01-06 ... 2019-01-06
Karfreitag 2019-04-19 2019-04-19 2019-04-19 2019-04-19 2019-04-19 ... 2019-04-19 2019-04-19 2019-04-19 2019-04-19 2019-04-19
Mariä Himmelfahrt 2019-08-15 ...
Neujahrstag 2019-01-01 2019-01-01 2019-01-01 2019-01-01 2019-01-01 ... 2019-01-01 2019-01-01 2019-01-01 2019-01-01 2019-01-01
Ostermontag 2019-04-22 2019-04-22 2019-04-22 2019-04-22 2019-04-22 ... 2019-04-22 2019-04-22 2019-04-22 2019-04-22 2019-04-22
Ostersonntag 2019-04-21 ...
Pfingstmontag 2019-06-10 2019-06-10 2019-06-10 2019-06-10 2019-06-10 ... 2019-06-10 2019-06-10 2019-06-10 2019-06-10 2019-06-10
Pfingstsonntag 2019-06-09 ...
Reformationstag 2019-10-31 2019-10-31 2019-10-31 ... 2019-10-31 2019-10-31 2019-10-31 2019-10-31
Tag der Arbeit 2019-05-01 2019-05-01 2019-05-01 2019-05-01 2019-05-01 ... 2019-05-01 2019-05-01 2019-05-01 2019-05-01 2019-05-01
Tag der Deutschen Einheit 2019-10-03 2019-10-03 2019-10-03 2019-10-03 2019-10-03 ... 2019-10-03 2019-10-03 2019-10-03 2019-10-03 2019-10-03
You could try to fix the shape of the input (i.e. the json response) before constructing the data frame & then reshape as needed.
example:
import datetime
import requests
import pandas as pd
now = datetime.datetime.now()
year = now.year
URL ='https://feiertage-api.de/api/?jahr='+ str(year)
r = requests.get(URL)
df = pd.DataFrame(
[(k1,k2,k3,v3)
for k1, v1 in r.json().items()
for k2, v2 in v1.items()
for k3, v3 in v2.items()]
)
df.head()
# Outputs:
0 1 2 3
0 BW Neujahrstag datum 2019-01-01
1 BW Neujahrstag hinweis
2 BW Heilige Drei Könige datum 2019-01-06
3 BW Heilige Drei Könige hinweis
4 BW Gründonnerstag datum 2019-04-18
# it is easier to see what is happening if we
# fix the column names
df.columns = ['State', 'Holiday', 'value_type', 'value']
pivoted = df[df.value_type == 'datum'].set_index(['Holiday', 'State']).value.unstack(-1)
pivoted.head()
# Outputs:
State BB BE BW ... SN ST TH
Holiday ...
1. Weihnachtstag 2019-12-25 2019-12-25 2019-12-25 ... 2019-12-25 2019-12-25 2019-12-25
2. Weihnachtstag 2019-12-26 2019-12-26 2019-12-26 ... 2019-12-26 2019-12-26 2019-12-26
Allerheiligen NaN NaN 2019-11-01 ... NaN NaN NaN
Augsburger Friedensfest NaN NaN NaN ... NaN NaN NaN
Buß- und Bettag NaN NaN NaN ... 2019-11-20 NaN NaN
[5 rows x 17 columns]
I have a multi-year timeseries with half-hourly resolution with some gaps and would like to impute them based on averages of the values of other years, but at the same time. E.g. if a value is missing at 2005-1-1 12:00, I'd like to take all the values at the same time, but from all other years and average them, then impute the missing value by the average. Here's what I got:
import pandas as pd
import numpy as np
idx = pd.date_range('2000-1-1', '2010-1-1', freq='30T')
df = pd.DataFrame({'somedata': np.random.rand(175345)}, index=idx)
df.loc[df['somedata'] > 0.7, 'somedata'] = None
grouped = df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]).mean()
Which gives me the averages I need, but I don't know how to plug them back into the original timeseries.
You are almost there. Just use .tranform to fill NaNs.
import pandas as pd
import numpy as np
# your data
# ==================================================
np.random.seed(0)
idx = pd.date_range('2000-1-1', '2010-1-1', freq='30T')
df = pd.DataFrame({'somedata': np.random.rand(175345)}, index=idx)
df.loc[df['somedata'] > 0.7, 'somedata'] = np.nan
somedata
2000-01-01 00:00:00 0.5488
2000-01-01 00:30:00 NaN
2000-01-01 01:00:00 0.6028
2000-01-01 01:30:00 0.5449
2000-01-01 02:00:00 0.4237
2000-01-01 02:30:00 0.6459
2000-01-01 03:00:00 0.4376
2000-01-01 03:30:00 NaN
... ...
2009-12-31 20:30:00 0.4983
2009-12-31 21:00:00 0.4282
2009-12-31 21:30:00 NaN
2009-12-31 22:00:00 0.3306
2009-12-31 22:30:00 0.3021
2009-12-31 23:00:00 0.2077
2009-12-31 23:30:00 0.2965
2010-01-01 00:00:00 0.5183
[175345 rows x 1 columns]
# processing
# ==================================================
result = df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute], as_index=False).transform(lambda g: g.fillna(g.mean()))
somedata
2000-01-01 00:00:00 0.5488
2000-01-01 00:30:00 0.2671
2000-01-01 01:00:00 0.6028
2000-01-01 01:30:00 0.5449
2000-01-01 02:00:00 0.4237
2000-01-01 02:30:00 0.6459
2000-01-01 03:00:00 0.4376
2000-01-01 03:30:00 0.3957
... ...
2009-12-31 20:30:00 0.4983
2009-12-31 21:00:00 0.4282
2009-12-31 21:30:00 0.4784
2009-12-31 22:00:00 0.3306
2009-12-31 22:30:00 0.3021
2009-12-31 23:00:00 0.2077
2009-12-31 23:30:00 0.2965
2010-01-01 00:00:00 0.5183
[175345 rows x 1 columns]
# take a look at a particular sample
# ======================================
x = list(df.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]))[0][1]
somedata
2000-01-01 0.5488
2001-01-01 0.1637
2002-01-01 0.3245
2003-01-01 NaN
2004-01-01 0.5654
2005-01-01 0.5729
2006-01-01 0.4740
2007-01-01 0.1728
2008-01-01 0.2577
2009-01-01 NaN
2010-01-01 0.5183
x.mean() # output: 0.3998
list(result.groupby([df.index.month, df.index.day, df.index.hour, df.index.minute]))[0][1]
somedata
2000-01-01 0.5488
2001-01-01 0.1637
2002-01-01 0.3245
2003-01-01 0.3998
2004-01-01 0.5654
2005-01-01 0.5729
2006-01-01 0.4740
2007-01-01 0.1728
2008-01-01 0.2577
2009-01-01 0.3998
2010-01-01 0.5183