Choosing time from 2300-0000 for different days

Choosing time from 2300-0000 for different days - python

So I'm having a issue with with the 23:00-00:00 time for different days within in Python.
times A B C D
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-08 23:00:00 NaN 0.060930 0.059008 0.057293
2003-01-09 23:00:00 NaN 0.102374 0.101441 0.100743
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-10 23:00:00 NaN 0.207653 0.205911 0.202886
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...
What I'm looking for is to mainly select the 00:00:00 hour which is why I've applied df = df.reset_index().groupby(df.index.date).first().set_index('times') but if that doesn't exist that it should use the 23:00:00 of the previous days as the 00:00:00 of the next day. The following is wrong:
times A B C D
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-09 23:00:00 NaN 0.102374 0.101441 0.100743
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...
How do I get it to look at the 23:00:00 of the previous day to the 00:00:00 of the next day, to achieve this solution.
2003-01-08 00:00:00 NaN 0.086215 0.086135 0.090659
2003-01-08 23:00:00 NaN 0.060930 0.059008 0.057293
2003-01-10 00:00:00 NaN 0.078799 0.077739 0.076138
2003-01-11 00:00:00 NaN 0.203436 0.201588 0.197515
...

Related

Combining dataframes with differing dates column

I have a dataset of hourly prices where I have produced a dataframe that contains the minimum price from the previous day using:
df_min = df_hour_0[['Price_REG1', 'Price_REG2', 'Price_REG3',
'Price_REG4']].between_time('00:00', '23:00').resample('d').min()
This gives me:
Price_REG1 Price_REG2 Price_REG3 Price_REG4
date
2020-01-01 00:00:00 25.07 25.07 25.07 25.07
2020-01-02 00:00:00 12.07 12.07 12.07 12.07
2020-01-03 00:00:00 0.14 0.14 0.14 0.14
2020-01-04 00:00:00 3.83 3.83 3.83 3.83
2020-01-05 00:00:00 25.77 25.77 25.77 25.77
Now, I want to combine this df with 24 other df's, one for each hour (hour_0 below):
Price_REG1 Price_REG2 ... Price_24_3 Price_24_4
date ...
2020-01-01 00:00:00 30.83 30.83 ... NaN NaN
2020-01-02 00:00:00 24.81 24.81 ... 25.88 25.88
2020-01-03 00:00:00 24.39 24.39 ... 27.69 27.69
2020-01-04 00:00:00 22.04 22.04 ... 25.70 25.70
2020-01-05 00:00:00 25.77 25.77 ... 27.37 27.37
Which I do this way:
df_hour_0 = pd.concat([df_hour_0, df_min, df_max], axis=1)
This works fine for the df from the first hour, since the dates matches. But for the other df's the date is "2020-01-01 00:01:00", "2020-01-01 00:02:00" etc.
Since the dates don't match, the pd.concat gives me two times as many observations where every other observation is null:
Price_REG1 Price_REG2 ... Price_3_min Price_4_min
date ...
2020-01-01 00:00:00 NaN NaN ... NaN NaN
2020-01-01 01:00:00 28.78 28.78 ... NaN NaN
2020-01-02 00:00:00 NaN NaN ... 30.83 30.83
2020-01-02 01:00:00 12.07 12.07 ... NaN NaN
2020-01-03 00:00:00 NaN NaN ... 31.20 31.20
I tried to fix this by:
df_max = df_max.reset_index()
df_max = df_max.drop(['date'], axis=1)
But this only gives me the same issue but instead of every other being null the whole df_min df is just inserted at the bottom of the first df.
I want to keep the date, otherwise I guess it could be possible to reset the index in both df's and combine them by index instead of date.
Thank you.

One option could be to normalize to the date:
dfs = [df_hour_0, df_min, df_max]
pd.concat([d.set_axis(d.index.normalize()) for d in dfs], axis=1)

Replace nan with zero or linear interpolation

I have a dataset with a lot of NaNs and numeric values with the following form:
PV_Power
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 NaN
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 NaN
... ...
2017-12-31 20:00:00 NaN
2017-12-31 21:00:00 NaN
2017-12-31 22:00:00 NaN
2017-12-31 23:00:00 NaN
2018-01-01 00:00:00 NaN
What I need to do is to replace a NaN value with either 0 if it is between other NaN values or with the result of interpolation if it is between numeric values. Any idea of how can I achieve that?

Use DataFrame.interpolate with limit_area='inside' if need interpolate between numeric values and then replace missing values:
print (df)
PV_Power
date
2017-01-01 00:00:00 NaN
2017-01-01 01:00:00 4.0
2017-01-01 02:00:00 NaN
2017-01-01 03:00:00 NaN
2017-01-01 04:00:00 5.0
2017-01-01 05:00:00 NaN
2017-01-01 06:00:00 NaN
df = df.interpolate(limit_area='inside').fillna(0)
print (df)
PV_Power
date
2017-01-01 00:00:00 0.000000
2017-01-01 01:00:00 4.000000
2017-01-01 02:00:00 4.333333
2017-01-01 03:00:00 4.666667
2017-01-01 04:00:00 5.000000
2017-01-01 05:00:00 0.000000
2017-01-01 06:00:00 0.000000

You could reindex your dataframe
idx = df.index
df = df.dropna().reindex(idx, fill_value=0)
or just set values where PV_Power is NaN:
df.loc[pd.isna(df.PV_Power), ["PV_Power"]] = 0

You Can use fillna(0) :-
df['PV_Power'].fillna(0, inplace=True)
or You Can Replace it:-
df['PV_Power'] = df['PV_Power'].replace(np.nan, 0)

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!

Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

count contiguous NaN values by unique values

I have contiguous periods of NaN values by code. I want to count NaN values from periods of contiguous NaN values by code, and also i want the start and end date of the contiguos period of NaN values.
df :
CODE TMIN
1998-01-01 00:00:00 12 2.5
1999-01-01 00:00:00 12 NaN
2000-01-01 00:00:00 12 NaN
2001-01-01 00:00:00 12 2.2
2002-01-01 00:00:00 12 NaN
1998-01-01 00:00:00 41 NaN
1999-01-01 00:00:00 41 NaN
2000-01-01 00:00:00 41 5.0
2001-01-01 00:00:00 41 9.0
2002-01-01 00:00:00 41 8.0
1998-01-01 00:00:00 52 2.0
1999-01-01 00:00:00 52 NaN
2000-01-01 00:00:00 52 NaN
2001-01-01 00:00:00 52 NaN
2002-01-01 00:00:00 52 1.0
1998-01-01 00:00:00 91 NaN
Expected results :
Start_Date End date CODE number of contiguous missing values
1999-01-01 00:00:00 2000-01-01 00:00:00 12 2
2002-01-01 00:00:00 2002-01-01 00:00:00 12 1
1998-01-01 00:00:00 1999-01-01 00:00:00 41 2
1999-01-01 00:00:00 2001-01-01 00:00:00 52 3
1998-01-01 00:00:00 1998-01-01 00:00:00 91 1
How can i solve this? Thanks!

You can try groupby the cumsum of non-null:
df['group'] = df.TMIN.notna().cumsum()
(df[df.TMIN.isna()]
.groupby(['group','CODE'])
.agg(Start_Date=('group', lambda x: x.index.min()),
End_Date=('group', lambda x: x.index.max()),
cont_missing=('TMIN', 'size')
)
)
Output:
Start_Date End_Date cont_missing
group CODE
1 12 1999-01-01 00:00:00 2000-01-01 00:00:00 2
2 12 2002-01-01 00:00:00 2002-01-01 00:00:00 1
41 1998-01-01 00:00:00 1999-01-01 00:00:00 2
6 52 1999-01-01 00:00:00 2001-01-01 00:00:00 3
7 91 1998-01-01 00:00:00 1998-01-01 00:00:00 1

start time series data when there are no NaN's

I currently have some time series data that I applied a rolling mean on with a window of 17520.
Thus before the head of my data looked like this:
SETTLEMENTDATE ==
0 2006/01/01 00:30:00 8013.27833 ... 5657.67500 20.03
1 2006/01/01 01:00:00 7726.89167 ... 5460.39500 18.66
2 2006/01/01 01:30:00 7372.85833 ... 5766.02500 20.38
3 2006/01/01 02:00:00 7071.83333 ... 5503.25167 18.59
4 2006/01/01 02:30:00 6865.44000 ... 5214.01500 17.53
And now it looks like this:
SETTLEMENTDATE =
0 2006/01/01 00:30:00 NaN ... NaN NaN
1 2006/01/01 01:00:00 NaN ... NaN NaN
2 2006/01/01 01:30:00 NaN ... NaN NaN
3 2006/01/01 02:00:00 NaN ... NaN NaN
4 2006/01/01 02:30:00 NaN ... NaN NaN
How can I get it so that my data only begins, when there is not a NaN? (also making sure that the date matches)
=

You can try with rolling with min_periods = 1
data['NSW DEMAND'] = data['NSW DEMAND'].rolling(17520,min_periods=17520).mean()
Also try using for loo, you do not need to write the columns one by one
youcols=['xxx'...'xxx1']
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
Base on your comments
for x in youcols:
data[x]=data[x].rolling(17520,min_periods=1).mean()
then ,
data=data.dropna(subset=youcols,thresh =1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Choosing time from 2300-0000 for different days - python

Related

Combining dataframes with differing dates column

Replace nan with zero or linear interpolation

How to find occurrence of consecutive events in python timeseries data frame?

count contiguous NaN values by unique values

start time series data when there are no NaN's

Categories

Resources