Pandas: complex condition on datetime - python

I have a dataframe with a datetime type column and a float type column.
date value
0 2010-01-01 01:23:00 21.2
1 2010-01-02 01:33:00 63.4
2 2010-01-03 06:02:00 80.6
3 2010-01-04 06:05:00 50.1
4 2010-01-05 06:20:00 346.5
5 2010-01-06 07:44:00 111.8
6 2010-01-07 08:00:00 113.1
7 2010-01-08 08:22:00 10.6
8 2010-01-09 09:00:00 287.2
9 2010-01-10 09:14:00 1652.6
I want to create a new column to record the mean value of one hours before the current iteration row time.
[UPDATE] Example:
If the current iteration is 4 2010-01-05 06:20:00 346.5 , I need to calculate (50.1 + 80.6) / 2 (value in range 2010-01-05 05:20:00~2010-01-05 06:20:00 and calculate mean).
date value before_1hr_mean
4 2010-01-05 06:20:00 346.5 65.35
I use iterrows() to solve this problem like the following code. But this method is really slow and the function iterrows() is usually not recommended in pandas and this row will become as
[UPDATE]
df['before_1hr_mean'] = np.nan
for index, row in df.iterrows():
df.loc[index, 'before_1hr_mean'] = df[(df['date'] < row['date']) & \
(df['date'] >= row['date'] - pd.Timedelta(hours=1))]['value'].mean()
Is there a better way to deal with this situation?

I took the liberty of changing your data to make it all the same day. It's the only way I could make sense of your question.
df.join(
df.set_index('date').value.rolling('H').mean().rename('before_1hr_mean'),
on='date'
)
date value before_1hr_mean
0 2010-01-01 01:23:00 21.2 21.200000
1 2010-01-01 01:33:00 63.4 42.300000
2 2010-01-01 06:02:00 80.6 80.600000
3 2010-01-01 06:05:00 50.1 65.350000
4 2010-01-01 06:20:00 346.5 159.066667
5 2010-01-01 07:44:00 111.8 111.800000
6 2010-01-01 08:00:00 113.1 112.450000
7 2010-01-01 08:22:00 10.6 78.500000
8 2010-01-01 09:00:00 287.2 148.900000
9 2010-01-01 09:14:00 1652.6 650.133333
If you want to exclude the current row, you have to track the sum and count of the rolling hour and back out what the average is after adjusting for the current value.
s = df.set_index('date')
sagg = s.rolling('H').agg(['sum', 'count']).value.rename(columns=str.title)
agged = df.join(sagg, on='date')
agged
date value Sum Count
0 2010-01-01 01:23:00 21.2 21.2 1.0
1 2010-01-01 01:33:00 63.4 84.6 2.0
2 2010-01-01 06:02:00 80.6 80.6 1.0
3 2010-01-01 06:05:00 50.1 130.7 2.0
4 2010-01-01 06:20:00 346.5 477.2 3.0
5 2010-01-01 07:44:00 111.8 111.8 1.0
6 2010-01-01 08:00:00 113.1 224.9 2.0
7 2010-01-01 08:22:00 10.6 235.5 3.0
8 2010-01-01 09:00:00 287.2 297.8 2.0
9 2010-01-01 09:14:00 1652.6 1950.4 3.0
Then do some math and assign a new column
df.assign(before_1hr_mean=agged.eval('(Sum - value) / (Count - 1)'))
date value before_1hr_mean
0 2010-01-01 01:23:00 21.2 NaN
1 2010-01-01 01:33:00 63.4 21.20
2 2010-01-01 06:02:00 80.6 NaN
3 2010-01-01 06:05:00 50.1 80.60
4 2010-01-01 06:20:00 346.5 65.35
5 2010-01-01 07:44:00 111.8 NaN
6 2010-01-01 08:00:00 113.1 111.80
7 2010-01-01 08:22:00 10.6 112.45
8 2010-01-01 09:00:00 287.2 10.60
9 2010-01-01 09:14:00 1652.6 148.90
Notice that you get nulls when there isn't an hours worth of prior data to calculate over.

Related

Drop all rows for the month if a column has more than one value that crossed the threshold

I have a dataframe with time data in the format:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999
3 2013-01-01 03:00:00 -9999
4 2013-01-01 04:00:00 0.0
.. ... ...
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999
8759 2016-12-31 23:00:00 0.0
Suppose the value -9999 was repeated 200 times in the month of January and the threshold is 150. Practically the entire month of January must be deleted or all its rows must be deleted.
date values repeated
1 2013-02 0
2 2013-03 2
4 2013-05 0
5 2013-06 0
6 2013-07 66
7 2013-08 0
8 2013-09 7
With this I think I can drop the rows that repeat but I want drop the whole month.
import numpy as np
df['month'] = df['date'].dt.to_period('M')
df['new_value'] = np.where((df['values'] == -9999) & (df['n_missing'] > 150),np.nan,df['values'])
df.dropna()
How can I do that ?
One way using pandas.to_datetime with pandas.DataFrame.groupby.filter.
Here's a sample with months that have -9999 repeated 2, 1, 0, 2 times each:
date values
0 2013-01-01 00:00:00 0.0
1 2013-01-01 01:00:00 0.0
2 2013-01-01 02:00:00 -9999.0
3 2013-01-01 03:00:00 -9999.0
4 2013-01-01 04:00:00 0.0
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
8754 2016-12-31 18:00:00 427.5
8755 2016-12-31 19:00:00 194.9
8756 2016-12-31 20:00:00 -9999.0
8757 2016-12-31 21:00:00 237.6
8758 2016-12-31 22:00:00 -9999.0
8759 2016-12-31 23:00:00 0.0
Then we do filtering:
date = pd.to_datetime(df["date"]).dt.strftime("%Y-%m")
new_df = df.groupby(date).filter(lambda x: x["values"].eq(-9999).sum() < 2)
print(new_df)
Output:
date values
5 2013-02-01 12:00:00 -9999.0
6 2013-03-01 12:00:00 0.0
You can see the months with 2 or more repeats are deleted.

Mapping ranges of date in pandas dataframe

I would like to map values defined in a dictionary of date: value into a DataFrame of dates.
Consider the following example:
import pandas as pd
df = pd.DataFrame(range(19), index=pd.date_range(start="2010-01-01", end="2010-01-10", freq="12H"))
dct = {
"2009-01-01": 1,
"2010-01-05": 2,
"2020-01-01": 3,
}
I would like to get something like this:
df
0 test
2010-01-01 00:00:00 0 1.0
2010-01-01 12:00:00 1 1.0
2010-01-02 00:00:00 2 1.0
2010-01-02 12:00:00 3 1.0
2010-01-03 00:00:00 4 1.0
2010-01-03 12:00:00 5 1.0
2010-01-04 00:00:00 6 1.0
2010-01-04 12:00:00 7 1.0
2010-01-05 00:00:00 8 2.0
2010-01-05 12:00:00 9 2.0
2010-01-06 00:00:00 10 2.0
2010-01-06 12:00:00 11 2.0
2010-01-07 00:00:00 12 2.0
2010-01-07 12:00:00 13 2.0
2010-01-08 00:00:00 14 2.0
2010-01-08 12:00:00 15 2.0
2010-01-09 00:00:00 16 2.0
2010-01-09 12:00:00 17 2.0
2010-01-10 00:00:00 18 2.0
I have tried the following but I get a list of nan:
df["test"] = pd.Series(df.index.map(dct), index=df.index).ffill()
Any suggestions?
There are missing values, because no match types - in dict are keys like strings, in DaatFrame is datetimes in DatetimeIndex, need same types - here datetimes in helper Series created from dictionary with Series.asfreq for add datetimes between:
dct = {
"2009-01-01": 1,
"2010-01-05": 2,
"2020-01-01": 3,
}
s = pd.Series(dct).rename(lambda x: pd.to_datetime(x)).asfreq('d', method='ffill')
df["test"] = df.index.to_series().dt.normalize().map(s)
print (df)
0 test
2010-01-01 00:00:00 0 1
2010-01-01 12:00:00 1 1
2010-01-02 00:00:00 2 1
2010-01-02 12:00:00 3 1
2010-01-03 00:00:00 4 1
2010-01-03 12:00:00 5 1
2010-01-04 00:00:00 6 1
2010-01-04 12:00:00 7 1
2010-01-05 00:00:00 8 2
2010-01-05 12:00:00 9 2
2010-01-06 00:00:00 10 2
2010-01-06 12:00:00 11 2
2010-01-07 00:00:00 12 2
2010-01-07 12:00:00 13 2
2010-01-08 00:00:00 14 2
2010-01-08 12:00:00 15 2
2010-01-09 00:00:00 16 2
2010-01-09 12:00:00 17 2
2010-01-10 00:00:00 18 2

Add a column with the hourly difference of the Datetime Index [duplicate]

This question already has answers here:
compute time difference of DateTimeIndex
(3 answers)
Closed 1 year ago.
This post was edited and submitted for review 1 year ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a Dataframe with a datetimeindex and I need to create a column that contains the difference in time between the rows of the datetimeindex expressed in hours. This is what I have:
Datetime Numbers
2020-11-27 08:30:00 1
2020-11-27 13:00:00 2
2020-11-27 15:15:00 3
2020-11-27 20:45:00 4
2020-11-28 08:45:00 5
2020-11-28 10:45:00 6
2020-12-01 04:00:00 7
2020-12-01 08:15:00 8
2020-12-01 12:45:00 9
2020-12-01 14:45:00 10
2020-12-01 17:15:00 11
...
This is what I need:
Datetime Numbers Delta
2020-11-27 08:30:00 1 Nan
2020-11-27 13:00:00 2 4.5
2020-11-27 15:15:00 3 2.25
2020-11-27 20:45:00 4 5.5
2020-11-28 08:45:00 5 12
2020-11-28 10:45:00 6 2
2020-12-01 04:00:00 7 65.25
2020-12-01 08:15:00 8 4.25
2020-12-01 12:45:00 9 4.5
2020-12-01 14:45:00 10 2
2020-12-01 17:15:00 11 2.5
...
The Dataframe has thousands of rows so I can't use a "for" loop. Thanks in advance!
EDIT: I found a solution:
df = df.reset_index()
df['Time'] = df['Datetime'].astype(np.int64) // 10**9
df['Delta'] = df['Time'].diff()/3600
df.drop(columns=['Time'],inplace =True)
df.set_index('Datetime', inplace=True)
I assume that Datetime is set as index:
df.reset_index(inplace=True)
df['Delta'] = df['Datetime'].diff().dt.total_seconds()/3600
df.set_index('Datetime', inplace=True)
OUTPUT:
Numbers Delta
Datetime
2020-11-27 08:30:00 1 NaN
2020-11-27 13:00:00 2 4.50
2020-11-27 15:15:00 3 2.25
2020-11-27 20:45:00 4 5.50
2020-11-28 08:45:00 5 12.00
2020-11-28 10:45:00 6 2.00
2020-12-01 04:00:00 7 65.25
2020-12-01 08:15:00 8 4.25
2020-12-01 12:45:00 9 4.50
2020-12-01 14:45:00 10 2.00
2020-12-01 17:15:00 11 2.50

Filtering out another dataframe based on selected hours

I'm trying to filter out my dataframe based only on 3 hourly frequency, meaning starting from 0000hr, 0300hr, 0900hr, 1200hr, 1500hr, 1800hr, 2100hr, so on and so forth.
A sample of my dataframe would look like this
Time A
2019-05-25 03:54:00 1
2019-05-25 03:57:00 2
2019-05-25 04:00:00 3
...
2020-05-25 03:54:00 4
2020-05-25 03:57:00 5
2020-05-25 04:00:00 6
Desired output:
Time A
2019-05-25 06:00:00 1
2019-05-25 09:00:00 2
2019-05-25 12:00:00 3
...
2020-05-25 00:00:00 4
2020-05-25 03:00:00 5
2020-05-25 06:00:00 6
2020-05-25 09:00:00 6
2020-05-25 12:00:00 6
2020-05-25 15:00:00 6
2020-05-25 18:00:00 6
2020-05-25 21:00:00 6
2020-05-26 00:00:00 6
...
You can define a date range with 3 hours interval with pd.date_range() and then filter your dataframe with .loc and isin(), as follows:
date_rng_3H = pd.date_range(start=df['Time'].dt.date.min(), end=df['Time'].dt.date.max() + pd.DateOffset(days=1), freq='3H')
df_out = df.loc[df['Time'].isin(date_rng_3H)]
Input data:
date_rng = pd.date_range(start='2019-05-25 03:54:00', end='2020-05-25 04:00:00', freq='3T')
np.random.seed(123)
df = pd.DataFrame({'Time': date_rng, 'A': np.random.randint(1, 6, len(date_rng))})
Time A
0 2019-05-25 03:54:00 3
1 2019-05-25 03:57:00 5
2 2019-05-25 04:00:00 3
3 2019-05-25 04:03:00 2
4 2019-05-25 04:06:00 4
... ... ...
175678 2020-05-25 03:48:00 2
175679 2020-05-25 03:51:00 1
175680 2020-05-25 03:54:00 2
175681 2020-05-25 03:57:00 2
175682 2020-05-25 04:00:00 1
175683 rows × 2 columns
Output:
print(df_out)
Time A
42 2019-05-25 06:00:00 4
102 2019-05-25 09:00:00 2
162 2019-05-25 12:00:00 1
222 2019-05-25 15:00:00 3
282 2019-05-25 18:00:00 5
... ... ...
175422 2020-05-24 15:00:00 1
175482 2020-05-24 18:00:00 5
175542 2020-05-24 21:00:00 2
175602 2020-05-25 00:00:00 3
175662 2020-05-25 03:00:00 3

How to find occurrence of consecutive events in python timeseries data frame?

I have got a time series of meteorological observations with date and value columns:
df = pd.DataFrame({'date':['11/10/2017 0:00','11/10/2017 03:00','11/10/2017 06:00','11/10/2017 09:00','11/10/2017 12:00',
'11/11/2017 0:00','11/11/2017 03:00','11/11/2017 06:00','11/11/2017 09:00','11/11/2017 12:00',
'11/12/2017 00:00','11/12/2017 03:00','11/12/2017 06:00','11/12/2017 09:00','11/12/2017 12:00'],
'value':[850,np.nan,np.nan,np.nan,np.nan,500,650,780,np.nan,800,350,690,780,np.nan,np.nan],
'consecutive_hour': [ 3,0,0,0,0,3,6,9,0,3,3,6,9,0,0]})
With this DataFrame, I want a third column of consecutive_hours such that if the value in a particular timestamp is less than 1000, we give corresponding value in "consecutive-hours" of "3:00" hours and find consecutive such occurrence like 6:00 9:00 as above.
Lastly, I want to summarize the table counting consecutive hours occurrence and number of days such that the summary table looks like:
df_summary = pd.DataFrame({'consecutive_hours':[3,6,9,12],
'number_of_day':[2,0,2,0]})
I tried several online solutions and methods like shift(), diff() etc. as mentioned in:How to groupby consecutive values in pandas DataFrame
and more, spent several days but no luck yet.
I would highly appreciate help on this issue.
Thanks!
Input data:
>>> df
date value
0 2017-11-10 00:00:00 850.0
1 2017-11-10 03:00:00 NaN
2 2017-11-10 06:00:00 NaN
3 2017-11-10 09:00:00 NaN
4 2017-11-10 12:00:00 NaN
5 2017-11-11 00:00:00 500.0
6 2017-11-11 03:00:00 650.0
7 2017-11-11 06:00:00 780.0
8 2017-11-11 09:00:00 NaN
9 2017-11-11 12:00:00 800.0
10 2017-11-12 00:00:00 350.0
11 2017-11-12 03:00:00 690.0
12 2017-11-12 06:00:00 780.0
13 2017-11-12 09:00:00 NaN
14 2017-11-12 12:00:00 NaN
The cumcount_reset function is adapted from this answer of #jezrael:
Python pandas cumsum with reset everytime there is a 0
cumcount_reset = \
lambda b: b.cumsum().sub(b.cumsum().where(~b).ffill().fillna(0)).astype(int)
df["consecutive_hour"] = (df.set_index("date")["value"] < 1000) \
.groupby(pd.Grouper(freq="D")) \
.apply(lambda b: cumcount_reset(b)).mul(3) \
.reset_index(drop=True)
Output result:
>>> df
date value consecutive_hour
0 2017-11-10 00:00:00 850.0 3
1 2017-11-10 03:00:00 NaN 0
2 2017-11-10 06:00:00 NaN 0
3 2017-11-10 09:00:00 NaN 0
4 2017-11-10 12:00:00 NaN 0
5 2017-11-11 00:00:00 500.0 3
6 2017-11-11 03:00:00 650.0 6
7 2017-11-11 06:00:00 780.0 9
8 2017-11-11 09:00:00 NaN 0
9 2017-11-11 12:00:00 800.0 3
10 2017-11-12 00:00:00 350.0 3
11 2017-11-12 03:00:00 690.0 6
12 2017-11-12 06:00:00 780.0 9
13 2017-11-12 09:00:00 NaN 0
14 2017-11-12 12:00:00 NaN 0
Summary table
df_summary = df.loc[df.groupby(pd.Grouper(key="date", freq="D"))["consecutive_hour"] \
.apply(lambda h: (h - h.shift(-1).fillna(0)) > 0),
"consecutive_hour"] \
.value_counts().reindex([3, 6, 9, 12], fill_value=0) \
.rename("number_of_day") \
.rename_axis("consecutive_hour") \
.reset_index()
>>> df_summary
consecutive_hour number_of_day
0 3 2
1 6 0
2 9 2
3 12 0

Categories

Resources