I was looking through the pandas.query documentation but couldn't find anything specific about this.
Is it possible to perform a query on a date based on the closest date to the one given, instead of a specific date?
For example lets say we use the wine dataset and creates some random dates.
import pandas as pd
import numpy as np
from sklearn import datasets
dir(datasets)
df = pd.DataFrame(datasets.load_wine().data)
df.columns = datasets.load_wine().feature_names
df.columns=df.columns.str.strip()
def random_dates(start, end, n, unit='D'):
ndays = (end - start).days + 1
return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start
np.random.seed(0)
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2022-01-01')
datelist=random_dates(start, end, 178)
df['Dates'] = datelist
if you perform a simple query on hue
df.query('hue == 0.6')
you'll receive three rows with three random dates. Is it possible to pick the query result that's closest to let's say 2017-1-1?
so something like
df.query('hue==0.6').query('Date ~2017-1-1')
I hope this makes sense!
You can use something like:
df.query("('2018-01-01' < Dates) & (Dates < '2018-01-31')")
# Output
alcohol malic_acid ... proline Dates
6 14.39 1.87 ... 1290.0 2018-01-24 08:21:14.665824000
41 13.41 3.84 ... 1035.0 2018-01-22 22:15:56.547561600
51 13.83 1.65 ... 1265.0 2018-01-26 22:37:26.812156800
131 12.88 2.99 ... 530.0 2018-01-01 18:58:05.118441600
139 12.84 2.96 ... 590.0 2018-01-08 13:38:26.117376000
142 13.52 3.17 ... 520.0 2018-01-19 22:37:10.170825600
[6 rows x 14 columns]
Or using #variables:
date = pd.to_datetime('2018-01-01')
offset = pd.DateOffset(days=10)
start = date - offset
end = date + offset
df.query("Dates.between(#start, #end)")
# Output
alcohol malic_acid ... proline Dates
131 12.88 2.99 ... 530.0 2018-01-01 18:58:05.118441600
139 12.84 2.96 ... 590.0 2018-01-08 13:38:26.117376000
Given a series, find the entries closest to a given date:
def closest_to_date(series, date, n=5):
date = pd.to_datetime(date)
return abs(series - date).nsmallest(n)
Then we can use the index of the returned series to select further rows (or you change the api to suit you):
(df.loc[df.hue == 0.6]
.loc[lambda df_: closest_to_date(df_.Dates, "2017-1-1", n=1).index]
)
I'm not sure if you have to use query, but this will give you the results you are looking for
df['Count'] = (df[df['hue'] == .6].sort_values(['Dates'], ascending=True)).groupby(['hue']).cumcount() + 1
df.loc[df['Count'] == 1]
Related
I have a dataframe close consists of close price (with some calculations beforehand) of some stocks, and I want to create a dataframe (with empty entries or random numbers) such that the row names are the tickers of the close and column names are from 10 to 300 with a step size 10. ie. 10,20,30,40,50...
I want to create this df in order to use a for loop to fill in all the entries.
The df close I have is like below:
Close \
ticker AAPL AMD BIDU GOOGL IXIC
Date
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
2011-06-02 12.360714 8.240000 138.490005 264.294281 2773.310059
2011-06-03 12.265714 7.970000 133.210007 261.801788 2732.780029
2011-06-06 12.072857 7.800000 126.970001 260.790802 2702.560059
2011-06-07 11.858571 7.710000 124.820000 259.774780 2701.560059
......
I tried to check if I firstly create this dataframe correctly as below:
rows = close.iloc[0]
columns = [[i] for i in range(10,300,10)]
print(pd.DataFrame(rows, columns))
But what I got is:
2011-06-01
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 NaN
After this, I would use something like
percent = pd.DataFrame(rows, columns)
for i in range(10, 300, 10):
myerror = myfunction(close, i) # myfunction is a function defined beforehand
extreme = myerror > 0.1
percent.iloc[:,i] = extreme.mean()
To be specific, for i=10, my extreme.mean() is something like:
ticker
Absolute Error (Volatility) AAPL 0.420
AMD 0.724
BIDU 0.552
GOOGL 0.316
IXIC 0.176
MSFT 0.320
NDXT 0.228
NVDA 0.552
NXPI 0.476
QCOM 0.468
SWKS 0.560
TXN 0.332
dtype: float64
But if I tried this way, I got:
IndexError: iloc cannot enlarge its target object
How shall I create this df first? Or do I even need to create this df first?
Here is how I will approach it:
from io import StringIO
import numpy as np
df = pd.read_csv(StringIO("""ticker_Date AAPL AMD BIDU GOOGL IXIC
2011-06-01 12.339643 8.370000 132.470001 263.063049 2769.189941
2011-06-02 12.360714 8.240000 138.490005 264.294281 2773.310059
2011-06-03 12.265714 7.970000 133.210007 261.801788 2732.780029
2011-06-06 12.072857 7.800000 126.970001 260.790802 2702.560059
2011-06-07 11.858571 7.710000 124.820000 259.774780 2701.560059 """), sep="\s+", index_col=0)
col_names = [f"col_{i}" for i in range(10, 300, 10)]
# generate random data
data = np.random.random((df.shape[1], len(col_names)))
# create dataframe
df = pd.DataFrame(data, columns=col_names, index=df.columns.values)
df.head()
This will generate:
col_10 col_20 col_30 col_40 col_50 col_60 col_70 col_80 col_90 col_100 ... col_200 col_210 col_220 col_230 col_240 col_250 col_260 col_270 col_280 col_290
AAPL 0.758983 0.990241 0.804344 0.143388 0.987025 0.402098 0.814308 0.302948 0.551587 0.107503 ... 0.270523 0.813130 0.354939 0.594897 0.711924 0.574312 0.124053 0.586718 0.182854 0.430028
AMD 0.280330 0.540498 0.958757 0.779778 0.988756 0.877748 0.083683 0.935331 0.601838 0.998863 ... 0.426469 0.459916 0.458180 0.047625 0.234591 0.831229 0.975838 0.277486 0.663604 0.773614
BIDU 0.488226 0.792466 0.488340 0.639612 0.829161 0.459805 0.619539 0.614297 0.337481 0.009500 ... 0.049147 0.452581 0.230441 0.943240 0.587269 0.703462 0.528252 0.099104 0.510057 0.151219
GOOGL 0.332762 0.135621 0.653414 0.955116 0.341629 0.213716 0.308320 0.982095 0.762138 0.532052 ... 0.095432 0.908001 0.077070 0.413706 0.036768 0.481697 0.092373 0.016260 0.394339 0.042559
IXIC 0.358842 0.653332 0.994692 0.863552 0.307594 0.269833 0.972357 0.520336 0.124850 0.907647 ... 0.189050 0.664955 0.167708 0.333537 0.295740 0.093228 0.762875 0.779000 0.316752 0.687238
I am working with a large dataframe (~10M rows) that contains dates & textual data, and I have a list of values that I need to make some calculations per each value in that list.
For each value, I need to filter/subset my dataframe based on 4 conditions then make my calculations and move on to the next value.
Currently, ~80% of the time is spent on the filters block making the processing time extremely long duration (few hours)
What I currently have is this:
for val in unique_list: # iterate on values in a list
if val is not None or val != kip: # as long as its an acceptable value
for year_num in range(1, 6): # split by years
# filter and make intermediate df based on per value & per year calculation
cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
cond_2 = df[f'{kip}'].notna()
cond_3 = df['Date'].dt.year < 2015 + year_num
cond_4 = df['Date'].dt.year >= 2015 + year_num -1
temp_df = df[cond_1 & cond_2 & cond_3 & cond_4].copy()
condition 1 takes around 45% of the time while conditions 3 & 4 take 22% each
is there a better way to implement this?, is there a way to remove .dt and .str and use something faster?
the time on 3 values (out of thousands)
Total time: 16.338 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def get_word_counts(df, kip, unique_list):
2 # to hold predictors
3 1 1929.0 1929.0 0.0 predictors_df = pd.DataFrame(index=[f'{kip}'])
4 1 2.0 2.0 0.0 n = 0
5
6 3 7.0 2.3 0.0 for val in unique_list: # iterate on values in a list
7 3 3.0 1.0 0.0 if val is not None or val != kip: # as long as its an acceptable value
8 18 39.0 2.2 0.0 for year_num in range(1, 6): # split by years
9
10 # filter and make intermediate df based on per value & per year calculation
11 15 7358029.0 490535.3 45.0 cond_1 = df[f'{kip}'].str.contains(re.escape(str(val)), na=False)
12 15 992250.0 66150.0 6.1 cond_2 = df[f'{kip}'].notna()
13 15 3723789.0 248252.6 22.8 cond_3 = df['Date'].dt.year < 2015 + year_num
14 15 3733879.0 248925.3 22.9 cond_4 = df['Date'].dt.year >= 2015 + year_num -1
The data mainly looks like this (I use only relevant columns when doing the calculations):
Date Ingredient
20 2016-07-20 Magnesium
21 2020-02-18 <NA>
22 2016-01-28 Apple;Cherry;Lemon;Olives General;Peanut Butter
23 2015-07-23 <NA>
24 2018-01-11 <NA>
25 2019-05-30 Egg;Soy;Unspecified Egg;Whole Eggs
26 2020-02-20 Chocolate;Peanut;Peanut Butter
27 2016-01-21 Raisin
28 2020-05-11 <NA>
29 2020-05-15 Chocolate
30 2019-08-16 <NA>
31 2020-03-28 Chocolate
32 2015-11-04 <NA>
33 2016-08-21 <NA>
34 2015-08-25 Almond;Coconut
35 2016-12-18 Almond
36 2016-01-18 <NA>
37 2015-11-18 Peanut;Peanut Butter
38 2019-06-04 <NA>
39 2016-04-08 <NA>
So, it looks like you really just want to split by year of the 'Date' column, and do something with each group. Also, for a large df, it is usually faster to filter what you can once beforehand, and then get a smaller one (in your example with one year worth of data), then do all your looping/extractions on the smaller df.
Without knowing much more about the data itself (C-contiguous? F-contiguous? Date-sorted?), it's hard to be sure, but I would guess that the following may prove to be faster (and it also feels more natural IMHO):
# 1. do everything you can outside the loop
# 1.a prep your patterns
escaped_vals = [re.escape(str(val)) for val in unique_list
if val is not None and val != kip]
# you meant 'and', not 'or', right?
# 1.b filter and sort the data (why sort? better mem locality)
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# 2. do one groupby by year
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
year = date.year # optional, if you need it
# 2.b reuse each group as much as possible
for escval in escaped_vals:
mask = dfy[kip].str.contains(escval, na=False)
temp_df = dfy[mask].copy()
# do something with temp_df ...
Example (guessing some data, really):
n = 10_000_000
str_examples = ['hello', 'world', 'hi', 'roger', 'kilo', 'zulu', None]
df = pd.DataFrame({
'Date': [pd.Timestamp('2010-01-01') + k*pd.Timedelta('1 day') for k in np.random.randint(0, 3650, size=n)],
'x': np.random.randint(0, 1200, size=n),
'foo': np.random.choice(str_examples, size=n),
'bar': np.random.choice(str_examples, size=n),
})
unique_list = ['rld', 'oger']
kip = 'foo'
escaped_vals = [re.escape(str(val)) for val in unique_list
if val is not None and val != kip]
%%time
z = df.loc[(df[kip].notna()) & (df['Date'] >= '2015') & (df['Date'] < '2021')].sort_values('Date')
# CPU times: user 1.67 s, sys: 124 ms, total: 1.79 s
%%time
out = defaultdict(dict)
for date, dfy in z.groupby(pd.Grouper(key='Date', freq='Y')):
year = date.year
for escval in escaped_vals:
mask = dfy[kip].str.contains(escval, na=False)
temp_df = dfy[mask].copy()
out[year].update({escval: temp_df})
# CPU times: user 2.64 s, sys: 0 ns, total: 2.64 s
Quick sniff test:
>>> out.keys()
dict_keys([2015, 2016, 2017, 2018, 2019])
>>> out[2015].keys()
dict_keys(['rld', 'oger'])
>>> out[2015]['oger'].shape
(142572, 4)
>>> out[2015]['oger'].tail()
Date x foo bar
3354886 2015-12-31 409 roger hello
8792739 2015-12-31 474 roger zulu
3944171 2015-12-31 310 roger hi
7578485 2015-12-31 125 roger None
2963220 2015-12-31 809 roger hi
Appreciate any help from the community on this. I've been toying with it for a few days now.
I have 2 dataframes, df1 & df2. The first dataframe will always be 1 min data about 20-30 thousand rows. The second dataframe will contain random times with associated relevant data & will always be relatively small (1000-4000 rows x 4 or 5 columns). I'm working through df1 with itertuples in order to perform a time specific slice (trailing). This process gets repeated thousands of times, & the single slice line below (df3 = df2...) is causing over 50% of the runtime. Simply adding a couple slicing criteria in the single line below can have 30+% increases on the final runtimes which run hours long!
I've considered trying pandas 'query', but have read it really only helps on larger dataframes. My thought is that it may be better to reduce df2 into a numpy array, simple python list, or other since it is always fairly short, though I think I'll need it back into a dataframe for subsequent sorting, summations, and vector multiplications that come afterward in the primary code. I did succeed in utilizing concurrent futures on a 12 core setup, which increased speed about 5X for my overall application, though I'm still talking hours of runtime.
Any help or suggestions would be appreciated.
Example code illustrating the issue:
import pandas as pd
import numpy as np
import random
from datetime import datetime as dt
from datetime import timedelta, timezone
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dfsize = 34000
df1 = pd.DataFrame({'datetime': pd.date_range('2010-01-01', periods=dfsize, freq='1min'), 'val':np.random.uniform(10, 100, size=dfsize)})
sizedf = 3000
start = pd.to_datetime('2010-01-01')
end = pd.to_datetime('2010-01-24')
test_list = [5, 30]
df2 = pd.DataFrame({'datetime':random_dates(start,end, sizedf), 'a':np.random.uniform(10, 100, size=sizedf), 'b':np.random.choice(test_list, sizedf), 'c':np.random.uniform(10, 100, size=sizedf), 'd':np.random.uniform(10, 100, size=sizedf), 'e':np.random.uniform(10, 100, size=sizedf)})
df2.set_index('datetime', inplace=True)
daysback5 = 3
daysback30 = 8
#%%timeit -r1 #time this section here:
#Slow portion here - Performing ~4000+ slices on a dataframe (df2) which is ~1000 to 3000 rows -- Some slowdown due to itertuples, which don't think is avoidable
for line, row in enumerate(df1.itertuples(index=False), 0):
if row.datetime.minute % 5 ==0:
#Lion's share of the slowdown:
df3 = df2[(df2['a']<=row.val*1.25) & (df2['a']>=row.val*.75) & (df2.index<=row.datetime) & (((df2.index>=row.datetime-timedelta(days=daysback30)) & (df2['b']==30)) | ((df2.index>=row.datetime-timedelta(days=daysback5)) & (df2['b']==5))) ].reset_index(drop=True).copy()
Time of slow part:
8.53 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
df1:
datetime val
0 2010-01-01 00:00:00 58.990147
1 2010-01-01 00:01:00 27.457308
2 2010-01-01 00:02:00 20.657251
3 2010-01-01 00:03:00 36.416561
4 2010-01-01 00:04:00 71.398897
... ... ...
33995 2010-01-24 14:35:00 77.763085
33996 2010-01-24 14:36:00 21.151239
33997 2010-01-24 14:37:00 83.741844
33998 2010-01-24 14:38:00 93.370216
33999 2010-01-24 14:39:00 99.720858
34000 rows × 2 columns
df2:
a b c d e
datetime
2010-01-03 23:38:13 22.363251 30 81.158073 21.806457 11.116421
2010-01-09 16:27:32 78.952070 5 27.045279 29.471537 29.559228
2010-01-13 04:49:57 85.985935 30 79.206437 29.711683 74.454446
2010-01-07 22:29:22 36.009752 30 43.072552 77.646257 57.208626
2010-01-15 09:33:02 13.653679 5 87.987849 37.433810 53.768334
... ... ... ... ... ...
2010-01-12 07:36:42 30.328512 5 81.281791 14.046032 38.288534
2010-01-08 20:26:31 80.911904 30 32.524414 80.571806 26.234552
2010-01-14 08:32:01 12.198825 5 94.270709 27.255914 87.054685
2010-01-06 03:25:09 82.591519 5 91.160917 79.042083 17.831732
2010-01-07 14:32:47 38.337405 30 10.619032 32.557640 87.890791
3000 rows × 5 columns
Actually, cross merge and query works pretty well for your data size:
(df1[df1.datetime.dt.minute % 5==0].assign(dummy=1)
.merge(df2.reset_index().assign(dummy=1),
on='dummy', suffixes=['_1','_2'])
.query('val*1.25 >= a >= val*.75 and datetime_2 <= datetime_1 ')
.loc[lambda x: ((x.datetime_2 >= x.datetime_1 - daysback30) & x['b'].eq(30) )
|((x.datetime_2>= x.datetime_1 - daysback5) & (x['b']==5))]
)
which takes about on my system:
2.05 s ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
where your code runs for about 10s.
I'm trying to use Python to get time taken, as well as average speed between an object traveling between points.
The data looks somewhat like this,
location initialtime id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9km
2 2020-09-18T12:10:14.485952Z car_uno 83 8km
3 2020-09-18T11:59:14.484781Z car_duo 70 9km
7 2020-09-18T12:00:14.484653Z car_trio 85 8km
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5km
The function I'm using currently is essentially like this,
Speeds.index = pd.to_datetime(Speeds.index)
..etc
Now if I were doing this usually, I would just take the unique values of the id's,
for x in speeds.id.unique():
Speeds[speeds.id=="x"]...
But this method really isn't working.
What is the best approach for simply seeing if there are multiple id points over time, then taking the average of the speeds by that time given? Otherwise just returning the speed itself if there are not multiple values.
Is there a simpler pandas filter I could use?
Expected output is simply,
area - id - initial time - journey time - average speed.
the point is to get the average time and journey time for a vehicle going past two points
To get the average speed and journey times you can use groupby() and pass in the columns that determine one complete journey, like id or area.
import pandas as pd
from io import StringIO
data = StringIO("""
area initialtime id speed
1 2020-09-18T12:03:14.485952Z car_uno 72
2 2020-09-18T12:10:14.485952Z car_uno 83
3 2020-09-18T11:59:14.484781Z car_duo 70
7 2020-09-18T12:00:14.484653Z car_trio 85
8 2020-09-18T12:12:14.484653Z car_trio 70
""")
df = pd.read_csv(data, delim_whitespace=True)
df["initialtime"] = pd.to_datetime(df["initialtime"])
# change to ["id", "area"] if need more granular aggregation
group_cols = ["id"]
time = df.groupby(group_cols)["initialtime"].agg([max, min]).eval('max-min').reset_index(name="journey_time")
speed = df.groupby(group_cols)["speed"].mean().reset_index(name="average_speed")
pd.merge(time, speed, on=group_cols)
id journey_time average_speed
0 car_duo 00:00:00 70.0
1 car_trio 00:12:00 77.5
2 car_uno 00:07:00 77.5
I tryed to use a very intuitive solution. I'm assuming the data has already been loaded to df.
df['initialtime'] = pd.to_datetime(df['initialtime'])
result = []
for car in df['id'].unique():
_df = df[df['id'] == car].sort_values('initialtime', ascending=True)
# Where the car is leaving "from" and where it's heading "to"
_df['From'] = _df['location']
_df['To'] = _df['location'].shift(-1, fill_value=_df['location'].iloc[0])
# Auxiliary columns
_df['end_time'] = _df['initialtime'].shift(-1, fill_value=_df['initialtime'].iloc[0])
_df['end_speed'] = _df['speed'].shift(-1, fill_value=_df['speed'].iloc[0])
# Desired columns
_df['journey_time'] = _df['end_time'] - _df['initialtime']
_df['avg_speed'] = (_df['speed'] + _df['end_speed']) / 2
_df = _df[_df['journey_time'] >= pd.Timedelta(0)]
_df.drop(['location', 'distance', 'speed', 'end_time', 'end_speed'],
axis=1, inplace=True)
result.append(_df)
final_df = pd.concat(result).reset_index(drop=True)
The final DataFrame is as follows:
initialtime id From To journey_time avg_speed
0 2020-09-18 12:03:14.485952+00:00 car_uno 1 2 0 days 00:07:00 77.5
1 2020-09-18 11:59:14.484781+00:00 car_duo 3 3 0 days 00:00:00 70.0
2 2020-09-18 12:00:14.484653+00:00 car_trio 7 8 0 days 00:12:00 77.5
Here is another approach. My results are different that other posts, so I may have misunderstood the requirements. In brief, I calculated each average speed as total distance divided by total time (for each car).
from io import StringIO
import pandas as pd
# speed in km / hour; distance in km
data = '''location initial-time id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9
2 2020-09-18T12:10:14.485952Z car_uno 83 8
3 2020-09-18T11:59:14.484781Z car_duo 70 9
7 2020-09-18T12:00:14.484653Z car_trio 85 8
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5
'''
Now create data frame and perform calculations
# create data frame
df = pd.read_csv(StringIO(data), delim_whitespace=True)
df['elapsed-time'] = df['distance'] / df['speed'] # in hours
# utility function
def hours_to_hms(elapsed):
''' Convert `elapsed` (in hours) to hh:mm:ss (round to nearest sec)'''
h, m = divmod(elapsed, 1)
m *= 60
_, s = divmod(m, 1)
s *= 60
hms = '{:02d}:{:02d}:{:02d}'.format(int(h), int(m), int(round(s, 0)))
return hms
# perform calculations
start_time = df.groupby('id')['initial-time'].min()
journey_hrs = df.groupby('id')['elapsed-time'].sum().rename('elapsed-hrs')
hms = journey_hrs.apply(lambda x: hours_to_hms(x)).rename('hh:mm:ss')
ave_speed = ((df.groupby('id')['distance'].sum()
/ df.groupby('id')['elapsed-time'].sum())
.rename('ave speed (km/hr)')
.round(2))
# assemble results
result = pd.concat([start_time, journey_hrs, hms, ave_speed], axis=1)
print(result)
initial-time elapsed-hrs hh:mm:ss \
id
car_duo 2020-09-18T11:59:14.484781Z 0.128571 00:07:43
car_trio 2020-09-18T12:00:14.484653Z 0.201261 00:12:05
car_uno 2020-09-18T12:03:14.485952Z 0.221386 00:13:17
ave speed (km/hr)
id
car_duo 70.00
car_trio 77.01
car_uno 76.79
You should provide a better dataset (ie with identical time points) so that we understand better the inputs, and an exemple of expected output so that we understand the computation of the average speed.
Thus I'm just guessing that you may be looking for df.groupby('initialtime')['speed'].mean() if df is a dataframe containing your input data.
I have a time series of water levels for which I need to calculate monthly and annual statistics in relation to several arbitrary flood stages. Specifically, I need to determine the duration per month that the water exceeded flood stage, as well as the number of times these excursions occurred. Additionally, because of the noise associated with the dataloggers, I need to exclude floods that lasted less than 1 hour as well as floods with less than 1 hour between events.
Mock up data:
start = datetime.datetime(2014,9,5,12,00)
daterange = pd.date_range(start, periods = 10000, freq = '30min', name = "Datetime")
data = np.random.random_sample((len(daterange), 3)) * 10
columns = ["Pond_A", "Pond_B", "Pond_C"]
df = pd.DataFrame(data = data, index = daterange, columns = columns)
flood_stages = [('Stage_1', 4.0), ('Stage_2', 6.0)]
My desired output is:
Pond_A_Stage_1_duration Pond_A_Stage_1_events \
2014-09-30 12:00:00 35.5 2
2014-10-31 12:00:00 40.5 31
2014-11-30 12:00:00 100 16
2014-12-31 12:00:00 36 12
etc. for the duration and events at each flood stage, at each reservoir.
I've tried grouping by month, iterating through the ponds and then iterating through each row like:
grouper = pd.TimeGrouper(freq = "1MS")
month_groups = df.groupby(grouper)
for name, group in month_groups:
flood_stage_a = group.sum()[1]
flood_stage_b = group.sum()[2]
inundation_a = False
inundation_30_a = False
inundation_hour_a = False
change_inundation_a = 0
for level in group.values:
if level[1]:
inundation_a = True
else:
inundation_a = False
if inundation_hour_a == False and inundation_a == True and inundation_30_a == True:
change_inundation_a += 1
inundation_hour_a = inundation_30_a
inundation_30_a = inundation_a
But this is a caveman solution and the heuristics are getting messy since I don't want to count a new event if a flood started in one month and continued into the next. This also doesn't combine events with less than one hour between their start and end. Is there a better way to compare a record to it previous and next?
My other thought is to create new columns with the series shifted t+1, t+2, t-1, t-2, so I can evaluate each row once, but this still seems inefficient. Is there a smarter way to do this by mapping a function?
Let me give a quick, partial answer since no one has answered yet, and maybe someone else can do something better later on if this does not suffice for you.
You can do the time spent above flood stage pretty easily. I divided by 48 so the units are in days.
df[ df > 4 ].groupby(pd.TimeGrouper( freq = "1MS" )).count() / 48
Pond_A Pond_B Pond_C
Datetime
2014-09-01 15.375000 15.437500 14.895833
2014-10-01 18.895833 18.187500 18.645833
2014-11-01 17.937500 17.979167 18.666667
2014-12-01 18.104167 18.354167 18.958333
2015-01-01 18.791667 18.645833 18.708333
2015-02-01 16.583333 17.208333 16.895833
2015-03-01 18.458333 18.458333 18.458333
2015-04-01 0.458333 0.520833 0.500000
Counting distinct events is a little harder, but something like this will get you most of the way. (Note that this produces an unrealistically high number of flooding events, but that's just because of how the sample data is set up and not reflective of a typical pond, though I'm not an expert on pond flooding!)
for c in df.columns:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift() <= 4))
df.iloc[:,-3:].groupby(pd.TimeGrouper( freq = "1MS" )).sum()
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 306 291 298
2014-10-01 381 343 373
2014-11-01 350 346 357
2014-12-01 359 352 361
2015-01-01 355 335 352
2015-02-01 292 337 316
2015-03-01 344 360 386
2015-04-01 9 10 9
A couple things to note. First, an event can span months and this method will group it with the month where the event began. Second, I'm ignoring the duration of the event here, but you can adjust that however you want. For example, if you want to say the event doesn't start unless there are 2 consecutive periods below flood level followed by 2 consecutive periods above flood level, just change the relevant line above to:
df[c+'_events'] = ((df[c] > 4) & (df[c].shift(1) <= 4) &
(df[c].shift(-1) > 4) & (df[c].shift(2) <= 4))
That produces a pretty dramatic reduction in the count of distinct events:
Pond_A_events Pond_B_events Pond_C_events
Datetime
2014-09-01 70 71 72
2014-10-01 91 85 81
2014-11-01 87 75 91
2014-12-01 88 87 77
2015-01-01 91 95 94
2015-02-01 79 90 83
2015-03-01 83 78 85
2015-04-01 0 2 2