np.select instead of for while loop - python

I am aiming to dramatically speed up my code which I think can be done using np.select although I dont know how.
Here is a the current output of the when my code is executed:
date starting_temp average_high average_low limit_temp observation_date Date_Limit_reached
2019-12-03 22:30:00 NaN 13.0 14.8 NaN nan
2019-12-03 23:00:00 NaN 14.7 14.9 NaN nan
2019-12-03 23:30:00 NaN 13.0 13.9 NaN nan
2019-12-04 00:00:00 13.2 13.0 14.7 NaN 2019-12-04 10:00:00
2019-12-04 00:30:00 NaN 14.0 13.8 NaN nan
2019-12-04 01:00:00 NaN 13.9 13.8 NaN nan
2019-12-04 01:30:00 NaN 13.6 14.8 NaN nan
2019-12-04 02:00:00 NaN 13.1 14.5 NaN nan
2019-12-04 02:30:00 NaN 14.9 13.7 NaN nan
2019-12-04 03:00:00 NaN 14.2 14.1 NaN nan
2019-12-04 03:30:00 NaN 13.4 14.1 NaN nan
2019-12-04 04:00:00 NaN 14.3 13.0 NaN nan
2019-12-04 04:30:00 NaN 13.5 14.1 NaN nan
2019-12-04 05:00:00 NaN 13.6 13.4 NaN nan
2019-12-04 05:30:00 NaN 14.5 13.9 NaN nan
2019-12-04 06:00:00 NaN 14.4 14.5 NaN nan
2019-12-04 06:30:00 NaN 13.7 14.2 NaN nan
2019-12-04 07:00:00 NaN 13.7 14.2 NaN nan
2019-12-04 07:30:00 NaN 13.2 14.4 NaN nan
2019-12-04 08:00:00 NaN 13.9 13.1 NaN nan
2019-12-04 08:30:00 NaN 13.9 14.4 NaN nan
2019-12-04 09:00:00 NaN 14.4 13.9 NaN nan
2019-12-04 09:30:00 NaN 14.4 13.8 NaN nan
2019-12-04 10:00:00 NaN 15.0 14.0 NaN nan
2019-12-04 10:30:00 NaN 13.2 13.2 NaN nan
2019-12-04 11:00:00 NaN 14.0 13.3 NaN nan
2019-12-04 11:30:00 NaN 14.2 13.4 NaN nan
2019-12-04 12:00:00 NaN 14.2 13.4 NaN nan
2019-12-04 12:30:00 NaN 13.7 13.6 NaN nan
2019-12-04 13:00:00 NaN 14.1 13.3 NaN nan
2019-12-04 13:30:00 NaN 13.1 14.1 NaN nan
2019-12-04 14:00:00 NaN 13.2 14.3 NaN nan
2019-12-04 14:30:00 NaN 13.7 13.8 NaN nan
The code to produce the final df['Date_Limit_reached'] column is way too slow which I have added below. I would like to change its structure to np.select if possible:
new_col = []
df_size = len(df)
# Loop the dataframe
for ind in df.index:
if not math.isnan(df['starting_temp'][ind]):
entry_price_val = df['starting_temp'][ind]
count = 0
hasValue = False
while count < df_size:
if df['starting_temp'][ind] > df['limit_temp'][ind] and df['limit_temp'][ind] >= df['asklow'][count] and df['date'][count] >= df['observation_date'][ind] :
new_col.append(df['date'][count])
hasValue = True
break # Break the loop if matching value meets
count += 1
elif df['starting_temp'][ind] < df['limit_temp'][ind] and df['limit_temp'][ind] <= df['average_high'][count] and df['date'][count] >= df['observation_date'][ind] :
new_col.append(df['date'][count])
hasValue = True
break # Break the loop if matching value meets
count += 1
# If matching value not meets, then append nan value to the column
if not hasValue:
new_col.append(float('nan'))
else:
new_col.append(float('nan'))
df['Date_Limit_reached'] = new_col

Since i can't run the code due lack of df here my suggestions:
use less flags but concrete values instead. makes the code more readable. hasValue --> val
you will have a problem if there is an entry with df['starting_temp'][ind] == df['limit_temp'][ind] because none of your cases will fire. Maybe this is the problem with the slow code.
you can pre-calculate the first boolean expression in the while loops. This might solve the issue from the above point
you don't use entry_price_val
for further improvement use vectorization of your data, this is possible in all of the loops. (not shown in my code since I can't test it)
here is my suggested code
new_col = []
df_size = len(df)
for ind in df.index:
val = float('nan') # use data instead of flags
if not math.isnan(df['starting_temp'][ind]):
count = 0
if df['starting_temp'][ind] > df['limit_temp'][ind]:
while count < df_size:
if df['limit_temp'][ind] >= df['asklow'][count] and df['date'][count] >= df['observation_date'][ind] :
val=df['date'][count]
break # Break the loop if matching value meets
count += 1
elif df['starting_temp'][ind] < df['limit_temp'][ind]
while count < df_size:
if df['limit_temp'][ind] <= df['average_high'][count] and df['date'][count] >= df['observation_date'][ind] :
val = df['date'][count]
break # Break the loop if matching value meets
count += 1
new_col.append(val)
df['Date_Limit_reached'] = new_col
Code snippets were not tested, test for correctness required, further improvements possible (hints on request).

Related

Moving forward in a panda dataframe looking for the first occurrence of multi-conditions with reset

I am having trouble with multi-conditions moving forward in a dataframe.
Here's a simplification of my model:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date':pd.date_range(start='2022-05-12', periods=27),
'l': [10.0,9.9,11.1,10.9,12.1,9.6,13.1,17.9,18.0,15.6,13.5,14.2,10.5,9.5,7.6,9.8,10.2,15.3,17.7,21.8,10.9,18.9,16.4,13.3,7.1,6.8,9.4],
'c': [10.5,10.2,12.0,11.7,13.5,10.9,13.9,18.2,18.8,16.2,15.1,14.8,11.8,10.1,8.9,10.5,11.1,16.9,19.8,22.0,15.5,20.1,17.7,14.8,8.9,7.3,10.1],
'h': [10.8,11.5,13.4,13.6,14.2,11.4,15.8,18.5,19.2,16.9,16.0,15.3,12.9,10.5,9.2,11.1,12.3,18.5,20.1,23.5,21.1,20.5,18.2,15.4,9.6,8.4,10.5],
'oc': [False,True,False,False,False,True,True,True,False,False,True,False,True,False,False,False,False,True,False,False,False,False,False,False,False,False,False],
's': [np.nan,9.3,np.nan,np.nan,np.nan,14.5,14.4,np.nan,np.nan,np.nan,8.1,np.nan,10.7,np.nan,np.nan,np.nan,np.nan,6.9,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'i': [np.nan,9.0,np.nan,np.nan,np.nan,13.6,13.4,np.nan,np.nan,np.nan,7.0,np.nan,9.9,np.nan,np.nan,np.nan,np.nan,9.2,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
't': [np.nan,15.5,np.nan,np.nan,np.nan,16.1,15.9,np.nan,np.nan,np.nan,16.5,np.nan,17.2,np.nan,np.nan,np.nan,np.nan,25.0,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
})
df = df.set_index('date')
# df Index is datetime type
print(df)
l c h oc s i t
date
2022-05-12 10.0 10.5 10.8 False NaN NaN NaN
2022-05-13 9.9 10.2 11.5 True 9.3 9.0 15.5
2022-05-14 11.1 12.0 13.4 False NaN NaN NaN
2022-05-15 10.9 11.7 13.6 False NaN NaN NaN
2022-05-16 12.1 13.5 14.2 False NaN NaN NaN
2022-05-17 9.6 10.9 11.4 True 14.5 13.6 16.1
2022-05-18 13.1 13.9 15.8 True 14.4 13.4 15.9
2022-05-19 17.9 18.2 18.5 True NaN NaN NaN
2022-05-20 18.0 18.8 19.2 False NaN NaN NaN
2022-05-21 15.6 16.2 16.9 False NaN NaN NaN
2022-05-22 13.5 15.1 16.0 True 8.1 7.0 16.5
2022-05-23 14.2 14.8 15.3 False NaN NaN NaN
2022-05-24 10.5 11.8 12.9 True 10.7 9.9 17.2
2022-05-25 9.5 10.1 10.5 False NaN NaN NaN
2022-05-26 7.6 8.9 9.2 False NaN NaN NaN
2022-05-27 9.8 10.5 11.1 False NaN NaN NaN
2022-05-28 10.2 11.1 12.3 False NaN NaN NaN
2022-05-29 15.3 16.9 18.5 True 6.9 9.2 25.0
2022-05-30 17.7 19.8 20.1 False NaN NaN NaN
2022-05-31 21.8 22.0 23.5 False NaN NaN NaN
2022-06-01 10.9 15.5 21.1 False NaN NaN NaN
2022-06-02 18.9 20.1 20.5 False NaN NaN NaN
2022-06-03 16.4 17.7 18.2 False NaN NaN NaN
2022-06-04 13.3 14.8 15.4 False NaN NaN NaN
2022-06-05 7.1 8.9 9.6 False NaN NaN NaN
2022-06-06 6.8 7.3 8.4 False NaN NaN NaN
2022-06-07 9.4 10.1 10.5 False NaN NaN NaN
This is the result I am trying to achieve:
date l c h oc s i t cc diff r
0 2022-05-12 10.0 10.5 10.8 False NaN NaN NaN NaN NaN NaN
1 2022-05-13 9.9 10.2 11.5 True 9.3 9.0 15.5 NaN NaN NaN
2 2022-05-14 11.1 12.0 13.4 False NaN NaN NaN NaN NaN NaN
3 2022-05-15 10.9 11.7 13.6 False NaN NaN NaN NaN NaN NaN
4 2022-05-16 12.1 13.5 14.2 False NaN NaN NaN NaN NaN NaN
5 2022-05-17 9.6 10.9 11.4 True 14.5 13.6 16.1 NaN NaN NaN
6 2022-05-18 13.1 13.9 15.8 True 14.4 13.4 15.9 True 5.3 t
7 2022-05-19 17.9 18.2 18.5 True NaN NaN NaN NaN NaN NaN
8 2022-05-20 18.0 18.8 19.2 False NaN NaN NaN NaN NaN NaN
9 2022-05-21 15.6 16.2 16.9 False NaN NaN NaN NaN NaN NaN
10 2022-05-22 13.5 15.1 16.0 True 8.1 7.0 16.5 NaN NaN NaN
11 2022-05-23 14.2 14.8 15.3 False NaN NaN NaN NaN NaN NaN
12 2022-05-24 10.5 11.8 12.9 True 10.7 9.9 17.2 NaN NaN NaN
13 2022-05-25 9.5 10.1 10.5 False NaN NaN NaN NaN NaN NaN
14 2022-05-26 7.6 8.9 9.2 False NaN NaN NaN True -7.0 s
15 2022-05-27 9.8 10.5 11.1 False NaN NaN NaN NaN NaN NaN
16 2022-05-28 10.2 11.1 12.3 False NaN NaN NaN NaN NaN NaN
17 2022-05-29 15.3 16.9 18.5 True 6.9 9.2 25.0 NaN NaN NaN
18 2022-05-30 17.7 19.8 20.1 False NaN NaN NaN NaN NaN NaN
19 2022-05-31 21.8 22.0 23.5 False NaN NaN NaN NaN NaN NaN
20 2022-06-01 10.9 15.5 21.1 False NaN NaN NaN NaN NaN NaN
21 2022-06-02 18.9 20.1 20.5 False NaN NaN NaN NaN NaN NaN
22 2022-06-03 16.4 17.7 18.2 False NaN NaN NaN NaN NaN NaN
23 2022-06-04 13.3 14.8 15.4 False NaN NaN NaN NaN NaN NaN
24 2022-06-05 7.1 8.9 9.6 False NaN NaN NaN True -7.7 i
25 2022-06-06 6.8 7.3 8.4 False NaN NaN NaN NaN NaN NaN
26 2022-06-07 9.4 10.1 10.5 False NaN NaN NaN NaN NaN NaN
Principles:
We always move forward in the dataframe
When oc is True we 'memorize' both c, s, i and t values from this row
Moving forward we look for the first occurrence of one of the following conditions:
h >= t
l <= s
l <= i
When it happens we set cc to True and we calculate the difference of the 'memorized' values when oc was True and write a letter to distinguish the condition:
If h >= t: diff = t-c and r = 't'
If l <= s: diff = s-c and r = 's'
If l <= i: diff = i-c and r = 'i'
Once one of the conditions has been met, we look again for oc is True and then the conditions to be met, until the end of the dataframe.
If oc is True again before one of the conditions has been met, we omit it.
What happens chronologically:
2022-05-13: oc is True so we memorize c, s, i, t
2022-05-17: oc is True but none of the conditions have been met, yet -> omission
2022-05-18: h > t[2022-05-13] -> diff = t[2022-05-13]-c[2022-05-13] = 15.5-10.2 = 5.3, r = 't'
2022-05-22: oc is True so we memorize c, s, i, t
2022-05-24: oc is True but none of the conditions have been met, yet -> omission
2022-05-26: l < s[2022-05-22] -> diff = s[2022-05-22]-c[2022-05-22] = 8.1-15.1 = -7.0, r = 's'
2022-05-29: oc is True so we memorize c, s, i, t
2022-06-05: l < i[2022-05-29] -> diff = i[2022-05-29]-c[2022-05-29] = 9.2-16.9 = -7.7, r = 'i'
A loop works but take an enormous amount of time, if possible I'd like to avoid it.
I've tried a really good solution from Baron Legendre described here which works perfectly when looking for equal values but I can't seem to adapt it to my model. Also I'm having an index problem: I'm getting different results when using a datetime Index even when I reset it.
I've been stuck with that problem for a while now so any help would gladly be appreciated.
IIUC, you can use the commented code below:
mem = False # Memory flag
data = [] # Store new values
# Create groups to speed the process (remove rows before first valid oc)
grp = df['oc'].cumsum().loc[lambda x: x > 0]
# For each group
for _, subdf in df.groupby(grp):
# Memorize new oc fields (c, s, i, t)
if not mem:
oc = subdf.iloc[0][['c', 's', 'i', 't']]
mem = True
# Extract l and h fields
lh = subdf.iloc[1:][['l', 'h']]
# Try to extract the first row where one of conditions is met
sr = (pd.concat([lh['h'] >= oc['t'], lh['l'] <= oc['s'], lh['l'] <= oc['i']],
keys=['t', 's', 'i'], axis=1)
.rename_axis(columns='r').stack().rename('cc')
.loc[lambda x: x].head(1).reset_index('r').squeeze())
# Keep this row if exists and unlock memory
if not sr.empty:
sr['diff'] = oc[sr['r']] - oc['c']
data.append(sr)
mem = False
# Merge new values
out = df.join(pd.concat(data, axis=1).T[['cc', 'r', 'diff']])
Output:
>>> out
l c h oc s i t cc r diff
date
2022-05-12 10.0 10.5 10.8 False NaN NaN NaN NaN NaN NaN
2022-05-13 9.9 10.2 11.5 True 9.3 9.0 15.5 NaN NaN NaN
2022-05-14 11.1 12.0 13.4 False NaN NaN NaN NaN NaN NaN
2022-05-15 10.9 11.7 13.6 False NaN NaN NaN NaN NaN NaN
2022-05-16 12.1 13.5 14.2 False NaN NaN NaN NaN NaN NaN
2022-05-17 9.6 10.9 11.4 True 14.5 13.6 16.1 NaN NaN NaN
2022-05-18 13.1 13.9 15.8 False NaN NaN NaN True t 5.3
2022-05-19 17.9 18.2 18.5 False NaN NaN NaN NaN NaN NaN
2022-05-20 18.0 18.8 19.2 False NaN NaN NaN NaN NaN NaN
2022-05-21 15.6 16.2 16.9 False NaN NaN NaN NaN NaN NaN
2022-05-22 13.5 15.1 16.0 True 8.1 7.0 16.5 NaN NaN NaN
2022-05-23 14.2 14.8 15.3 False NaN NaN NaN NaN NaN NaN
2022-05-24 10.5 11.8 12.9 True 10.7 9.9 17.2 NaN NaN NaN
2022-05-25 9.5 10.1 10.5 False NaN NaN NaN NaN NaN NaN
2022-05-26 7.6 8.9 9.2 False NaN NaN NaN True s -7.0
2022-05-27 9.8 10.5 11.1 False NaN NaN NaN NaN NaN NaN
2022-05-28 10.2 11.1 12.3 False NaN NaN NaN NaN NaN NaN
2022-05-29 15.3 16.9 18.5 True 6.9 9.2 25.0 NaN NaN NaN
2022-05-30 17.7 19.8 20.1 False NaN NaN NaN NaN NaN NaN
2022-05-31 21.8 22.0 23.5 False NaN NaN NaN NaN NaN NaN
2022-06-01 10.9 15.5 21.1 False NaN NaN NaN NaN NaN NaN
2022-06-02 18.9 20.1 20.5 False NaN NaN NaN NaN NaN NaN
2022-06-03 16.4 17.7 18.2 False NaN NaN NaN NaN NaN NaN
2022-06-04 13.3 14.8 15.4 False NaN NaN NaN NaN NaN NaN
2022-06-05 7.1 8.9 9.6 False NaN NaN NaN True i -7.7
2022-06-06 6.8 7.3 8.4 False NaN NaN NaN NaN NaN NaN
2022-06-07 9.4 10.1 10.5 False NaN NaN NaN NaN NaN NaN

How should I combine the rows of similar time in a Dataframe?

I'm processing a MIMIC dataset. Now I want to combine the data in the rows whose time difference (delta time) is below 10min. How can I do that?
The original data:
charttime hadm_id age is_male HR RR SPO2 Systolic_BP Diastolic_BP MAP PEEP PO2
0 2119-07-20 17:54:00 26270240 NaN NaN NaN NaN NaN 103.0 66.0 81.0 NaN NaN
1 2119-07-20 17:55:00 26270240 68.0 1.0 113.0 26.0 NaN NaN NaN NaN NaN NaN
2 2119-07-20 17:57:00 26270240 NaN NaN NaN NaN 92.0 NaN NaN NaN NaN NaN
3 2119-07-20 18:00:00 26270240 68.0 1.0 114.0 28.0 NaN 85.0 45.0 62.0 16.0 NaN
4 2119-07-20 18:01:00 26270240 NaN NaN NaN NaN 91.0 NaN NaN NaN NaN NaN
5 2119-07-30 21:00:00 26270240 68.0 1.0 90.0 16.0 93.0 NaN NaN NaN NaN NaN
6 2119-07-30 21:00:00 26270240 68.0 1.0 89.0 9.0 94.0 NaN NaN NaN NaN NaN
7 2119-07-30 21:01:00 26270240 68.0 1.0 89.0 10.0 93.0 NaN NaN NaN NaN NaN
8 2119-07-30 21:05:00 26270240 NaN NaN NaN NaN NaN 109.0 42.0 56.0 NaN NaN
9 2119-07-30 21:10:00 26270240 68.0 1.0 90.0 10.0 93.0 NaN NaN NaN NaN NaN
After combining the rows whose delta time is less than 10 min, the output I want:
(when there is duplicate data in same column in some rows to group, just take the first one)
charttime hadm_id age is_male HR RR SPO2 Systolic_BP Diastolic_BP MAP PEEP PO2
0 2119-07-20 17:55:00 26270240 68.0 1.0 113.0 26.0 92.0 103.0 66.0 81.0 16.0 NaN2119-07-30 20:00:00 26270240 68.0 1.0 90.0 16.0 93.0 NaN NaN NaN NaN NaN
1 2119-07-30 21:00:00 26270240 68.0 1.0 89.0 9.0 94.0 109.0 42.0 56.0 NaN NaN
How can I do this?
First, I would round the timestamp column to 10 minutes:
df['charttime'] = pd.to_datetime(df['charttime']).dt.floor('10T').dt.time
Then, I would drop the duplicates, based on the columns you want to compare (for example, hadm_id and charttime:
df.drop_duplicates(subset=['charttime', 'hadm_id'], keep='first', inplace=True)

Using the last valid data index in one Dataframe to select data in another Dataframe

I want to find the last valid index of the first Dataframe, and use it to index the second Dataframe.
So, suppose I have the following Dataframe (df1):
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 13.0 28.0 76.0 45 90.0 58.0
2001-01-01 77.0 75.0 57.0 3 41.0 24.0
2002-01-01 50.0 29.0 2.0 65 48.0 21.0
2003-01-01 7.0 48.0 14.0 63 12.0 66.0
2004-01-01 11.0 90.0 11.0 5 47.0 6.0
2005-01-01 50.0 4.0 31.0 1 40.0 79.0
2006-01-01 30.0 98.0 91.0 96 43.0 39.0
2007-01-01 50.0 20.0 54.0 65 NaN 47.0
2008-01-01 24.0 84.0 52.0 84 NaN 81.0
2009-01-01 56.0 61.0 57.0 25 NaN 36.0
2010-01-01 87.0 45.0 68.0 65 NaN 71.0
2011-01-01 22.0 50.0 92.0 91 NaN 48.0
2012-01-01 12.0 44.0 79.0 77 NaN 25.0
2013-01-01 1.0 22.0 34.0 57 NaN 25.0
2014-01-01 94.0 NaN 86.0 97 NaN 91.0
2015-01-01 2.0 NaN 98.0 44 NaN 79.0
2016-01-01 81.0 NaN 35.0 87 NaN 32.0
2017-01-01 59.0 NaN 95.0 32 NaN 58.0
2018-01-01 NaN NaN 3.0 14 NaN NaN
2019-01-01 NaN NaN 48.0 9 NaN NaN
2020-01-01 NaN NaN NaN 49 NaN NaN
Now I can use "first_valid_index()" to find the last valid index of each column:
lvi = df.apply(lambda series: series.last_valid_index())
Which yields:
Site 1 2017-01-01
Site 2 2013-01-01
Site 3 2019-01-01
Site 4 2020-01-01
Site 5 2006-01-01
Site 6 2017-01-01
How do I apply this to another Dataframe where I use this index to slice the timeseries of another Dataframe. Another example of a Dataframe could be created with:
import pandas as pd
import numpy as np
from numpy import random
random.seed(30)
df2 = pd.DataFrame({
"Site 1": np.random.rand(21),
"Site 2": np.random.rand(21),
"Site 3": np.random.rand(21),
"Site 4": np.random.rand(21),
"Site 5": np.random.rand(21),
"Site 6": np.random.rand(21)})
idx = pd.date_range(start='2000-01-01', end='2020-01-01',freq ='AS')
df2 = df2.set_index(idx)
How do I use that "lvi" variable to index into df2?
To do this manually I could just use:
df_s1 = df['Site 1'].loc['2000-01-01':'2017-01-01']
To get something like:
2000-01-01 13.0
2001-01-01 77.0
2002-01-01 50.0
2003-01-01 7.0
2004-01-01 11.0
2005-01-01 50.0
2006-01-01 30.0
2007-01-01 50.0
2008-01-01 24.0
2009-01-01 56.0
2010-01-01 87.0
2011-01-01 22.0
2012-01-01 12.0
2013-01-01 1.0
2014-01-01 94.0
2015-01-01 2.0
2016-01-01 81.0
2017-01-01 59.0
Is there a better way to approach this? Also, will each column have to essentially be its own dataframe to work? Any help is greatly appreciated!
This might be a bit more idiomatic:
df2[df.notna()]
or even
df2.where(df.notna())
Note that in these cases (and df1*0 + df2), the operations are done for matching index values of df and df2. For example, df2[df.reset_index(drop=True).notna()] will return all nan because there are no common index values.
This seems to work just fine:
In [34]: d
Out[34]:
x y
Date
2020-01-01 1.0 2.0
2020-01-02 1.0 2.0
2020-01-03 1.0 2.0
2020-01-04 1.0 2.0
2020-01-05 1.0 2.0
2020-01-06 1.0 NaN
2020-01-07 1.0 NaN
2020-01-08 1.0 NaN
2020-01-09 1.0 NaN
2020-01-10 1.0 NaN
2020-01-11 NaN NaN
2020-01-12 NaN NaN
2020-01-13 NaN NaN
2020-01-14 NaN NaN
2020-01-15 NaN NaN
2020-01-16 NaN NaN
2020-01-17 NaN NaN
2020-01-18 NaN NaN
2020-01-19 NaN NaN
2020-01-20 NaN NaN
In [35]: d.apply(lambda col: col.last_valid_index())
Out[35]:
x 2020-01-10
y 2020-01-05
dtype: datetime64[ns]
And then:
In [15]: d.apply(lambda col: col.last_valid_index()).apply(lambda date: df2.loc[date]) Out[15]: z x 0.940396 y 0.564007
Alright, so after thinking about this for a while and trying to come up with a detailed procedure that involved a for loop etc., I came to the conclusions that this simple math operation will do the trick. Basically I am taking advantage of how math is done between Dataframes in pandas.
output = df1*0 + df2
This gives the output on df2 that will take on the NaN values from df1 and look like this:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 0.690597 0.443933 0.787931 0.659639 0.363606 0.922373
2001-01-01 0.388669 0.577734 0.450225 0.021592 0.554249 0.305546
2002-01-01 0.578212 0.927848 0.361426 0.840541 0.626881 0.545491
2003-01-01 0.431668 0.128282 0.893351 0.783488 0.122182 0.666194
2004-01-01 0.151491 0.928584 0.834474 0.945401 0.590830 0.802648
2005-01-01 0.113477 0.398326 0.649955 0.202538 0.485927 0.127925
2006-01-01 0.521906 0.458672 0.923632 0.948696 0.638754 0.552753
2007-01-01 0.266599 0.839047 0.099069 0.000928 NaN 0.018146
2008-01-01 0.819810 0.809779 0.706223 0.247780 NaN 0.759691
2009-01-01 0.441574 0.020291 0.702551 0.468862 NaN 0.341191
2010-01-01 0.277030 0.130573 0.906697 0.589474 NaN 0.819986
2011-01-01 0.795344 0.103121 0.846405 0.589916 NaN 0.564411
2012-01-01 0.697255 0.599767 0.206482 0.718980 NaN 0.731366
2013-01-01 0.891771 0.001944 0.703132 0.751986 NaN 0.845933
2014-01-01 0.672579 NaN 0.466981 0.466770 NaN 0.618069
2015-01-01 0.767219 NaN 0.702156 0.370905 NaN 0.481971
2016-01-01 0.315264 NaN 0.793531 0.754920 NaN 0.091432
2017-01-01 0.431651 NaN 0.974520 0.708074 NaN 0.870077
2018-01-01 NaN NaN 0.408743 0.430576 NaN NaN
2019-01-01 NaN NaN 0.751509 0.755521 NaN NaN
2020-01-01 NaN NaN NaN 0.518533 NaN NaN
I was basically wanting to imprint the NaN values from one Dataframe onto another. I cannot believe how difficult I was making this. As long as my Dataframes are the same size this should work fine for my needs.
Now I should be able to take it from here to calculate the percent change from each last valid datapoint. Thank you everyone for the input!
EDIT:
Just to show everyone what I was ultimately trying to accomplish, here is the final code I produced with everyone's help and suggestions!
The original df originally looked like:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
Date
2000-01-01 13.0 28.0 76.0 45 90.0 58.0
2001-01-01 77.0 75.0 57.0 3 41.0 24.0
2002-01-01 50.0 29.0 2.0 65 48.0 21.0
2003-01-01 7.0 48.0 14.0 63 12.0 66.0
2004-01-01 11.0 90.0 11.0 5 47.0 6.0
2005-01-01 50.0 4.0 31.0 1 40.0 79.0
2006-01-01 30.0 98.0 91.0 96 43.0 39.0
2007-01-01 50.0 20.0 54.0 65 NaN 47.0
2008-01-01 24.0 84.0 52.0 84 NaN 81.0
2009-01-01 56.0 61.0 57.0 25 NaN 36.0
2010-01-01 87.0 45.0 68.0 65 NaN 71.0
2011-01-01 22.0 50.0 92.0 91 NaN 48.0
2012-01-01 12.0 44.0 79.0 77 NaN 25.0
2013-01-01 1.0 22.0 34.0 57 NaN 25.0
2014-01-01 94.0 NaN 86.0 97 NaN 91.0
2015-01-01 2.0 NaN 98.0 44 NaN 79.0
2016-01-01 81.0 NaN 35.0 87 NaN 32.0
2017-01-01 59.0 NaN 95.0 32 NaN 58.0
2018-01-01 NaN NaN 3.0 14 NaN NaN
2019-01-01 NaN NaN 48.0 9 NaN NaN
2020-01-01 NaN NaN NaN 49 NaN NaN
Then I came up with a second full dataframe (df2) with:
df2 = pd.DataFrame({
"Site 1": np.random.rand(21),
"Site 2": np.random.rand(21),
"Site 3": np.random.rand(21),
"Site 4": np.random.rand(21),
"Site 5": np.random.rand(21),
"Site 6": np.random.rand(21)})
idx = pd.date_range(start='2000-01-01', end='2020-01-01',freq ='AS')
df2 = df2.set_index(idx)
Now I replace the nan values in df2 with the nan values from df:
dfr = df2[df.notna()]
Then I invert the dataframe:
dfr = dfr[::-1]
valid_first = dfr.apply(lambda col: col.first_valid_index())
valid_last = dfr.apply(lambda col: col.last_valid_index())
Now I want the to calculate the percent change from my last valid data point, which is fixed for each column. This gives me the % change from the present to the past, with respect to the most recent (or last valid) data point.
new = []
for j in dfr:
m = dfr[j].loc[valid_first[j]:valid_last[j]]
pc = m / m.iloc[0]-1
new.append(pc)
final = pd.concat(new,axis=1)
print(final)
Which gave me:
Site 1 Site 2 Site 3 Site 4 Site 5 Site 6
2000-01-01 0.270209 -0.728445 -0.636105 0.380330 41.339081 -0.462147
2001-01-01 0.854952 -0.827804 -0.703568 -0.787391 40.588791 -0.884806
2002-01-01 -0.677757 -0.120482 -0.208255 -0.982097 54.348094 -0.483415
2003-01-01 -0.322010 -0.061277 -0.382602 1.025088 5.440808 -0.602661
2004-01-01 1.574451 -0.768251 -0.543260 1.210434 50.494788 -0.859331
2005-01-01 -0.412226 -0.866441 -0.055027 -0.168267 1.346869 -0.385080
2006-01-01 1.280867 -0.640899 0.354513 1.086703 0.000000 0.108504
2007-01-01 1.121585 -0.741675 -0.735990 -0.768578 NaN -0.119436
2008-01-01 -0.210467 -0.376884 -0.575106 -0.779147 NaN 0.055949
2009-01-01 1.864107 -0.966827 0.566590 1.003121 NaN -0.214482
2010-01-01 0.571762 -0.311459 -0.518113 1.036950 NaN -0.513911
2011-01-01 -0.122525 -0.178137 -0.641642 0.197481 NaN 0.033141
2012-01-01 0.403578 -0.829402 0.161753 -0.438578 NaN -0.996595
2013-01-01 0.383481 0.000000 -0.305824 0.602079 NaN -0.057711
2014-01-01 -0.699708 NaN -0.515074 -0.277157 NaN -0.840873
2015-01-01 0.422364 NaN -0.759708 1.230037 NaN -0.663253
2016-01-01 -0.418945 NaN 0.197396 -0.445260 NaN -0.299741
2017-01-01 0.000000 NaN -0.897428 0.669791 NaN 0.000000
2018-01-01 NaN NaN 0.138997 0.486961 NaN NaN
2019-01-01 NaN NaN 0.000000 0.200771 NaN NaN
2020-01-01 NaN NaN NaN 0.000000 NaN NaN
I know often times these questions don't have context, so here is the final output achieved thanks to your input. Again, thank you to everyone for the help!

How to group by level 0 and describe in a multi index and level dataframe (pandas)?

Here is (file) a multi index and level dataframe. Loading the dataframe from a csv:
import pandas as pd
df = pd.read_csv('./enviar/only-bh-extreme-events-satellite.csv'
,index_col=[0,1,2,3,4]
,header=[0,1,2,3]
,skipinitialspace=True
,tupleize_cols=True
)
df.columns = pd.MultiIndex.from_tuples(df.columns)
print(df)
ci \
1
1
00h 06h 12h 18h
wsid lat lon start prcp_24
329 -43.969397 -19.883945 2007-03-18 10:00:00 72.0 NaN NaN NaN NaN
2007-03-20 10:00:00 104.4 NaN NaN NaN NaN
2007-10-18 23:00:00 92.8 NaN NaN NaN NaN
2007-12-21 00:00:00 60.4 NaN NaN NaN NaN
2008-01-19 18:00:00 53.0 NaN NaN NaN NaN
2008-04-05 01:00:00 80.8 0.0 0.0 0.0 0.0
2008-10-31 17:00:00 101.8 NaN NaN NaN NaN
2008-11-01 04:00:00 82.0 NaN NaN NaN NaN
2008-12-29 00:00:00 57.8 NaN NaN NaN NaN
2009-03-28 10:00:00 72.4 NaN NaN NaN NaN
2009-10-07 02:00:00 57.8 NaN NaN NaN NaN
2009-10-08 00:00:00 83.8 NaN NaN NaN NaN
2009-11-28 16:00:00 84.4 NaN NaN NaN NaN
2009-12-18 04:00:00 51.8 NaN NaN NaN NaN
2009-12-28 00:00:00 96.4 NaN NaN NaN NaN
2010-01-06 05:00:00 74.2 NaN NaN NaN NaN
2011-12-18 00:00:00 113.6 NaN NaN NaN NaN
2011-12-19 00:00:00 90.6 NaN NaN NaN NaN
2012-11-15 07:00:00 85.8 NaN NaN NaN NaN
2013-10-17 00:00:00 52.4 NaN NaN NaN NaN
2014-04-01 22:00:00 72.0 0.0 0.0 0.0 0.0
2014-10-20 06:00:00 56.6 NaN NaN NaN NaN
2014-12-13 09:00:00 104.4 NaN NaN NaN NaN
2015-02-09 00:00:00 62.0 NaN NaN NaN NaN
2015-02-16 19:00:00 56.8 NaN NaN NaN NaN
2015-05-06 17:00:00 50.8 0.0 0.0 0.0 0.0
2016-02-26 00:00:00 52.2 NaN NaN NaN NaN
343 -44.416883 -19.885398 2008-08-30 21:00:00 50.4 0.0 0.0 0.0 0.0
2009-02-01 01:00:00 53.8 NaN NaN NaN NaN
2010-03-22 00:00:00 51.4 NaN NaN NaN NaN
2011-11-12 21:00:00 57.8 NaN NaN NaN NaN
2011-11-25 22:00:00 107.6 NaN NaN NaN NaN
2012-12-28 20:00:00 94.0 NaN NaN NaN NaN
2013-10-16 22:00:00 50.8 NaN NaN NaN NaN
2014-11-06 21:00:00 55.2 NaN NaN NaN NaN
2015-01-24 00:00:00 80.0 NaN NaN NaN NaN
2015-01-27 00:00:00 52.8 NaN NaN NaN NaN
370 -43.958651 -19.980034 2015-01-28 23:00:00 50.4 NaN NaN NaN NaN
2015-01-29 00:00:00 50.6 NaN NaN NaN NaN
I'm trying to describe grouping by level (0), variables ci, d, r, z... I like to get the count, max, min, std, etc...
When I tried df.describe() I got not grouping by level 0. So I expected:
ci cc z r -> Level 0
count 39.000000 39.000000 39.000000 39.000000
mean 422577.032051 422025.595353 421672.402244 422449.004808
std 144740.869473 144550.040108 144425.167173 144692.422425
min 0.000000 0.000000 0.000000 0.000000
25% 467962.437500 467512.156250 467915.437500 468552.750000
50% 470644.687500 469924.468750 469772.312500 470947.468750
75% 472557.875000 471953.828125 471156.250000 472279.937500
max 473988.062500 473269.187500 472358.125000 473675.812500
I had created this helper function:
def format_percentiles(percentiles):
percentiles = np.asarray(percentiles)
percentiles = 100 * percentiles
int_idx = (percentiles.astype(int) == percentiles)
if np.all(int_idx):
out = percentiles.astype(int).astype(str)
return [i + '%' for i in out]
And this my own describe function:
import numpy as np
from functools import reduce
def describe_customized(df):
_df = pd.DataFrame()
data = []
variables = list(set(df.columns.get_level_values(0)))
variables.sort()
for var in variables:
idx = pd.IndexSlice
values = df.loc[:, idx[[var]]].values.tolist() #get all values from a specif variable
z = reduce(lambda x,y: x+y,values) #flat a list of list
data.append(pd.Series(z,name=var))
#return data
for series in data:
percentiles = np.array([0.25, 0.5, 0.75])
formatted_percentiles = format_percentiles(percentiles)
stat_index = (['count', 'mean', 'std', 'min'] + formatted_percentiles + ['max'])
d = ([series.count(), series.mean(), series.std(), series.min()] +
[series.quantile(x) for x in percentiles] + [series.max()])
s = pd.Series(d, index=stat_index, name=series.name)
_df = pd.concat([_df,s], axis=1)
return _df
dd = describe_customized(df)
Result:
al asn cc chnk ci ciwc \
25% 0.130846 0.849998 0.000000 0.018000 0.0 0.000000e+00
50% 0.131369 0.849999 0.000000 0.018000 0.0 0.000000e+00
75% 0.134000 0.849999 0.000000 0.018000 0.0 0.000000e+00
count 624.000000 624.000000 23088.000000 624.000000 64.0 2.308800e+04
max 0.137495 0.849999 1.000000 0.018006 0.0 5.576574e-04
mean 0.119082 0.762819 0.022013 0.016154 0.0 8.247306e-07
min 0.000000 0.000000 0.000000 0.000000 0.0 0.000000e+00
std 0.040338 0.258087 0.098553 0.005465 0.0 8.969210e-06
I created a function that returns a new dataframe with the statistics of the variables for a level of your choice:
def describe_levels(df,level):
df_des = pd.DataFrame(
index=df.columns.levels[0],
columns=['count','mean','std','min','25','50','75','max']
)
for index in df_des.index:
df_des.loc[index,'count'] = len(df[index]['1'][level])
df_des.loc[index,'mean'] = df[index]['1'][level].mean().mean()
df_des.loc[index,'std'] = df[index]['1'][level].std().mean()
df_des.loc[index,'min'] = df[index]['1'][level].min().mean()
df_des.loc[index,'max'] = df[index]['1'][level].max().mean()
df_des.loc[index,'25'] = df[index]['1'][level].quantile(q=0.25).mean()
df_des.loc[index,'50'] = df[index]['1'][level].quantile(q=0.5).mean()
df_des.loc[index,'75'] = df[index]['1'][level].quantile(q=0.75).mean()
return df_des
For example, I called:
describe_levels(df,'1').T
See here the result for pressure level 1:

Using Fixed interval(3hour) data to generates continuous time(1hour) data

This is part of my data:
Day_Data Hour_Data WIN_D WIN_S TEM RHU PRE_1h
1 0 58 1 22 78 0
1 3 32 1.9 24.6 65 0
1 6 41 3.2 25.6 59 0
1 9 20 0.8 24.8 64 0
1 12 44 1.7 22.7 76 0
1 15 118 0.7 20.2 92 0
1 18 70 2.6 20.2 94 0
1 21 76 3.4 19.9 66 0
2 0 76 3.8 19.4 58 0
2 3 75 5.8 19.4 47 0
2 6 81 5.1 19.5 42 0
2 9 61 3.6 17.4 48 0
2 12 50 0.9 15.8 46 0
2 15 348 1.1 14.5 52 0
2 18 357 1.9 13.5 60 0
2 21 333 1.2 12.4 74 0
and, I want to generate extra data like this:
the fill values are the mean of the last value and the next value.
How can I do that?
Thank you!
And, #jdy thanks for reminder, this is what I have done:
data['time']='2017'+'-'+'10'+'-'+data['Day_Data'].map(int).map(str)+'
'+data['Hour_Data'].map(int).map(str)+':'+'00'+':'+'00'
from datetime import datetime
data.loc[:,'Date']=pd.to_datetime(data['time'])
data=data.drop(['Day_Data','Hour_Data','time'],axis=1)
index = data.set_index(data['Date'])
data=index.resample('1h').mean()
Output:
2017-10-01 00:00:00 58.0 1.0 22.0 78.0 0.0
2017-10-01 01:00:00 NaN NaN NaN NaN NaN
2017-10-01 02:00:00 NaN NaN NaN NaN NaN
2017-10-01 03:00:00 32.0 1.9 24.6 65.0 0.0
2017-10-01 04:00:00 NaN NaN NaN NaN NaN
2017-10-01 05:00:00 NaN NaN NaN NaN NaN
2017-10-01 06:00:00 41.0 3.2 25.6 59.0 0.0
2017-10-01 07:00:00 NaN NaN NaN NaN NaN
2017-10-01 08:00:00 NaN NaN NaN NaN NaN
2017-10-01 09:00:00 20.0 0.8 24.8 64.0 0.0
2017-10-01 10:00:00 NaN NaN NaN NaN NaN
2017-10-01 11:00:00 NaN NaN NaN NaN NaN
2017-10-01 12:00:00 44.0 1.7 22.7 76.0 0.0
2017-10-01 13:00:00 NaN NaN NaN NaN NaN
2017-10-01 14:00:00 NaN NaN NaN NaN NaN
2017-10-01 15:00:00 118.0 0.7 20.2 92.0 0.0
2017-10-01 16:00:00 NaN NaN NaN NaN NaN
2017-10-01 17:00:00 NaN NaN NaN NaN NaN
2017-10-01 18:00:00 70.0 2.6 20.2 94.0 0.0
2017-10-01 19:00:00 NaN NaN NaN NaN NaN
2017-10-01 20:00:00 NaN NaN NaN NaN NaN
2017-10-01 21:00:00 76.0 3.4 19.9 66.0 0.0
2017-10-01 22:00:00 NaN NaN NaN NaN NaN
2017-10-01 23:00:00 NaN NaN NaN NaN NaN
2017-10-02 00:00:00 76.0 3.8 19.4 58.0 0.0
2017-10-02 01:00:00 NaN NaN NaN NaN NaN
2017-10-02 02:00:00 NaN NaN NaN NaN NaN
2017-10-02 03:00:00 75.0 5.8 19.4 47.0 0.0
2017-10-02 04:00:00 NaN NaN NaN NaN NaN
2017-10-02 05:00:00 NaN NaN NaN NaN NaN
2017-10-02 06:00:00 81.0 5.1 19.5 42.0 0.0
but, I have no idea about how to fill the NaN by the mean of the last value and the next value.

Categories

Resources