Sum by groups, preserving NAs

Sum by groups, preserving NAs - python

I am trying to aggregate the dataframe in order to have one date per row (for each group).
Cod1 Cod2 Date E A S
327 100013.0 001 2019-02-01 0.0 0.0 511.0
323 100013.0 001 2019-02-01 0.0 -14.0 NaN
336 100013.0 001 2019-02-02 0.0 -28.0 NaN
341 100013.0 001 2019-02-03 0.0 -6.0 NaN
350 100013.0 001 2019-02-03 0.0 -3.0 NaN
373 100013.0 001 2019-02-07 0.0 -15.0 0
377 100013.0 001 2019-02-07 0.0 -9.0 NaN
Using the following:
df = df.groupby(['Date', 'Cod1', 'Cod2'])['E','A', 'S'].sum()
I got the following output:
2019-02-01 100013.0 001 0.0 -14.0 511.0
2019-02-02 100013.0 001 0.0 -28.0 0.0
2019-02-03 100013.0 001 0.0 -9.0 0.0
2019-02-06 100013.0 001 0.0 -24.0 0.0
My questions is:
Is there some way to aggregate preserving NaN ?
There will be 3 scenarios:
1 -) Two rows on same date, last column having NaN and a not null number:
327 100013.0 001 2019-02-01 0.0 0.0 511.0
323 100013.0 001 2019-02-01 0.0 -14.0 NaN
I would like that in this situation always keep the number.
2-) Two rows on same date, last column having 2 NaNs rows
341 100013.0 001 2019-02-03 0.0 -6.0 NaN
350 100013.0 001 2019-02-03 0.0 -3.0 NaN
I would like that in this situation always keep the NaN.
3-) Two rows on same date, last column having one zero value column and one NaN column
373 100013.0 001 2019-02-07 0.0 -15.0 0
377 100013.0 001 2019-02-07 0.0 -9.0 NaN
I would like that in this situation always keep the 0.
So my expected out should be this one:
2019-02-01 100013.0 001 0.0 -14.0 511.0
2019-02-02 100013.0 001 0.0 -28.0 NaN
2019-02-03 100013.0 001 0.0 -9.0 NaN
2019-02-06 100013.0 001 0.0 -24.0 0.0

Check min_count
df.groupby(['Date', 'Cod1', 'Cod2'])['E','A', 'S'].sum(min_count=1)
Out[260]:
E A S
Date Cod1 Cod2
2019-02-01 100013.0 1 0.0 -14.0 511.0
2019-02-02 100013.0 1 0.0 -28.0 NaN
2019-02-03 100013.0 1 0.0 -9.0 NaN
2019-02-07 100013.0 1 0.0 -24.0 0.0

I guess a custom function can do:
(df.groupby(['Date', 'Cod1', 'Cod2'])
['E','A', 'S']
.agg(lambda x: np.nan if x.isna().all() else x.sum())
)
Output:
E A S
Date Cod1 Cod2
2019-02-01 100013.0 1 0.0 -14.0 511.0
2019-02-02 100013.0 1 0.0 -28.0 NaN
2019-02-03 100013.0 1 0.0 -9.0 NaN
2019-02-07 100013.0 1 0.0 -24.0 0.0

Related

How should I combine the rows of similar time in a Dataframe?

I'm processing a MIMIC dataset. Now I want to combine the data in the rows whose time difference (delta time) is below 10min. How can I do that?
The original data:
charttime hadm_id age is_male HR RR SPO2 Systolic_BP Diastolic_BP MAP PEEP PO2
0 2119-07-20 17:54:00 26270240 NaN NaN NaN NaN NaN 103.0 66.0 81.0 NaN NaN
1 2119-07-20 17:55:00 26270240 68.0 1.0 113.0 26.0 NaN NaN NaN NaN NaN NaN
2 2119-07-20 17:57:00 26270240 NaN NaN NaN NaN 92.0 NaN NaN NaN NaN NaN
3 2119-07-20 18:00:00 26270240 68.0 1.0 114.0 28.0 NaN 85.0 45.0 62.0 16.0 NaN
4 2119-07-20 18:01:00 26270240 NaN NaN NaN NaN 91.0 NaN NaN NaN NaN NaN
5 2119-07-30 21:00:00 26270240 68.0 1.0 90.0 16.0 93.0 NaN NaN NaN NaN NaN
6 2119-07-30 21:00:00 26270240 68.0 1.0 89.0 9.0 94.0 NaN NaN NaN NaN NaN
7 2119-07-30 21:01:00 26270240 68.0 1.0 89.0 10.0 93.0 NaN NaN NaN NaN NaN
8 2119-07-30 21:05:00 26270240 NaN NaN NaN NaN NaN 109.0 42.0 56.0 NaN NaN
9 2119-07-30 21:10:00 26270240 68.0 1.0 90.0 10.0 93.0 NaN NaN NaN NaN NaN
After combining the rows whose delta time is less than 10 min, the output I want:
(when there is duplicate data in same column in some rows to group, just take the first one)
charttime hadm_id age is_male HR RR SPO2 Systolic_BP Diastolic_BP MAP PEEP PO2
0 2119-07-20 17:55:00 26270240 68.0 1.0 113.0 26.0 92.0 103.0 66.0 81.0 16.0 NaN2119-07-30 20:00:00 26270240 68.0 1.0 90.0 16.0 93.0 NaN NaN NaN NaN NaN
1 2119-07-30 21:00:00 26270240 68.0 1.0 89.0 9.0 94.0 109.0 42.0 56.0 NaN NaN
How can I do this?

First, I would round the timestamp column to 10 minutes:
df['charttime'] = pd.to_datetime(df['charttime']).dt.floor('10T').dt.time
Then, I would drop the duplicates, based on the columns you want to compare (for example, hadm_id and charttime:
df.drop_duplicates(subset=['charttime', 'hadm_id'], keep='first', inplace=True)

Add a date to a list of dataframes by extracting it from the filename

I have a list of multiple dataframes dfs.
The dataframes come from files that have dates in its name. Eg. FilenameYYYYMMDD.xlsx
files = [str(file) for file in Path(/dir)]
dfs = [pd.read_excel(file, header=1)] for file in files]
I can extract the date from the file names:
date_extract = re.search('[0-9]{8}',files[0...20])
date = datetime.datetime.strptime(date_extract[0...20], '%Y%m%d').date()
But how can I assign to each df its respective date (by adding a column called 'Date')?

if your using pathlib we can use a dictionary to hold your dataframes and use a quick regex to extract the date, when we concat the dataframes the index will be set to the date.
import re
from pathlib import Path
dfs = {
re.search('(\d{4}.*).xlsx',f.name).group(1): pd.read_excel(f,header=1)
for f in Path(
/dir
).glob("*.xlsx")
}
print(pd.concat(dfs))
Unnamed: 0 e f c d
20200610 0 0 0.0 0.0 NaN NaN
1 1 0.0 0.0 NaN NaN
2 2 0.0 0.0 NaN NaN
3 3 0.0 0.0 NaN NaN
4 4 1.0 0.0 NaN NaN
5 5 0.0 1.0 NaN NaN
6 6 0.0 0.0 NaN NaN
7 7 0.0 0.0 NaN NaN
8 8 0.0 0.0 NaN NaN
9 9 0.0 0.0 NaN NaN
10 10 0.0 0.0 NaN NaN
11 11 0.0 0.0 NaN NaN
12 12 0.0 0.0 NaN NaN
13 13 0.0 0.0 NaN NaN
14 14 0.0 0.0 NaN NaN
15 15 0.0 0.0 NaN NaN
16 16 0.0 0.0 NaN NaN
17 17 0.0 0.0 NaN NaN
18 18 0.0 0.0 NaN NaN
19 19 0.0 0.0 NaN NaN
20 20 0.0 0.0 NaN NaN
21 21 0.0 0.0 NaN NaN
22 22 0.0 0.0 NaN NaN
23 23 0.0 0.0 NaN NaN
24 24 0.0 0.0 NaN NaN
25 25 0.0 0.0 NaN NaN
20201012 0 0 NaN NaN 0.0 0.0
1 1 NaN NaN 0.0 0.0
2 2 NaN NaN 1.0 0.0
3 3 NaN NaN 0.0 1.0

Pandas - duration where parameter is "1"

I am new to python and pandas and I am trying to solve this problem:
I have a dataset that looks something like this:
timestamp par_1 par_2
1486873206867 0 0
1486873207039 NaN 0
1486873207185 0 NaN
1486873207506 1 0
1486873207518 NaN NaN
1486873207831 1 0
1486873208148 0 NaN
1486873208469 0 1
1486873208479 1 NaN
1486873208793 1 NaN
1486873208959 NaN 1
1486873209111 1 NaN
1486873209918 NaN 0
1486873210075 0 NaN
I want to know the total duration of the event "1" for each parameter. (Parameters can only be NaN, 1 or 0)
I have already tried
df['duration_par_1'] = df.groupby(['par_1'])['timestamp'].apply(lambda x: x.max() - x.min())
but for further processing, I only need the duration of the event "1" to be in new columns and then that duration needs to be in every row of the new column so that it looks like this:
timestamp par_1 par_2 duration_par_1 duration_par2
1486873206867 0 0 2238 1449
1486873207039 NaN 0 2238 1449
1486873207185 0 NaN 2238 1449
1486873207506 1 0 2238 1449
1486873207518 NaN NaN 2238 1449
1486873207831 1 0 2238 1449
1486873208148 0 NaN 2238 1449
1486873208469 0 1 2238 1449
1486873208479 1 NaN 2238 1449
1486873208793 1 NaN 2238 1449
1486873208959 NaN 1 2238 1449
1486873209111 1 NaN 2238 1449
1486873209918 NaN 0 2238 1449
1486873210075 0 NaN 2238 1449
Thanks in advance!

I believe you need multiple values of par columns by difference of datetimes, because not exist another values like 0, 1 and NaN in data:
d = df['timestamp'].diff()
df1 = df.filter(like='par')
#if need duration by some value e.g. by `0`
#df1 = df.filter(like='par').eq(0).astype(int)
s = df1.mul(d, axis=0).sum().astype(int).add_prefix('duration_')
df = df.assign(**s)
print (df)
timestamp par_1 par_2 duration_par_1 duration_par_2
0 1486873206867 0.0 0.0 1110 487
1 1486873207039 NaN 0.0 1110 487
2 1486873207185 0.0 NaN 1110 487
3 1486873207506 1.0 0.0 1110 487
4 1486873207518 NaN NaN 1110 487
5 1486873207831 1.0 0.0 1110 487
6 1486873208148 0.0 NaN 1110 487
7 1486873208469 0.0 1.0 1110 487
8 1486873208479 1.0 NaN 1110 487
9 1486873208793 1.0 NaN 1110 487
10 1486873208959 NaN 1.0 1110 487
11 1486873209111 1.0 NaN 1110 487
12 1486873209918 NaN 0.0 1110 487
13 1486873210075 0.0 NaN 1110 487
Explanation:
First get difference of timestamp column:
print (df['timestamp'].diff())
0 NaN
1 172.0
2 146.0
3 321.0
4 12.0
5 313.0
6 317.0
7 321.0
8 10.0
9 314.0
10 166.0
11 152.0
12 807.0
13 157.0
Name: timestamp, dtype: float64
Select all columns with string par by filter:
print (df.filter(like='par'))
par_1 par_2
0 0.0 0.0
1 NaN 0.0
2 0.0 NaN
3 1.0 0.0
4 NaN NaN
5 1.0 0.0
6 0.0 NaN
7 0.0 1.0
8 1.0 NaN
9 1.0 NaN
10 NaN 1.0
11 1.0 NaN
12 NaN 0.0
13 0.0 NaN
Multiple filtered columns by mul by d:
print (df1.mul(d, axis=0))
par_1 par_2
0 NaN NaN
1 0.0 0.0
2 0.0 0.0
3 321.0 0.0
4 0.0 0.0
5 313.0 0.0
6 0.0 0.0
7 0.0 321.0
8 10.0 0.0
9 314.0 0.0
10 0.0 166.0
11 152.0 0.0
12 0.0 0.0
13 0.0 0.0
And sum values:
print (df1.mul(d, axis=0).sum())
par_1 1110.0
par_2 487.0
dtype: float64
Convert to integers and change index by add_prefix:
print (df1.mul(d, axis=0).sum().astype(int).add_prefix('duration_'))
duration_par_1 1110
duration_par_2 487
dtype: int32
Last create new columns by assign.

How to group by level 0 and describe in a multi index and level dataframe (pandas)?

Here is (file) a multi index and level dataframe. Loading the dataframe from a csv:
import pandas as pd
df = pd.read_csv('./enviar/only-bh-extreme-events-satellite.csv'
,index_col=[0,1,2,3,4]
,header=[0,1,2,3]
,skipinitialspace=True
,tupleize_cols=True
)
df.columns = pd.MultiIndex.from_tuples(df.columns)
print(df)
ci \
1
1
00h 06h 12h 18h
wsid lat lon start prcp_24
329 -43.969397 -19.883945 2007-03-18 10:00:00 72.0 NaN NaN NaN NaN
2007-03-20 10:00:00 104.4 NaN NaN NaN NaN
2007-10-18 23:00:00 92.8 NaN NaN NaN NaN
2007-12-21 00:00:00 60.4 NaN NaN NaN NaN
2008-01-19 18:00:00 53.0 NaN NaN NaN NaN
2008-04-05 01:00:00 80.8 0.0 0.0 0.0 0.0
2008-10-31 17:00:00 101.8 NaN NaN NaN NaN
2008-11-01 04:00:00 82.0 NaN NaN NaN NaN
2008-12-29 00:00:00 57.8 NaN NaN NaN NaN
2009-03-28 10:00:00 72.4 NaN NaN NaN NaN
2009-10-07 02:00:00 57.8 NaN NaN NaN NaN
2009-10-08 00:00:00 83.8 NaN NaN NaN NaN
2009-11-28 16:00:00 84.4 NaN NaN NaN NaN
2009-12-18 04:00:00 51.8 NaN NaN NaN NaN
2009-12-28 00:00:00 96.4 NaN NaN NaN NaN
2010-01-06 05:00:00 74.2 NaN NaN NaN NaN
2011-12-18 00:00:00 113.6 NaN NaN NaN NaN
2011-12-19 00:00:00 90.6 NaN NaN NaN NaN
2012-11-15 07:00:00 85.8 NaN NaN NaN NaN
2013-10-17 00:00:00 52.4 NaN NaN NaN NaN
2014-04-01 22:00:00 72.0 0.0 0.0 0.0 0.0
2014-10-20 06:00:00 56.6 NaN NaN NaN NaN
2014-12-13 09:00:00 104.4 NaN NaN NaN NaN
2015-02-09 00:00:00 62.0 NaN NaN NaN NaN
2015-02-16 19:00:00 56.8 NaN NaN NaN NaN
2015-05-06 17:00:00 50.8 0.0 0.0 0.0 0.0
2016-02-26 00:00:00 52.2 NaN NaN NaN NaN
343 -44.416883 -19.885398 2008-08-30 21:00:00 50.4 0.0 0.0 0.0 0.0
2009-02-01 01:00:00 53.8 NaN NaN NaN NaN
2010-03-22 00:00:00 51.4 NaN NaN NaN NaN
2011-11-12 21:00:00 57.8 NaN NaN NaN NaN
2011-11-25 22:00:00 107.6 NaN NaN NaN NaN
2012-12-28 20:00:00 94.0 NaN NaN NaN NaN
2013-10-16 22:00:00 50.8 NaN NaN NaN NaN
2014-11-06 21:00:00 55.2 NaN NaN NaN NaN
2015-01-24 00:00:00 80.0 NaN NaN NaN NaN
2015-01-27 00:00:00 52.8 NaN NaN NaN NaN
370 -43.958651 -19.980034 2015-01-28 23:00:00 50.4 NaN NaN NaN NaN
2015-01-29 00:00:00 50.6 NaN NaN NaN NaN
I'm trying to describe grouping by level (0), variables ci, d, r, z... I like to get the count, max, min, std, etc...
When I tried df.describe() I got not grouping by level 0. So I expected:
ci cc z r -> Level 0
count 39.000000 39.000000 39.000000 39.000000
mean 422577.032051 422025.595353 421672.402244 422449.004808
std 144740.869473 144550.040108 144425.167173 144692.422425
min 0.000000 0.000000 0.000000 0.000000
25% 467962.437500 467512.156250 467915.437500 468552.750000
50% 470644.687500 469924.468750 469772.312500 470947.468750
75% 472557.875000 471953.828125 471156.250000 472279.937500
max 473988.062500 473269.187500 472358.125000 473675.812500

I had created this helper function:
def format_percentiles(percentiles):
percentiles = np.asarray(percentiles)
percentiles = 100 * percentiles
int_idx = (percentiles.astype(int) == percentiles)
if np.all(int_idx):
out = percentiles.astype(int).astype(str)
return [i + '%' for i in out]
And this my own describe function:
import numpy as np
from functools import reduce
def describe_customized(df):
_df = pd.DataFrame()
data = []
variables = list(set(df.columns.get_level_values(0)))
variables.sort()
for var in variables:
idx = pd.IndexSlice
values = df.loc[:, idx[[var]]].values.tolist() #get all values from a specif variable
z = reduce(lambda x,y: x+y,values) #flat a list of list
data.append(pd.Series(z,name=var))
#return data
for series in data:
percentiles = np.array([0.25, 0.5, 0.75])
formatted_percentiles = format_percentiles(percentiles)
stat_index = (['count', 'mean', 'std', 'min'] + formatted_percentiles + ['max'])
d = ([series.count(), series.mean(), series.std(), series.min()] +
[series.quantile(x) for x in percentiles] + [series.max()])
s = pd.Series(d, index=stat_index, name=series.name)
_df = pd.concat([_df,s], axis=1)
return _df
dd = describe_customized(df)
Result:
al asn cc chnk ci ciwc \
25% 0.130846 0.849998 0.000000 0.018000 0.0 0.000000e+00
50% 0.131369 0.849999 0.000000 0.018000 0.0 0.000000e+00
75% 0.134000 0.849999 0.000000 0.018000 0.0 0.000000e+00
count 624.000000 624.000000 23088.000000 624.000000 64.0 2.308800e+04
max 0.137495 0.849999 1.000000 0.018006 0.0 5.576574e-04
mean 0.119082 0.762819 0.022013 0.016154 0.0 8.247306e-07
min 0.000000 0.000000 0.000000 0.000000 0.0 0.000000e+00
std 0.040338 0.258087 0.098553 0.005465 0.0 8.969210e-06

I created a function that returns a new dataframe with the statistics of the variables for a level of your choice:
def describe_levels(df,level):
df_des = pd.DataFrame(
index=df.columns.levels[0],
columns=['count','mean','std','min','25','50','75','max']
)
for index in df_des.index:
df_des.loc[index,'count'] = len(df[index]['1'][level])
df_des.loc[index,'mean'] = df[index]['1'][level].mean().mean()
df_des.loc[index,'std'] = df[index]['1'][level].std().mean()
df_des.loc[index,'min'] = df[index]['1'][level].min().mean()
df_des.loc[index,'max'] = df[index]['1'][level].max().mean()
df_des.loc[index,'25'] = df[index]['1'][level].quantile(q=0.25).mean()
df_des.loc[index,'50'] = df[index]['1'][level].quantile(q=0.5).mean()
df_des.loc[index,'75'] = df[index]['1'][level].quantile(q=0.75).mean()
return df_des
For example, I called:
describe_levels(df,'1').T
See here the result for pressure level 1:

Using Fixed interval(3hour) data to generates continuous time(1hour) data

This is part of my data:
Day_Data Hour_Data WIN_D WIN_S TEM RHU PRE_1h
1 0 58 1 22 78 0
1 3 32 1.9 24.6 65 0
1 6 41 3.2 25.6 59 0
1 9 20 0.8 24.8 64 0
1 12 44 1.7 22.7 76 0
1 15 118 0.7 20.2 92 0
1 18 70 2.6 20.2 94 0
1 21 76 3.4 19.9 66 0
2 0 76 3.8 19.4 58 0
2 3 75 5.8 19.4 47 0
2 6 81 5.1 19.5 42 0
2 9 61 3.6 17.4 48 0
2 12 50 0.9 15.8 46 0
2 15 348 1.1 14.5 52 0
2 18 357 1.9 13.5 60 0
2 21 333 1.2 12.4 74 0
and, I want to generate extra data like this:
the fill values are the mean of the last value and the next value.
How can I do that?
Thank you!
And, #jdy thanks for reminder, this is what I have done:
data['time']='2017'+'-'+'10'+'-'+data['Day_Data'].map(int).map(str)+'
'+data['Hour_Data'].map(int).map(str)+':'+'00'+':'+'00'
from datetime import datetime
data.loc[:,'Date']=pd.to_datetime(data['time'])
data=data.drop(['Day_Data','Hour_Data','time'],axis=1)
index = data.set_index(data['Date'])
data=index.resample('1h').mean()
Output:
2017-10-01 00:00:00 58.0 1.0 22.0 78.0 0.0
2017-10-01 01:00:00 NaN NaN NaN NaN NaN
2017-10-01 02:00:00 NaN NaN NaN NaN NaN
2017-10-01 03:00:00 32.0 1.9 24.6 65.0 0.0
2017-10-01 04:00:00 NaN NaN NaN NaN NaN
2017-10-01 05:00:00 NaN NaN NaN NaN NaN
2017-10-01 06:00:00 41.0 3.2 25.6 59.0 0.0
2017-10-01 07:00:00 NaN NaN NaN NaN NaN
2017-10-01 08:00:00 NaN NaN NaN NaN NaN
2017-10-01 09:00:00 20.0 0.8 24.8 64.0 0.0
2017-10-01 10:00:00 NaN NaN NaN NaN NaN
2017-10-01 11:00:00 NaN NaN NaN NaN NaN
2017-10-01 12:00:00 44.0 1.7 22.7 76.0 0.0
2017-10-01 13:00:00 NaN NaN NaN NaN NaN
2017-10-01 14:00:00 NaN NaN NaN NaN NaN
2017-10-01 15:00:00 118.0 0.7 20.2 92.0 0.0
2017-10-01 16:00:00 NaN NaN NaN NaN NaN
2017-10-01 17:00:00 NaN NaN NaN NaN NaN
2017-10-01 18:00:00 70.0 2.6 20.2 94.0 0.0
2017-10-01 19:00:00 NaN NaN NaN NaN NaN
2017-10-01 20:00:00 NaN NaN NaN NaN NaN
2017-10-01 21:00:00 76.0 3.4 19.9 66.0 0.0
2017-10-01 22:00:00 NaN NaN NaN NaN NaN
2017-10-01 23:00:00 NaN NaN NaN NaN NaN
2017-10-02 00:00:00 76.0 3.8 19.4 58.0 0.0
2017-10-02 01:00:00 NaN NaN NaN NaN NaN
2017-10-02 02:00:00 NaN NaN NaN NaN NaN
2017-10-02 03:00:00 75.0 5.8 19.4 47.0 0.0
2017-10-02 04:00:00 NaN NaN NaN NaN NaN
2017-10-02 05:00:00 NaN NaN NaN NaN NaN
2017-10-02 06:00:00 81.0 5.1 19.5 42.0 0.0
but, I have no idea about how to fill the NaN by the mean of the last value and the next value.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum by groups, preserving NAs - python

Check min_count df.groupby(['Date', 'Cod1', 'Cod2'])['E','A', 'S'].sum(min_count=1) Out[260]: E A S Date Cod1 Cod2 2019-02-01 100013.0 1 0.0 -14.0 511.0 2019-02-02 100013.0 1 0.0 -28.0 NaN 2019-02-03 100013.0 1 0.0 -9.0 NaN 2019-02-07 100013.0 1 0.0 -24.0 0.0

Related

How should I combine the rows of similar time in a Dataframe?

Add a date to a list of dataframes by extracting it from the filename

Pandas - duration where parameter is "1"

How to group by level 0 and describe in a multi index and level dataframe (pandas)?

Using Fixed interval(3hour) data to generates continuous time(1hour) data

Categories

Resources