I have a dataframe that contains number of observations per group of income:
INCAGG
1 6.561681e+08
3 9.712955e+08
5 1.658043e+09
7 1.710781e+09
9 2.356979e+09
I would like to compute the median income group. What do I mean?
Let's start with a simpler series:
INCAGG
1 6
3 9
5 16
7 17
9 23
It represents this set of numbers:
1 1 1 1 1 1
3 3 3 3 3 3 3 3 3
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
Which I can reorder to
1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7
7 7 7 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
which visually is what I mean - the median here would be 7.
After glancing at a numpy example here, I think cumsum() provides a good approach. Assuming your column of counts is called 'wt', here's a simple solution that will work most of the time (and see below for a more general solution):
df = df.sort('incagg')
df['tmp'] = df.wt.cumsum() < ( df.wt.sum() / 2. )
df['med_grp'] = (df.tmp==False) & (df.tmp.shift()==True)
The second code line above is dividing into rows above and below the median. The median observation will be in the first False group.
incagg wt tmp med_grp
0 1 656168100 True False
1 3 971295500 True False
2 5 1658043000 True False
3 7 1710781000 False True
4 9 2356979000 False False
df.ix[df.med_grp,'incagg']
3 7
Name: incagg, dtype: int64
This will work fine when the median is unique and often when it isn't. The problem can only occur if the median is non-unique AND it falls on the edge of a group. In this case (with 5 groups and weights in the millions/billions), it's really not a concern but nevertheless here's a more general solution:
df['tmp1'] = df.wt.cumsum() == (df.wt.sum() / 2.)
df['tmp2'] = df.wt.cumsum() < (df.wt.sum() / 2.)
df['med_grp'] = (df.tmp2==False) & (df.tmp2.shift()==True)
df['med_grp'] = df.med_grp | df.tmp1.shift()
incagg wt tmp1 tmp2 med_grp
0 1 1 False True False
1 3 1 False True False
2 5 1 True False True
3 7 2 False False True
4 9 1 False False False
df.ix[df.med_grp,'incagg']
2 5
3 7
df.ix[df.med_grp,'incagg'].mean()
6.0
You can use chain from itertools. I used list comprehension to get a list of the aggregation group repeated the appropriate number of times, and then used chain to put it into a single list. Finally, I converted it to a Series and calculated the median:
from itertools import chain
df = pd.DataFrame([6, 9, 16, 17, 23], index=[1, 3, 5, 7, 9], columns=['counts'])
median = pd.Series([i for i in chain(*[[k] * v for k, v in zip(df.index, df.counts)])]).median()
>>> median
7.0
Related
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({'date':[1,2,3,4,5,6,7,8,9,10] ,'open':[4,5,3,4,5,6,7,8,9,10],'close':[4,5,6,7,8,1,2,9,10,11],'stock':['A']*5 + ['B']*5})
df['flag'] = np.select([df['close']>df['open'],df['close']<df['open']],['up','down'],default='flat')
df
date
open
close
stock
flag
0
1
4
4
A
flat
1
2
5
5
A
flat
2
3
3
6
A
up
3
4
4
7
A
up
4
5
5
8
A
up
5
6
6
1
B
down
6
7
7
2
B
down
7
8
8
9
B
up
8
9
9
10
B
up
9
10
10
11
B
up
I tried the following. None of them works. They all give me "No numeric types to aggregate" error
# flag if previous 3 days (t-2,t-1, and t) are all increase for each stock
df['3days_up'] = df.groupby('stock')['flag'].rolling(3).apply(lambda x: 'Yes' if all(x['flag']=='up') else 'No')
df['3days_up'] = df.groupby('stock')[['flag']].rolling(3).apply(lambda x: 'Yes' if all(x['flag']=='up') else 'No')
df['3days_up'] = df.groupby('stock').rolling(3).apply(lambda x: 'Yes' if all(x['flag']=='up') else 'No')
Expected output:
date
open
close
stock
flag
3days_up
0
1
4
4
A
flat
No
1
2
5
5
A
flat
No
2
3
3
6
A
up
No
3
4
4
7
A
up
No
4
5
5
8
A
up
Yes
5
6
6
1
B
down
No
6
7
7
2
B
down
No
7
8
8
9
B
up
No
8
9
9
10
B
up
No
9
10
10
11
B
up
Yes
Convert up value to True and others to False as starting point:
df['3days_up'] = np.where(df.assign(is_up=df['flag'] == 'up')
.groupby('stock').rolling(3)['is_up']
.sum() >= 3, 'Yes', 'No')
print(df)
# Output
date open close stock flag 3days_up
0 1 4 4 A flat No
1 2 5 5 A flat No
2 3 3 6 A up No
3 4 4 7 A up No
4 5 5 8 A up Yes
5 6 6 1 B down No
6 7 7 2 B down No
7 8 8 9 B up No
8 9 9 10 B up No
9 10 10 11 B up Yes
I have a dataframe like below having patients stay in ICU (in hours) that is shown by ICULOS.
df # Main dataframe
dfy = df.copy()
dfy
P_ID
ICULOS
Count
1
1
5
1
2
5
1
3
5
1
4
5
1
5
5
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
3
1
3
3
2
3
3
3
3
4
1
7
4
2
7
4
3
7
4
4
7
4
5
7
4
6
7
4
7
7
I calculated their ICULOS Count and placed in the new column named Count using the code:
dfy['Count'] = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
Now, I want to remove those patients based on P_ID whose Count is less than 8. (Note, I want to remove whole patient record). So, after removing the patients with Count < 8, Only the P_ID = 2 will remain as the count is 9.
The desired output:
P_ID
ICULOS
Count
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
I tried the following code, but for some reason, it is not working for me. It did worked for me but when I re-run the code after few days, it is giving me 0 result. Can someone suggest a better code? Thanks.
dfy = dfy.drop_duplicates(subset=['P_ID'],keep='first')
lis1 = dfy['P_ID'].tolist()
Icu_less_8 = dfy.loc[dfy['Count'] < 8]
lis2 = Icu_less_8.P_ID.to_list()
lis_3 = [k for k in tqdm_notebook(lis1) if k not in lis2]
# removing those patients who have ICULOS of less than 8 hours
df_1 = pd.DataFrame()
for l in tqdm_notebook(lis_3, desc = 'Progress'):
df_1 = df_1.append(df.loc[df['P_ID']==l])
You can directly filter rows in transform using Series.ge:
In [1521]: dfy[dfy.groupby(['P_ID'])['ICULOS'].transform('count').ge(8)]
Out[1521]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
EDIT after OP's comment: For multiple conditions, do:
In [1533]: x = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
In [1539]: dfy.loc[x[x.ge(8) & x.le(72)].index]
Out[1539]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
dataframe in the image
year= 2020 (MAX COLUMN)
lastFifthYear = year - 4
input = '2001509-00'
I want to add all the values between year(2020) and lastFifthYear(2016) WHERE INPUT PARTNO = input
so for input value I should get 4+6+2+3+2 (2016+2017+2018+2019+2020) i.e 17
please give me some code
Here is some code that should work but you definitely need to improve on the way you ask questions here :-)
Considering df is the table you pasted as image above.
>>> year = 2016
>>> df_new=df.query('INPUT_PARTNO == "2001509-00"').melt(['ACTUAL_GI_YEAR', 'INPUT_PARTNO'], var_name='year', value_name='number')
>>> df_new.year=df_new.year.astype(int)
>>> df_new[df_new.year >= year].groupby(['ACTUAL_GI_YEAR','INPUT_PARTNO']).agg({'number' : sum})
number
ACTUAL_GI_YEAR INPUT_PARTNO
0 2001509-00 17
Example Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (10, 10)),
columns=list('ab')+list(range(2,10)))
Solved
#sum where a == 9 columns between 3,6 by rows
df['number'] = df.loc[df['a'].eq(9),
pd.to_numeric(df.columns, errors='coerce')
.to_series()
.between(3, 6)
.values].sum(axis=1)
print(df)
a b 2 3 4 5 6 7 8 9 number
0 1 9 9 2 6 0 6 1 4 2 NaN
1 2 3 4 8 7 2 4 0 0 6 NaN
2 2 2 7 4 9 6 7 1 0 0 NaN
3 0 3 5 3 0 4 2 7 2 6 NaN
4 7 7 1 4 7 7 9 7 4 2 NaN
5 9 9 9 0 3 3 3 8 7 7 9.0
6 9 0 5 5 7 9 6 6 5 7 27.0
7 2 1 9 1 9 3 3 4 4 9 NaN
8 4 0 5 9 6 7 3 9 1 6 NaN
9 5 5 0 8 6 4 5 4 7 4 NaN
If I have preexisting columns (say 12 columns, all with unique names), and I want to organize them into two "header" columns, such as 8 assigned to Detail and 4 assigned to Summary, what is the most effective approach besides sorting them, manually creating a new index, and then swapping out the indices?
Happy to provide more example detail, but that's the gist of what is pretty generic problem.
Need to use multi-index of columns capability. It's important to rename() columns before reindex() so no data is lost.
df = pd.DataFrame({f"col-{i}":[random.randint(1,10) for i in range(10)] for i in range(12)})
header = [f"col-{i}" for i in range(8)]
header
# build a multi-index
mi = pd.MultiIndex.from_tuples([tuple(["Header" if c in header else "Detail", c])
for c in df.columns], names=('Category', 'Name'))
# rename before reindex to prevent data loss
df = df.rename(columns={c:mi[i] for i,c in enumerate(df.columns)}).reindex(columns=mi)
print(df.to_string())
output
Category Header Detail
Name col-0 col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 col-9 col-10 col-11
0 5 5 6 1 8 3 8 6 8 2 8 10
1 2 7 10 5 2 10 5 10 10 7 6 1
2 10 1 1 2 7 9 2 9 4 4 7 6
3 8 10 1 3 3 4 10 10 9 7 6 8
4 6 8 7 2 5 4 3 3 7 9 8 6
5 6 4 4 4 1 5 8 4 4 1 6 8
6 3 7 3 8 8 4 6 1 5 10 5 10
7 5 1 10 9 9 7 8 2 6 7 10 4
8 2 2 1 4 8 8 7 2 5 9 9 9
9 8 6 5 6 2 8 2 8 10 7 9 3
Ok I have this pandas dataframe
import pandas
dfp=pandas.DataFrame([5,10,1,7,13,4,5,7,8,10,11,3])
And i want to create a second data frame with the rows that have a value greater than 5, thereby:
dfp2=dfp[dfp>5]
My problem is that I obtain this result:
0
0 NaN
1 10
2 NaN
3 7
4 13
5 NaN
6 NaN
7 7
8 8
9 10
10 11
11 NaN
And what I want is this other result:
0
0 10
1 7
2 13
3 7
4 8
5 10
6 11
What is wrong with my code?
Thanks a lot
You're using the mask generated from the comparison so where it's False it returns NaN, to get rid of those call dropna:
In [32]:
dfp[dfp > 5].dropna()
Out[32]:
0
1 10
3 7
4 13
7 7
8 8
9 10
10 11
The mask here:
In [33]:
dfp > 5
Out[33]:
0
0 False
1 True
2 False
3 True
4 True
5 False
6 False
7 True
8 True
9 True
10 True
11 False