Calculate count of a column based on other column in python dataframe - python

I have a dataframe like below having patients stay in ICU (in hours) that is shown by ICULOS.
df # Main dataframe
dfy = df.copy()
dfy
P_ID
ICULOS
Count
1
1
5
1
2
5
1
3
5
1
4
5
1
5
5
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
3
1
3
3
2
3
3
3
3
4
1
7
4
2
7
4
3
7
4
4
7
4
5
7
4
6
7
4
7
7
I calculated their ICULOS Count and placed in the new column named Count using the code:
dfy['Count'] = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
Now, I want to remove those patients based on P_ID whose Count is less than 8. (Note, I want to remove whole patient record). So, after removing the patients with Count < 8, Only the P_ID = 2 will remain as the count is 9.
The desired output:
P_ID
ICULOS
Count
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
I tried the following code, but for some reason, it is not working for me. It did worked for me but when I re-run the code after few days, it is giving me 0 result. Can someone suggest a better code? Thanks.
dfy = dfy.drop_duplicates(subset=['P_ID'],keep='first')
lis1 = dfy['P_ID'].tolist()
Icu_less_8 = dfy.loc[dfy['Count'] < 8]
lis2 = Icu_less_8.P_ID.to_list()
lis_3 = [k for k in tqdm_notebook(lis1) if k not in lis2]
# removing those patients who have ICULOS of less than 8 hours
df_1 = pd.DataFrame()
for l in tqdm_notebook(lis_3, desc = 'Progress'):
df_1 = df_1.append(df.loc[df['P_ID']==l])

You can directly filter rows in transform using Series.ge:
In [1521]: dfy[dfy.groupby(['P_ID'])['ICULOS'].transform('count').ge(8)]
Out[1521]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
EDIT after OP's comment: For multiple conditions, do:
In [1533]: x = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
In [1539]: dfy.loc[x[x.ge(8) & x.le(72)].index]
Out[1539]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9

Related

How to reduce pandas dataframe to only those individuals with all timepoints

I am trying to conduct a mixed model analysis but would like to only include individuals who have data in all timepoints available. Here is an example of what my dataframe looks like:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,4,4,4,4]
timepoint = [1,2,3,4,5,6,1,2,3,4,5,6,1,2,4,1,2,3,4,5,6]
outcome = [2,3,4,5,6,7,3,4,1,2,3,4,5,4,5,8,4,5,6,2,3]
df = pd.DataFrame({'id':ids,
'timepoint':timepoint,
'outcome':outcome})
print(df)
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 3 1 5
13 3 2 4
14 3 4 5
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3
I want to only keep individuals in the id column who have all 6 timepoints. I.e. IDs 1, 2, and 4 (and cut out all of ID 3's data).
Here's the ideal output:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
12 4 1 8
13 4 2 4
14 4 3 5
15 4 4 6
16 4 5 2
17 4 6 3
Any help much appreciated.
You can count the number of unique timepoints you have, and then filter your dataframe accordingly with transform('nunique') and loc keeping only the ID's that contain all 6 of them:
t = len(set(timepoint))
res = df.loc[df.groupby('id')['timepoint'].transform('nunique').eq(t)]
Prints:
id timepoint outcome
0 1 1 2
1 1 2 3
2 1 3 4
3 1 4 5
4 1 5 6
5 1 6 7
6 2 1 3
7 2 2 4
8 2 3 1
9 2 4 2
10 2 5 3
11 2 6 4
15 4 1 8
16 4 2 4
17 4 3 5
18 4 4 6
19 4 5 2
20 4 6 3

Pandas- Cumilative Sum of previous row values

Here is sample dataset:
id a
0 5 1
1 5 0
2 5 4
3 5 6
4 5 2
5 5 3
6 9 0
7 9 1
8 9 6
9 9 2
10 9 4
From the dataset, I want to generate a column sum. For first 3 rows: sum=sum+a(group by id). From 4th row, each row contains the cumulative sum of the previous 3 rows of a value(group by id). Loop through each row.
Desired Output:
id a sum
0 5 1 1
1 5 0 1
2 5 4 5
3 5 6 5
4 5 2 10
5 5 3 12
6 9 0 0
7 9 1 1
8 9 6 7
9 9 2 7
10 9 4 9
Code I tried:
df['sum']=df['a'].rolling(min_periods=1, window=3).groupby(df['id']).cumsum()
You can define a functiona like the below:
def cumsum_last3(DF):
nrow=DF.shape[0]
DF["sum"]=0
DF["sum"].iloc[0]=DF["a"].iloc[0]
DF["sum"].iloc[1]=DF["a"].iloc[0]+DF["a"].iloc[1]
for a in range(nrow-2):
cums=np.sum(DF["a"].iloc[a:a+3])
DF["sum"].iloc[a+2]=cums
return DF
DF_cums=cumsum_last3(DF)
DF_cums

how to dynamically add values of some of the columns in a dataframe?

dataframe in the image
year= 2020 (MAX COLUMN)
lastFifthYear = year - 4
input = '2001509-00'
I want to add all the values between year(2020) and lastFifthYear(2016) WHERE INPUT PARTNO = input
so for input value I should get 4+6+2+3+2 (2016+2017+2018+2019+2020) i.e 17
please give me some code
Here is some code that should work but you definitely need to improve on the way you ask questions here :-)
Considering df is the table you pasted as image above.
>>> year = 2016
>>> df_new=df.query('INPUT_PARTNO == "2001509-00"').melt(['ACTUAL_GI_YEAR', 'INPUT_PARTNO'], var_name='year', value_name='number')
>>> df_new.year=df_new.year.astype(int)
>>> df_new[df_new.year >= year].groupby(['ACTUAL_GI_YEAR','INPUT_PARTNO']).agg({'number' : sum})
number
ACTUAL_GI_YEAR INPUT_PARTNO
0 2001509-00 17
Example Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (10, 10)),
columns=list('ab')+list(range(2,10)))
Solved
#sum where a == 9 columns between 3,6 by rows
df['number'] = df.loc[df['a'].eq(9),
pd.to_numeric(df.columns, errors='coerce')
.to_series()
.between(3, 6)
.values].sum(axis=1)
print(df)
a b 2 3 4 5 6 7 8 9 number
0 1 9 9 2 6 0 6 1 4 2 NaN
1 2 3 4 8 7 2 4 0 0 6 NaN
2 2 2 7 4 9 6 7 1 0 0 NaN
3 0 3 5 3 0 4 2 7 2 6 NaN
4 7 7 1 4 7 7 9 7 4 2 NaN
5 9 9 9 0 3 3 3 8 7 7 9.0
6 9 0 5 5 7 9 6 6 5 7 27.0
7 2 1 9 1 9 3 3 4 4 9 NaN
8 4 0 5 9 6 7 3 9 1 6 NaN
9 5 5 0 8 6 4 5 4 7 4 NaN

Bundle columns of a DataFrame into hierarchical index

If I have preexisting columns (say 12 columns, all with unique names), and I want to organize them into two "header" columns, such as 8 assigned to Detail and 4 assigned to Summary, what is the most effective approach besides sorting them, manually creating a new index, and then swapping out the indices?
Happy to provide more example detail, but that's the gist of what is pretty generic problem.
Need to use multi-index of columns capability. It's important to rename() columns before reindex() so no data is lost.
df = pd.DataFrame({f"col-{i}":[random.randint(1,10) for i in range(10)] for i in range(12)})
header = [f"col-{i}" for i in range(8)]
header
# build a multi-index
mi = pd.MultiIndex.from_tuples([tuple(["Header" if c in header else "Detail", c])
for c in df.columns], names=('Category', 'Name'))
# rename before reindex to prevent data loss
df = df.rename(columns={c:mi[i] for i,c in enumerate(df.columns)}).reindex(columns=mi)
print(df.to_string())
output
Category Header Detail
Name col-0 col-1 col-2 col-3 col-4 col-5 col-6 col-7 col-8 col-9 col-10 col-11
0 5 5 6 1 8 3 8 6 8 2 8 10
1 2 7 10 5 2 10 5 10 10 7 6 1
2 10 1 1 2 7 9 2 9 4 4 7 6
3 8 10 1 3 3 4 10 10 9 7 6 8
4 6 8 7 2 5 4 3 3 7 9 8 6
5 6 4 4 4 1 5 8 4 4 1 6 8
6 3 7 3 8 8 4 6 1 5 10 5 10
7 5 1 10 9 9 7 8 2 6 7 10 4
8 2 2 1 4 8 8 7 2 5 9 9 9
9 8 6 5 6 2 8 2 8 10 7 9 3

Separate DataFrame into N (almost) equal segments

Say I have a data frame that looks like this:
Id ColA
1 2
2 2
3 3
4 5
5 10
6 12
7 18
8 20
9 25
10 26
I would like my code to create a new column at the end of the DataFrame that divides the total # of obvservations by 5 ranging from 5 to 1.
Id ColA Segment
1 2 5
2 2 5
3 3 4
4 5 4
5 10 3
6 12 3
7 18 2
8 20 2
9 25 1
10 26 1
I tried the following code but doesn't work:
df['segment'] = pd.qcut(df['Id'],5)
I also want to know what would happpen if the total of my observations was not dividable by 5.
Actually, you were closer to the answer than you think. This will work regardless of whether len(df) is a multiple of 5 or not.
bins = 5
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 4
4 5 10 3
5 6 12 3
6 7 18 2
7 8 20 2
8 9 25 1
9 10 26 1
Where,
pd.qcut(df['Id'], bins).cat.codes
0 0
1 0
2 1
3 2
4 3
5 4
6 4
dtype: int8
Represents the categorical intervals returned by pd.qcut as integer values.
Another example, for a DataFrame with 7 rows.
df = df.head(7).copy()
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 3
4 5 10 2
5 6 12 1
6 7 18 1
This should work:
df['segment'] = np.linspace(1, 6, len(df), False, dtype=int)
It creates a list of int between 1 and 5 of the size of your array. If you want from 5 to 1, just add [::-1] at the end of the line.

Categories

Resources