Creating a Cumulative Frequency Column in a Dataframe Python - python

I am trying to create a new column named 'Cumulative Frequency' in a data frame where it consists of all the previous frequencies to the frequency for the current row as shown here.
What is the way to do this?

You want cumsum:
df['Cumulative Frequency'] = df['Frequency'].cumsum()
Example:
In [23]:
df = pd.DataFrame({'Frequency':np.arange(10)})
df
Out[23]:
Frequency
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
In [24]:
df['Cumulative Frequency'] = df['Frequency'].cumsum()
df
Out[24]:
Frequency Cumulative Frequency
0 0 0
1 1 1
2 2 3
3 3 6
4 4 10
5 5 15
6 6 21
7 7 28
8 8 36
9 9 45

Related

Calculate count of a column based on other column in python dataframe

I have a dataframe like below having patients stay in ICU (in hours) that is shown by ICULOS.
df # Main dataframe
dfy = df.copy()
dfy
P_ID
ICULOS
Count
1
1
5
1
2
5
1
3
5
1
4
5
1
5
5
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
3
1
3
3
2
3
3
3
3
4
1
7
4
2
7
4
3
7
4
4
7
4
5
7
4
6
7
4
7
7
I calculated their ICULOS Count and placed in the new column named Count using the code:
dfy['Count'] = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
Now, I want to remove those patients based on P_ID whose Count is less than 8. (Note, I want to remove whole patient record). So, after removing the patients with Count < 8, Only the P_ID = 2 will remain as the count is 9.
The desired output:
P_ID
ICULOS
Count
2
1
9
2
2
9
2
3
9
2
4
9
2
5
9
2
6
9
2
7
9
2
8
9
2
9
9
I tried the following code, but for some reason, it is not working for me. It did worked for me but when I re-run the code after few days, it is giving me 0 result. Can someone suggest a better code? Thanks.
dfy = dfy.drop_duplicates(subset=['P_ID'],keep='first')
lis1 = dfy['P_ID'].tolist()
Icu_less_8 = dfy.loc[dfy['Count'] < 8]
lis2 = Icu_less_8.P_ID.to_list()
lis_3 = [k for k in tqdm_notebook(lis1) if k not in lis2]
# removing those patients who have ICULOS of less than 8 hours
df_1 = pd.DataFrame()
for l in tqdm_notebook(lis_3, desc = 'Progress'):
df_1 = df_1.append(df.loc[df['P_ID']==l])
You can directly filter rows in transform using Series.ge:
In [1521]: dfy[dfy.groupby(['P_ID'])['ICULOS'].transform('count').ge(8)]
Out[1521]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9
EDIT after OP's comment: For multiple conditions, do:
In [1533]: x = dfy.groupby(['P_ID'])['ICULOS'].transform('count')
In [1539]: dfy.loc[x[x.ge(8) & x.le(72)].index]
Out[1539]:
P_ID ICULOS Count
5 2 1 9
6 2 2 9
7 2 3 9
8 2 4 9
9 2 5 9
10 2 6 9
11 2 7 9
12 2 8 9
13 2 9 9

Pandas calculate column based on other row

I need to calculate a column based on other row. Basically I want my new_column to be the sum of "base_column" for all row with same id.
I currently do the following (but is not really efficient) what is the most efficient way to achieve that ?
def calculate(x):
filtered_df = df[["id"] == dataset.at[x.name, "id"]] # in fact my filter is more complex basically same id and date in the last 4 weeks
df.at[x.name, "new_column"] = filtered_df["base_column"].sum()
df.apply(calculate)
You can do a below
df['new_column']= df.groupby('id')['base_column'].transform('sum')
input
id base_column
0 1 2
1 1 4
2 2 5
3 3 6
4 5 7
5 7 4
6 7 5
7 7 3
output
id base_column new_column
0 1 2 6
1 1 4 6
2 2 5 5
3 3 6 6
4 5 7 7
5 7 4 12
6 7 5 12
7 7 3 12
Another way to do this is to use groupby and merge
import pandas as pd
df = pd.DataFrame({'id':[1,1,2],'base_column':[2,4,5]})
# compute sum by id
sum_base =df.groupby("id").agg({"base_column": 'sum'}).reset_index().rename(columns={'base_column':'new_column'})
# join the result to df
df = pd.merge(df,sum_base,how='left',on='id')
# id base_column new_column
#0 1 2 6
#1 1 4 6
#2 2 5 5

Separate DataFrame into N (almost) equal segments

Say I have a data frame that looks like this:
Id ColA
1 2
2 2
3 3
4 5
5 10
6 12
7 18
8 20
9 25
10 26
I would like my code to create a new column at the end of the DataFrame that divides the total # of obvservations by 5 ranging from 5 to 1.
Id ColA Segment
1 2 5
2 2 5
3 3 4
4 5 4
5 10 3
6 12 3
7 18 2
8 20 2
9 25 1
10 26 1
I tried the following code but doesn't work:
df['segment'] = pd.qcut(df['Id'],5)
I also want to know what would happpen if the total of my observations was not dividable by 5.
Actually, you were closer to the answer than you think. This will work regardless of whether len(df) is a multiple of 5 or not.
bins = 5
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 4
4 5 10 3
5 6 12 3
6 7 18 2
7 8 20 2
8 9 25 1
9 10 26 1
Where,
pd.qcut(df['Id'], bins).cat.codes
0 0
1 0
2 1
3 2
4 3
5 4
6 4
dtype: int8
Represents the categorical intervals returned by pd.qcut as integer values.
Another example, for a DataFrame with 7 rows.
df = df.head(7).copy()
df['Segment'] = bins - pd.qcut(df['Id'], bins).cat.codes
df
Id ColA Segment
0 1 2 5
1 2 2 5
2 3 3 4
3 4 5 3
4 5 10 2
5 6 12 1
6 7 18 1
This should work:
df['segment'] = np.linspace(1, 6, len(df), False, dtype=int)
It creates a list of int between 1 and 5 of the size of your array. If you want from 5 to 1, just add [::-1] at the end of the line.

Pandas: how to add row values by index value

I'm having trouble working out how to add the index value of a pandas dataframe to each value at that index. For example, if I have a dataframe of zeroes, the row with index 1 should have a value of 1 for all columns. The row at index 2 should have values of 2 for each column, and so on.
Can someone enlighten me please?
You can use pd.DataFrame.add with axis=0. Just remember, as below, to convert your index to a series first.
df = pd.DataFrame(np.random.randint(0, 10, (5, 5)))
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 9 6 1 8 0
2 2 9 0 5 3
3 3 1 1 7 0
4 2 6 3 6 6
df = df.add(df.index.to_series(), axis=0)
print(df)
0 1 2 3 4
0 3 4 2 2 2
1 10 7 2 9 1
2 4 11 2 7 5
3 6 4 4 10 3
4 6 10 7 10 10

Pandas series Max value for column based on index

I am trying to extract the max value for a column based on the index. I have this series:
Hour Values
1 0
1 3
1 1
2 0
2 5
2 4
...
23 3
23 4
23 2
24 1
24 9
24 2
and am looking to add a new column 'Max Value' that will have the maximum of the 'Values' column for each value, based on the index (Hour):
Hour Values Max Value
1 0 3
1 3 3
1 1 3
2 0 5
2 5 5
2 4 5
...
23 3 4
23 4 4
23 2 4
24 1 9
24 9 9
24 2 9
I can do this in excel, but am new to pandas. The closest I have come is this scratchy effort, which is as far as I have got, but I get a syntax error on the first '=':
df['Max Value'] = 0
df['Max Value'][(df['Hour'] =1)] = df['Value'].max()
Use transform('max') method:
In [61]: df['Max Value'] = df.groupby('Hour')['Values'].transform('max')
In [62]: df
Out[62]:
Hour Values Max Value
0 1 0 3
1 1 3 3
2 1 1 3
3 2 0 5
4 2 5 5
5 2 4 5
6 23 3 4
7 23 4 4
8 23 2 4
9 24 1 9
10 24 9 9
11 24 2 9

Categories

Resources