Sum within column based on values from another column

Sum within column based on values from another column - python

I have a pandas dataframe like below for the columns value_to_sum and indicator. I'd like to sum all values within value_to_sum up to and including the most recent value within that column where indicator == True. If indicator == False, I do not want to sum.
row
value_to_sum
indicator
desired_outcome
1
1
True
NaN
2
3
True
1
3
1
False
NaN
4
2
False
NaN
5
4
False
NaN
6
6
True
10
7
2
True
6
8
3
False
NaN
How can I achieve the values under desired_outcome?

You can set a group based on the .cumsum() of True values of column indicator and then use .groupby() together with .transform() to get the sum of value_to_sum of each group.
Then, for indicator == True, since the desired outcome is up to the previous row, we get the value of desired_outcome from last row by using .shift(). At the same time, for indicator == False, we set the value of desired_outcome to NaN. These last 2 steps are done altogether by a call to np.where().
df['desired_outcome'] = df.assign(group=df['indicator'].cumsum()).groupby('group')['value_to_sum'].transform('sum')
df['desired_outcome'] = np.where(df['indicator'], df['desired_outcome'].shift(), np.nan)
Result:
print(df)
row value_to_sum indicator desired_outcome
0 1 1 True NaN
1 2 3 True 1.0
2 3 1 False NaN
3 4 2 False NaN
4 5 4 False NaN
5 6 6 True 10.0
6 7 2 True 6.0
7 8 3 False NaN

Related

Ignore nan elements in a list using loc pandas

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?

Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

Filter df based on multiple conditions of values of columns in the df?

I have df below as:
id status id_reference ids_related
1 True NaN 4
4 False 1 NaN
2 False NaN NaN
7 False 3 11,12
6 True 2 NaN
10 True 4 NaN
22 True 1 NaN
11 True 7 NaN
12 True 7 NaN
I want to filter df for only rows where status is False, id_reference exists in id column whos status is True and ids_related is NaN
so expected output would be
id | status | id_reference | ids_related
4 False 1 NaN
I have code like
(df.loc[df["status"]&df["id_reference"].astype(float).isin(df.loc[~df["status"], "id"])])
This gives me rows where status is True and id reference exists in id column where it is false, but I want to tweak this to also look at ids_related column is NaN for the column that we are filtering for
Thanks!

Step by step
g=df[~df.status]#g=df[~df.status.astype(bool)]
g[(g.ids_related.isna())&(g.id_reference.eq('1'))]
Or chained solution;
df[((~df.status)&(df.ids_related.isna())&(df.id_reference.eq('1')))]

Python Pandas creating column on condition with dynamic amount of columns

If I create a new dataframe based on a user parameter, say a = 2. Therefore my dataframe df shrinks to 4 (ax2) columns into df_new. For example:
df_new = pd.DataFrame(data = {'col_01_01': [float('nan'),float('nan'),1,2,float('nan')], 'col_02_01': [float('nan'),float('nan'),1,2,float('nan')],'col_01_02': [0,0,0,0,1],'col_02_02': [1,0,0,1,1],'output':[1,0,1,1,1]})
To be more precise on the output column, let's look at the first row. [(nan,nan,0,1)] -> apply notna()-function to the first two entries and a comparison '==1' to the third and fourth row. -> This gives [(false, false, false, true)] -> compare these with an OR-expression and receive the desired result True -> 1
In the second row we find [(nan,nan,0,0)] therefore we find the output to be 0, since there is no valid value in the first two cols and 0 in the last two.
For a parameter a=3 we find 6 columns.
The result loos like this:
col_01_01 col_02_01 col_01_02 col_02_02 output
0 NaN NaN 0 1 1
1 NaN NaN 0 0 0
2 1.0 1.0 0 0 1
3 2.0 2.0 0 1 1
4 NaN NaN 1 1 1

You can use vectorised operations with notnull and eq:
null_cols = ['col_01_01', 'col_02_01']
int_cols = ['col_01_02', 'col_02_02']
df['output'] = (df[null_cols].notnull().any(1) | df[int_cols].eq(1).any(1)).astype(int)
print(df)
col_01_01 col_02_01 col_01_02 col_02_02 output
0 NaN NaN 0 1 1
1 NaN NaN 0 0 0
2 1.0 1.0 0 0 1
3 2.0 2.0 0 1 1
4 NaN NaN 1 1 1

Calculate Mean by Groupby, drop some rows with Boolean conditions and then save the file in original format

I have a data like this.
I calculate the mean of each IDs
df.groupby(['ID'], as_index= False)['A'].mean()
Now, I want to drop all those Ids whose mean value is more than 3
df.drop(df[df.A > 3].index)
And this is here i am stucked. I want to save the file but in original format (without grouping and no mean value) and without those Ids whose means were more than 3.
Any Idea How can i achieve this. Output something like this. Also I want to know how many unique Ids were removed while using drop.

Use transform for Series with same size as original DataFrame, so is possible filtering by changed condition from > 3 to <=3 by boolean indexing:
df1 = df[df.groupby('ID')['A'].transform('mean') <= 3]
print (df1)
ID A
0 1 2
1 1 3
2 1 1
6 3 6
7 3 1
8 3 1
9 3 1
Details:
print (df.groupby('ID')['A'].transform('mean'))
0 2.000000
1 2.000000
2 2.000000
3 6.666667
4 6.666667
5 6.666667
6 2.250000
7 2.250000
8 2.250000
9 2.250000
Name: A, dtype: float64
print (df.groupby('ID')['A'].transform('mean') <= 3)
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 True
9 True
Name: A, dtype: bool

Another solution using groupby and filter. This solutions is a slower than using transform with boolean indexing.
df.groupby('ID').filter(lambda x: x['A'].mean() < 3)
Output:
ID A
0 1 2
1 1 3
2 1 1
6 3 6
7 3 1
8 3 1
9 3 1

Replacing values in a 2nd level column on MultiIndex df in Pandas

I was looking into this post which almost solved my problem. However, in my case, I want to work based on the 2nd level of the df, but trying not to specify my 1st level column names explicitly.
Borrowing the original dataframe:
df = pd.DataFrame({('A','a'): [-1,-1,0,10,12],
('A','b'): [0,1,2,3,-1],
('B','a'): [-20,-10,0,10,20],
('B','b'): [-200,-100,0,100,200]})
##df
A B
a b a b
0 -1 0 -20 -200
1 -1 1 -10 -100
2 0 2 0 0
3 10 3 10 100
4 12 -1 20 200
I want to assign NA to all columns a and b where b<0. I was selecting them based on: df.xs('b',axis=1,level=1)<b, but then I cannot actually perform the replace. However, I have varying 1st level names, so the indexing there cannot be made based on A and B explicitly, but possibly through df.columns.values?
The desired output would be
##df
A B
a b a b
0 -1 0 NA NA
1 -1 1 NA NA
2 0 2 0 0
3 10 3 10 100
4 NA NA 20 200
I appreciate all tips, thank you in advance.

You can use DataFrame.mask with reindex for same index and column names as original DataFrame created by reindex:
mask = df.xs('b',axis=1,level=1) < 0
print (mask)
A B
0 False True
1 False True
2 False False
3 False False
4 True False
print (mask.reindex(columns = df.columns, level=0))
A B
a b a b
0 False False True True
1 False False True True
2 False False False False
3 False False False False
4 True True False False
df = df.mask(mask.reindex(columns = df.columns, level=0))
print (df)
A B
a b a b
0 -1.0 0.0 NaN NaN
1 -1.0 1.0 NaN NaN
2 0.0 2.0 0.0 0.0
3 10.0 3.0 10.0 100.0
4 NaN NaN 20.0 200.0
Edit by OP: I had asked in comments how to consider multiple conditions (e.g. df.xs('b',axis=1,level=1) < 0 OR df.xs('b',axis=1,level=1) being an NA). #Jezrael kindly indicated that if I wanted to do this, I should consider
mask=(df.xs('b',axis=1,level=1) < 0 | df.xs('b',axis=1,level=1).isnull())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum within column based on values from another column - python

Related

Ignore nan elements in a list using loc pandas

Filter df based on multiple conditions of values of columns in the df?

Python Pandas creating column on condition with dynamic amount of columns

Calculate Mean by Groupby, drop some rows with Boolean conditions and then save the file in original format

Replacing values in a 2nd level column on MultiIndex df in Pandas

Categories

Resources