Sum within column based on values from another column - python

I have a pandas dataframe like below for the columns value_to_sum and indicator. I'd like to sum all values within value_to_sum up to and including the most recent value within that column where indicator == True. If indicator == False, I do not want to sum.
row
value_to_sum
indicator
desired_outcome
1
1
True
NaN
2
3
True
1
3
1
False
NaN
4
2
False
NaN
5
4
False
NaN
6
6
True
10
7
2
True
6
8
3
False
NaN
How can I achieve the values under desired_outcome?

You can set a group based on the .cumsum() of True values of column indicator and then use .groupby() together with .transform() to get the sum of value_to_sum of each group.
Then, for indicator == True, since the desired outcome is up to the previous row, we get the value of desired_outcome from last row by using .shift(). At the same time, for indicator == False, we set the value of desired_outcome to NaN. These last 2 steps are done altogether by a call to np.where().
df['desired_outcome'] = df.assign(group=df['indicator'].cumsum()).groupby('group')['value_to_sum'].transform('sum')
df['desired_outcome'] = np.where(df['indicator'], df['desired_outcome'].shift(), np.nan)
Result:
print(df)
row value_to_sum indicator desired_outcome
0 1 1 True NaN
1 2 3 True 1.0
2 3 1 False NaN
3 4 2 False NaN
4 5 4 False NaN
5 6 6 True 10.0
6 7 2 True 6.0
7 8 3 False NaN

Related

Ignore nan elements in a list using loc pandas

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?
Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

Filter df based on multiple conditions of values of columns in the df?

I have df below as:
id status id_reference ids_related
1 True NaN 4
4 False 1 NaN
2 False NaN NaN
7 False 3 11,12
6 True 2 NaN
10 True 4 NaN
22 True 1 NaN
11 True 7 NaN
12 True 7 NaN
I want to filter df for only rows where status is False, id_reference exists in id column whos status is True and ids_related is NaN
so expected output would be
id | status | id_reference | ids_related
4 False 1 NaN
I have code like
(df.loc[df["status"]&df["id_reference"].astype(float).isin(df.loc[~df["status"], "id"])])
This gives me rows where status is True and id reference exists in id column where it is false, but I want to tweak this to also look at ids_related column is NaN for the column that we are filtering for
Thanks!
Step by step
g=df[~df.status]#g=df[~df.status.astype(bool)]
g[(g.ids_related.isna())&(g.id_reference.eq('1'))]
Or chained solution;
df[((~df.status)&(df.ids_related.isna())&(df.id_reference.eq('1')))]

Python Pandas creating column on condition with dynamic amount of columns

If I create a new dataframe based on a user parameter, say a = 2. Therefore my dataframe df shrinks to 4 (ax2) columns into df_new. For example:
df_new = pd.DataFrame(data = {'col_01_01': [float('nan'),float('nan'),1,2,float('nan')], 'col_02_01': [float('nan'),float('nan'),1,2,float('nan')],'col_01_02': [0,0,0,0,1],'col_02_02': [1,0,0,1,1],'output':[1,0,1,1,1]})
To be more precise on the output column, let's look at the first row. [(nan,nan,0,1)] -> apply notna()-function to the first two entries and a comparison '==1' to the third and fourth row. -> This gives [(false, false, false, true)] -> compare these with an OR-expression and receive the desired result True -> 1
In the second row we find [(nan,nan,0,0)] therefore we find the output to be 0, since there is no valid value in the first two cols and 0 in the last two.
For a parameter a=3 we find 6 columns.
The result loos like this:
col_01_01 col_02_01 col_01_02 col_02_02 output
0 NaN NaN 0 1 1
1 NaN NaN 0 0 0
2 1.0 1.0 0 0 1
3 2.0 2.0 0 1 1
4 NaN NaN 1 1 1
You can use vectorised operations with notnull and eq:
null_cols = ['col_01_01', 'col_02_01']
int_cols = ['col_01_02', 'col_02_02']
df['output'] = (df[null_cols].notnull().any(1) | df[int_cols].eq(1).any(1)).astype(int)
print(df)
col_01_01 col_02_01 col_01_02 col_02_02 output
0 NaN NaN 0 1 1
1 NaN NaN 0 0 0
2 1.0 1.0 0 0 1
3 2.0 2.0 0 1 1
4 NaN NaN 1 1 1

Calculate Mean by Groupby, drop some rows with Boolean conditions and then save the file in original format

I have a data like this.
I calculate the mean of each IDs
df.groupby(['ID'], as_index= False)['A'].mean()
Now, I want to drop all those Ids whose mean value is more than 3
df.drop(df[df.A > 3].index)
And this is here i am stucked. I want to save the file but in original format (without grouping and no mean value) and without those Ids whose means were more than 3.
Any Idea How can i achieve this. Output something like this. Also I want to know how many unique Ids were removed while using drop.
Use transform for Series with same size as original DataFrame, so is possible filtering by changed condition from > 3 to <=3 by boolean indexing:
df1 = df[df.groupby('ID')['A'].transform('mean') <= 3]
print (df1)
ID A
0 1 2
1 1 3
2 1 1
6 3 6
7 3 1
8 3 1
9 3 1
Details:
print (df.groupby('ID')['A'].transform('mean'))
0 2.000000
1 2.000000
2 2.000000
3 6.666667
4 6.666667
5 6.666667
6 2.250000
7 2.250000
8 2.250000
9 2.250000
Name: A, dtype: float64
print (df.groupby('ID')['A'].transform('mean') <= 3)
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 True
9 True
Name: A, dtype: bool
Another solution using groupby and filter. This solutions is a slower than using transform with boolean indexing.
df.groupby('ID').filter(lambda x: x['A'].mean() < 3)
Output:
ID A
0 1 2
1 1 3
2 1 1
6 3 6
7 3 1
8 3 1
9 3 1

Replacing values in a 2nd level column on MultiIndex df in Pandas

I was looking into this post which almost solved my problem. However, in my case, I want to work based on the 2nd level of the df, but trying not to specify my 1st level column names explicitly.
Borrowing the original dataframe:
df = pd.DataFrame({('A','a'): [-1,-1,0,10,12],
('A','b'): [0,1,2,3,-1],
('B','a'): [-20,-10,0,10,20],
('B','b'): [-200,-100,0,100,200]})
##df
A B
a b a b
0 -1 0 -20 -200
1 -1 1 -10 -100
2 0 2 0 0
3 10 3 10 100
4 12 -1 20 200
I want to assign NA to all columns a and b where b<0. I was selecting them based on: df.xs('b',axis=1,level=1)<b, but then I cannot actually perform the replace. However, I have varying 1st level names, so the indexing there cannot be made based on A and B explicitly, but possibly through df.columns.values?
The desired output would be
##df
A B
a b a b
0 -1 0 NA NA
1 -1 1 NA NA
2 0 2 0 0
3 10 3 10 100
4 NA NA 20 200
I appreciate all tips, thank you in advance.
You can use DataFrame.mask with reindex for same index and column names as original DataFrame created by reindex:
mask = df.xs('b',axis=1,level=1) < 0
print (mask)
A B
0 False True
1 False True
2 False False
3 False False
4 True False
print (mask.reindex(columns = df.columns, level=0))
A B
a b a b
0 False False True True
1 False False True True
2 False False False False
3 False False False False
4 True True False False
df = df.mask(mask.reindex(columns = df.columns, level=0))
print (df)
A B
a b a b
0 -1.0 0.0 NaN NaN
1 -1.0 1.0 NaN NaN
2 0.0 2.0 0.0 0.0
3 10.0 3.0 10.0 100.0
4 NaN NaN 20.0 200.0
Edit by OP: I had asked in comments how to consider multiple conditions (e.g. df.xs('b',axis=1,level=1) < 0 OR df.xs('b',axis=1,level=1) being an NA). #Jezrael kindly indicated that if I wanted to do this, I should consider
mask=(df.xs('b',axis=1,level=1) < 0 | df.xs('b',axis=1,level=1).isnull())

Categories

Resources