Groupby in pandas between rows based on a condition - python

Let's say that I have the following dataframe:
name number
0 A 100
1 B 200
2 B 30
3 A 20
4 B 30
5 A 40
6 A 50
7 A 100
8 B 10
9 B 20
10 B 30
11 A 40
What I would like to do is to merge all the successive rows where name == 'B', between two rows with name == 'A' and get the corresponding sum. So, I would like my final output to look like that:
name number
0 A 100
1 B 230
2 A 20
3 B 30
4 A 40
5 A 50
6 A 100
7 B 60
8 A 40

We can use a little groupby trick here. Create a mask with of A's and then shift each subsequent group of B's into their own group. This answer assumes that your name Series contains just A's and B's.
c = df['name'].eq('A')
m1 = c.cumsum()
m = m1.where(c, m1 + m1.max())
df.groupby(m, sort=False, as_index=False).agg({'name': 'first', 'number': 'sum'})
name number
0 A 100
1 B 230
2 A 20
3 B 30
4 A 40
5 A 50
6 A 100
7 B 60
8 A 40

A clumsier attempt - but since I've done it might as well post.
This is just a basic for loop with a while:
for i in df.index:
if i in df.index and df.loc[i, 'name'] == 'B':
while df.loc[i+1, 'name'] == 'B':
df.loc[i, 'number'] += df.loc[i+1, 'number']
df = df.drop(i+1).reset_index(drop=True)
It's very straightforward (and hence inefficient I imagine): if B, if next row is also B, add next row to this row's number and delete next row.

Related

Getting max values based on sliced column

Let's consider this Dataframe:
$> df
a b
0 6 50
1 2 20
2 9 60
3 4 40
4 5 20
I want to compute column D based on:
The max value between:
integer 0
A slice of column B at that row's index
So I have created a column C (all zeroes) in my dataframe in order use DataFrame.max(axis=1). However, short of using apply or looping over the DataFrame, I don't know how to slice the input values. Expected result would be:
$> df
a b c d
0 6 50 0 60
1 2 20 0 60
2 9 60 0 60
3 4 40 0 40
4 5 20 0 20
So essentially, d's 3rd row is computed (pseudo-code) as max(df[3:,"b"], df[3:,"c"]), and similarly for each row.
Since the input columns (b, c) have already been computed, there has to be a way to slice the input as I calculate each row for D without having to loop, as this is slow.
Seems like this could work: Reverse "B", find cummax, then reverse it back and assign it to "d". Then use where on "d" to see if any value is less than 0:
df['d'] = df['b'][::-1].cummax()[::-1]
df['d'] = df['d'].where(df['d']>0, 0)
We can replace the last line with the one below using clip (thanks #Either), and drop the 2nd reversal (assuming indexes match) making it all a one liner:
df['d'] = df['b'][::-1].cummax().clip(lower=0)
Output:
a b d
0 6 50 60
1 2 20 60
2 9 60 60
3 4 40 40
4 5 20 20

How to multiply specific column from dataframe with one specific column in same dataframe?

I have a dataframe where i need to create new column based on the multiplication of other column with specific column
Here is how my data frame looks.
df:
Brand Price S_Value S_Factor
A 10 2 2
B 20 4 1
C 30 2 1
D 40 1 2
E 50 1 1
F 10 1 1
I would like multiply column Value and Factor with Price to get new column. I can do it manually but I have a lot of column and all start with specific prefix wihivh i need to multiply... here I used S_ which mean I need to multiply all the columns which start with S_
Here would be te desired output columns
Brand Price S_Value S_Factor S_Value_New S_Factor_New
A 10 2 2
B 20 4 1
C 30 2 1
D 40 1 2
E 50 1 1
F 10 1 1
Firstly, to get the columns which you have to multiply, you can use list comprehension and string function startswith. And then just loop over the columns and create new columns by muptiplying with Price
multiply_cols = [col for col in df.columns if col.startswith('S_')]
for col in multiply_cols:
df[col+'_New'] = df[col] * df['Price']
df
Since you did not added and example of the output. This might be what you are looking for:
dfr = pd.DataFrame({
'Brand' : ['A', 'B', 'C', 'D', 'E', 'F'],
'price' : [10, 20, 30, 40, 50, 10],
'S_Value' : [2,4,2,1,1,1],
'S_Factor' : [2,1,1,2,1,1]
})
pre_fixes = ['S_']
for prefix in pre_fixes:
coltocal = [col for col in dfr.columns if col.startswith(prefix)]
for col in coltocal:
dfr.loc[:,col+'_new'] = dfr.price*dfr[col]
dfr
Brand price S_Value S_Factor S_Value_new S_Factor_new
0 A 10 2 2 20 20
1 B 20 4 1 80 20
2 C 30 2 1 60 30
3 D 40 1 2 40 80
4 E 50 1 1 50 50
5 F 10 1 1 10 10
Just add as many prefixes you have to pre_fixes (use come to separate them)

How to select pandas rows when a cell value is between two values and the next cell or cells are between the same values?

Given the following pandas dataframe:
Name speed
---------------
0 A 100
1 A 50
2 A 40
4 A 30
5 A 10
6 B 100
7 B 50
8 B 40
9 B 120
10 A 10
I want to select rows for each name when we find a 10< speed<100 where the next rows speed are also between 10 and 100, until we find a row with speed=<10 so we select them, if we find a row speed=>100 we don't select them and go on iterating the other rows. the result should be like this:
Name speed
---------------
1 A 50
2 A 40
4 A 30
You can use between() and shift() for this:
low = 10
high = 100
mask_next = df.speed.between(low, high, inclusive=False) & (df.speed.shift(-1).between(low, high, inclusive=False))
mask_before = df.speed.between(low, high, inclusive=False) & (df.speed.shift(1).between(low, high, inclusive=False))
df[mask_next | mask_before]
Output:
Name speed
1 A 50
2 A 40
4 A 30
7 B 50
8 B 40
If you need it for each name, you have to loop:
results = []
for name in df.Name.drop_duplicates().to_list():
mask_name = df.Name == name
results.append(df[mask_name & (mask_next | mask_before)])
pd.concat(results)
Output:
Name speed
1 A 50
2 A 40
4 A 30
7 B 50
8 B 40
The results are identical with your test data. However, the second approach checks if id 5 and 10 fulfill the condition. The first approach without looping will always drop id 10.

Compute number of occurance of each value and Sum another column in Pandas

I have a pandas dataframe with some columns in it. The column I am interested in is something like this,
df['col'] = ['A', 'A', 'B', 'C', 'B', 'A']
I want to make another column say, col_count such that it shows count value in col from that index to the end of the column.
The first A in the column should have a value 3 because there is 3 occurrence of A in the column from that index. The second A will have value 2 and so on.
Finally, I want to get the following result,
col col_count
0 A 3
1 A 2
2 B 2
3 C 1
4 B 1
5 A 1
How can I do this effectively in pandas.? I was able to do this by looping through the dataframe and taking a unique count of that value for a sliced dataframe.
Is there an efficient method to do this.? Something without loops preferable.
Another part of the question is, I have another column like this along with col,
df['X'] = [10, 40, 10, 50, 30, 20]
I want to sum up this column in the same fashion I wanted to count the column col.
For instance, At index 0, I will have 10 + 40 + 20 as the sum. At index 1, the sum will be 40 + 20. In short, instead of counting, I want to sum up another column.
The result will be like this,
col col_count X X_sum
0 A 3 10 70
1 A 2 40 60
2 B 2 10 40
3 C 1 50 50
4 B 1 30 30
5 A 1 20 20
Use pandas.Series.groupby with cumcount and cumsum.
g = df[::-1].groupby('col')
df['col_count'] = g.cumcount().add(1)
df['X_sum'] = g['X'].cumsum()
print(df)
Output:
col X col_count X_sum
0 A 10 3 70
1 A 40 2 60
2 B 10 2 40
3 C 50 1 50
4 B 30 1 30
5 A 20 1 20

GroupBy one column, custom operation on another column of grouped records in pandas

I wanted to apply a custom operation on a column by grouping the values on another column. Group by column to get the count, then divide the another column value with this count for all the grouped records.
My Data Frame:
emp opp amount
0 a 1 10
1 b 1 10
2 c 2 30
3 b 2 30
4 d 2 30
My scenario:
For opp=1, two emp's worked(a,b). So the amount should be shared like
10/2 =5
For opp=2, two emp's worked(b,c,d). So the amount should be like
30/3 = 10
Final Output DataFrame:
emp opp amount
0 a 1 5
1 b 1 5
2 c 2 10
3 b 2 10
4 d 2 10
What is the best possible to do so
df['amount'] = df.groupby('opp')['amount'].transform(lambda g: g/g.size)
df
# emp opp amount
# 0 a 1 5
# 1 b 1 5
# 2 c 2 10
# 3 b 2 10
# 4 d 2 10
Or:
df['amount'] = df.groupby('opp')['amount'].apply(lambda g: g/g.size)
does similar thing.
You could try something like this:
df2 = df.groupby('opp').amount.count()
df.loc[:, 'calculated'] = df.apply( lambda row: \
row.amount / df2.ix[row.opp], axis=1)
df
Yields:
emp opp amount calculated
0 a 1 10 5
1 b 1 10 5
2 c 2 30 10
3 b 2 30 10
4 d 2 30 10

Categories

Resources