Getting max values based on sliced column - python

Let's consider this Dataframe:
$> df
a b
0 6 50
1 2 20
2 9 60
3 4 40
4 5 20
I want to compute column D based on:
The max value between:
integer 0
A slice of column B at that row's index
So I have created a column C (all zeroes) in my dataframe in order use DataFrame.max(axis=1). However, short of using apply or looping over the DataFrame, I don't know how to slice the input values. Expected result would be:
$> df
a b c d
0 6 50 0 60
1 2 20 0 60
2 9 60 0 60
3 4 40 0 40
4 5 20 0 20
So essentially, d's 3rd row is computed (pseudo-code) as max(df[3:,"b"], df[3:,"c"]), and similarly for each row.
Since the input columns (b, c) have already been computed, there has to be a way to slice the input as I calculate each row for D without having to loop, as this is slow.

Seems like this could work: Reverse "B", find cummax, then reverse it back and assign it to "d". Then use where on "d" to see if any value is less than 0:
df['d'] = df['b'][::-1].cummax()[::-1]
df['d'] = df['d'].where(df['d']>0, 0)
We can replace the last line with the one below using clip (thanks #Either), and drop the 2nd reversal (assuming indexes match) making it all a one liner:
df['d'] = df['b'][::-1].cummax().clip(lower=0)
Output:
a b d
0 6 50 60
1 2 20 60
2 9 60 60
3 4 40 40
4 5 20 20

Related

Remove duplicates using column value with some ignore condition

I have two columns in my excel file and I want to remove duplicates from 'A' column with an ignore condition. The columns are as follow:
A B
1 10
1 20
2 30
2 40
3 10
3 20
Now, I want it to turn into this:
A B
1 10
2 30
2 40
3 10
So, basically I want to remove all duplicates except when column 'A' has value 2 (I want to ignore 2). My current code is as follows but it does not work for me as it removes duplicates with value '2' too.
df = pd.read_excel(save_filename)
df2 = df.drop_duplicates(subset=["A", "B"], keep='first')
df2.to_excel(save_filename, index=False)
You can use two conditions:
df[~df.duplicated(subset="A") | df["A"].eq(2)]
A B
0 1 10
2 2 30
3 2 40
4 3 10

Adding concatenation of unique values as groupby output in pandas

Imagine you have a dataframe df as follows:
Id Side Volume
2 a 40
2 b 30
1 a 20
2 b 10
You want the following output
Id Side sum
1 a 20
1 all 20
2 a 40
2 b 40
2 all 80
all a 60
all b 40
all all 100
Which would be a df.groupby(['Id','Side'].C.sum().reset_index() AND the sum values for all side and all id's (df.Volume.sum(), df[df.Side == 'a'].Volume.sum(), df[df.Side == 'b'].Volume.sum(), etc...)?
Is there a way to do this without calculating it outside and then merging both results?

Check for value in one column , take the adjacent value and apply function

I have the following:
df1=pd.DataFrame([[1,10],[2,15],[3,16]], columns=["a","b"])
which result in:
a b
0 1 10
1 2 15
2 3 16
I want to create a third column "c" where the value in each row is a product of the value in column "b" form the same row multiplied by a number depending on the value in column "a". So for example
if value in "a" is 1 multiply 10 x 2,
if value in "a" is 2 multiply 15 x 5,
if value in "a" is 3 multiply 16 x 10.
In effect I want to achieve this:
a b c
0 1 10 20
1 2 15 75
2 3 16 160
I have tried something with if and elif but don't get to the right solution.
The dataframe is lengthy and the numbers 1, 2, 3 in column "a" appear in random order.
Thanks in advance.
Are you looking for something like this, I have extended your Dataframe, please check if it helps
df1=pd.DataFrame([[1,10],[2,15],[3,16],[3,11],[2,12],[1,16]], columns=["a","b"])
dict_prod = {1:2, 2:5, 3:10}
df1['c'] = df1['a'].map(dict_prod)*df1['b']
a b c
0 1 10 20
1 2 15 75
2 3 16 160
3 3 11 110
4 2 12 60
5 1 16 32
You should be able to just do
df1['c'] = df['a']*your_number * df['b']
or
df1['c'] = some_function(df['a']) * df['b']

Compute number of occurance of each value and Sum another column in Pandas

I have a pandas dataframe with some columns in it. The column I am interested in is something like this,
df['col'] = ['A', 'A', 'B', 'C', 'B', 'A']
I want to make another column say, col_count such that it shows count value in col from that index to the end of the column.
The first A in the column should have a value 3 because there is 3 occurrence of A in the column from that index. The second A will have value 2 and so on.
Finally, I want to get the following result,
col col_count
0 A 3
1 A 2
2 B 2
3 C 1
4 B 1
5 A 1
How can I do this effectively in pandas.? I was able to do this by looping through the dataframe and taking a unique count of that value for a sliced dataframe.
Is there an efficient method to do this.? Something without loops preferable.
Another part of the question is, I have another column like this along with col,
df['X'] = [10, 40, 10, 50, 30, 20]
I want to sum up this column in the same fashion I wanted to count the column col.
For instance, At index 0, I will have 10 + 40 + 20 as the sum. At index 1, the sum will be 40 + 20. In short, instead of counting, I want to sum up another column.
The result will be like this,
col col_count X X_sum
0 A 3 10 70
1 A 2 40 60
2 B 2 10 40
3 C 1 50 50
4 B 1 30 30
5 A 1 20 20
Use pandas.Series.groupby with cumcount and cumsum.
g = df[::-1].groupby('col')
df['col_count'] = g.cumcount().add(1)
df['X_sum'] = g['X'].cumsum()
print(df)
Output:
col X col_count X_sum
0 A 10 3 70
1 A 40 2 60
2 B 10 2 40
3 C 50 1 50
4 B 30 1 30
5 A 20 1 20

Pandas: group by two columns, sum up the first value in the first column group

In Python, I have a pandas data frame df.
ID Ref Dist
A 0 10
A 0 10
A 1 20
A 1 20
A 2 30
A 2 30
A 3 5
A 3 5
B 0 8
B 0 8
B 1 40
B 1 40
B 2 7
B 2 7
I want to group by ID and Ref, and take the first row of the Dist column in each group.
ID Ref Dist
A 0 10
A 1 20
A 2 30
A 3 5
B 0 8
B 1 40
B 2 7
And I want to sum up the Dist column in each ID group.
ID Sum
A 65
B 55
I tried this to do the first step, but this gives me just an index of the row and Dist, so I cannot move on to the second step.
df.groupby(['ID', 'Ref'])['Dist'].head(1)
It'd be wonderful if somebody helps me for this.
Thank you!
I believe this is what you're looking for.
The first step you need to use first since you want the first in the groupby. Once you've done that, use reset_index() so you can use a groupby afterwards and sum it up using ID.
df.groupby(['ID','Ref'])['Dist'].first()\
.reset_index().groupby(['ID'])['Dist'].sum()
ID
A 65
B 55
Just drop_duplicates before the groupby. The default behavior is to keep the first duplicate row, which is what you want.
df.drop_duplicates(['ID', 'Ref']).groupby('ID').Dist.sum()
#A 65
#B 55
#Name: Dist, dtype: int64

Categories

Resources