I have the dataset below, I want to calculte the aggregated sum of "notes" of each school, except the school "B", where I want to be equal zero or missing
student school notes nbr_of_student_per_school
1 A 12 45
1 A 13 45
2 A 10 45
3 B 13 -
4 C 16 46
5 A 10 45
6 C 20 46
7 C 10 46
8 B 11 -
df.groupby(['Country'])['notes'].sum()
Try this:
df.query('school != "B"').groupby('school')['notes'].sum()
So you are only selecting the subset of the dataframe where the school is not B
EDIT:
Another approach re: comments:
# calculate mean
df['new_col'] = df.groupby('school')['notes'].transform('sum')
# now set B school sum to np.nan
df.loc[df['school'] == 'B', 'new_col'] = np.nan
Related
I am trying to rank a large dataset using python. I do not want duplicates and rather than using the 'first' method, I would instead like it to look at another column and rank it based on that value.
It should only look at the second column if the rank in the first column has duplicates.
Name CountA CountB
Alpha 15 3
Beta 20 52
Delta 20 31
Gamma 45 43
I would like the ranking to end up
Name CountA CountB Rank
Alpha 15 3 4
Beta 20 52 2
Delta 20 31 3
Gamma 45 43 1
Currently, I am using df.rank(ascending=False, method='first')
Maybe use sort and pull out the index:
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C','D'],'CountA':[15,20,20,45],'CountB':[3,52,31,43]})
df['rank'] = df.sort_values(['CountA','CountB'],ascending=False).index + 1
Name CountA CountB rank
0 A 15 3 4
1 B 20 52 2
2 C 20 31 3
3 D 45 43 1
You can take the counts of the values in CountA and then filter the DataFrame rows based on the count of CountA being greater than 1. Where the count is greater than 1, take CountB, otherwise CountA.
df = pd.DataFrame([[15,3],[20,52],[20,31],[45,43]],columns=['CountA','CountB'])
colAcount = df['CountA'].value_counts()
#then take the indices where colACount > 1 and use them in a where
df['final'] = df['CountA'].where(~df['CountA'].isin(colAcount[colAcount>1].index),df['CountB'])
df = df.sort_values(by='final', ascending=False).reset_index(drop=True)
# the rank is the index
CountA CountB final
0 20 52 52
1 45 43 45
2 20 31 31
3 15 3 15
See this for more details.
I have the following:
df1=pd.DataFrame([[1,10],[2,15],[3,16]], columns=["a","b"])
which result in:
a b
0 1 10
1 2 15
2 3 16
I want to create a third column "c" where the value in each row is a product of the value in column "b" form the same row multiplied by a number depending on the value in column "a". So for example
if value in "a" is 1 multiply 10 x 2,
if value in "a" is 2 multiply 15 x 5,
if value in "a" is 3 multiply 16 x 10.
In effect I want to achieve this:
a b c
0 1 10 20
1 2 15 75
2 3 16 160
I have tried something with if and elif but don't get to the right solution.
The dataframe is lengthy and the numbers 1, 2, 3 in column "a" appear in random order.
Thanks in advance.
Are you looking for something like this, I have extended your Dataframe, please check if it helps
df1=pd.DataFrame([[1,10],[2,15],[3,16],[3,11],[2,12],[1,16]], columns=["a","b"])
dict_prod = {1:2, 2:5, 3:10}
df1['c'] = df1['a'].map(dict_prod)*df1['b']
a b c
0 1 10 20
1 2 15 75
2 3 16 160
3 3 11 110
4 2 12 60
5 1 16 32
You should be able to just do
df1['c'] = df['a']*your_number * df['b']
or
df1['c'] = some_function(df['a']) * df['b']
Let's say that I have the following dataframe:
name number
0 A 100
1 B 200
2 B 30
3 A 20
4 B 30
5 A 40
6 A 50
7 A 100
8 B 10
9 B 20
10 B 30
11 A 40
What I would like to do is to merge all the successive rows where name == 'B', between two rows with name == 'A' and get the corresponding sum. So, I would like my final output to look like that:
name number
0 A 100
1 B 230
2 A 20
3 B 30
4 A 40
5 A 50
6 A 100
7 B 60
8 A 40
We can use a little groupby trick here. Create a mask with of A's and then shift each subsequent group of B's into their own group. This answer assumes that your name Series contains just A's and B's.
c = df['name'].eq('A')
m1 = c.cumsum()
m = m1.where(c, m1 + m1.max())
df.groupby(m, sort=False, as_index=False).agg({'name': 'first', 'number': 'sum'})
name number
0 A 100
1 B 230
2 A 20
3 B 30
4 A 40
5 A 50
6 A 100
7 B 60
8 A 40
A clumsier attempt - but since I've done it might as well post.
This is just a basic for loop with a while:
for i in df.index:
if i in df.index and df.loc[i, 'name'] == 'B':
while df.loc[i+1, 'name'] == 'B':
df.loc[i, 'number'] += df.loc[i+1, 'number']
df = df.drop(i+1).reset_index(drop=True)
It's very straightforward (and hence inefficient I imagine): if B, if next row is also B, add next row to this row's number and delete next row.
I have a dataframe with 2 columns: value and product. There will be duplicated products, but with different values. What I want to do is to get all products, but remove any duplication. The condition to remove duplication will be to get the row with the lowest value and drop the rest. For example, I want something like this:
Before:
product value
A 25
B 45
C 15
C 14
C 13
B 22
After
product value
A 25
B 22
C 13
How can I make it so that only the lowest valued duplicated columns get added in the new dataframe?
df.sort_values('value').groupby('product').first()
# value
#product
#A 25
#B 22
#C 13
You can sort_values and then drop_duplicates:
res = df.sort_values('values').drop_duplicates('product')
While going through the requirement i see , even you don't need to use drop.duplicate and sort_values as we are looking for the least minimum value of each product column in the dataFrame. So, there are couple ways doing it as follows...
I believe one of the shorted way will looking at the unique index by using pandas.DataFrame.idxmin.
>>> df
product value
0 A 25
1 B 45
2 C 15
3 C 14
4 C 13
5 B 22
>>> df.loc[df.groupby('product')['value'].idxmin()]
product value
0 A 25
5 B 22
4 C 13
OR
In this case another shortest and elegant way around using Compute min of group values using groupby.min() :
>>> df
product value
0 A 25
1 B 45
2 C 15
3 C 14
4 C 13
5 B 22
>>> df.groupby('product').min()
value
product
A 25
B 22
C 13
In R, cbind(dataframe, new_column) will return the original dataframe with an extra column called "new_column"
What is best practice for achieving this in Python (preferably using base or pandas)?
To make the question more concrete, suppose
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
and
new_column = [2,4,6,8,10,12]
And that the final output should be
Day Visitors Bounce Rate new_column
0 1 43 65 2
1 2 34 67 4
2 3 65 78 6
3 4 56 65 8
4 5 29 45 10
5 6 76 52 12
You can do this:
web_stats['new_column'] = [2,4,6,8,10,12]