How do I group a dataframe by multiple columns simultaneously [duplicate] - python

This question already has answers here:
Apply multiple functions to multiple groupby columns
(7 answers)
Closed 6 months ago.
I have a dataframe that looks something like this:
Individual Category Amount Extras
A 1 250 30
A 1 300 10
A 1 500 8
A 2 350 12
B 1 200 9
B 2 300 20
B 2 450 15
I want to get a dataframe that looks like this:
Individual Category Count Amount Extras
A 1 3 1050 48
A 2 1 350 12
B 1 1 200 9
B 2 2 750 35
I know that you can use groupby with Pandas, but is it possible to group using count and sum simultaneously?

You could try as follows:
output_df = df.groupby(['Individual','Category']).agg(
Count=('Individual', 'count'),
Amount=('Amount','sum'),
Extras=('Extras','sum')).reset_index(drop=False)
print(output_df)
Individual Category Count Amount Extras
0 A 1 3 1050 48
1 A 2 1 350 12
2 B 1 1 200 9
3 B 2 2 750 35
So, we are using df.groupby, and then apply named aggregation, allowing us to "[name] output columns when applying multiple aggregation functions to specific columns".

Related

How to create new column in Pandas dataframe where each row is product of previous rows

I have the following DataFrame dt:
a
0 1
1 2
2 3
3 4
4 5
How do I create a a new column where each row is a function of previous rows?
For instance, say the formula is:
B_row(t) = A_row(t-1)+A_row(t-2)+3
Such that:
a b
0 1 /
1 2 /
2 3 6
3 4 8
4 5 10
Also, I hear a lot about the fact that we mustn't loop through rows in Pandas', however it seems to me that I should go at it by looping through each row and creating a sort of recursive loop - as I would do in regular Python.
You could use cumprod:
dt['b'] = dt['a'].cumprod()
Output:
a b
0 1 1
1 2 2
2 3 6
3 4 24
4 5 120

Fill 0s with Column Value based on Group (Another Column Value) [duplicate]

This question already has answers here:
Python Pandas max value in a group as a new column
(3 answers)
Closed last year.
I have a DF, Sample below:
Group $ Type
1 50 A
1 0 B
1 0 C
2 150 A
2 0 B
2 0 C
What I want to do is populate the $ column with the value associated with the column A, by each group.
Resulting DF will look like the below:
Group $ Type
1 50 A
1 50 B
1 50 C
2 150 A
2 150 B
2 150 C
I have tried various np.where functions but can't seem to get the desired output.
Thanks in advance!
Try with groupby with transform max
df['new$'] = df.groupby('Group')['$'].transform('max')
df
Out[371]:
Group $ Type new$
0 1 50 A 50
1 1 0 B 50
2 1 0 C 50
3 2 150 A 150
4 2 0 B 150
5 2 0 C 150

Pandas groupby results assign to a new column

Hi I'm trying to create a new column in my dataframe and I want the values to based on a calc. The calc is - score share of Student within the Class. There are 2 different students with the same name in different classes, hence why the first group by below is on Class and Student both.
df['share'] = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
With the code above, I get the error incompatible index of inserted column with frame index.
Can someone please help. Thanks
the problem is the groupby aggregate and the index are the unique values of the column you group. And in your case, the SHARE score is the class's score and not the student's, and this sets up a new dataframe with each student's share score.
I understood your problem this way.
ndf = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
ndf = ndf.reset_index()
ndf
If I understood you correctly, given an example df like the following:
Class Student Score
1 1 1 99
2 1 2 60
3 1 3 90
4 1 4 50
5 2 1 93
6 2 2 93
7 2 3 67
8 2 4 58
9 3 1 54
10 3 2 29
11 3 3 34
12 3 4 46
Do you need the following result?
Class Student Score Score_Share
1 1 1 99 0.331104
2 1 2 60 0.200669
3 1 3 90 0.301003
4 1 4 50 0.167224
5 2 1 93 0.299035
6 2 2 93 0.299035
7 2 3 67 0.215434
8 2 4 58 0.186495
9 3 1 54 0.331288
10 3 2 29 0.177914
11 3 3 34 0.208589
12 3 4 46 0.282209
If so, that can be achieved straight forward with:
df['Score_Share'] = df.groupby('Class')['Score'].apply(lambda x: x / x.sum())
You can apply operations within each group's scope like that.
PS. I don't know why a student with the same name in a different class would be a problem, so maybe I'm not getting something right. I'll edit this according to your response. Can't make a comment because I'm a newbie here :)

Create a new pandas column that uses an existing columns to fill previous rows and group by based on multiple conditions

I have below dataset with me :
myid id_1 Date group new_id
100 1 1-Jan-2020 A
100 2 3-Jan-2020 A
100 3 4-Jan-2020 A 101
100 4 15-Jan-2020 A
100 5 20-Feb-2020 A
200 6 3-Jan-2020 B
200 7 8-Feb-2020 B
200 8 9-Feb-2020 B 102
200 9 25-Mar-2020 B
200 9 26-Jan-2020 B
I wanted to create a column named ns.
The column "ns" needs to be created in a way that it uses myid, Date and new_id.
If the difference b/w the previous date is greater than 30 and it belongs to the same my_id, the value should be incremented, otherwise should retain the same value
If new_id is not null and the it will share the same value as the previous row and the next row will have an increment.
For every my_id the value starts from 1
Expected output :
myid id_1 Date group new_id ns
100 1 1-Jan-2020 A 1
100 2 3-Jan-2020 A 1
100 3 4-Jan-2020 A 101 1
100 4 15-Jan-2020 A 2
100 5 20-Jan-2020 A 3
200 6 3-Jan-2020 B 1
200 7 8-Feb-2020 B 2
200 8 9-Feb-2020 B 102 2
200 9 25-Mar-2020 B 3
200 9 26-Mar-2020 B 4
I have used df.groupby('CMID')['Date'].diff() and df.groupby('CMID')['PlanID'].bfill() , np.where to create multiple dummy columns in order to achieve this and still working on it , please let me know if there's a better way to go about this ?

Sorting rows in csv file using Python Pandas

I have a quick question regarding sorting rows in a csv files using Pandas. The csv file which I have has the data that looks like:
quarter week Value
5 1 200
3 2 100
2 1 50
2 2 125
4 2 175
2 3 195
3 1 10
5 2 190
I need to sort in following way: sort the quarter and the corresponding weeks. So the output should look like following:
quarter week Value
2 1 50
2 2 125
2 3 195
3 1 10
3 2 100
4 2 175
5 1 200
5 2 190
My attempt:
df = df.sort('quarter', 'week')
But this does not produce the correct result. Any help/suggestions?
Thanks!
New answer, as of 14 March 2019
df.sort_values(by=["COLUMN"], ascending=False)
This returns a new sorted data frame, doesn't update the original one.
Note: You can change the ascending parameter according to your needs, without passing it, it will default to ascending=True
Note: sort has been deprecated in favour of sort_values, which you should use in Pandas 0.17+.
Typing help(df.sort) gives:
sort(self, columns=None, column=None, axis=0, ascending=True, inplace=False) method of pandas.core.frame.DataFrame instance
Sort DataFrame either by labels (along either axis) or by the values in
column(s)
Parameters
----------
columns : object
Column name(s) in frame. Accepts a column name or a list or tuple
for a nested sort.
[...]
Examples
--------
>>> result = df.sort(['A', 'B'], ascending=[1, 0])
[...]
and so you pass the columns you want to sort as a list:
>>> df
quarter week Value
0 5 1 200
1 3 2 100
2 2 1 50
3 2 2 125
4 4 2 175
5 2 3 195
6 3 1 10
7 5 2 190
>>> df.sort(["quarter", "week"])
quarter week Value
2 2 1 50
3 2 2 125
5 2 3 195
6 3 1 10
1 3 2 100
4 4 2 175
0 5 1 200
7 5 2 190
DataFrame object has no attribute sort

Categories

Resources