Pandas groupby results assign to a new column - python

Hi I'm trying to create a new column in my dataframe and I want the values to based on a calc. The calc is - score share of Student within the Class. There are 2 different students with the same name in different classes, hence why the first group by below is on Class and Student both.
df['share'] = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
With the code above, I get the error incompatible index of inserted column with frame index.
Can someone please help. Thanks

the problem is the groupby aggregate and the index are the unique values of the column you group. And in your case, the SHARE score is the class's score and not the student's, and this sets up a new dataframe with each student's share score.
I understood your problem this way.
ndf = df.groupby(['Class', 'Student'])['Score'].agg('sum')/df.groupby(['Class'])['Score'].agg('sum')
ndf = ndf.reset_index()
ndf

If I understood you correctly, given an example df like the following:
Class Student Score
1 1 1 99
2 1 2 60
3 1 3 90
4 1 4 50
5 2 1 93
6 2 2 93
7 2 3 67
8 2 4 58
9 3 1 54
10 3 2 29
11 3 3 34
12 3 4 46
Do you need the following result?
Class Student Score Score_Share
1 1 1 99 0.331104
2 1 2 60 0.200669
3 1 3 90 0.301003
4 1 4 50 0.167224
5 2 1 93 0.299035
6 2 2 93 0.299035
7 2 3 67 0.215434
8 2 4 58 0.186495
9 3 1 54 0.331288
10 3 2 29 0.177914
11 3 3 34 0.208589
12 3 4 46 0.282209
If so, that can be achieved straight forward with:
df['Score_Share'] = df.groupby('Class')['Score'].apply(lambda x: x / x.sum())
You can apply operations within each group's scope like that.
PS. I don't know why a student with the same name in a different class would be a problem, so maybe I'm not getting something right. I'll edit this according to your response. Can't make a comment because I'm a newbie here :)

Related

Comparing Columns in a Pandas Dataframe

I have a pandas data frame with racing results.
Place BibNum Time
0 1 2 5:50
1 2 4 8:09
2 3 7 10:27
3 4 3 11:12
4 5 1 12:13
...
34 1 5 2:03
35 2 9 4:35
36 3 7 5:36
What I would like to know is how can I get a count of how many times the BibNum showed up where the Place was 1, 2, 3 etc?
I know that I can do a "value_counts" but that is for how many times it shows up in a single column. I also looked into using numpy "where" but that is using a conditional like greater than or less than.
IIUC , this is what you need:
out = df.groupby(['Place','BibNum']).size()

Python Panda : Count number of occurence of a number

I've searched for long time and I need your help, I'm newbie on python and panda lib. I've a dataframe like that charged from a csv file :
ball_1,ball_2,ball_3,ball_4,ball_5,ball_6,ball_7,extraball_1,extraball_2
10,32,25,5,8,19,21,3,4
43,12,8,19,4,37,12,1,5
12,16,43,19,4,28,40,2,4
ball_X is an int in between 1-50 and extraball_X is an int between 1-9. I want count how many times appear each number in 2 other frames like that :
First DF ball :
Number,Score
1,128
2,34
3,12
4,200
....
50,145
Second DF extraball :
Number,Score
1,340
2,430
3,123
4,540
....
9,120
I've the algorythme in my head but i'm too noob in panda to translate into code.
I Hope it's clear enough and someone will be able to help me. Dont hesitate if you have questions.
groupby on columns with value_counts
def get_before_underscore(x):
return x.split('_', 1)[0]
val_counts = {
k: d.stack().value_counts()
for k, d in df.groupby(get_before_underscore, axis=1)
}
print(val_counts['ball'])
12 3
19 3
4 2
8 2
43 2
32 1
5 1
10 1
37 1
40 1
16 1
21 1
25 1
28 1
dtype: int64
print(val_counts['extraball'])
4 2
1 1
2 1
3 1
5 1
dtype: int64

Combine two dataframes and pick first entry based on common column [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes like
df1
sub_id Weight
1 56
2 67
3 81
5 73
9 59
df2
sub_id Text
1 He is normal.
1 person is healthy.
1 has strong immune power.
3 She is over weight.
3 person is small.
9 Looks good.
5 Not well.
5 Need to be tested.
By combining these two data frame i need to get as
(when there are multiple sub_id's in second df need to pick first text and combine with first df as below)
merge_df
sub_id Weight Text
1 56 He is normal.
2 67 Nan.
3 81 She is over weight.
5 73 Not well.
9 59 Looks good.
Can anyone help me out?
Thanks in advance.
Here you go:
print(pd.merge(df1, df2.drop_duplicates(subset='sub_id'),
on='sub_id',
how='outer'))
Output
sub_id Weight Text
0 1 56 He is normal.
1 2 67 NaN
2 3 81 She is over weight.
3 5 73 Not well.
4 9 59 Looks good.
To keep the last duplicate, you'd use the parameter keep='last'
print(pd.merge(df1, df2.drop_duplicates(subset='sub_id', keep='last'),
on='sub_id',
how='outer'))
Output
sub_id Weight Text
0 1 56 has strong immune power.
1 2 67 NaN
2 3 81 person is small.
3 5 73 Need to be tested.
4 9 59 Looks good.

How to cbind a vector to a dataframe in python and keep the vector name as the name of the new column?

In R, cbind(dataframe, new_column) will return the original dataframe with an extra column called "new_column"
What is best practice for achieving this in Python (preferably using base or pandas)?
To make the question more concrete, suppose
web_stats = {'Day':[1,2,3,4,5,6],
'Visitors':[43,34,65,56,29,76],
'Bounce Rate':[65,67,78,65,45,52]}
df = pd.DataFrame(web_stats, columns = ['Day', 'Visitors', 'Bounce Rate'])
and
new_column = [2,4,6,8,10,12]
And that the final output should be
Day Visitors Bounce Rate new_column
0 1 43 65 2
1 2 34 67 4
2 3 65 78 6
3 4 56 65 8
4 5 29 45 10
5 6 76 52 12
You can do this:
web_stats['new_column'] = [2,4,6,8,10,12]

Sorting rows in csv file using Python Pandas

I have a quick question regarding sorting rows in a csv files using Pandas. The csv file which I have has the data that looks like:
quarter week Value
5 1 200
3 2 100
2 1 50
2 2 125
4 2 175
2 3 195
3 1 10
5 2 190
I need to sort in following way: sort the quarter and the corresponding weeks. So the output should look like following:
quarter week Value
2 1 50
2 2 125
2 3 195
3 1 10
3 2 100
4 2 175
5 1 200
5 2 190
My attempt:
df = df.sort('quarter', 'week')
But this does not produce the correct result. Any help/suggestions?
Thanks!
New answer, as of 14 March 2019
df.sort_values(by=["COLUMN"], ascending=False)
This returns a new sorted data frame, doesn't update the original one.
Note: You can change the ascending parameter according to your needs, without passing it, it will default to ascending=True
Note: sort has been deprecated in favour of sort_values, which you should use in Pandas 0.17+.
Typing help(df.sort) gives:
sort(self, columns=None, column=None, axis=0, ascending=True, inplace=False) method of pandas.core.frame.DataFrame instance
Sort DataFrame either by labels (along either axis) or by the values in
column(s)
Parameters
----------
columns : object
Column name(s) in frame. Accepts a column name or a list or tuple
for a nested sort.
[...]
Examples
--------
>>> result = df.sort(['A', 'B'], ascending=[1, 0])
[...]
and so you pass the columns you want to sort as a list:
>>> df
quarter week Value
0 5 1 200
1 3 2 100
2 2 1 50
3 2 2 125
4 4 2 175
5 2 3 195
6 3 1 10
7 5 2 190
>>> df.sort(["quarter", "week"])
quarter week Value
2 2 1 50
3 2 2 125
5 2 3 195
6 3 1 10
1 3 2 100
4 4 2 175
0 5 1 200
7 5 2 190
DataFrame object has no attribute sort

Categories

Resources