Plot a bar plot using department wise toppers using pandas - python

Department wise toppers (horizantal bar graph or any visual representations of your choice)
I need to plot a bar graph using department wise toppers from the below table
Data set Table
index Name python mysql Previous Geekions CodeKata Score Average Score Department Rising python_en computational_thinking
0 A.Dharani 82.0 20.0 24500 24500 24500.0 Computer Science and Engineering 0 NaN NaN
1 V.JEEVITHA 82.0 20.0 21740 21740 21740.0 Computer Science and Engineering 0 NaN NaN
2 HEMAVATHI.R 100.0 100.0 19680 19680 19680.0 Computer Science and Engineering 0 NaN NaN
3 Mugunthan S 100.0 47.0 10610 10610 10610.0 Computer Science and Engineering 0 NaN NaN
I have tried the below code, but I am not able to get the name of the toppers.
DToppers=df.groupby('Department')['CodeKata Score'].max()
DToppers.plot(kind='bar', title='Department wise toppers')
fig.show()

Here's some example data, I only included the relevant columns and split the 4 records into 2 different departments to show the behaviour:
data = {
'Name': ['A. Dharani', 'V.JEEVITHA', 'HEMAVATHI.R', 'Mugunthan S'],
'CodeKata Score': [24500, 21740, 19680, 10610],
'Department': ['Department1', 'Department1', 'Department2', 'Department2']
}
df = pd.DataFrame(data)
Now get a DataFrame where the row matches the calculated max:
DToppers = df.loc[df.groupby(['Department'])['CodeKata
Score'].idxmax()]
Now plot which is very similar to your original - Not sure if you wanted the names and departments on the plot - you will need to adjust the x values and/or set the labels if you do.
DToppers.plot(x='Name', y='CodeKata Score', kind='bar', title='Department wise toppers')
NB: Think about what you want to happen if more than one person has the same "high score" in the department. Of course if you are only wanting to show department names that won't matter, but does give you a problem/decision if you want the names on there.

Related

How can I combine two newly created columns and also create a column for the division of both

I have the first column:
df_weight = df2.groupby(['Genre']).agg(total = ('weighted score', 'sum')).reset_index()
Genre
total_weight
0
Action and Adventure
1000.0
1
Classic and cult TV
500.0
and the second column:
df_TotalShow = df2.groupby(['Genre']).agg(total = ('No. of shows', 'sum')).reset_index()
Genre
total_shows
0
Action and Adventure
200.0
1
Classic and cult TV
150.0
I want to combine the two and make something similar below but I am unsure of what the code should look like.
Genre
total_weight
total_shows
0
Action and Adventure
1000.0
200.0
1
Classic and cult TV
500.0
150.0
Next, I want to create another column with the division of total_weight / total_shows.
So far, I tried
df = df_weight['total'].div(df_TotalShow['total'])
but this gives me a new series. Is there a way where this could be another column by itself with the final product looking something like
Genre
total_weight
total_shows
Avg
0
Action and Adventure
1000.0
200.0
5.0
1
Classic and cult TV
500.0
150.0
3.33
Use this code
df = df2.groupby(['Genre']).agg({'no. of shows': 'sum', 'weighted score': : 'sum'}).reset_index()
df['division']=df.col1/df.col2
replace the name of your columns. You must perform one time aggregation and grouping for this problem not 2 times

Multiple similar columns with similar values

The dataframe looks like:
name education education_2 education_3
name_1 NaN some college NaN
name_2 NaN NaN graduate degree
name_3 high school NaN NaN
I just want to keep one education column. I tried to use the conditional statement and compared to each other, I got nothing but error though. I also looked through the merge solution, but in vain. Does anyone know how to deal with it using Python or pandas? Thank you in advance.
name education
name_1 some college
name_2 graduate degree
name_3 high school
One day I hope they'll have better functions for String type rows, rather than the limited support for columns currently available:
df['education'] = (df.filter(like='education') # Filters to only Education columns.
.T # Transposes
.convert_dtypes() # Converts to pandas dtypes, still somewhat in beta.
.max() # Gets the max value from the column, which will be the not-null one.
)
df = df[['name', 'education']]
print(df)
Output:
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
Looping this wouldn't be too hard e.g.:
cols = ['education', 'age', 'income']
for col in cols:
df[col] = df.filter(like=col).bfill(axis=1)[col]
df = df[['name'] + cols]
You can use df.fillna to do so.
df['combine'] = df[['education','education2','education3']].fillna('').sum(axis=1)
df
name education education2 education3 combine
0 name1 NaN some college NaN some college
1 name2 NaN NaN graduate degree graduate degree
2 name3 high school NaN NaN high school
If you have a lot of columns to combine, you can try this.
df['combine'] = df[df.columns[1:]].fillna('').sum(axis=1)
use bfill to fill the empty (NaN) values
df.bfill(axis=1).drop(columns=['education 2','education 3'])
name education
0 name 1 some college
1 name 2 graduate degree
2 name 3 high school
if there are other columns in between then choose the columns to apply bfill
In essence, if you have multiple columns for education that you need to consolidate under a single column then choose the columns to which you apply the bfill. subsequently, you can delete those columns from which you back filled.
df[['education','education 2','education 3']].bfill(axis=1).drop(columns=['education 2','education 3'])

pandas grouping and visualization

I have to do some analysis using Python3 and pandas with a dataset which is shown as a toy example-
data
'''
location importance agent count
0 London Low chatbot 2
1 NYC Medium chatbot 1
2 London High human 3
3 London Low human 4
4 NYC High human 1
5 NYC Medium chatbot 2
6 Melbourne Low chatbot 3
7 Melbourne Low human 4
8 Melbourne High human 5
9 NYC High chatbot 5
'''
My aim is to group the location and then count the number of Low, Medium and/or High 'importance' column for each location. So far, the code I have come up with is-
data.groupby(['location', 'importance']).aggregate(np.size)
'''
agent count
location importance
London High 1 1
Low 2 2
Melbourne High 1 1
Low 2 2
NYC High 2 2
Medium 2 2
'''
This grouping and count aggregation contains index as the grouping objects-
data.groupby(['location', 'importance']).aggregate(np.size).index
I don't know how to proceed next? Also, how can I visualize this?
Help?
I think you need DataFrame.pivot_table, added aggfunc=sum for aggregate if duplicates and then use DataFrame.plot:
df = data.pivot_table(index='location', columns='importance', values='count', aggfunc='sum')
df.plot()
If need counts of pairs location with importance use crosstab:
df = pd.crosstab(data['location'], data['importance'])
df.plot()

How to form a pivot table on two categorical columns and count for each index?

I have a data frame of 62 undergrads from a state university with 13 column (age, class, major, GPA, etc.)
print(studentSurvey)
ID Gender Age Major ... Text Messages
1 F 20 Other 120
2 M 22 CS 50
.
.
.
62 F 21 Retail 200
I want to make pivot tables on studentSurvey. For example, I want to find out how many women took CS as major, men taking Others, etc. The closest I could code this out was through this:
studentSurvey.pivot_table(index="Gender", columns="Major", aggfunc='count')
Age ... Text Messages
Major Accounting CIS Economics/Finance ... Other Retailing/Marketing Undecided
Gender ...
Female 3.0 3.0 7.0 ... 3.0 9.0 NaN
Male 4.0 1.0 4.0 ... 4.0 5.0 3.0
That is not what I require. I only require Gender to be the index (row) with all the unique values under Major to be the columns and each cell containing the count of that gender and major. I've also tried slicing only these two columns and pivoting but the results are mixed up. Can anyone suggest something better? I'm new to advanced reshaping in pandas.
Check crosstab
pd.crosstab(df['Gender'], df['Major'])
Fix your code
studentSurvey.pivot_table(index="Gender", columns="Major", values="ID", aggfunc="count")
Try:
(studentSurvey.groupby(['Gender','Major'])
.value_counts()
.unstack('Major', fill_value=0)
)
Or you can do crosstab:
pd.crosstab(studentSurvey['Gender'], studentSurvey['Major'])

Group by when condition is not just by row in Python Pandas

I have two dataframes one at the lower level and one that summarizes the data at a higher level. I'm trying to add a new column to the summary table that sums the total spending of all people who are fans of a particular sport. IE in the summary row of soccer I do NOT want to sum the total soccer spending, but the total sports spending of anyone who spends anything on soccer.
df = pd.DataFrame({'Person': [1,2,3,3,3],
'Sport': ['Soccer','Tennis','Tennis','Football','Soccer'],
'Ticket_Cost': [10,20,10,10,20]})
df2 = pd.DataFrame({'Sport': ['Soccer','Tennis','Football']})
I can currently do this in many steps, but I'm sure there is a more efficient/quicker way. Here is how I currently do it.
#Calculate the total spend for each person in an temporary dataframe
df_intermediate = df.groupby(['Person'])['Ticket_Cost'].sum()
df_intermediate= df_intermediate.rename("Total_Sports_Spend")
Person Total_Sports_Spend
1 10
2 20
3 40
#place this total in the detailed table
df = pd.merge(df,df_intermediate,how='left',on='Person')
#Create a second temporary dataframe
df_intermediate2 = df.groupby(['Sport'])['Total_Sports_Spend'].sum()
Sport Total_Sports_Spend
Football 40
Soccer 50
Tennis 60
#Merge this table with the summary table
df2 = pd.merge(df2,df_intermediate2,how='left',on='Sport')
Sport Total_Sports_Spend
0 Soccer 50
1 Tennis 60
2 Football 40
Finally, I clean up the temporary dataframes and remove the extra column from the detailed table. I'm sure there is a better way.
You might want to rotate your DataFrame in 2D:
df2 = df.pivot_table(index = 'Person', columns = 'Sport', values = 'Ticket_Cost')
You get
Sport Football Soccer Tennis
Person
1 NaN 10.0 NaN
2 NaN NaN 20.0
3 10.0 20.0 10.0
Now you can compute the total spending per person:
total = df2.sum(axis=1)
which is
Person
1 10.0
2 20.0
3 40.0
dtype: float64
Finally you place the total spending values of total in the cells of df2 where the cell has a positive value:
df3 = (df2>0).mul(total, axis=0)
which is here:
Sport Football Soccer Tennis
Person
1 0.0 10.0 0.0
2 0.0 0.0 20.0
3 40.0 40.0 40.0
Finally you just have to sum along columns to get what you want:
spending = df3.sum(axis=0)
and will get what you expect.

Categories

Resources