I am looking for:
the percentage of players who are in weight type thin,
the percentage of players who are in weight type normal,
the percentage of players who are in weight type overweight,
the percentage of players who are in weight type obese.
All of them are listed in the IMC column.
This is my dataset:
I don't know Python well. I get the percentage of each row, but it does not group by group (thin, normal, overweight, obese)
This is my current code:
(df.groupby('IMC').size() / df['IMC'].count()) * 100
that's because your groupby is reversed:
df.groupby('IMC')
will group by the weight, not by the category. If you flip your code to
df.groupby('clasificacion_oms')
you will then get all the players grouped by their weight type. Then you can divide the count in each group by the total number of players to get your percentage.
Does that give you what you need? If not, please post your code so we can provide more detail.
Related
I need 2 lists (one for highest teacher-student ratio and the other for the lowest ratio) which contain the codes of the 10 districts of the schools that have the highest and lowest teacher / student ratio, respectively.
This is the dataset, the columns i need to work on are underlined in red.
I need to calculate the 10 highest value from dataframe.students/dataset.teachers but i also have to associate to every ratio the dataset.district row.
I tried but i can't get around this.
Please help me.
So far I can get the average age of the data:
np.mean(df.age)
As well as the average weight sorted by gender:
df.groupby(by='gender')['weight'].mean()
But I don't know how to put the condition of how I need to find the average weight of people who are above the average age only, and show it by gender.
You can filter and groupby:
mean_age = df['age'].mean()
out = df[df['age']>mean_age].groupby('gender')['weight'].mean()
On the other note, you may want to filter by average age per gender:
mean_age = df.groupby('gender')['age'].transform('mean')
out = df[df['age']>mean_age].groupby('gender')['weight'].mean()
Once you have that, you can plot with
out.plot.bar()
Need to calculate What country has the highest percentage of people that earn >50K?
Here is the preview of dataset used
Expected Answer is Iran with 41.9%
1994 census dataset
My approach
country = df[df['income']==">50K"][['sex','native.country']] top = country.describe() top.loc['top','native.country']
Try this:
p = (df[df['salary'] =='>50K']['native-country'].value_counts()
/df['native-country'].value_counts()*100).sort_values(ascending=False)
Then, get first value with p.iloc[0]
Suppose you stored your dataset into a variable named new.
#converting your sex column into numerical values to calculate the population
gender={'male':1,'female':2}
new.sex=[gender[item] for item in new.sex]
#calculating your desired result
data=new.loc[new.income>50K,['sex','native.country']]
result=data.groupby('native.country')['sex'].sum()
print(result)
This will give you the country's name with the highest population of people getting over 50k income.
Then, if you still want to find the percentage of population, you can easily do it by using:
total=data['sex'].sum()
list1=[]
for i in result:
list1.append(i/total*100)
print(list1)
Hope, you find some help from my answer.
Happy Coding :)
I have a huge dataset with a lot of different client names, bills etc.
Now I want to show the 4 clients with the cumulated total highest bill.
So far I have used the groupby function:
data.groupby(by = ["CustomerName","Bill"], as_index=False).sum()
I tried to group by the name of the customers and the bill but it's not giving me the total sum of all the individual customer orders but only each single order from the customer.
Can someone help and tell me how I can receive on the first position customer x (with the highest accumulated bill) and the sum of all his orders and on position 2 the customer with the second highest accumulated bill and so on?
Big thanks!
Since, I don't know the full structure of your data data frame, I recommend subsetting the relevant columns first:
data = data[["CustomerName", "Bill"]]
Then, you just need to group by CustomerName and sum over all columns (Bill in that case):
data.groupby(by=["CustomerName"]).sum()
Finally, you need to sort by the Bill column in non-ascending order:
data.sort_values(by='Bill', ascending=False)
print(data.head(4))
I am currently working on IMDB 5000 movie dataset for a class project. The budget variable has a lot of zero values.
They are missing entries. I cannot drop them because they are 22% of my entire data.
What should I do in Python? Some suggested binning? Could you provide more details?
Well there are a few options.
Take an average of the non zero values and fill all the zeros with the average. This yields 'tacky' results and is not best practice a few outliers can throw off the whole.
Use the median of the non zero values, also not a super option but less likely to be thrown off by outliers.
Binning would be taking the sum of the budgets then say splitting the movies into a certain number of groups like say budgets over or under a million, take the average budget then divide that by the amount of groups you want then use the intervals created from the average if they fall in group 0 give them a zero if group 1 a one etc.
I think finding the actual budgets for the movies and replacing the bad itemized budgets with the real budget would be a good option depending on the analysis you are doing. You could take the median or average of each feature column of the budget make that be the percent of each budget for a movie then fill in the zeros with the percent of budget the median occupies. If median value for the non zero actor_pay column is budget/actor_pay=60% then filling the actor_pay column of a zeroed value with 60 percent of the budget for that movie would be an option.
Hard option create a function that takes the non zero values of a movies budgets and attempts to interpolate the movies budget based upon the other movies data in the table. This option is more like its own project and the above options should really be tried first.