I have a dataset that shows the share of every disease from total diseases.
I want to find the country name which in that country AIDS is bigger than other diseases.
Try with
df.index[df.AIDS.eq(df.drop('Total',1).max(1))]
Have a look at pandas max function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html)
Here you could get the max for each row, as is:
most_frequent_disease = df.drop(columns=['Total', 'Other']).max(axis=1)
Then you can create a condition to check wether AIDS is the most frequent disease, and apply it to your dataframe:
is_aids_most_frequent_disease = df.loc[:, 'A'].eq(most_frequent_disease)
df[is_aids_most_frequent_disease]
You could get the country name by using the .index at the end of the expression too.
Related
I'm doing some Python exercises and I'm stuck with a question.
I'm using the following Titanic dataframe: https://drive.google.com/file/d/1NEHvlUMTNPusHZvHUFTqeUR_9yY1tHVz/view
Now I need to find the minimum value of the column 'Age' for each class of 'Pclass' for the passengers that paid a fare ('Fare') above the average.
Using this I can get the minimum age by group, but how can I add the 'above average Fare' condition to this?
df.groupby('Pclass')['Age'].min()
you can:
find mean
filter
pivot_table, minimum value of the column 'Age' for each class of 'Pclass'
avrg_Fare = df['Fare'].mean()
df = df.loc[df['Fare'] > avrg_Fare]
PVT_min_age = df.pivot_table(index='Pclass', aggfunc={'Age':np.min}).reset_index()
Give this a shot
average_fare = df['Fare'].mean()
df.query("fare > #average_fare").groupby('Pclass_2').agg{'Age': ['min']}
Grouping by with Where conditions in Pandas
I may have some syntax errors since its been awhile since i've done pandas, if anyone sees a problem please correct it
this is a code i wrote, but the output is too big, over 6000, how do i get the first result for each year
df_year = df.groupby('release_year')['genres'].value_counts()
Let's start from a small correction concerning variable name:
value_counts returns a Series (not DataFrame), so you should
not use name starting from df.
Assume that the variable holding this Series is gen.
Then, one of possible solutions is:
result = gen.groupby(level=0).apply(lambda grp:
grp.droplevel(0).sort_values(ascending=False).head(1))
Initially you wrote that you wanted the most popular genre in each year,
so I sorted each group in descending order and returned the first
row from the current group.
I am working with pandas using a small dataset and I am stuck somewhere.
Here is the data after merging:
Using this data, the code below give the minimum area of each region and complete with the corresponding country name on the same line in the Dataframe obtained.
Area_min=Africa.groupby('Region').Area.agg([min])
Area_min['Country']=(Africa.loc[Africa.groupby('Region').Area.idxmin(), 'Names']).values
Area_min
And this one give the maximum population of each region and complete with the corresponding country name on the same line in Dataframe obtained.
Pop_max=Africa.groupby('Region').Population.agg([max])
Pop_max['Country']=(Africa.loc[Africa.groupby('Region').Population.idxmax(), 'Names']).values
Pop_max
Now I am trying to get the average population of each region and complete with the name of country having the population closest to the average of the corresponding group, on the same line in the Dataframe obtained.
The code below give the average population of each region but I am stuck on corresponding with the country name.
Pop_average=Africa.groupby('Region').Population.agg(['mean'])
I am thinking about .map() and .apply() function but I have tried without success. Any hint will be helpful.
Since you're grouping by only one column, it's more efficient to do it once.
Also, since you're using idxmin anyway, it seems it's redundant to do the first groupby.agg, since you can directly access the column names.
g = Africa.groupby('Region')
Area_min = Africa.loc[g['Area'].idxmin(), ['Names', 'Area']]
Pop_max = Africa.loc[g['Population'].idxmax(), ['Names', 'Population']]
Then for your question, here's one approach. Transform the population mean and find the difference between the mean and the population and find the location where the difference is the smallest using abs + groupby + idxmin; then use the loc accessor like above to get the desired outcome:
Pop_average = Africa.loc[((g['Population'].transform('mean') - Africa['Population']).abs()
.groupby(Africa['Region']).idxmin()),
['Names','Population']]
Need to calculate What country has the highest percentage of people that earn >50K?
Here is the preview of dataset used
Expected Answer is Iran with 41.9%
1994 census dataset
My approach
country = df[df['income']==">50K"][['sex','native.country']] top = country.describe() top.loc['top','native.country']
Try this:
p = (df[df['salary'] =='>50K']['native-country'].value_counts()
/df['native-country'].value_counts()*100).sort_values(ascending=False)
Then, get first value with p.iloc[0]
Suppose you stored your dataset into a variable named new.
#converting your sex column into numerical values to calculate the population
gender={'male':1,'female':2}
new.sex=[gender[item] for item in new.sex]
#calculating your desired result
data=new.loc[new.income>50K,['sex','native.country']]
result=data.groupby('native.country')['sex'].sum()
print(result)
This will give you the country's name with the highest population of people getting over 50k income.
Then, if you still want to find the percentage of population, you can easily do it by using:
total=data['sex'].sum()
list1=[]
for i in result:
list1.append(i/total*100)
print(list1)
Hope, you find some help from my answer.
Happy Coding :)
I have a Dataframe with the follow columns:
"Country Name"
"Indicator Code"
"Porcentaje de Registros" (as it is show in the image) for each country there are 32 indicator codes with its percentage value.
The values are order in an descending way, and I need to keep the 15th highest values for each country, that means for example for Ecuador I need to know which ones are the 15th indicators with highest value. I was trying the following:
countries = gender['Country Name'].drop_duplicates().to_list()
for countries in countries:
test = RelevantFeaturesByID[RelevantFeaturesByID['Country Name']==countries].set_index(["Country Name", "Indicator Code"]).iloc[0:15]
test
But it just returns the first 15 rows for one country.
What am I doing wrong?
There is a mispelling in a loop statement for countries in countries: and then you are using countries again. That for sure is a problem. Also you substitute for test multiple times.
I am not sure whether I understood well what is your aim, however that seems to be a good basis to start:
# sorting with respect to countries and their percentage
df = df.sort_values(by=[df.columns[0],df.columns[-1]],ascending=[True,False])
# choosing unique values of country names
countries = df[df.columns[0]].unique()
test = []
for country in countries:
test.append( df.loc[df["Country Name"]==country].iloc[0:15] )