Find a row closest to the mean of a DataFrame column - python

I am working with pandas using a small dataset and I am stuck somewhere.
Here is the data after merging:
Using this data, the code below give the minimum area of each region and complete with the corresponding country name on the same line in the Dataframe obtained.
Area_min=Africa.groupby('Region').Area.agg([min])
Area_min['Country']=(Africa.loc[Africa.groupby('Region').Area.idxmin(), 'Names']).values
Area_min
And this one give the maximum population of each region and complete with the corresponding country name on the same line in Dataframe obtained.
Pop_max=Africa.groupby('Region').Population.agg([max])
Pop_max['Country']=(Africa.loc[Africa.groupby('Region').Population.idxmax(), 'Names']).values
Pop_max
Now I am trying to get the average population of each region and complete with the name of country having the population closest to the average of the corresponding group, on the same line in the Dataframe obtained.
The code below give the average population of each region but I am stuck on corresponding with the country name.
Pop_average=Africa.groupby('Region').Population.agg(['mean'])
I am thinking about .map() and .apply() function but I have tried without success. Any hint will be helpful.

Since you're grouping by only one column, it's more efficient to do it once.
Also, since you're using idxmin anyway, it seems it's redundant to do the first groupby.agg, since you can directly access the column names.
g = Africa.groupby('Region')
Area_min = Africa.loc[g['Area'].idxmin(), ['Names', 'Area']]
Pop_max = Africa.loc[g['Population'].idxmax(), ['Names', 'Population']]
Then for your question, here's one approach. Transform the population mean and find the difference between the mean and the population and find the location where the difference is the smallest using abs + groupby + idxmin; then use the loc accessor like above to get the desired outcome:
Pop_average = Africa.loc[((g['Population'].transform('mean') - Africa['Population']).abs()
.groupby(Africa['Region']).idxmin()),
['Names','Population']]

Related

Filtering the Min Value of a Mean in a GroupBy Pandas

I have a Star Wars People df with the following Columns:
columns = [name, height, mass, birth_year, gender, homeworld]
name is the index
I need to compute the following:
Which is the planet with the lowest average mass index of its characters?
Which character/s are from that planet?
Which I tried:
df.groupby(["homeworld"]).filter(lambda row: row['mass'].mean() > 0).min()
However, I need to have the min() inside the filter because I can have more than 1 character in the homeworld that have this lowest average mass index. Right now the filter function is not doing anything, is just to show how I want the code to be.
How can I achieve that? Hopefully with the filter function.
Use:
#aggreagate mean to Series
s = df.groupby("homeworld")['mass'].mean()
#filter out negative values and get homeworld with minimum value
out = s[s.gt(0)].idxmin()
#filter original DataFrame
df1 = df[df['homeworld'].eq(out)]
What do you mean with "more than 1 character in the homeworld that have this lowest average mass index"?
It should not matter how many characters are present per homeworld, the groupby aggregation with the mean method will calculate the averages for you.
When I look at the question you can just do the groupby like so:
df = df.groupby(['homeworld']).mean().sort_values(by=["mass"], ascending=False)
df.head(1)
And note the homeworld that is displayed

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

guys, Is there a way to get the firs row in a group dataframe

this is a code i wrote, but the output is too big, over 6000, how do i get the first result for each year
df_year = df.groupby('release_year')['genres'].value_counts()
Let's start from a small correction concerning variable name:
value_counts returns a Series (not DataFrame), so you should
not use name starting from df.
Assume that the variable holding this Series is gen.
Then, one of possible solutions is:
result = gen.groupby(level=0).apply(lambda grp:
grp.droplevel(0).sort_values(ascending=False).head(1))
Initially you wrote that you wanted the most popular genre in each year,
so I sorted each group in descending order and returned the first
row from the current group.

How to find the biggest value in the pandas dataset

I have a dataset that shows the share of every disease from total diseases.
I want to find the country name which in that country AIDS is bigger than other diseases.
Try with
df.index[df.AIDS.eq(df.drop('Total',1).max(1))]
Have a look at pandas max function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html)
Here you could get the max for each row, as is:
most_frequent_disease = df.drop(columns=['Total', 'Other']).max(axis=1)
Then you can create a condition to check wether AIDS is the most frequent disease, and apply it to your dataframe:
is_aids_most_frequent_disease = df.loc[:, 'A'].eq(most_frequent_disease)
df[is_aids_most_frequent_disease]
You could get the country name by using the .index at the end of the expression too.

How do I assign 'other' to low frequency categories? (pandas)

I have a 'city' column which has more than 1000 unique entries. (The entries are integers for some reason and are currently assigned float type.)
I tried df['city'].value_counts()/len(df) to get their frequences. It returned a table. The fist few values were 0.12,.4,.4,.3.....
I'm a complete beginner so I'm not sure how to use this information to assign everything in, say, the last 10 percentile to 'other'.
I want to reduce the unique city values from 1000 to something like 10, so I can later use get_dummies on this.
Let's go through the logic of expected actions:
Count frequencies for every city
Calculate the bottom 10% percentage
Find the cities with frequencies less then 10%
Change them to other
You started in the right direction. To get frequencies for every city:
city_freq = (df['city'].value_counts())/df.shape[0]
We want to find the bottom 10%. We use pandas' quantile to do it:
bottom_decile = city_freq.quantile(q=0.1)
Now bottom_decile is a float which represents the number that differs bottom 10% from the rest. Cities with frequency less then 10%:
less_freq_cities = city_freq[city_freq<=botton_decile]
less_freq_cities will hold enteries of cities. If you want to change the value of them in 'df' to "other":
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
complete code:
city_freq = (df['city'].value_counts())/df.shape[0]
botton_decile = city_freq.quantile(q=0.1)
less_freq_cities = city_freq[city_freq<=botton_decile]
df.loc[df["city"].isin(less_freq_cities.index.tolist())] = "other"
This is how you replace 10% (or whatever you want, just change q param in quantile) to a value of your choice.
EDIT:
As suggested in comment, to get normalized frequency it's better use
city_freq = df['city'].value_counts(normalize=True) instead of dividing it by shape. But actually, we don't need normalized frequencies. pandas' qunatile will work even if they are not normalize. We can use:
city_freq = df['city'].value_counts() and it will still work.

Categories

Resources