find mean by grouping two columns - python

I had a dataset that looked like this:
Value Type mean
-1.975767 Weather
-0.540979 Fruits
-2.359127 Fruits
-2.815604 Corona
-0.929755 Weather
I wanted to iterate through each row and calculate a mean value for each row above (only if the Type matches). Mean is calculated by:
sum of all values / number of observations
where number of observations will be the number of times a Type has occurred so far.
For example, in the first row, there's no "weather" row above so for weather n = 1. So the mean would be -1.975767 / 1 = -1.975767.
In the second row, there's no FRUITS row above it, so the mean will just be -0.540979/1 = -0.540979.
However, in the third row, when we scan all previous rows, we see that FRUITS has already occurred before this and hence, n = 2 for Fruits. So we should get the last's FRUIT's value and calculate a new mean. So here, the mean will be -0.540979 + (-2.359127) divided by 2.
Value Type mean
-1.975767 Weather -1.975767
-0.540979 Fruits -0.540979
-2.359127 Fruits (-0.540979 -2.359127) / 2
-2.815604 Corona -2.815604
-0.929755 Weather (-1.975767 -0.929755) / 2
I used this to achieve this and it worked fine:
df['mean'] = df.groupby('type', as_index=False)['value'].expanding().mean().sort_index(level=1).droplevel(0)
However, now I want to do the same thing based on grouping of two cols Such that Country and Type both match.
Value Type mean Country
-1.975767 Weather Albania
-0.540979 Fruits Brazil --should be grouped
-2.359127 Fruits Brazil --should be grouped
-2.815604 Corona Albania
-0.929755 Weather China
I tried this:
df['mean'] = df.groupby([df.type,df.country], as_index=False)['value'].expanding().mean().sort_index(level=1).droplevel(0)
However, this gives me an error that:
TypeError: incompatible index of inserted column with frame index
even though its almost the same thing. What am I doing wrong?

Try:
df["Mean"] = df.groupby(["Type", "Country"])["Value"].expanding().mean().droplevel([0,1]).sort_index()
>>> df
Value Type Country Mean
0 -1.975767 Weather Albania -1.975767
1 -0.540979 Fruits Brazil -0.540979
2 -2.359127 Fruits Brazil -1.450053
3 -2.815604 Corona Albania -2.815604
4 -0.929755 Weather China -0.929755
Input df:
df = pd.DataFrame({"Value": [-1.975767, -0.540979, -2.359127, -2.815604, -0.929755],
"Type": ["Weather", "Fruits", "Fruits", "Corona", "Weather"],
"Country": ["Albania", "Brazil", "Brazil", "Albania", "China"]})

Related

calculate new mean using old mean

I had a dataset dfthat looked like this:
Value themes country date
-1.975767 Weather Brazil 2022-02-13
-0.540979 Fruits China 2022-02-13
-2.359127 Fruits China 2022-02-13
-2.815604 Corona China 2022-02-13
-0.712323 Weather UK 2022-02-13
-0.929755 Weather Brazil 2022-02-13
I grouped themes+country to calculate mean and count values of each combination of theme and country (eg: Weather, Brazil or Weather, UK)
df_calculations = df.groupby(["themes", "country"], as_index = False)["value"].mean()
df_calculations['count'] = df.groupby(["themes", "country"])["value"].count().tolist()
Then I added this info to a new table df_avg that looks like this:
country type mean count last_checked_date
Brazil Weather x 2 2022-02-13 #same for all rows
Brazil Corona y 2022-02-13
China Corona z 1 2022-02-13
China Fruits s 2 2022-02-13
However, now, there's new are additional rows in the same original df.
Value themes country date
-1.975560 Weather Brazil 2022-02-15
-0.540123 Fruits China 2022-02-16
-2.359234 Fruits China 2022-02-16
-2.359234 Corona UK 2022-02-16
I want to go through the df rows who's date is after the last_checked_date.
Then I want to calculate a new mean for each combination again but using the old mean and n value from my df_avgtable instead of re-calculating for the whole df
How can I achieve this?
Please see this: Calculate new mean from old mean
Since you are maintaining a count (if not, it is pretty trivial) you can use that along with existing mean to calculate updated mean using the new observation.

merge two datasets to find a mean

I have two similar looking tables:
df1:
country type mean count last_checked_date
Brazil Weather x 2 2022-02-13
Brazil Corona y 3 2022-02-13
China Corona z 1 2022-02-13
China Fruits s 2 2022-02-13
df2
country type mean count last_checked_date
Ghana Weather a 2 2022-02-13
Brazil Corona b 5 2022-02-13
China Corona c 1 2022-02-13
Germany Fruits d 2 2022-02-13
I want to join df2 with df1 such that no combination of country, type is lost. For each combination of country and type, I want to calculate a mean value with this formula:
df find_new_values(old_mean, new_mean, old_count, new_count):
mean = (old_mean + new_mean)/(old_count+new_count)
count = old_count+new_count
return mean, count
For example, in df2, China, Corona is present in df1 as well so the mean would be (c+z)/(1+1)
However, Ghana, Weather is present in df2 but not in df1 so in this case, I want to simply add a row to df1 as it is without the formula calculation.
How can I achieve this? What's the correct join/merge type to use here?
We may consider the problem this way, we combine them into one table,
df = pd.concat([df1, df2])
then use groupby to apply aggregations on each group of the rows that share the same country and type.
df.groupby(['country', 'type']).agg({'mean': 'mean', 'count': 'sum'})
For country-type combination that only occur once in one of the dataframe, the corresponding group will only discover one row and the aggregation functions won't change anything.
You may add 'last_checked_date': 'last' to the list of agg if needed.

how to get a index of row after it satisfies certain condition

a data frame of the country name in rows with corresponding medals win in summer and winter Olympics
I want in this data frame to get the country name which has a max difference in summer gold and winter gold, let's say summer gold column name is x and winter gold column name is y
all the country names are an index of rows
It is always good to provide a sample data frame so we can help better. I think you are looking for this:
(df.y-df.x).idxmax()
And if you care only about the absolute value of difference:
(df.x-df.y).abs().idxmax()
Example:
df = pd.DataFrame({'x':[1,2,3],'y':[2,10,5]}, index=['a','b','c'])
x y
a 1 2
b 2 10
c 3 5
print((df.y-df.x).abs().idxmax())
b

How to filter out entries in a data frame with specific and different values?

I have this real estate data:
neighborhood type_property type_negotiation price
Smallville house rent 2000
Oakville apartment for sale 100000
King Bay house for sale 250000
...
I have this groupby that identifies which values in the data set are a house for sale, and then returns the 10th and 90th percentile and quantity of these houses for each neighborhood in a new data frame called df_breakdown. The result looks like this:
neighborhood tenthpercentile ninetiethpercentile Quantity
King Bay 250000.0 250000.0 1
Smallville 99000.0 120000.0 8
Oakville 45000.0 160000.0 6
...
I now want to take this information back to my original real estate data set, and filter out all listings if it's a house for sale over the 90th percentile or below the 10th percentile in respect to the percentiles calculated for each neighborhood. For example, I would want a house in the Oakville neighborhood that has a price of 350000 filtered out.
I have used this argument before:
df1 = df[df.price < df.price.quantile(.90)]
But I don't know how to utilize it for differing values for each neighborhood, or even if it is useful to use. Thank you in advance for the help.
Probably not the most elegant but you could join the percentile aggregations to each of the real estate data.
df.join(df.groupby(‘neighborhood’).quantile([0.1,0.9]), on=‘neighborhood’)
On mobile, so forgive me if the syntax isn’t perfect.
You can set them to have same indexes, broadcast the percentiles, and just use .between
So first,
df2 = df2.set_index('neighborhood')
df = df.set_index('neighborhood')
Then, broadcast using loc
df.loc[:, 't'], df.loc[:, 'n'] = df2.tenthpercentile, df2.ninetiethpercentile
Finally,
df.price.between(df.t, df.n)
which yields
neighborhood
Smallville False
Oakville True
King Bay True
King Bay False
dtype: bool
So to filter, just slice
df[df.price.between(df.t, df.n)]

Grouping and adding calculated columns to my dataframe

I have a dataframe that looks like this I have made my continents my Index field. I want it to show up a little different. I would like to get the dataframe to just have 3 continents and then have all the countries that fall under that continent to show up as a count
Continent Country
Oceania Australia 53 154.3 203.6 209.9
Europe Austria 28.2 49.3 59.7 59.9
Europe Belgium 33.2 70.3 83.4 82.8
Europe Denmark 18.6 26.0 38.9 36.1
Asia Japan 382.9 835.5 1028.1 1049.0
So my output would look like such: and it would show just the number of countries under that continent. I would also like it for when it combines everything into num_countries that it gives the mean of everything for that country so its all rolled into one for each continent
Continent num_Countries mean
Oceania 1 209.9
Europe 3 328.2
Asia 1 382.9
I have tried to create these columns but i can get the new columns to create and when I do they come up as Nan values and for the continents I cant get the groupby() function to work in the way I want it to because it doesnt roll all of the countries into just the continents it displays the full list of continents and countries.
You can use a pivot table for this. (I labeled the unlabeled columns with 1 to 4)
df.pivot_table(index="Continent", values=["Country", "1"],
aggfunc=('count', 'mean'))
The following groups by 'Continent' and applies a function that counts the number of countries and finds the mean of means (I assumed this is what you wanted since you have 4 columns of numeric data for a number of countries per continent).
def f(group):
return pd.DataFrame([{'num_Countries': group.Country.count(),
'mean': group.mean().mean()}])
grouped = df.groupby('Continent')
result = grouped.apply(f).reset_index(level=1, drop=True)

Categories

Resources