I had a dataset dfthat looked like this:
Value themes country date
-1.975767 Weather Brazil 2022-02-13
-0.540979 Fruits China 2022-02-13
-2.359127 Fruits China 2022-02-13
-2.815604 Corona China 2022-02-13
-0.712323 Weather UK 2022-02-13
-0.929755 Weather Brazil 2022-02-13
I grouped themes+country to calculate mean and count values of each combination of theme and country (eg: Weather, Brazil or Weather, UK)
df_calculations = df.groupby(["themes", "country"], as_index = False)["value"].mean()
df_calculations['count'] = df.groupby(["themes", "country"])["value"].count().tolist()
Then I added this info to a new table df_avg that looks like this:
country type mean count last_checked_date
Brazil Weather x 2 2022-02-13 #same for all rows
Brazil Corona y 2022-02-13
China Corona z 1 2022-02-13
China Fruits s 2 2022-02-13
However, now, there's new are additional rows in the same original df.
Value themes country date
-1.975560 Weather Brazil 2022-02-15
-0.540123 Fruits China 2022-02-16
-2.359234 Fruits China 2022-02-16
-2.359234 Corona UK 2022-02-16
I want to go through the df rows who's date is after the last_checked_date.
Then I want to calculate a new mean for each combination again but using the old mean and n value from my df_avgtable instead of re-calculating for the whole df
How can I achieve this?
Please see this: Calculate new mean from old mean
Since you are maintaining a count (if not, it is pretty trivial) you can use that along with existing mean to calculate updated mean using the new observation.
Related
I have two similar looking tables:
df1:
country type mean count last_checked_date
Brazil Weather x 2 2022-02-13
Brazil Corona y 3 2022-02-13
China Corona z 1 2022-02-13
China Fruits s 2 2022-02-13
df2
country type mean count last_checked_date
Ghana Weather a 2 2022-02-13
Brazil Corona b 5 2022-02-13
China Corona c 1 2022-02-13
Germany Fruits d 2 2022-02-13
I want to join df2 with df1 such that no combination of country, type is lost. For each combination of country and type, I want to calculate a mean value with this formula:
df find_new_values(old_mean, new_mean, old_count, new_count):
mean = (old_mean + new_mean)/(old_count+new_count)
count = old_count+new_count
return mean, count
For example, in df2, China, Corona is present in df1 as well so the mean would be (c+z)/(1+1)
However, Ghana, Weather is present in df2 but not in df1 so in this case, I want to simply add a row to df1 as it is without the formula calculation.
How can I achieve this? What's the correct join/merge type to use here?
We may consider the problem this way, we combine them into one table,
df = pd.concat([df1, df2])
then use groupby to apply aggregations on each group of the rows that share the same country and type.
df.groupby(['country', 'type']).agg({'mean': 'mean', 'count': 'sum'})
For country-type combination that only occur once in one of the dataframe, the corresponding group will only discover one row and the aggregation functions won't change anything.
You may add 'last_checked_date': 'last' to the list of agg if needed.
I had a dataset that looked like this:
Value Type mean
-1.975767 Weather
-0.540979 Fruits
-2.359127 Fruits
-2.815604 Corona
-0.929755 Weather
I wanted to iterate through each row and calculate a mean value for each row above (only if the Type matches). Mean is calculated by:
sum of all values / number of observations
where number of observations will be the number of times a Type has occurred so far.
For example, in the first row, there's no "weather" row above so for weather n = 1. So the mean would be -1.975767 / 1 = -1.975767.
In the second row, there's no FRUITS row above it, so the mean will just be -0.540979/1 = -0.540979.
However, in the third row, when we scan all previous rows, we see that FRUITS has already occurred before this and hence, n = 2 for Fruits. So we should get the last's FRUIT's value and calculate a new mean. So here, the mean will be -0.540979 + (-2.359127) divided by 2.
Value Type mean
-1.975767 Weather -1.975767
-0.540979 Fruits -0.540979
-2.359127 Fruits (-0.540979 -2.359127) / 2
-2.815604 Corona -2.815604
-0.929755 Weather (-1.975767 -0.929755) / 2
I used this to achieve this and it worked fine:
df['mean'] = df.groupby('type', as_index=False)['value'].expanding().mean().sort_index(level=1).droplevel(0)
However, now I want to do the same thing based on grouping of two cols Such that Country and Type both match.
Value Type mean Country
-1.975767 Weather Albania
-0.540979 Fruits Brazil --should be grouped
-2.359127 Fruits Brazil --should be grouped
-2.815604 Corona Albania
-0.929755 Weather China
I tried this:
df['mean'] = df.groupby([df.type,df.country], as_index=False)['value'].expanding().mean().sort_index(level=1).droplevel(0)
However, this gives me an error that:
TypeError: incompatible index of inserted column with frame index
even though its almost the same thing. What am I doing wrong?
Try:
df["Mean"] = df.groupby(["Type", "Country"])["Value"].expanding().mean().droplevel([0,1]).sort_index()
>>> df
Value Type Country Mean
0 -1.975767 Weather Albania -1.975767
1 -0.540979 Fruits Brazil -0.540979
2 -2.359127 Fruits Brazil -1.450053
3 -2.815604 Corona Albania -2.815604
4 -0.929755 Weather China -0.929755
Input df:
df = pd.DataFrame({"Value": [-1.975767, -0.540979, -2.359127, -2.815604, -0.929755],
"Type": ["Weather", "Fruits", "Fruits", "Corona", "Weather"],
"Country": ["Albania", "Brazil", "Brazil", "Albania", "China"]})
I'm pretty new to python and pandas, and know only the basics. Nowadays I'm conducting a research and I need your kind help.
Let’s say I have data on births, containing 2 variables: Date and Country.
Date Country
1.1.20 USA
1.1.20 USA
1.1.20 Italy
1.1.20 England
2.1.20 Italy
2.1.20 Italy
3.1.20 USA
3.1.20 USA
Now I want to create a third variable, let’s call him ‘Births’, which contains the number of births in country at a date. In other words, I want to stick to just one row for each date+country combination by aggregating the number of countries in each date, so I end up with something like this:
Date Country Births
1.1.20 USA 2
1.1.20 Italy 1
1.1.20 England 1
2.1.20 Italy 2
3.1.20 USA 2
I’ve tried many things, but nothing seemed to work. Any help will be much appreciated.
Thanks,
Eran
I guess you can use the groupby method of your DataFrame, then use the size method to count the number of individuals in each group :
df.groupby(by=['Date', 'Country']).size().reset_index(name='Births')
Output:
Date Country Births
0 1.1.20 England 1
1 1.1.20 Italy 1
2 1.1.20 USA 2
3 2.1.20 Italy 2
4 3.1.20 USA 2
Also, the pandas documentation has several examples related to group-by operations : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html.
I've tried to use df2.nlargest(5, ['1960'] this gives me:
Country Name Country Code ... 2017 2018
0 IDA & IBRD total IBT ... 6335039629.0000 6412522234.0000
1 Low & middle income LMY ... 6306560891.0000 6383958209.0000
2 Middle income MIC ... 5619111361.0000 5678540888.0000
3 IBRD only IBD ... 4731120193.0000 4772284113.0000
6 Upper middle income UMC ... 2637690770.0000 2655635719.0000
This is somewhat right, but it's outputting all the columns. I just want it to include the column name "Country Name" and "1960" only, but sort by the column "1960."
So the output should look like this...
Country Name 1960
China 5000000000
India 499999999
USA 300000
France 100000
Germany 90000
I have a dataframe that looks like this I have made my continents my Index field. I want it to show up a little different. I would like to get the dataframe to just have 3 continents and then have all the countries that fall under that continent to show up as a count
Continent Country
Oceania Australia 53 154.3 203.6 209.9
Europe Austria 28.2 49.3 59.7 59.9
Europe Belgium 33.2 70.3 83.4 82.8
Europe Denmark 18.6 26.0 38.9 36.1
Asia Japan 382.9 835.5 1028.1 1049.0
So my output would look like such: and it would show just the number of countries under that continent. I would also like it for when it combines everything into num_countries that it gives the mean of everything for that country so its all rolled into one for each continent
Continent num_Countries mean
Oceania 1 209.9
Europe 3 328.2
Asia 1 382.9
I have tried to create these columns but i can get the new columns to create and when I do they come up as Nan values and for the continents I cant get the groupby() function to work in the way I want it to because it doesnt roll all of the countries into just the continents it displays the full list of continents and countries.
You can use a pivot table for this. (I labeled the unlabeled columns with 1 to 4)
df.pivot_table(index="Continent", values=["Country", "1"],
aggfunc=('count', 'mean'))
The following groups by 'Continent' and applies a function that counts the number of countries and finds the mean of means (I assumed this is what you wanted since you have 4 columns of numeric data for a number of countries per continent).
def f(group):
return pd.DataFrame([{'num_Countries': group.Country.count(),
'mean': group.mean().mean()}])
grouped = df.groupby('Continent')
result = grouped.apply(f).reset_index(level=1, drop=True)