Find average of row and column groups pandas - python

I want to find the states with the highest average total revenue and be able to see states with the 40-45th highest average, 35-40th, etc for all states from 1992-2016.
Data is organized in a dataframe in the below picture. So ideally I could have another column like the following. I think this is what I am trying to do.
STATE // YEAR // TOTAL_REVENUE // AVG_TOTAL_REVENUE
ALABAMA // 1992 // 5000 // 6059
ALABAMA // 1993 // 4000 // 6059
ALASKA // 1992 // 3000 // 2059
ALABAMA // 1996 // 6019 // 6059
Is this possible to do? I am not sure if I am stating what I want to do correctly and not sure what I am looking for google wise to figure out a way forward.

Assuming your input looks like:
STATE YEAR TOTAL_REVENUE
Michigan 2001 1000
Michigan 2002 2000
California 2003 3000
California 2004 4000
Michigan 2005 5000
Then just do:
df['AVG_TOTAL_REVENUE'] = np.nan
states = df['STATE'].tolist()
states = list(set(states))
for state in states:
state_values = df[df['STATE'] == state]
revenues = state_values['TOTAL_REVENUE'].tolist()
revenues = [float(x) for x in revenues]
avg = sum(revenues)/len(revenues)
df['AVG_TOTAL_REVENUE'].loc[state_values.index] = avg
which gives you:
STATE YEAR TOTAL_REVENUE AVG_TOTAL_REVENUE
0 Michigan 2001 1000 2666.666667
1 Michigan 2002 2000 2666.666667
2 California 2003 3000 3500.000000
3 California 2004 4000 3500.000000
4 Michigan 2005 5000 2666.666667

If your data is stored in a pandas dataframe called df with STATE as index, then you can try:
df.set_index("STATE",inplace=True)
avg_revenue = df.groupby(level=0)["TOTAL_REVENUE"].agg("mean")
df["AVG_TOTAL_REVENUE"] = avg_revenue.loc[df.index]
df = df.sort_values(by="AVG_TOTAL_REVENUE",ascending=False)
Regarding the "40-45th highest average", I'm not sure exactly what you're looking for. But you could do this for instance:
import numpy as np
bin = (np.array([0.40, 0.45]) * len(df)).astype(int)
df.iloc[bin[0]:bin[1],:]
# Or with quantiles
min_q,max_q = (0.40, 0.45)
avg = df.AVG_TOTAL_REVENUE
df.loc[(avg >= avg.quantile(min_q)) & (avg <= avg.quantile(max_q)), :]
Or maybe you want to bin your data every 5 state in order of AVG_TOTAL_REVENUE?
df_grouped = df.groupby("STATE")["AVG_TOTAL_REVENUE"].agg("first")
n_bins = int(df_grouped.shape[0] / 5)
bins = (pd.cut(df_grouped,bins=n_bins)
.reset_index()
.groupby("AVG_TOTAL_REVENUE")
.agg(list)
)

Related

how to calculate percentage variation between two values?

I have this dataframe with the total population number by year.
import pandas as pd
cases_df = pd.DataFrame(data=cases_list, columns=['Year', 'Population', 'Nation'])
cases_df.head(7)
Year Population Nation
0 2019 328239523 United States
1 2018 327167439 United States
2 2017 325719178 United States
3 2016 323127515 United States
4 2015 321418821 United States
5 2014 318857056 United States
6 2013 316128839 United States
I want to calculate how much the population has increased from the year 2013 to 2019 by calculating the percentage change between two values (2013 and 2019):
{[(328239523 - 316128839)/ 316128839] x 100 }
How can I do this? Thank you very much!!
ps. some advice to remove index? 0 1 2 3 4 5 6
i tried to to that
df1 = df.groupby(level='Population').pct_change()
print(df1)
but i get error because "Population" says that is not the name of Index
I would do it following way
import pandas as pd
df = pd.DataFrame({"year":[2015,2014,2013],"population":[321418821,318857056,316128839],"nation":["United States","United States","United States"]})
df = df.set_index("year")
df["percentage"] = df["population"] * 100 / df["population"][2013]
print(df)
output
population nation percentage
year
2015 321418821 United States 101.673363
2014 318857056 United States 100.863008
2013 316128839 United States 100.000000
Note I used subset of data for brevity sake. Using year as index allow easy access to population value in 2013, percentage is computed as (population) * 100 / (population for 2013).
How to remove the mentioned index :
df.set_index('Year',inplace=True)
Now Year will replace your numbered index.
Now
Use cases_df.describe()
or cases_df.attribute_name.describe()
This is more of a math question rather than a programming question.
Let's call this a percentage difference between two values since population can vary both ways (increase or decrease over time).
Now, lets say that in 2013 we had 316128839 people and in 2019 we had 328239523 people:
a = 316128839
b = 328239523
Before we go about calculating the percentage, we need to find the difference between the b and a:
diff = b - a
Now that we have that, we need to see what is the percentage of diff of a:
perc = (diff / a) * 100
And there is your percentage variation between a and b

How to add categorical variable to dataframe?

Im working in a World Happiness Report project that includes datasets from 2015 to 2019. I concatenated them into a final dataframe to get average of parameters (economy, health, etc.) for every country across that time span. But what I forgot to add was the respective region that the country is in (ej: England - Western Europe). How could I add the 'Region' column to my final dataframe and to be sure that that region matches with its respective country?
Not sure if this is what you are looking for.
You may want to do something like this:
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
Or you can also use merge statement. The assumption is that for each country, you have a region that it can map to.
df = pd.merge(df,region_df,how='left',on = ['Country'])
Make sure you have indexed both on Country before you merge to get the optimized response.
data setup
import pandas as pd
c = ['Country','Happiness Score','Other_fields']
d = [['Denmark', 7.5460,1.25],
['Norway',7.5410,1.50],
['Finland',7.5378,1.85]]
region_cols = ['Country','Region']
region_data = [['Denmark','R1'],['Norway','R2'],['Finland','R3']]
df = pd.DataFrame(data = d, columns = c)
region_df = pd.DataFrame(data = region_data, columns = region_cols)
Based on the lookup DataFrame, you can do a map to check for Country and assign Region to df.
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
print (df)
Your result will be as follows:
Base DataFrame:
Country Happiness Score Other_fields
0 Denmark 7.5460 1.25
1 Norway 7.5410 1.50
2 Finland 7.5378 1.85
Lookup DataFrame:
Country Region
0 Denmark R1
1 Norway R2
2 Finland R3
Updated DataFrame:
Country Happiness Score Other_fields Region
0 Denmark 7.5460 1.25 R1
1 Norway 7.5410 1.50 R2
2 Finland 7.5378 1.85 R3

Calculating new rows in a Pandas Dataframe on two different columns

So I'm a beginner at Python and I have a dataframe with Country, avgTemp and year.
What I want to do is calculate new rows on each country where the year adds 20 and avgTemp is multiplied by a variable called tempChange. I don't want to remove the previous values though, I just want to append the new values.
This is how the dataframe looks:
Preferably I would also want to create a loop that runs the code a certain number of times
Super grateful for any help!
If you need to copy the values from the dataframe as an example you can have it here:
Country avgTemp year
0 Afghanistan 14.481583 2012
1 Africa 24.725917 2012
2 Albania 13.768250 2012
3 Algeria 23.954833 2012
4 American Samoa 27.201417 2012
243 rows × 3 columns
If you want to repeat the rows, I'd create a new dataframe, perform any operation in the new dataframe (sum 20 years, multiply the temperature by a constant or an array, etc...) and use then use concat() to append it to the original dataframe:
import pandas as pd
tempChange=1.15
data = {'Country':['Afghanistan','Africa','Albania','Algeria','American Samoa'],'avgTemp':[14,24,13,23,27],'Year':[2012,2012,2012,2012,2012]}
df = pd.DataFrame(data)
df_2 = df.copy()
df_2['avgTemp'] = df['avgTemp']*tempChange
df_2['Year'] = df['Year']+20
df = pd.concat([df,df_2]) #ignore_index=True if you wish to not repeat the index value
print(df)
Output:
Country avgTemp Year
0 Afghanistan 14.00 2012
1 Africa 24.00 2012
2 Albania 13.00 2012
3 Algeria 23.00 2012
4 American Samoa 27.00 2012
0 Afghanistan 16.10 2032
1 Africa 27.60 2032
2 Albania 14.95 2032
3 Algeria 26.45 2032
4 American Samoa 31.05 2032
where df is your data frame name:
df['tempChange'] = df['year']+ 20 * df['avgTemp']
This will add a new column to your df with the logic above. I'm not sure if I understood your logic correct so the math may need some work
I believe that what you're looking for is
dfName['newYear'] = dfName.apply(lambda x: x['year'] + 20,axis=1)
dfName['tempDiff'] = dfName.apply(lambda x: x['avgTemp']*tempChange,axis=1)
This is how you apply to each row.

Python How to create an else if statement within a dataframe that is dependant on another column

e.g. I have created a data frame. There is a heading for years. I want to populate it with random values for each year (within a certain limit). If the year is 2001 then the values should be randomly selected from 15000 to 20000. If the year is 2010 then the values can be from 5000 to 7000.
I have df ['mileage'] = np.random.randint(0,20000,100) and that returns different values for the df but some values for 2001 have less than 2010. I would like to change it so that 2001 would have more than 2010.
The years have been randomly generated also. It looks like this:
year fuel mileage status sex licence_type
2006 diesel 19184 fail male full
2007 diesel 9186 fail female full
You could use something like:
is_2001 = df['year'] == 2001
lower_bound = is_2001*15000 + ~is_2001*5000
upper_bound = is_2001*20000 + ~is_2001*7000
df['mileage'] = np.random.randint(lower_bound, upper_bound)

Multiply columns based on two columns conditions from different dataframes?

I have two dataframes as indicated below:
dfA =
Country City Pop
US Washington 1000
US Texas 5000
CH Geneva 500
CH Zurich 500
dfB =
Country City Density (pop/km2)
US Washington 10
US Texas 50
CH Geneva 5
CH Zurich 5
What I want is to compare the columns Country and City from both dataframes, and when these match such as:
US Washington & US Washington in both dataframes, it takes the Pop value and divides it by Density, as to get a new column area in dfB with the resulting division. Example of first row results dfB['area km2'] = 100
I have tried with np.where() but it is nit working. Any hints on how to achieve this?
Using index matching and div
match_on = ['Country', 'City']
dfA = dfA.set_index(match_on)
dfA.assign(ratio=dfA.Pop.div(df.set_index(['Country', 'City'])['Density (pop/km2)']))
Country City
US Washington 100.0
Texas 100.0
CH Geneva 100.0
Zurich 100.0
dtype: float64
You can also use merge to combine the two dataframes and divide as usual:
dfMerge = dfA.merge(dfB, on=['Country', 'City'])
dfMerge['area'] = dfMerge['Pop'].div(dfMerge['Density (pop/km2)'])
print(dfMerge)
Output:
Country City Pop Density (pop/km2) area
0 US Washington 1000 10 100.0
1 US Texas 5000 50 100.0
2 CH Geneva 500 5 100.0
3 CH Zurich 500 5 100.0
you can also use merge like below
dfB["Area"] = dfB.merge(dfA, on=["Country", "City"], how="left")["Pop"] / dfB["Density (pop/km2)"]
dfB

Categories

Resources