Define a function for df using a pandas serie

Define a function for df using a pandas serie - python

I would like to calculate the number of people (dataframe variable) for a sector (ROME column) belonging to a workgroup (FAP column) for each year that I divide by the total number of people in that workgroup.
The total number of workgroups is stored in a variable Total_FAP :
Total_FAP = df2.Total
Total_FAP.head()
which shows
FAP
Agents administratifs et commerciaux des transports et du tourisme 63160.0
Agents d'entretien 718150.0
Agents d'exploitation des transports 142680.0
Agents de gardiennage et de sécurité 465010.0
Agriculteurs, éleveurs, sylviculteurs, bûcherons 121040.0
For example, for the year 2010, I have to take the number of people for the ROME A1101 corresponding to the FAP "Agriculteurs, éleveurs, sylviculteurs, bûcherons " (which is 2630) and divide it by the total number that is in the pandas series (which is 121040).
It would make something like : 2630/121040 = 0.02172835426
I would like to know if there is a way to make a function, because I wanted to try to make an iteration on the dataframes but I saw that it was not advised....
Thanks for your help
EDIT: Here is the raw data for DF1
{'FAP': {0: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
1: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
2: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
3: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons',
4: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons'},
'ROME': {0: 'A1101', 1: 'A1201', 2: 'A1202', 3: 'A1203', 4: 'A1204'},
'2010': {0: 2630, 1: 1380, 2: 4450, 3: 20330, 4: 130},
'2011': {0: 2790, 1: 1500, 2: 3670, 3: 20040, 4: 90},
'2012': {0: 2700, 1: 1320, 2: 4020, 3: 19130, 4: 130},
'2013': {0: 2970, 1: 1690, 2: 3520, 3: 20500, 4: 140},
'2014': {0: 2680, 1: 1980, 2: 2790, 3: 16900, 4: 150},
'2015': {0: 2440, 1: 1780, 2: 2640, 3: 16310, 4: 170},
'2016': {0: 3600, 1: 1980, 2: 2540, 3: 17680, 4: 90},
'2017': {0: 2930, 1: 2470, 2: 2510, 3: 18520, 4: 130},
'2018': {0: 2740, 1: 2010, 2: 2130, 3: 19280, 4: 150},
'2019': {0: 1600.0, 1: 1760.0, 2: 1050.0, 3: 14260.0, 4: 80.0},
'2020': {0: 11140, 1: 6490, 2: 14000, 3: 76570, 4: 510},
'1e Trimestre 2021': {0: 600, 1: 560, 2: 300, 3: 6090, 4: 30}}

You could use:
cols = df.filter(regex='^\d{4}$').columns
df = df.merge(Total_FAP, left_on='FAP', right_index=True, suffixes=('', '_total'))
df[cols].div(df['FAP_total'], axis=0)
output:
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 0.021728 0.023050 0.022307 0.024537 0.022141 0.020159 0.029742 0.024207 0.022637 0.013219 0.092036
1 0.011401 0.012393 0.010905 0.013962 0.016358 0.014706 0.016358 0.020406 0.016606 0.014541 0.053619
2 0.036765 0.030321 0.033212 0.029081 0.023050 0.021811 0.020985 0.020737 0.017597 0.008675 0.115664
3 0.167961 0.165565 0.158047 0.169365 0.139623 0.134749 0.146067 0.153007 0.159286 0.117812 0.632601
4 0.001074 0.000744 0.001074 0.001157 0.001239 0.001404 0.000744 0.001074 0.001239 0.000661 0.004213

Related

Plotly - set decimal place in choropleth

How do you convert number 1.425887B to 1.4 in plotly choropleth ?
data2022 = dict(type = 'choropleth',
colorscale = 'agsunset',
reversescale = True,
locations = df['Country/Territory'],
locationmode = 'country names',
z = df['2022 Population'],
text = df['CCA3' ],
marker = dict(line = dict(color = 'rgb(12, 12, 12)', width=1)),
colorbar = {'title': 'Population'})
layout2022 = dict(title = '<b>World Population 2022<b>',
geo = dict(showframe = True,
showland = True, landcolor = 'rgb(198, 197, 198)',
showlakes = True, lakecolor = 'rgb(85, 173, 240)',
showrivers = True, rivercolor = 'rgb(173, 216, 230)',
showocean = True, oceancolor = 'rgb(173, 216, 230)',
projection = {'type': 'natural earth'}))
choromap2022 = go.Figure(data=[data2022], layout=layout2022)
choromap2022.update_geos(lataxis_showgrid = True, lonaxis_showgrid = True)
choromap2022.update_layout(height = 600,
title_x = 0.5,
title_font_color = 'red',
title_font_family = 'Times New Roman',
title_font_size = 30,
margin=dict(t=80, r=50, l=50))
iplot(choromap2022)
This is the image of the result I got, I want to convert the population of China from 1.425887B to 1.4B
I try to look up on the plotly document but cannot find anything.
This is the output of df.head().to_dict()
'CCA3': {0: 'AFG', 1: 'ALB', 2: 'DZA', 3: 'ASM', 4: 'AND'},
'Country/Territory': {0: 'Afghanistan',
1: 'Albania',
2: 'Algeria',
3: 'American Samoa',
4: 'Andorra'},
'Capital': {0: 'Kabul',
1: 'Tirana',
2: 'Algiers',
3: 'Pago Pago',
4: 'Andorra la Vella'},
'Continent': {0: 'Asia', 1: 'Europe', 2: 'Africa', 3: 'Oceania', 4: 'Europe'},
'2022 Population': {0: 41128771, 1: 2842321, 2: 44903225, 3: 44273, 4: 79824},
'2020 Population': {0: 38972230, 1: 2866849, 2: 43451666, 3: 46189, 4: 77700},
'2015 Population': {0: 33753499, 1: 2882481, 2: 39543154, 3: 51368, 4: 71746},
'2010 Population': {0: 28189672, 1: 2913399, 2: 35856344, 3: 54849, 4: 71519},
'2000 Population': {0: 19542982, 1: 3182021, 2: 30774621, 3: 58230, 4: 66097},
'1990 Population': {0: 10694796, 1: 3295066, 2: 25518074, 3: 47818, 4: 53569},
'1980 Population': {0: 12486631, 1: 2941651, 2: 18739378, 3: 32886, 4: 35611},
'1970 Population': {0: 10752971, 1: 2324731, 2: 13795915, 3: 27075, 4: 19860},
'Area (km²)': {0: 652230, 1: 28748, 2: 2381741, 3: 199, 4: 468},
'Density (per km²)': {0: 63.0587,
1: 98.8702,
2: 18.8531,
3: 222.4774,
4: 170.5641},
'Growth Rate': {0: 1.0257, 1: 0.9957, 2: 1.0164, 3: 0.9831, 4: 1.01},
'World Population Percentage': {0: 0.52, 1: 0.04, 2: 0.56, 3: 0.0, 4: 0.0}}```

This is trickier than it appears because plotly uses d3-format, but I believe they are using additional metric abbreviations in their formatting to have the default display numbers larger than 1000 in the format 1.425887B.
My original idea was to round to the nearest 2 digits in the hovertemplate with something like:
data2022 = dict(..., hovertemplate = "%{z:.2r}<br>%{text}<extra></extra>")
However, this removes the default metric abbreviation and causes the entire long form decimal to display. The population of China should show up as 1400000000 instead of 1.4B.
So one possible workaround would be to create a new column in your DataFrame called "2022 Population Text" and format the number using a custom function to round and abbreviate your number (credit goes to #rtaft for their function which does exactly that). Then you can pass this column to customdata, and display customdata in your hovertemplate (instead of z).
import pandas as pd
import plotly.graph_objects as go
data = {'CCA3': {0: 'AFG', 1: 'ALB', 2: 'DZA', 3: 'ASM', 4: 'AND'},
'Country/Territory': {0: 'Afghanistan',
1: 'Albania',
2: 'Algeria',
3: 'American Samoa',
4: 'Andorra'},
'Capital': {0: 'Kabul',
1: 'Tirana',
2: 'Algiers',
3: 'Pago Pago',
4: 'Andorra la Vella'},
'Continent': {0: 'Asia', 1: 'Europe', 2: 'Africa', 3: 'Oceania', 4: 'Europe'},
'2022 Population': {0: 1412000000, 1: 2842321, 2: 44903225, 3: 44273, 4: 79824},
'2020 Population': {0: 38972230, 1: 2866849, 2: 43451666, 3: 46189, 4: 77700},
'2015 Population': {0: 33753499, 1: 2882481, 2: 39543154, 3: 51368, 4: 71746},
'2010 Population': {0: 28189672, 1: 2913399, 2: 35856344, 3: 54849, 4: 71519},
'2000 Population': {0: 19542982, 1: 3182021, 2: 30774621, 3: 58230, 4: 66097},
'1990 Population': {0: 10694796, 1: 3295066, 2: 25518074, 3: 47818, 4: 53569},
'1980 Population': {0: 12486631, 1: 2941651, 2: 18739378, 3: 32886, 4: 35611},
'1970 Population': {0: 10752971, 1: 2324731, 2: 13795915, 3: 27075, 4: 19860},
'Area (km²)': {0: 652230, 1: 28748, 2: 2381741, 3: 199, 4: 468},
'Density (per km²)': {0: 63.0587,
1: 98.8702,
2: 18.8531,
3: 222.4774,
4: 170.5641},
'Growth Rate': {0: 1.0257, 1: 0.9957, 2: 1.0164, 3: 0.9831, 4: 1.01},
'World Population Percentage': {0: 0.52, 1: 0.04, 2: 0.56, 3: 0.0, 4: 0.0}
}
## rounds a number to the specified precision, and adds metrics abbreviations
## i.e. 14230000000 --> 14B
## reference: https://stackoverflow.com/a/45846841/5327068
def human_format(num):
num = float('{:.2g}'.format(num))
magnitude = 0
while abs(num) >= 1000:
magnitude += 1
num /= 1000.0
return '{}{}'.format('{:f}'.format(num).rstrip('0').rstrip('.'), ['', 'K', 'M', 'B', 'T'][magnitude])
df = pd.DataFrame(data=data)
df['2022 Population Text'] = df['2022 Population'].apply(lambda x: human_format(x))
data2022 = dict(type = 'choropleth',
colorscale = 'agsunset',
reversescale = True,
locations = df['Country/Territory'],
locationmode = 'country names',
z = df['2022 Population'],
text = df['CCA3'],
customdata = df['2022 Population Text'],
marker = dict(line = dict(color = 'rgb(12, 12, 12)', width=1)),
colorbar = {'title': 'Population'},
hovertemplate = "%{customdata}<br>%{text}<extra></extra>"
)
layout2022 = dict(title = '<b>World Population 2022<b>',
geo = dict(showframe = True,
showland = True, landcolor = 'rgb(198, 197, 198)',
showlakes = True, lakecolor = 'rgb(85, 173, 240)',
showrivers = True, rivercolor = 'rgb(173, 216, 230)',
showocean = True, oceancolor = 'rgb(173, 216, 230)',
projection = {'type': 'natural earth'}))
choromap2022 = go.Figure(data=[data2022], layout=layout2022)
choromap2022.update_geos(lataxis_showgrid = True, lonaxis_showgrid = True)
choromap2022.update_layout(height = 600,
title_x = 0.5,
title_font_color = 'red',
title_font_family = 'Times New Roman',
title_font_size = 30,
margin=dict(t=80, r=50, l=50),
)
choromap2022.show()
Note: Since China wasn't included in your sample data, I changed the population of AFG to 1412000000 to test that the hovertemplate would display it as '1.4B'.

Fixing column names and renaming them after grouping the dataframe by two columns

I have a dataframe:
{'ARTICLE_ID': {0: 111, 1: 111, 2: 222, 3: 222, 4: 222}, 'CITEDIN_ARTICLE_ID': {0: 11, 1: 11, 2: 11, 3: 22, 4: 22}, 'enrollment': {0: 10, 1: 10, 2: 10, 3: 10, 4: 10}, 'Trial_year': {0: 2017, 1: 2017, 2: 2017, 3: 2017, 4: 2017}, 'AUTHOR_ID': {0: 'aaa', 1: 'aaa', 2: 'aaa', 3: 'aaa', 4: 'aaa'}, 'AUTHOR_RANK': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}}
I am grouping it by two columns
df_grouped = df.groupby(['AUTHOR_ID', 'Trial_year']).agg({'ARTICLE_ID': "count",
'enrollment': ["count", 'sum']}).reset_index()
As a result, I receive this dataframe, where column names have two levels
{('AUTHOR_ID', ''): {0: 'aaa'}, ('Trial_year', ''): {0: 2017}, ('ARTICLE_ID', 'count'): {0: 5}, ('enrollment', 'count'): {0: 5}, ('enrollment', 'sum'): {0: 50}}
My ideal output - the dataframe with one level of column names and renamed column names
`AUTHOR_ID`, `Trial_year`, `ARTICLE_ID_count`, `enrollment_count`, `enrollment_sum`

You can modify the columns:
df_grouped.columns = [f"{i}_{j}" if j!='' else i for i,j in df_grouped.columns]
or use NamedAgg from the beginning:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'])
.agg(ARTICLE_ID_count=('ARTICLE_ID', "count"),
enrollment_count=('enrollment','count'),
enrollment_sum=('enrollment','sum')).reset_index())
You can also pass a dictionary to groupby.agg for a little concise code:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'], as_index=False)
.agg(**{'_'.join(pair): pair for pair in [('ARTICLE_ID', 'count'),
('enrollment','count'),
('enrollment','sum')]}))
Output:
AUTHOR_ID Trial_year ARTICLE_ID_count enrollment_count enrollment_sum
0 aaa 2017 5 5 50

Pandas Loc string giving KeyError

I am trying to pass into a url a date in the format 2015-12-20, search the pandas dataframe and do a model.predict on it.
The problem is that I am trying to convert a working code from the jupyter lab into the .py file in order to run everything on the flask server and following I can not transfer.
The following code only works if the 'Date' column is converted to datetime. If it is in object format, the following code also doesn't work.
data.loc[2015-12-06]
The above works but the following gives an error:
data.loc['2015-12-06']
KeyError: '2015-12-06'
How do I pass in the 2015-12-06 not as string for the .loc to work?
print(data.head(5).to_dict())
{'Date': {0: '2015-12-27', 1: '2015-12-20', 2: '2015-12-13', 3: '2015-12-06', 4: '2015-11-29'}, 'Total Volume': {0: 64236.62, 1: 54876.98, 2: 118220.22, 3: 78992.15, 4: 51039.6}, '4046': {0: 1036.74, 1: 674.28, 2: 794.7, 3: 1132.0, 4: 941.48}, '4225': {0: 54454.85, 1: 44638.81, 2: 109149.67, 3: 71976.41, 4: 43838.39}, '4770': {0: 48.16, 1: 58.33, 2: 130.5, 3: 72.58, 4: 75.78}, 'Total Bags': {0: 8696.87, 1: 9505.56, 2: 8145.35, 3: 5811.16, 4: 6183.95}, 'Small Bags': {0: 8603.62, 1: 9408.07, 2: 8042.21, 3: 5677.4, 4: 5986.26}, 'Large Bags': {0: 93.25, 1: 97.49, 2: 103.14, 3: 133.76, 4: 197.69}, 'XLarge Bags': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}, 'type': {0: 'conventional', 1: 'conventional', 2: 'conventional', 3: 'conventional', 4: 'conventional'}, 'year': {0: 2015, 1: 2015, 2: 2015, 3: 2015, 4: 2015}, 'region': {0: 'Albany', 1: 'Albany', 2: 'Albany', 3: 'Albany', 4: 'Albany'}}

Pandas dataframe don't merge on specific column

Good evening,
I have a problem with my df
Here is df1
and df2
Trimestre level_0
0 "A1101" Agriculteurs, éleveurs, sylviculteurs, bûcherons"
1 "A1401" Maraîchers, jardiniers, viticulteurs"
2 "A1405" Maraîchers, jardiniers, viticulteurs"
3 "A1406" Marins, pêcheurs, aquaculteurs"
4 "N3101" Marins, pêcheurs, aquaculteurs"
... ... ...
123 "K1205" Professionnels de l'action sociale et de l'ori...
124 "K2104" Professionnels de l'action culturelle, sportiv...
125 "K2108" Enseignants"
126 "K2110" Formateurs"
127 "K2111" Formateurs"
I try to merge df1 with df2 on "Trimestre" column
df2.Trimestre = df2.Trimestre.astype(str)
df1.Trimestre = df1.Trimestre.astype(str)
df=pd.merge(df1,df2,on="Trimestre")
and nothing appear
Trimestre level_0 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Help me pls
EDIT: Here is the output of df.head().to_dict() to reproduce the error
df1
{'Trimestre': {0: 'A1101 ',
1: 'A1201 ',
2: 'A1202 ',
3: 'A1203 ',
4: 'A1204 '},
'2010': {0: 2630, 1: 1380, 2: 4450, 3: 20330, 4: 130},
'2011': {0: 2790, 1: 1500, 2: 3670, 3: 20040, 4: 90},
'2012': {0: 2700, 1: 1320, 2: 4020, 3: 19140, 4: 130},
'2013': {0: 2970, 1: 1690, 2: 3520, 3: 20500, 4: 140},
'2014': {0: 2680, 1: 1980, 2: 2790, 3: 16900, 4: 150},
'2015': {0: 2440, 1: 1780, 2: 2640, 3: 16310, 4: 170},
'2016': {0: 3600, 1: 1980, 2: 2540, 3: 17680, 4: 90},
'2017': {0: 2930, 1: 2470, 2: 2510, 3: 18520, 4: 130},
'2018': {0: 2740, 1: 2010, 2: 2130, 3: 19280, 4: 150},
'2019': {0: 1600.0, 1: 1760.0, 2: 1050.0, 3: 14260.0, 4: 80.0},
'2020': {0: 11140, 1: 6490, 2: 14000, 3: 76580, 4: 510}}
df2
{'Trimestre': {0: 'A1101', 1: 'A1401', 2: 'A1405', 3: 'A1406', 4: 'N3101'},
'level_0': {0: 'Agriculteurs, éleveurs, sylviculteurs, bûcherons"',
1: 'Maraîchers, jardiniers, viticulteurs"',
2: 'Maraîchers, jardiniers, viticulteurs"',
3: 'Marins, pêcheurs, aquaculteurs"',
4: 'Marins, pêcheurs, aquaculteurs"'}}

Calculating total unique values per column

I am trying to use the below data to get the 'Total Facebook likes' for each unique actor. The output should be in two columns, column 1
containing the unique actor names from all the actor_name columns and
column 2 should have the total likes from all three
actor_facebook_likes columns. Any idea on how this can done, will be
appreciated.
{'actor_1_name': {0: 'Ryan Gosling',
1: 'Ginnifer Goodwin',
2: 'Dev Patel',
3: 'Amy Adams',
4: 'Casey Affleck'},
'actor_2_name': {0: 'Emma Stone',
1: 'Jason Bateman',
2: 'Nicole Kidman',
3: 'Jeremy Renner',
4: 'Michelle Williams '},
'actor_3_name': {0: 'Amiée Conn',
1: 'Idris Elba',
2: 'Rooney Mara',
3: 'Forest Whitaker',
4: 'Kyle Chandler'},
'actor_1_facebook_likes': {0: 14000, 1: 2800, 2: 33000, 3: 35000, 4: 518},
'actor_2_facebook_likes': {0: 19000.0,
1: 28000.0,
2: 96000.0,
3: 5300.0,
4: 71000.0},
'actor_3_facebook_likes': {0: nan, 1: 27000.0, 2: 9800.0, 3: nan, 4: 3300.0}}

Use pivot to get sum of likes for each actor in each facebook like category
df3=pd.pivot_table(df,columns=['actor_1_name', 'actor_2_name', 'actor_3_name'],values=['actor_1_facebook_likes', 'actor_2_facebook_likes',
'actor_3_facebook_likes'],aggfunc=[np.sum]).reset_index()
Melt the Actors, groupby and sum all categories
res=pd.melt(df3,id_vars=['sum'], value_vars=['actor_1_name', 'actor_2_name', 'actor_3_name']).groupby('value').agg(Totallikes =('sum', 'sum')).reset_index()
Rename the columns
res.columns=['Actor','Totallikes']
print(res)
Actor Totallikes
0 Amiée Conn 33000.0
1 Amy Adams 40300.0
2 Casey Affleck 74818.0
3 Dev Patel 138800.0
4 Emma Stone 33000.0
5 Forest Whitaker 40300.0
6 Ginnifer Goodwin 57800.0
7 Idris Elba 57800.0
8 Jason Bateman 57800.0
9 Jeremy Renner 40300.0
10 Kyle Chandler 74818.0
11 Michelle Williams 74818.0
12 Nicole Kidman 138800.0
13 Rooney Mara 138800.0
14 Ryan Gosling 33000.0

This makes the job :
df0 = pd.DataFrame({'actor_1_name': {0: 'Ryan Gosling',
1: 'Ginnifer Goodwin',
2: 'Dev Patel',
3: 'Amy Adams',
4: 'Casey Affleck'},
'actor_2_name': {0: 'Emma Stone',
1: 'Jason Bateman',
2: 'Nicole Kidman',
3: 'Jeremy Renner',
4: 'Michelle Williams '},
'actor_3_name': {0: 'Amiée Conn',
1: 'Idris Elba',
2: 'Rooney Mara',
3: 'Forest Whitaker',
4: 'Kyle Chandler'},
'actor_1_facebook_likes': {0: 14000, 1: 2800, 2: 33000, 3: 35000, 4: 518},
'actor_2_facebook_likes': {0: 19000.0,
1: 28000.0,
2: 96000.0,
3: 5300.0,
4: 71000.0},
'actor_3_facebook_likes': {0: 0, 1: 27000.0, 2: 9800.0, 3: 0, 4: 3300.0}})
df1 = pd.concat([df0, df0, df0])
dfa = pd.DataFrame()
for i in range(0, 3):
names = list(df1.iloc[3*i:4+3*i, i])
val = df1.iloc[3*i:4+3*i, 3+i]
df = pd.DataFrame(names)
df['value'] = val
dfa = pd.concat([dfa, df], axis = 0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Define a function for df using a pandas serie - python

Related

Plotly - set decimal place in choropleth

Fixing column names and renaming them after grouping the dataframe by two columns

Pandas Loc string giving KeyError

Pandas dataframe don't merge on specific column

Calculating total unique values per column

Categories

Resources