Pandas isin() output to string and general code optimisation - python

I am just starting to use python to do some analysis of data at work, so I could really use some help here :)
I have a df with African countries and a bunch of indicators and another df with dimensions representing groupings, and if a country is in that group, the name of the country is in there.
Here's a snapshot:
# indicators df
df_indicators= pd.DataFrame({'Country': ['Algeria', 'Angola', 'Benin'],
'Commitment to CAADP Process': [np.nan, 0.1429, 0.8571]})
# groupings df
df_groupings= pd.DataFrame({'Fragile': ['Somalia', 'Angola', 'Benin'],
'SSA': ['Angola', 'Benin', 'South Africa']})
# what I would like to have without doing it manually
df_indicators['Fragile'] = ['Not Fragile', 'Fragile', 'Fragile']
df_indicators['SSA'] = ['Not SSA', 'SSA', 'Not SSA']
df_indicators
and want to add dimensions to this df that tell me if the country is a fragile state or not (and other groupings). So I have another df with the list of countries belonging to that category.
I use the isin instance to check for equality, but what I would really like is that instead of TRUE and FALSE in the new dimension "Fragile" for example, TRUE values would be replaced by "Fragile" and FALSE values by "NOT FRAGILE".
It goes without saying that if you seen any way to improve this code I am very eager to learn from professionals! Especially if you are in the domain of sustainable development goal statistics.
import pandas as pd
import numpy as np
excel_file = 'Downloads/BR_Data.xlsx'
indicators = pd.read_excel(excel_file, sheetname="Indicators", header=1)
groupings = pd.read_excel(excel_file, sheetname="Groupings", header=0)
# Title countries in the Sub-Saharan Africa dimension
decap = groupings["Sub-Saharan Africa (World Bank List)"].str.title()
groupings["Sub-Saharan Africa (World Bank List)"] = decap
# Create list of legal country names
legal_tags = {"Côte d’Ivoire":"Ivory Coast", "Cote D'Ivoire":"Ivory Coast", "Democratic Republic of the Congo":"DR Congo", "Congo, Dem. Rep.":"DR Congo",
"Congo, Repub. of the":"DR Congo", "Congo, Rep.": "DR Congo", "Dr Congo": "DR Congo", "Central African Rep.":"Central African Republic", "Sao Tome & Principe":
"Sao Tome and Principe", "Gambia, The":"Gambia"}
# I am sure there is a way to insert a list of the column names instead of copy pasting the name of every column label 5 times
groupings.replace({"Least Developing Countries in Africa (UN Classification, used by WB)" : legal_tags}, inplace = True)
groupings.replace({"Oil Exporters (McKinsey Global Institute)" : legal_tags}, inplace = True)
groupings.replace({"Sub-Saharan Africa (World Bank List)" : legal_tags}, inplace = True)
groupings.replace({"African Fragile and Conflict Affected Aread (OECD)" : legal_tags}, inplace = True)
groupings
# If the country is df indicator is found in grouping df then assign true to new column [LDC] => CAN I REPLACE TRUE WITH "LDC" etc...?
indicators["LDC"] = indicators["Country"].isin(groupings["Least Developing Countries in Africa (UN Classification, used by WB)"])
indicators["Fragile"] = indicators["Country"].isin(groupings["African Fragile and Conflict Affected Aread (OECD)"])
indicators["Oil"] = indicators["Country"].isin(groupings["Oil Exporters (McKinsey Global Institute)"])
indicators["SSA"] = indicators["Country"].isin(groupings["Sub-Saharan Africa (World Bank List)"])
indicators["Landlock"] = indicators["Country"].isin(groupings['Landlocked (UNCTAD List)'])
# When I concatenate the data frames of the groupings average I get an index with a bunch of true, false, true, false etc...
df = indicators.merge(groupings, left_on = "Country", right_on= "Country", how ="right")
labels = ['African Fragile and Conflict Affected Aread (OECD)', 'Sub-Saharan Africa (World Bank List)', 'Landlocked (UNCTAD List)', 'North Africa (excl. Middle east)', 'Oil Exporters (McKinsey Global Institute)', 'Least Developing Countries in Africa (UN Classification, used by WB)']
df.drop(labels, axis = 1, inplace = True)
df.loc['mean'] = df.mean()
df_regions = df[:].groupby('Regional Group').mean()
df_LDC = df[:].groupby('LDC').mean()
df_Oil = df[:].groupby('Oil').mean()
df_SSA = df[:].groupby('SSA').mean()
df_landlock = df[:].groupby('Landlock').mean()
df_fragile = df[:].groupby('Fragile').mean()
frames = [df_regions, df_Oil, df_SSA, df_landlock, df_fragile]
result = pd.concat(frames)
result

You can apply a function on the serie instead of isin()
def get_value(x, y, choice):
if x in y:
return choice[0]
else:
return choice[1]
indicators["LDC"] = indicators["Country"].apply(get_value, y=groupings["..."].tolist(), choice= ["Fragile", "Not Fragile"])
I'm not 100% sure that you need tolist() but this code will apply the function for every lines of your dataframe and return either the choice 1 if True or 2 if False.
I hope it helps,

Related

How do I split up a string with multiple parts and duplicate its rows in a dataframe with a prefix in python?

Sorry I am unable to embed images to my post. Please see below for the images.
Here is the sample code:
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr = [True, False, False, False, True, True, True]
car_number = ["FX00809", "FX00731", "FX00588,48,57", "FX0018", "FX00200", "FX0070", "FX0045"]
import pandas as pd
my_dict = {
"country":names,
"drives_right":dr,
"car_number":car_number}
cars = pd.DataFrame(my_dict)
print(cars)
I've tried to pull that row into a different dataframe and was able to duplicate the row using:
df_multiple_car_number = df[df['car_number'].str.contains(r'^FX\d\d[,]', regex=True)]
m = df_multiple_car_number['car_number'].str.count(r'[,].*[0-9]$')
m = int(m + 2)
print(m)
#duplicate the rows
df_multiple_car_number = pd.concat([df_multiple_car_number]*m, ignore_index=True)
I'm having issues on how to split the texts and add the prefix "FX00" to the other car numbers.
At the beginning you start by dividing the column by commas to obtain a complete list then you can explode this series of lists. After move 'Country' and 'drives_right' to the index so that it's repeated for each element of 'car_number'.
I create another function to concat numeric value with prefix FX00.
So you can do like this :
import pandas as pd
import numpy as np
def convert_splited_value(value):
try:
if int(value) >= 0:
return 'FX00' + value
else:
return value
except ValueError:
return value
df = pd.DataFrame({
'Country':['United States','Australia','Japan','India', 'Russia', 'Marocco', 'Eqypt'],
'drives_right':[True,False,False,False,True,True,True],
'car_number':["FX008093","FX00731","FX00588,48,57","FX0018","FX00200","FX0070","FX0045"],
})
df1 = (df.set_index(['Country','drives_right'])['car_number'].str.split(',').explode().rename('car_number').reset_index())
df1.loc[:, 'car_number'] = df1['car_number'].apply(convert_splited_value)
print(df1)
Output :
Country drives_right car_number
0 United States True FX008093
1 Australia False FX00731
2 Japan False FX00588
3 Japan False FX0048
4 Japan False FX0057
5 India False FX0018
6 Russia True FX00200
7 Marocco True FX0070
8 Eqypt True FX0045

Adding information from a smaller table to a large one with Pandas

I would like to add the regional information to the main table that contains entity and account columns. In this way, each row in the main table should be duplicated, just like the append tool in Alteryx.
Is there a way to do this operation with Pandas in Python?
Thanks!
Unfortunately no build-in method exist, as you'll need to build cartesian product of those DataFrame check that fancy explanation of merge DataFrames in pandas
But for your specific problem, try this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['Entity', 'Account'])
df1.Entity = ['Entity1', 'Entity1']
df1.Account = ['Sales', 'Cost']
df2 = pd.DataFrame(columns=['Region'])
df2.Region = ['North America', 'Europa', 'Asia']
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
resultdf = cartesian_product_simplified(df1, df2)
print(resultdf)
output:
0 1 2
0 Entity1 Sales North America
1 Entity1 Sales Europa
2 Entity1 Sales Asia
3 Entity1 Cost North America
4 Entity1 Cost Europa
5 Entity1 Cost Asia
as expected.
Btw, please provide the Data Frame the next time as code, not as a screenshot or even as link. It helps up saving time (please check how to ask)

Deleting groups of countries from dataframe indexes

I have a dataframe, where I have set indexes as countries. However there are also groups of countries such as Sub-Saharan Africa (IDA & IBRD countries) or Middle East & North Africa (IDA & IBRD countries). I want to delete them. I want just countries to stay.
Example imput dataframe where indexes are:
Antigua and Barbuda
Angola
Arab World
Wanted output dataframe:
Antigua and Barbuda
Angola
My idea was using pycountry, however it does nothing.
countr=list(pycountry.countries)
for idx in df.index:
if idx in countr :
continue
else:
df.drop(index=idx)
Check your list of country names:
countries_list = []
for country in pycountry.countries:
countries_list.append(country.name)
Checking output:
>>> print(countries_list[0:5])
['Aruba', 'Afghanistan', 'Angola', 'Anguilla', 'Åland Islands']
You can do this if you want to get a dataframe of countries that are in your countries list:
import pycountry
import pandas as pd
country = {'Country Name': ['Antigua and Barbuda','USSR', 'United States', 'Germany','The Moon']}
df = pd.DataFrame(data=country)
countries_list = []
for country in pycountry.countries:
countries_list.append(country.name)
new_df = []
for i in df['Country Name']:
if i in countries_list:
new_df.append(i)
Checking output:
>>> print(new_df)
['Antigua and Barbuda', 'United States', 'Germany']
Otherwise, for your specific code, try this:
Assuming you have your data in a dataframe 'df':
import pandas as pd
import pycountry
countries_list = []
for country in pycountry.countries:
countries_list.append(country.name)
for idx in df.index:
if idx in countries_list:
continue
else:
df.drop(index=idx)
Let me know if this works for you.

Replacing strings in a list using conditionals

I have a dataframe which has a column called regional_codes. Now I need to add a new column into the dataframe where the regional codes are replaced by the list of countries that are attributed to that region.
For eg. if the regional_codes contains ['asia'] then I need my new column to have the list of asian countries like ['china','japan','india','bangaldesh'...]
Currently what I do is that I have created a separate list for each region and I use something like this code
asia_list= ['asia','china','japan','india'...]
output_list = []
output_list+= [asia_list for w in regional_codes if w in asia_list]
output_list+= [africa_list for w in regional_codes if w in africa_list]
and so on until all the regional lists are exhausted
With the codes that I have provided above, my results are exactly what I need and it is efficient in terms of running time as well. However, I feel like I am doing it in a very long way. Therefore, I am looking for some suggestions that can help me shorten my code.
One way I found to do this is to create a DataFrame with all the needed data for your regional_codes and the regional_lists
import pandas as pd
import itertools
import numpy as np
# DF is your dataframe
# df is the dataframe containing the association between the regional_code and regional lists
df = pd.DataFrame({'regional_code': ['asia', 'africa', 'europe'], 'ragional_list': [['China', 'Japan'], ['Morocco', 'Nigeria', 'Ghana'], ['France', 'UK', 'Germany', 'Spain']]})
# regional_code ragional_list
# 0 asia [China, Japan]
# 1 africa [Morocco, Nigeria, Ghana]
# 2 europe [France, UK, Germany, Spain]
df2 = pd.DataFrame({'regional_code': [['asia', 'africa'],['africa', 'europe']], 'ragional_list': [1,2]})
# regional_code ragional_list
# 0 [asia, africa] 1
# 1 [africa, europe] 2
df2['list'] = df2.apply(lambda x: list(itertools.chain.from_iterable((df.loc[df['regional_code']==i, 'ragional_list'] for i in x.loc['regional_code']))), axis=1)
# In [95]: df2
# Out[95]:
# regional_code ragional_list list
# 0 [asia, africa] 1 [[China, Japan], [Morocco, Nigeria, Ghana]]
# 1 [africa, europe] 2 [[Morocco, Nigeria, Ghana], [France, UK, Germa...
Now we flatten the df2['list']
df2['list'] = df2['list'].apply(np.concatenate)
# regional_code ragional_list list
# 0 [asia, africa] 1 [China, Japan, Morocco, Nigeria, Ghana]
# 1 [africa, europe] 2 [Morocco, Nigeria, Ghana, France, UK, Germany,...
I guess this answers your question?

Update a null values in countryside column in a data frame with reference to valid countrycode column from another data frame using python

I have two data frames: Disaster, CountryInfo
Disaster has a column country code which has some null values
for example:
Disaster:
1.**Country** - **Country_code**
2.India - Null
3.Afghanistan - AFD
4.India - IND
5.United States - Null
CountryInfo:
0.**CountryName** - **ISO**
1.India - IND
2.Afganistan - AFD
3.United States - US
I need to fill the country code with reference to the country name.Can anyone suggest a solution for this?
You can simply use map with a Series.
With this approach all values are overwritten not only non NaN.
# Test data
disaster = pd.DataFrame({'Country': ['India', 'Afghanistan', 'India', 'United States'],
'Country_code': [np.nan, 'AFD', 'IND', np.nan]})
country = pd.DataFrame({'Country': ['India', 'Afghanistan', 'United States'],
'Country_code': ['IND', 'AFD', 'US']})
# Transforming country into a Series in order to be able to map it directly
country_se = country.set_index('Country').loc[:, 'Country_code']
# Map
disaster['Country_code'] = disaster['Country'].map(country_se)
print(disaster)
# Country Country_code
# 0 India IND
# 1 Afghanistan AFD
# 2 India IND
# 3 United States US
You can filter on NaN if you do not want to overwrite all values.
disaster.loc[pd.isnull(disaster['Country_code']),
'Country_code'] = disaster['Country'].map(country_se)

Categories

Resources