Which regions have the lowest literacy rates? - python

I need to group/merge each city of the same name and calculate its overall percentage, to see which city amongst them has the lowest % literacy rate.
Code:
Python
import pandas as pd
df = pd.DataFrame({'Cities': ["Cape Town", "Cape Town", "Cape Town", "Tokyo", "Cape Town", "Tokyo", "Mumbai", "Belgium", "Belgium" ],
'LiteracyRate': [0.05, 0.35, 0.2, 0.11, 0.15, 0.2, 0.65, 0.35, 0.45]})
print(df)
For example:
Cities LiteracyRate
0 Cape Town 0.05
1 Cape Town 0.35
2 Cape Town 0.2
3 Tokyo 0.11
4 Cape Town 0.15
5 Tokyo 0.2
6 Mumbai 0.65
7 Belgium 0.35
8 Belgium 0.45
I'm expecting this:
Cities LiteracyRate %LiteracyRate
0 Cape Town 0.75 75
1 Tokyo 0.31 31
2 Mumbai 0.65 65
3 Belgium 0.8 80
So I tried this code below but it's not giving me desirable results, the countries with similar names are still not merged. And the percentages ain't right.
# Calculate the percentage
df["%LiteracyRate"] = (df["LiteracyRate"]/df["LiteracyRate"].sum())*100
# Show the DataFrame
print(df)

You can use groupby() in pandas, to join cities with the same names and sum() to calculate %
df = df.groupby('Cities').sum()
Than you can format results using
df['%LiteracyRate'] = (df['LiteracyRate']*100).round().astype(int)
df = df.reset_index()
To sort them by literacy rate you can
df = df.sort_values(by='%LiteracyRate')
df = df.reset_index()
Hope this helps!

Related

How to concatenate two dataframes with duplicates some values?

I have two dataframes of unequal lengths. I want to combine them with a condition.
If two rows of df1 are identical then they must share the same value of df2.(without changing order )
import pandas as pd
d = {'country': ['France', 'France','Japan','China', 'China','Canada','Canada','India']}
df1 = pd.DataFrame(data=d)
I={'conc': [0.30, 0.25, 0.21, 0.37, 0.15]}
df2 = pd.DataFrame(data=I)
dfc=pd.concat([df1,df2], axis=1)
my output
country conc
0 France 0.30
1 France 0.25
2 Japan 0.21
3 China 0.37
4 China 0.15
5 Canada NaN
6 Canada NaN
7 India NaN
expected output
country conc
0 France 0.30
1 France 0.30
2 Japan 0.25
3 China 0.21
4 China 0.21
5 Canada 0.37
6 Canada 0.37
7 India 0.15
You need to create a link between the values and the countries first.
df2["country"] = df1["country"].unique()
Then you can use it to merge it with your original dataframe.
pd.merge(df1, df2, on="country")
But be aware that this only works as long as the number of the values is identical to the number of countries and the order for them is as expected.
I'd construct the dataframe directly, without intermediate dfs.
d = {'country': ['France', 'France','Japan','China', 'China','Canada','Canada','India']}
I = {'conc': [0.30, 0.25, 0.21, 0.37, 0.15]}
c = 'country'
dfc = pd.DataFrame(I, index=pd.Index(pd.unique(d[c]), name=c)).reindex(d[c]).reset_index()

In a pandas pivot table, how do I define a function for a subset of data?

I'm working with a dataframe similar to this:
Name
Metric 1
Metric 2
Country
John
0.10
5.00
Canada
Jane
0.50
Canada
Jack
2.00
Canada
Polly
0.30
Canada
Mike
Canada
Steve
Canada
Lily
0.15
1.20
Canada
Kate
3.00
Canada
Edward
0.05
Canada
Pete
0.02
0.03
Canada
I am trying to define a function that will calculate the percentage of metrics that are greater than 1 of the rows that have metrics. I expect that for Metric 1, I should get 25%, and for Metric 2, I should get 66%. However, my function is returning results based on the total number of rows. Here's my code:
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(data_to_load['metric data.csv']))
df = df.fillna(0)
def metricgreaterthanone(x):
return (x>1).sum()/len(x!=0)
pd.pivot_table(df,index=['Country'],values=["Name","Metric 1","Metric 2"],aggfunc={'Name':pd.Series.nunique, "Metric 1":metricgreaterthanone,"Metric 2":metricgreaterthanone})
The result I get is:
Country
Metric 1
Metric 2
Name
Canada
0.2
0.2
10
So the function is returning the percent of all rows all that are greater than 1. Any ideas on how to fix this?
It seems that you have empty string "" instead of numbers. You can try:
def metricgreaterthanone(x):
n = pd.to_numeric(x, errors="coerce")
return (n > 1).sum() / n.notna().sum()
x = pd.pivot_table(
df,
index=["Country"],
values=["Name", "Metric 1", "Metric 2"],
aggfunc={
"Name": pd.Series.nunique,
"Metric 1": metricgreaterthanone,
"Metric 2": metricgreaterthanone,
},
)
print(x)
Prints:
Metric 1 Metric 2 Name
Country
Canada 0.25 0.666667 10
x!=0 returns a boolean array, so len() is not counting the number of Trues.
Try
def metricgreaterthanone(x):
return (x>1).sum()/(x!=0).sum()

Python Pandas groupby, with a date column with different values, then returns a dataframe with the date column filled with the latest date

So I have a data like this:
I want to group them and sum the values of Month 0 - Month 3, I can achieve that using pandas groupby.
The problem is the End date column has different values, and I want to take the latest date in the column. For this example, means I want the End date column to have the value 2020-09-25. As below:
How do I do this with pandas groupby? For your convenience, the variable for the columns names are below:
details_columns = [ "Person Name", "Bill rate", "Project ERP","Status", "Assignment", "Engagement Code", "End date"]
sum_columns = ["Month 0", "Month 1", "Month 2", "Month 3"]
I need the return value to be a DataFrame hoping anyone can help. Thanks!
Text data:
Person Name Bill rate Project ERP Status Assignment Engagement Code End date Current Month U% Month 1 U% Month 2 U% Month 3 U%
John Doe 3500000 0.58 Chargeable - Standard Project A 21572323 2020-08-22 0 0.5 0.3 0.2
John Doe 3500000 0.58 Chargeable - Standard Project A 21572323 2020-05-22 0.4 0.25 0 0
John Doe 3500000 0.45 Chargeable - Standard Project B 21579528 2020-09-25 0 0.7 0.7 0.7
John Doe 3500000 0.45 Chargeable - Standard Project B 21579528 2020-05-22 0.2 0.12 0 0
John Doe 3500000 0.45 Chargeable - Standard Project B 21579528 2020-04-03 0.1 0 0 0
Create dictionary d for sum columns and max value for column End date and then aggregate by GroupBy.agg, last is added DataFrame.reindex for same order of columns like original DataFrame:
cols = ["Person Name", "Bill rate", "Project ERP","Status", "Assignment","Engagement Code"]
sum_columns = ["Current Month U%", "Month 1 U%", "Month 2 U%","Month 3 U%"]
d = dict.fromkeys(sum_columns, 'sum')
d["End date"] = 'max'
df1 = df.groupby(cols, as_index=False).agg(d).reindex(df.columns, axis=1)
print (df1)
Person Name Bill rate Project ERP Status Assignment \
0 John Doe 3500000 0.45 Chargeable Standard Project B
1 John Doe 3500000 0.58 Chargeable Standard Project A
Engagement Code End date Current Month U% Month 1 U% Month 2 U% \
0 21579528 2020-09-25 0.3 0.82 0.7
1 21572323 2020-08-22 0.4 0.75 0.3
Month 3 U%
0 0.7
1 0.2

Taking a cross product of Pandas Dataframes

I am attempting to take string data from one dataframe and substitute it with numerical values and create a cross product of the results like the following below.
I read the data into dataframes, example input coming below:
import pandas as pd
shoppers = pd.DataFrame({'name': ['bill', 'bob', 'james', 'jill', 'henry'],
'item': ['apple','apple','orange','grapes','orange']})
stores = pd.DataFrame({'apple' : [0.25, 0.20, 0.18],
'orange': [0.30, 0.4, 0.35],
'grapes': [1.0, 0.9, 1.1],
'store': ['kroger', 'publix', 'walmart']})
Here's the resulting shoppers dataframe:
item
name
bill apple
bob apple
james orange
jill grapes
henry orange
And here's the resulting stores dataframe:
apple orange grapes
store
kroger 0.25 0.30 1.0
publix 0.20 0.40 0.9
walmart 0.18 0.35 1.1
And the desired result is the price each person would pay at each store. For example:
I'm really struggling to find the right way to make such a transformation in Pandas efficiently. I could easily loop over shoppers and stores and build each row in a brute-force manner, but there must be a more efficient way to do this with the pandas API. Thanks for any suggestions.
Here's a solution, not cross, but dot:
pd.crosstab(shoppers.index, shoppers['item']).dot(stores.T)
Output:
kroger publix walmart
row_0
bill 0.25 0.2 0.18
bob 0.25 0.2 0.18
henry 0.30 0.4 0.35
james 0.30 0.4 0.35
jill 1.00 0.9 1.10

Pandas add value to inner level of hierarchical index

I have a Pandas DataFrame with a hierarchical index (MultiIndex). I created this DataFrame by grouping values for "cousub" and "year".
annualMed = df.groupby(["cousub", "year"])[["ratio", "sr_val_transfer"]].median().round(2)
print annualMed.head(8)
ratio sr_val_transfer
cousub year
Allen Park city 2013 0.51 75000.0
2014 0.47 85950.0
2015 0.47 95030.0
2016 0.45 102500.0
Belleville city 2013 0.49 113900.0
2014 0.55 114750.0
2015 0.53 149000.0
2016 0.48 121500.0
I would like to add an "overall" value in the "year" level that I could then populate with values based on a grouping of "cousub" alone, i.e., excluding "year". I would like the result to look like the following
ratio sr_val_transfer
cousub year
Allen Park city 2013 0.51 75000.0
2014 0.47 85950.0
2015 0.47 95030.0
2016 0.45 102500.0
Overall 0.50 90000.0
Belleville city 2013 0.49 113900.0
2014 0.55 114750.0
2015 0.53 149000.0
2016 0.48 121500.0
Overall 0.50 135000.0
How can I add this new item to the "years" level of the MultiIndex?
If you want to just add these two columns explicitly, you could just specify all the MultiIndex levels with loc.
df.loc[('Allen Park city', 'Overall'), :] = (0.50, 90000.)
df.loc[('Belleville city', 'Overall'), :] = (0.50, 135000.)
If you had a whole list of cities that you wanted to add this row for however, this would be a bit tedious. Maybe you could append another DataFrame with the overall values with a bit of index manipulation.
(df.reset_index()
.append(pd.DataFrame([['Allen Park city', 'Overall', 0.5, 90000.],
['Belleville city', 'Overall', 0.5, 135000.]],
columns=list(df.index.names) + list(df.columns)))
.set_index(df.index.names)
.sort_index())
Demo
Method 1 (smaller case)
>>> df.loc[('Allen Park city', 'Overall'), :] = (0.50, 90000.)
>>> df.loc[('Belleville city', 'Overall'), :] = (0.50, 135000.)
>>> df.sort_index()
ratio sr_val_transfer
cousub year
Allen Park city 2013 0.51 75000.0
2014 0.47 85950.0
2015 0.47 95030.0
2016 0.45 102500.0
Overall 0.50 90000.0
Belleville city 2013 0.49 113900.0
2014 0.55 114750.0
2015 0.53 149000.0
2016 0.48 121500.0
Overall 0.50 135000.0
Method 2 (larger case)
>>> (df.reset_index()
.append(pd.DataFrame([['Allen Park city', 'Overall', 0.5, 90000.],
['Belleville city', 'Overall', 0.5, 135000.]],
columns=list(df.index.names) + list(df.columns)))
.set_index(df.index.names)
.sort_index())
ratio sr_val_transfer
cousub year
Allen Park city 2013 0.51 75000.0
2014 0.47 85950.0
2015 0.47 95030.0
2016 0.45 102500.0
Overall 0.50 90000.0
Belleville city 2013 0.49 113900.0
2014 0.55 114750.0
2015 0.53 149000.0
2016 0.48 121500.0
Overall 0.50 135000.0

Categories

Resources