How to concatenate two dataframes with duplicates some values? - python

I have two dataframes of unequal lengths. I want to combine them with a condition.
If two rows of df1 are identical then they must share the same value of df2.(without changing order )
import pandas as pd
d = {'country': ['France', 'France','Japan','China', 'China','Canada','Canada','India']}
df1 = pd.DataFrame(data=d)
I={'conc': [0.30, 0.25, 0.21, 0.37, 0.15]}
df2 = pd.DataFrame(data=I)
dfc=pd.concat([df1,df2], axis=1)
my output
country conc
0 France 0.30
1 France 0.25
2 Japan 0.21
3 China 0.37
4 China 0.15
5 Canada NaN
6 Canada NaN
7 India NaN
expected output
country conc
0 France 0.30
1 France 0.30
2 Japan 0.25
3 China 0.21
4 China 0.21
5 Canada 0.37
6 Canada 0.37
7 India 0.15

You need to create a link between the values and the countries first.
df2["country"] = df1["country"].unique()
Then you can use it to merge it with your original dataframe.
pd.merge(df1, df2, on="country")
But be aware that this only works as long as the number of the values is identical to the number of countries and the order for them is as expected.

I'd construct the dataframe directly, without intermediate dfs.
d = {'country': ['France', 'France','Japan','China', 'China','Canada','Canada','India']}
I = {'conc': [0.30, 0.25, 0.21, 0.37, 0.15]}
c = 'country'
dfc = pd.DataFrame(I, index=pd.Index(pd.unique(d[c]), name=c)).reindex(d[c]).reset_index()

Related

Which regions have the lowest literacy rates?

I need to group/merge each city of the same name and calculate its overall percentage, to see which city amongst them has the lowest % literacy rate.
Code:
Python
import pandas as pd
df = pd.DataFrame({'Cities': ["Cape Town", "Cape Town", "Cape Town", "Tokyo", "Cape Town", "Tokyo", "Mumbai", "Belgium", "Belgium" ],
'LiteracyRate': [0.05, 0.35, 0.2, 0.11, 0.15, 0.2, 0.65, 0.35, 0.45]})
print(df)
For example:
Cities LiteracyRate
0 Cape Town 0.05
1 Cape Town 0.35
2 Cape Town 0.2
3 Tokyo 0.11
4 Cape Town 0.15
5 Tokyo 0.2
6 Mumbai 0.65
7 Belgium 0.35
8 Belgium 0.45
I'm expecting this:
Cities LiteracyRate %LiteracyRate
0 Cape Town 0.75 75
1 Tokyo 0.31 31
2 Mumbai 0.65 65
3 Belgium 0.8 80
So I tried this code below but it's not giving me desirable results, the countries with similar names are still not merged. And the percentages ain't right.
# Calculate the percentage
df["%LiteracyRate"] = (df["LiteracyRate"]/df["LiteracyRate"].sum())*100
# Show the DataFrame
print(df)
You can use groupby() in pandas, to join cities with the same names and sum() to calculate %
df = df.groupby('Cities').sum()
Than you can format results using
df['%LiteracyRate'] = (df['LiteracyRate']*100).round().astype(int)
df = df.reset_index()
To sort them by literacy rate you can
df = df.sort_values(by='%LiteracyRate')
df = df.reset_index()
Hope this helps!

pandas pivot table: custom aggfunc with margins

I am using pandas to create pivot tables. My data looks usually contains a lot of numeric values which can easily be aggregated with np.mean (e.g. question1), but there is one exception - Net Promoter Score (notice Total 0.00 both for EU and NA)
responseId country region nps question1
0 1 Germany EU 11 3.2
1 2 Germany EU 10 5.0
2 3 US NA 7 4.3
3 4 US NA 5 4.8
4 5 France EU 5 3.2
5 6 France EU 5 5.0
6 7 France EU 11 5.0
region EU NA
country France Germany Total US Total
nps -33.33 100.0 0.00 -100.00 0.00
question1 4.40 4.1 4.25 4.55 4.55
For NPS I use a custom aggfunc
def calculate_nps(column):
detractors = [1,2,3,4,5,6,7]
passives = [8,9]
promoters = [10,11]
counts = column.value_counts(normalize=True)
percent_promoters = counts.reindex(promoters).sum()
percent_detractors = counts.reindex(detractors).sum()
return (percent_promoters - percent_detractors) * 100
aggfunc = {
"nps": calculate_nps,
"question1": np.mean
}
pd.pivot_table(data=df,columns=["region","country"],values=["nps","question1"],aggfunc=aggfunc,margins=True,margins_name="Total",sort=True)
This aggfunc works fine for regular columns, but fails for margins ("Total" columns), because pandas passes data already aggregated. For regular fields calculate_nps receives columns like this
4 5
5 5
6 11
Name: nps, dtype: int64
but for margins the data looks like this
region country
EU France -33.333333
Germany 100.000000
Name: nps, dtype: float64
calculate_nps cannot deal with such data and returns 0. In this case column.mean() should be applied which I solved like this (notice if column.index.names != [None])
def calculate_nps(column):
if column.index.names != [None]:
return column.mean()
detractors = [1,2,3,4,5,6,7]
passives = [8,9]
promoters = [10,11]
counts = column.value_counts(normalize=True)
percent_promoters = counts.reindex(promoters).sum()
percent_detractors = counts.reindex(detractors).sum()
return (percent_promoters - percent_detractors) * 100
Now the pivot table is correct
region EU NA
country France Germany Total US Total
nps -33.33 100.0 33.33 -100.00 -100.00
question1 4.40 4.1 4.25 4.55 4.55
Question
Is there a proper / better way to determine what kind of data is passed to aggfunc? I'm not sure that my solution will work for all scenarios

In a pandas pivot table, how do I define a function for a subset of data?

I'm working with a dataframe similar to this:
Name
Metric 1
Metric 2
Country
John
0.10
5.00
Canada
Jane
0.50
Canada
Jack
2.00
Canada
Polly
0.30
Canada
Mike
Canada
Steve
Canada
Lily
0.15
1.20
Canada
Kate
3.00
Canada
Edward
0.05
Canada
Pete
0.02
0.03
Canada
I am trying to define a function that will calculate the percentage of metrics that are greater than 1 of the rows that have metrics. I expect that for Metric 1, I should get 25%, and for Metric 2, I should get 66%. However, my function is returning results based on the total number of rows. Here's my code:
import pandas as pd
import io
df = pd.read_csv(io.BytesIO(data_to_load['metric data.csv']))
df = df.fillna(0)
def metricgreaterthanone(x):
return (x>1).sum()/len(x!=0)
pd.pivot_table(df,index=['Country'],values=["Name","Metric 1","Metric 2"],aggfunc={'Name':pd.Series.nunique, "Metric 1":metricgreaterthanone,"Metric 2":metricgreaterthanone})
The result I get is:
Country
Metric 1
Metric 2
Name
Canada
0.2
0.2
10
So the function is returning the percent of all rows all that are greater than 1. Any ideas on how to fix this?
It seems that you have empty string "" instead of numbers. You can try:
def metricgreaterthanone(x):
n = pd.to_numeric(x, errors="coerce")
return (n > 1).sum() / n.notna().sum()
x = pd.pivot_table(
df,
index=["Country"],
values=["Name", "Metric 1", "Metric 2"],
aggfunc={
"Name": pd.Series.nunique,
"Metric 1": metricgreaterthanone,
"Metric 2": metricgreaterthanone,
},
)
print(x)
Prints:
Metric 1 Metric 2 Name
Country
Canada 0.25 0.666667 10
x!=0 returns a boolean array, so len() is not counting the number of Trues.
Try
def metricgreaterthanone(x):
return (x>1).sum()/(x!=0).sum()

Taking a cross product of Pandas Dataframes

I am attempting to take string data from one dataframe and substitute it with numerical values and create a cross product of the results like the following below.
I read the data into dataframes, example input coming below:
import pandas as pd
shoppers = pd.DataFrame({'name': ['bill', 'bob', 'james', 'jill', 'henry'],
'item': ['apple','apple','orange','grapes','orange']})
stores = pd.DataFrame({'apple' : [0.25, 0.20, 0.18],
'orange': [0.30, 0.4, 0.35],
'grapes': [1.0, 0.9, 1.1],
'store': ['kroger', 'publix', 'walmart']})
Here's the resulting shoppers dataframe:
item
name
bill apple
bob apple
james orange
jill grapes
henry orange
And here's the resulting stores dataframe:
apple orange grapes
store
kroger 0.25 0.30 1.0
publix 0.20 0.40 0.9
walmart 0.18 0.35 1.1
And the desired result is the price each person would pay at each store. For example:
I'm really struggling to find the right way to make such a transformation in Pandas efficiently. I could easily loop over shoppers and stores and build each row in a brute-force manner, but there must be a more efficient way to do this with the pandas API. Thanks for any suggestions.
Here's a solution, not cross, but dot:
pd.crosstab(shoppers.index, shoppers['item']).dot(stores.T)
Output:
kroger publix walmart
row_0
bill 0.25 0.2 0.18
bob 0.25 0.2 0.18
henry 0.30 0.4 0.35
james 0.30 0.4 0.35
jill 1.00 0.9 1.10

Pandas DataFrame compare columns to a threshold column using where()

I need to null values in several columns where they are less in absolute value than correspond values in the threshold column
import pandas as pd
import numpy as np
df=pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'key2': [2000, 2001, 2002, 2001, 2002],
'data1': np.random.randn(5),
'data2': np.random.randn(5),
'threshold': [0.5,0.4,0.6,0.1,0.2]}).set_index(['key1','key2'])
data1 data2 threshold
key1 key2
Ohio 2000 0.201240 0.083833 0.5
2001 -1.993489 -1.081208 0.4
2002 0.759038 -1.688769 0.6
Nevada 2001 -0.543916 1.412679 0.1
2002 -1.545781 0.181224 0.2
this gives me an error "cannot join with no level specified and no overlapping names"
df.where(df.abs()>df['threshold'])
this works but obviously against a scalar
df.where(df.abs()>0.5)
data1 data2 threshold
key1 key2
Ohio 2000 NaN NaN NaN
2001 -1.993489 -1.081208 NaN
2002 0.759038 -1.688769 NaN
Nevada 2001 -0.543916 1.412679 NaN
2002 -1.545781 NaN NaN
BTW, this does appear to be giving me an OK result - still want to find out how to do it with where() method
df.apply(lambda x:x.where(x.abs()>x['threshold']),axis=1)
Here's a slightly different option using the DataFrame.gt (greater than) method.
df[df.abs().gt(df['threshold'], axis='rows')]
Out[16]:
# Output might not look the same because of different random numbers,
# use np.random.seed() for reproducible random number gen
Out[13]:
data1 data2 threshold
key1 key2
Ohio 2000 NaN NaN NaN
2001 1.954543 1.372174 NaN
2002 NaN NaN NaN
Nevada 2001 0.275814 0.854617 NaN
2002 NaN 0.204993 NaN

Categories

Resources