Pandas merge fail to extract common Index values - python

I'm trying to merge 2 DataFrames of different sizes, both are indexed by 'Country'. The first dataframe 'GDP_EN' contains every country in the world, and the second dataframe 'ScimEn' contains 15 countries.
When I try to merge these DataFrames,instead of merging the columns based on index countries of ScimEn, I got back 'Country_x' and 'Country_y'. 'Country_x' came from GDP_EN, which are the first 15 countries in alphabetical order. 'Country_y' are the 15 countries from ScimEn. I'm wondering why didn't they merge?
I used:
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')

I think both DataFrames are not indexes by Country, but Country is column add parameter on='Country':
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]})
print (GDP_EN)
Country a
0 USA 4
1 France 8
2 Slovakia 6
3 Russia 9
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]})
print (ScimEn)
Country b
0 France 80
1 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
Country_x a Country_y b
0 USA 4 France 80
1 France 8 Slovakia 70
DF=pd.merge(GDP_EN,ScimEn,on='Country',how='right')
print (DF)
Country a b
0 France 8 80
1 Slovakia 6 70
If Country are indexes it works perfectly:
GDP_EN = pd.DataFrame({'Country':['USA','France','Slovakia', 'Russia'],
'a':[4,8,6,9]}).set_index('Country')
print (GDP_EN)
a
Country
USA 4
France 8
Slovakia 6
Russia 9
print (GDP_EN.index)
Index(['USA', 'France', 'Slovakia', 'Russia'], dtype='object', name='Country')
ScimEn = pd.DataFrame({'Country':['France','Slovakia'],
'b':[80,70]}).set_index('Country')
print (ScimEn)
b
Country
France 80
Slovakia 70
print (ScimEn.index)
Index(['France', 'Slovakia'], dtype='object', name='Country')
DF=pd.merge(GDP_EN,ScimEn,left_index=True,right_index=True,how='right')
print (DF)
a b
Country
France 8 80
Slovakia 6 70

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

How to keep the values with most frequent prefix in a groupby pandas dataframe?

Let's say I have this dataframe :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 Spain m2_location
4 USA m1_name
5 USA m2_name
6 USA m3_size
7 USA m3_location
I want to group on the "Country" columns and to keep the records with the most frequent records in the groupby object.
The expected result would be :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
I already tried extracting the prefix, then getting the mode of the prefix on the dataframe and merging rows with this mode, but I feel that a more direct and more efficient solution exists.
Here is the working sample code below for reproducible results :
df = pd.DataFrame({
"Country": ["Spain","Spain","Spain","Spain","USA","USA","USA","USA"],
"City": ["m1_name","m1_location","m1_size","m2_location","m1_name","m2_name","m3_size","m3_location"]
})
df['prefix'] = df['City'].str[1]
modes = df.groupby('Country')['prefix'].agg(pd.Series.mode).rename("modes")
df = df.merge(modes, how="right", left_on=['Country','prefix'], right_on=['Country',"modes"])
df = df.drop(['modes','prefix'], axis = 1)
print(df)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location
You can try groupby and apply to filter group rows
out = (df.assign(prefix=df['City'].str.split('_').str[0])
.groupby('Country')
.apply(lambda g: g[g['prefix'].isin(g['prefix'].mode())])
.reset_index(drop=True)
.drop('prefix',axis=1))
print(out)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location
Use:
In [575]: df['Prefix_count'] = df.groupby(['Country', df.City.str.split('_').str[0]])['City'].transform('size')
In [589]: idx = df.groupby('Country')['Prefix_count'].transform(max) == df['Prefix_count']
In [593]: df[idx].drop('Prefix_count', 1)
Out[593]:
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
An interesting fact about the solutions proposed below is that Mayank's one is way faster. I ran it on 1000 rows on my data and got :
Mayank's solution : 0.020 seconds
Ynjxsjmh's solution : 0.402 seconds
My (OP) solution : 0.122 seconds

Replace the values in a column based on frequency

I have a dataframe (3.7 million rows) with a column with different country names
id Country
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 INDIA
6 USA
7 USA
8 ITALY
9 USA
10 RUSSIA
I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column
My alternate solution is to replace the names with there frequency using
df.column_name = df.column_name.map(df.column_name.value_counts())
Use:
df.loc[df.groupby('Country')['id']
.transform('size')
.div(len(df))
.lt(0.15),
'Country'] = 'Miscellanous'
Or
df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
.lt(0.15)),
'Country'] = 'Miscellanous'
If you want to put all country whose frequency is less than a threshold into the "Misc" category:
threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()
df['Country'].map(mappings)
Here is another option
s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')
You can use dictionary and map for this:
d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x if d[x] > 0.15 else 'Miscellanous' )
Output:
id
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 Miscellanous
6 USA
7 USA
8 Miscellanous
9 USA
10 RUSSIA
Name: Country, dtype: object

how to merge a multiple of rows into one row and name it in Pandas?

I have a dataframe:
age sex country
25 m USA
30 f Canada
65 f china
42 m Indonesia
32 f mexico
I want to convert the country to 2 categories and then I want to generate 2 columns of dummy variables:
North America=(USA, Canada, Mexico).
Asia= (China, Indonesia)
You can make a single column named continent and get your result:-
df = pd.DataFrame(data = {'age':[25,23,26], 'sex':['m','f','f'], 'country':
['mexico','china','usa']})
north_america = ['usa','mexico','canada']
asia = ['china','indonesia']
def change(country):
if country in north_america:
return "North America"
elif country in asia:
return "Asia"
df['continent'] = df['country'].apply(change)
df
Output
age sex country continent
0 25 m mexico North America
1 23 f china Asia
2 26 f usa North America

delete part of a row in pandas / shift up part of a row ? Align Column Headings

So I have a data frame where the headings I want do not currently line up:
In [1]: df = pd.read_excel('example.xlsx')
print (df.head(10))
Out [1]: Portfolio Asset Country Quantity
Unique Identifier Number of fund B24 B65 B35 B44
456 2 General Type A UNITED KINGDOM 1
123 3 General Type B US 2
789 2 General Type C UNITED KINGDOM 4
4852 4 General Type C UNITED KINGDOM 4
654 1 General Type A FRANCE 3
987 5 General Type B UNITED KINGDOM 2
321 1 General Type B GERMANY 1
951 3 General Type A UNITED KINGDOM 2
357 4 General Type C UNITED KINGDOM 3
As we can see; above the first 2 column headings there are 2 blank cells and below the next 4 column headings are "B" numbers which I don't care about.
So 2 questions; How can I shift up the first 2 columns without having a column heading to identify them with (due to the blank cells above)?
And how can I delete just Row 2 of the remaining columns and have the data below move up to take the place of the "B" numbers?
I found some similar questions already asked python: shift column in pandas dataframe up by one but nothing that solves the particular intricacies above I don't think.
Also I'm quite new to Python and Pandas so if this is really basic I apologise!
IIUC you can use:
#create df from multiindex in columns
df1 = pd.DataFrame([x for x in df.columns.values])
print df1
0 1
0 Unique Identifier
1 Number of fund
2 Portfolio B24
3 Asset B65
4 Country B35
5 Quantity B44
#if len of string < 4, give value from column 0 to column 1
df1.loc[df1.iloc[:,1].str.len() < 4, 1] = df1.iloc[:,0]
print df1
0 1
0 Unique Identifier
1 Number of fund
2 Portfolio Portfolio
3 Asset Asset
4 Country Country
5 Quantity Quantity
#set columns by first columns of df1
df.columns = df1.iloc[:,1]
print df
0 Unique Identifier Number of fund Portfolio Asset Country \
0 456 2 General Type A UNITED KINGDOM
1 123 3 General Type B US
2 789 2 General Type C UNITED KINGDOM
3 4852 4 General Type C UNITED KINGDOM
4 654 1 General Type A FRANCE
5 987 5 General Type B UNITED KINGDOM
6 321 1 General Type B GERMANY
7 951 3 General Type A UNITED KINGDOM
8 357 4 General Type C UNITED KINGDOM
0 Quantity
0 1
1 2
2 4
3 4
4 3
5 2
6 1
7 2
8 3
EDIT by comments:
print df.columns
Index([u'Portfolio', u'Asset', u'Country', u'Quantity'], dtype='object')
#set first row by columns names
df.iloc[0,:] = df.columns
#reset_index
df = df.reset_index()
#set columns from first row
df.columns = df.iloc[0,:]
df.columns.name= None
#remove first row
print df.iloc[1:,:]
Unique Identifier Number of fund Portfolio Asset Country Quantity
1 456 2 General Type A UNITED KINGDOM 1
2 123 3 General Type B US 2
3 789 2 General Type C UNITED KINGDOM 4
4 4852 4 General Type C UNITED KINGDOM 4
5 654 1 General Type A FRANCE 3
6 987 5 General Type B UNITED KINGDOM 2
7 321 1 General Type B GERMANY 1
8 951 3 General Type A UNITED KINGDOM 2
9 357 4 General Type C UNITED KINGDOM 3

Categories

Resources