Extract country name from text in column to create another column - python

I have tried different combinations to extract the country names from a column and create a new column with solely the countries. I can do it for selected rows i.e. df.address[9998] but not for the whole column.
import pycountry
Cntr = []
for country in pycountry.countries:
for country.name in df.address:
Cntr.append(country.name)
Any ideas what is going wrong here?
edit:
address is an object in the df and
df.address[:10] looks like this
Address
0 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 Turin, Italy
1 NaN
2 Zurich, Switzerland
3 NaN
4 Glyfada, Greece
5 Frosinone, Italy
6 Dublin, Ireland
7 NaN
8 ...
9 Kristiansand, Norway
Name: address, Length: 10, dtype: object
Based on Petar's response when I run individual queries I get the country correctly, but when I try to create a column with all the countries (or ranges like df.address[:5] I get an empty Cntr)
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df['address'][1]:
Cntr.append(country.name)
Cntr
Returns
[Italy]
and df.address[2] returns [ ]
etc.
I have also run
df['address'] = df['address'].astype('str')
to make sure that there are no floats or int in the column.

Sample dataframe
df = pd.DataFrame({'address': ['Turin, Italy', np.nan, 'Zurich, Switzerland', np.nan, 'Glyfada, greece']})
df[['city', 'country']] = df['address'].str.split(',', expand=True, n=2)
address city country
0 Turin, Italy Turin Italy
1 NaN NaN NaN
2 Zurich, Switzerland Zurich Switzerland
3 NaN NaN NaN
4 Glyfada, greece Glyfada greece

You were really close. We cannot loop like this for country.name in df.address. Instead:
import pycountry
Cntr = []
for country in pycountry.countries:
if country.name in df.address:
Cntr.append(country.name)
If this does not work, please supply more information because I am unsure what df.address looks like.

You can use the function clean_country() from the library DataPrep. Install it with pip install dataprep.
from dataprep.clean import clean_country
df = pd.DataFrame({"address": ["Turin, Italy", np.nan, "Zurich, Switzerland", np.nan, "Glyfada, Greece"]})
df2 = clean_country(df, "address")
df2
address address_clean
0 Turin, Italy Italy
1 NaN NaN
2 Zurich, Switzerland Switzerland
3 NaN NaN
4 Glyfada, Greece Greece

Related

How to keep the values with most frequent prefix in a groupby pandas dataframe?

Let's say I have this dataframe :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 Spain m2_location
4 USA m1_name
5 USA m2_name
6 USA m3_size
7 USA m3_location
I want to group on the "Country" columns and to keep the records with the most frequent records in the groupby object.
The expected result would be :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
I already tried extracting the prefix, then getting the mode of the prefix on the dataframe and merging rows with this mode, but I feel that a more direct and more efficient solution exists.
Here is the working sample code below for reproducible results :
df = pd.DataFrame({
"Country": ["Spain","Spain","Spain","Spain","USA","USA","USA","USA"],
"City": ["m1_name","m1_location","m1_size","m2_location","m1_name","m2_name","m3_size","m3_location"]
})
df['prefix'] = df['City'].str[1]
modes = df.groupby('Country')['prefix'].agg(pd.Series.mode).rename("modes")
df = df.merge(modes, how="right", left_on=['Country','prefix'], right_on=['Country',"modes"])
df = df.drop(['modes','prefix'], axis = 1)
print(df)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location
You can try groupby and apply to filter group rows
out = (df.assign(prefix=df['City'].str.split('_').str[0])
.groupby('Country')
.apply(lambda g: g[g['prefix'].isin(g['prefix'].mode())])
.reset_index(drop=True)
.drop('prefix',axis=1))
print(out)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location
Use:
In [575]: df['Prefix_count'] = df.groupby(['Country', df.City.str.split('_').str[0]])['City'].transform('size')
In [589]: idx = df.groupby('Country')['Prefix_count'].transform(max) == df['Prefix_count']
In [593]: df[idx].drop('Prefix_count', 1)
Out[593]:
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
An interesting fact about the solutions proposed below is that Mayank's one is way faster. I ran it on 1000 rows on my data and got :
Mayank's solution : 0.020 seconds
Ynjxsjmh's solution : 0.402 seconds
My (OP) solution : 0.122 seconds

Replace the values in a column based on frequency

I have a dataframe (3.7 million rows) with a column with different country names
id Country
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 INDIA
6 USA
7 USA
8 ITALY
9 USA
10 RUSSIA
I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column
My alternate solution is to replace the names with there frequency using
df.column_name = df.column_name.map(df.column_name.value_counts())
Use:
df.loc[df.groupby('Country')['id']
.transform('size')
.div(len(df))
.lt(0.15),
'Country'] = 'Miscellanous'
Or
df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
.lt(0.15)),
'Country'] = 'Miscellanous'
If you want to put all country whose frequency is less than a threshold into the "Misc" category:
threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()
df['Country'].map(mappings)
Here is another option
s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')
You can use dictionary and map for this:
d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x if d[x] > 0.15 else 'Miscellanous' )
Output:
id
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 Miscellanous
6 USA
7 USA
8 Miscellanous
9 USA
10 RUSSIA
Name: Country, dtype: object

Forward fill or back fill NaN values in Pandas columns based on grouping of other columns

I have a dataframe as below:
import pandas as pd
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas','NaN','NaN','Asia','Europe','NaN','NaN'],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison','NaN','Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
I want to group by Country and Flower and forward fill or backward fill the columns Region and Animal where there are missing values. However the column Game should remain intact
I have tried this but it didn't work:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
also :
df.groupby(['Country','Flower'])['Animal', 'Region'].isna().bfill()
I want to know how to go about with this.
while this works but it removes the Games column:
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Animal', 'Region'].bfill().ffill()
And if i do a transform there is a mismatch in the length. Also please note that this is sample dataframe where I had added "NaN" as a string in the original frame it is as np.nan.
If you change your dataframe code to actually include np.nans, then the code you provided actually works. Although nans appear as normal text 'Nan', you can't create a dataframe writing that text by hand because that will be interpreted as a string, not an actual missing value.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
Then, this:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
actually yields this:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 NaN USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 NaN UK Dandelion cricket Europe
First you need to know 'NaN' is not NaN
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Region'].ffill()
Out[109]:
0 Americas
1 Americas
2 NaN# since here only have single row , that why stay NaN
3 Asia
4 Europe
5 Europe
6 Europe
Name: Region, dtype: object
Second if you need to chain two iid function in pandas you need apply
df.update(df.groupby(['Country','Flower'])['Animal', 'Region'].apply(lambda x : x.bfill().ffill()))
df
Out[119]:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 Bison USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 Lion UK Dandelion cricket Europe
As Mex and Lily are only rows and moreover their region value is nan, fillna function not able to find appropriate group value.
If we catch the exception while fillna group mode then those value where there is no group will be left as it is. Then apply ffill and bfill to cover those value which doesn't have appropriate group
df_stack = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],'Region': ['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],'Flower': ['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion',np.nan],'Game': ['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
print("-------Before imputation------")
print(df_stack)
def fillna_Region(grp):
try:
return grp.fillna(grp.mode()[0])
except BaseException as e:
print('Error as no correspindg group: ' + str(e))
df_stack["Region"] =
df_stack["Region"].fillna(df_stack.groupby(['Country','Flower']) ['Region'].transform(lambda grp : fillna_Region(grp)))
df_stack["Animal"] =
df_stack["Animal"].fillna(df_stack.groupby(['Country','Flower']) ['Animal'].transform(lambda grp : fillna_Region(grp)))
df_stack = df_stack.ffill(axis = 0)
df_stack = df_stack.bfill(axis =0)
print("-------After imputation------")
print(df_stack)

Problem with New Column in Pandas Dataframe

I have a dataframe and I'm trying to create a new column of values that is one column divided by the other. This should be obvious but I'm only getting 0's and 1's as my output.
I also tried converting the output to float in case the output was somehow being rounded off but that didn't change anything.
def answer_seven():
df = answer_one()
columns_to_keep = ['Self-citations', 'Citations']
df = df[columns_to_keep]
df['ratio'] = df['Self-citations'] / df['Citations']
return df
answer_seven()
Output:
Self_cite Citations ratio
Country
Aus. 15606 90765 0
Brazil 14396 60702 0
Canada 40930 215003 0
China 411683 597237 1
France 28601 130632 0
Germany 27426 140566 0
India 37209 128763 0
Iran 19125 57470 0
Italy 26661 111850 0
Japan 61554 223024 0
S Korea 22595 114675 0
Russian 12422 34266 0
Spain 23964 123336 0
Britain 37874 206091 0
America 265436 792274 0
Does anyone know why I'm only getting 1's and 0's when I want float values? I tried the solutions given in the link suggested and none of them worked. I've tried to convert the values to floats using a few different methods including .astype('float'), float(df['A']) and df['ratio'] = df['Self-citations'] * 1.0 / df['Citations']. But none have worked so far.
Without having the exact dataframe it is difficult to say. But it is most likely a casting problem.
Lets build a MCVE:
import io
import pandas as pd
s = io.StringIO("""Country;Self_cite;Citations
Aus.;15606;90765
Brazil;14396;60702
Canada;40930;215003
China;411683;597237
France;28601;130632
Germany;27426;140566
India;37209;128763
Iran;19125;57470
Italy;26661;111850
Japan;61554;223024
S. Korea;22595;114675
Russian;12422;34266
Spain;23964;123336
Britain;37874;206091
America;265436;792274""")
df = pd.read_csv(s, sep=';', header=0).set_index('Country')
Then we can perform the desired operation as you suggested:
df['ratio'] = df['Self_cite']/df['Citations']
Checking dtypes:
df.dtypes
Self_cite int64
Citations int64
ratio float64
dtype: object
The result is:
Self_cite Citations ratio
Country
Aus. 15606 90765 0.171939
Brazil 14396 60702 0.237159
Canada 40930 215003 0.190369
China 411683 597237 0.689313
France 28601 130632 0.218943
Germany 27426 140566 0.195111
India 37209 128763 0.288973
Iran 19125 57470 0.332782
Italy 26661 111850 0.238364
Japan 61554 223024 0.275997
S. Korea 22595 114675 0.197035
Russian 12422 34266 0.362517
Spain 23964 123336 0.194299
Britain 37874 206091 0.183773
America 265436 792274 0.335031
Graphically:
df['ratio'].plot(kind='bar')
If you want to enforce type, you can cast dataframe using astype method:
df.astype(float)

Find percentile in pandas dataframe based on groups

Season Name value
2001 arkansas 3.497
2002 arkansas 3.0935
2003 arkansas 3.3625
2015 arkansas 3.766
2001 colorado 2.21925
2002 colorado 1.4795
2010 colorado 2.89175
2011 colorado 2.48825
2012 colorado 2.08475
2013 colorado 1.68125
2014 colorado 2.5555
2015 colorado 2.48825
In the dataframe above, I want to identify top and bottom 10 percentile values in column value for each state (arkansas and colorado). How do I do that? I can identify top and bottom percentile for entire value column like so:
np.searchsorted(np.percentile(a, [10, 90]), a))
You can use groupby + quantile:
df.groupby('Name')['value'].quantile([.1, .9])
Name
arkansas 0.1 3.174200
0.9 3.685300
colorado 0.1 1.620725
0.9 2.656375
Name: value, dtype: float64
And then call np.searchsorted.
Alternatively, use qcut.
df.groupby('Name').apply(lambda x:
pd.qcut(x['value'], [.1, .9]))
Name
arkansas 0 (3.173, 3.685]
1 NaN
2 (3.173, 3.685]
3 NaN
colorado 4 (1.62, 2.656]
5 NaN
6 NaN
7 (1.62, 2.656]
8 (1.62, 2.656]
9 (1.62, 2.656]
10 (1.62, 2.656]
11 (1.62, 2.656]
Name: value, dtype: object
If the variable for your dataframe is df, this should work. I'm not sure what you want your output to look like, but I just created code for a dictionary, where each key is a state. Also, since you have very few values, I used the option "nearest" for the argument interpolation (the default value is interpolation). To see the possible options, check out the documentation for the function here.
import pandas as pd
import numpy as np
df = pd.read_csv('stacktest.csv')
#array of unique state names from the dataframe
states = np.unique(df['Name'])
#empty dictionary
state_data = dict()
for state in states:
state_data[state] = np.percentile(df[df['Name'] == state]['value'],[10,90],interpolation = 'nearest')
print(state_data)

Categories

Resources