My code looks like:
import pandas as pd
df = pd.read_excel("Energy Indicators.xls", header=None, footer=None)
c_df = df.copy()
c_df = c_df.iloc[18:245, 2:]
c_df = c_df.rename(columns={2: 'Country', 3: 'Energy Supply', 4:'Energy Supply per Capita', 5:'% Renewable'})
c_df['Energy Supply'] = c_df['Energy Supply'].apply(lambda x: x*1000000)
print(c_df)
c_df = c_df.loc[c_df['Country'] == ('Korea, Rep.')] = 'South Korea'
When I run it, I get the error "'str' has no attribute 'loc'". It seems like it is telling me that I can't use loc on the dataframe. All I want to do is replace the value so if there is an easier way, I am all ears.
Just do
c_df.loc[c_df['Country'] == ('Korea, Rep.')] = 'South Korea'
instead of
c_df = c_df.loc[c_df['Country'] == ('Korea, Rep.')] = 'South Korea'
I would suggest using df.replace:
df = df.replace({'c_df':{'Korea, Rep.':'South Korea'}})
The code above replaces Korea, Rep. with South Korea only in the column c_df. Take a look at the df.replace documentation, which explains the nested dictionary syntax I used above as :
Nested dictionaries, e.g., {‘a’: {‘b’: nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with nan. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.
Example:
# Original dataframe:
>>> df
c_df whatever
0 Korea, Rep. abcd
1 x abcd
2 Korea, Rep. abcd
3 y abcd
# After df.replace:
>>> df
c_df whatever
0 South Korea abcd
1 x abcd
2 South Korea abcd
3 y abcd
Related
I have a pandas dataset like below:
import pandas as pd
data = {'id': ['001', '002', '003'],
'address': ["William J. Clare\n290 Valley Dr.\nCasper, WY 82604\nUSA, United States",
"1180 Shelard Tower\nMinneapolis, MN 55426\nUSA, United States",
"William N. Barnard\n145 S. Durbin\nCasper, WY 82601\nUSA, United States"]
}
df = pd.DataFrame(data)
print(df)
I need to convert address column to text delimited by \n and create new columns like name, address line 1, City, State, Zipcode, Country like below:
id Name addressline1 City State Zipcode Country
1 William J. Clare 290 Valley Dr. Casper WY 82604 United States
2 null 1180 Shelard Tower Minneapolis MN 55426 United States
3 William N. Barnard 145 S. Durbin Casper WY 82601 United States
I am learning python and from morning I am solving this. Any help will be greatly appreciated.
Thanks,
Right now, Pandas is returning you the table with 2 columns. If you look at the value in the second column, the essential information is separated with the comma. Therefore, if you saved your dataframe to df you can do the following:
df['address_and_city'] = df['address'].apply(lambda x: x.split(',')[0])
df['state_and_postal'] = df['address'].apply(lambda x: x.split(',')[1])
df['country'] = df['address'].apply(lambda x: x.split(',')[2])
Now, you have additional three columns in your dataframe, the last one contains the full information about the country already. Now from the first two columns that you have created you can extract the info you need in a similar way.
df['address_first_line'] = df['address_and_city'].apply(lambda x: ' '.join(x.split('\n')[:-1]))
df['city'] = df['address_and_city'].apply(lambda x: x.split('\n')[-1])
df['state'] = df['state_and_postal'].apply(lambda x: x.split(' ')[1])
df['postal'] = df['state_and_postal'].apply(lambda x: x.split(' ')[2].split('\n')[0])
Now you should have all the columns you need. You can remove the excess columns with:
df.drop(columns=['address','address_and_city','state_and_postal'], inplace=True)
Of course, it all can be done faster and with fewer lines of code, but I think it is the clearest way of doing it, which I hope you will find useful. If you don't understand what I did there, check the documentation for split and join methods, and also for apply method, native to pandas.
I am working on the following dataset: https://drive.google.com/file/d/1UVgSfIO-46aLKHeyk2LuKV6nVyFjBdWX/view?usp=sharing
I am trying to replace the countries in the "Nationality" column whose value_counts() are less than 450 with the value of "Others".
def collapse_category(df):
df.loc[df['Nationality'].map(df['Nationality'].value_counts(normalize=True)
.lt(450)), 'Nationality'] = 'Others'
print(df['Nationality'].unique())
This is the code I used but it returns the result as this: ['Others']
Here is the link to my notebook for reference: https://colab.research.google.com/drive/1MfwwBfi9_4E1BaZcPnS7KJjTy8xVsgZO?usp=sharing
Use boolean indexing:
s = df['Nationality'].value_counts()
df.loc[df['Nationality'].isin(s[s<450].index), 'Nationality'] = 'Others'
New value_counts after the change:
FRA 12307
PRT 11382
DEU 10164
GBR 8610
Others 5354
ESP 4864
USA 3398
... ...
FIN 632
RUS 578
ROU 475
Name: Nationality, dtype: int64
value_filter = df.Nationality.value_counts().lt(450)
temp_dict = value_filter[value_filter == False].replace({False: "others"}).to_dict()
df = df.replace(temp_dict)
In general, the third line will look up the entire df rather than a particular column. But the above code will work for you.
I have a DataFrame that has a column that I need to search using a wildcard. I tried this:
df = pd.read_excel('CHQ REG.xlsx',index=False)
df.sort_values(['CheckNumber'], inplace=True)
df[df.CheckNumber.str.match('888')]
df
This returns everything in by df
Here is my goal:
CheckBranch CheckNumber
Lebanon 8880121
Sample:
CheckBranch CheckNumber
Texas 4782436
Georgia 8967462
Lebanon 8880121
China 8947512
Try:
res = df[df['CheckNumber'].astype('string').str.match('888')]
print(res)
Output
CheckBranch CheckNumber
2 Lebanon 8880121
As an alternative:
res = df[df['CheckNumber'].astype('string').str.startswith('888')]
I have a dataframe which has some duplicate tags separated by commas in the "Tags" column, is there a way to remove the duplicate strings from the series. I want the output in 400 to have just Museum, Drinking, Shopping.
I can't split on a comma & remove them because there are some tags in the series that have similar words like for example: [Museum, Art Museum, Shopping] so splitting and dropping multiple museum strings would affect the unique 'Art Museum' string.
Desired Output
You can split by comma and convert to a set(),which removes duplicates, after removing leading/trailing white space with str.strip(). Then, you can df.apply() this to your column.
df['Tags']=df['Tags'].apply(lambda x: ', '.join(set([y.strip() for y in x.split(',')])))
You can create a function that removes duplicates from a given string. Then apply this function to your column Tags.
def remove_dup(strng):
'''
Input a string and split them
'''
return ', '.join(list(dict.fromkeys(strng.split(', '))))
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
DEMO:
import pandas as pd
my_dict = {'Tags':["Museum, Art Museum, Shopping, Museum",'Drink, Drink','Shop','Visit'],'Country':['USA','USA','USA', 'USA']}
df = pd.DataFrame(my_dict)
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
df
Output:
Tags Country
0 Museum, Art Museum, Shopping USA
1 Drink USA
2 Shop USA
3 Visit USA
Without some code example, I've thrown together something that would work.
import pandas as pd
test = [['Museum', 'Art Museum', 'Shopping', "Museum"]]
df = pd.DataFrame()
df[0] = test
df[0]= df.applymap(set)
Out[35]:
0
0 {Museum, Shopping, Art Museum}
One approach that avoids apply
# in your code just s = df['Tags']
s = pd.Series(['','', 'Tour',
'Outdoors, Beach, Sports',
'Museum, Drinking, Drinking, Shopping'])
(s.str.split(',\s+', expand=True)
.stack()
.reset_index()
.drop_duplicates(['level_0',0])
.groupby('level_0')[0]
.agg(','.join)
)
Output:
level_0
0
1
2 Tour
3 Outdoors,Beach,Sports
4 Museum,Drinking,Shopping
Name: 0, dtype: object
there maybe mach fancier way doing these kind of stuffs.
but will do the job.
make it lower-case
data['tags'] = data['tags'].str.lower()
split every row in tags col by comma it will return a list of string
data['tags'] = data['tags'].str.split(',')
map function str.strip to every element of list (remove trailing spaces).
apply set function return set of current words and remove duplicates
data['tags'] = data['tags'].apply(lambda x: set(map(str.strip , x)))
In my dataframe , There are several countries with numbers and/or parenthesis in their name.
I want to remove parentheses and numbers from these countries names.
For example :
'Bolivia (Plurinational State of)' should be 'Bolivia',
'Switzerland17' should be 'Switzerland'.
Here is my code , but it seems not working :
import numpy as np
import pandas as pd
def func():
energy=pd.ExcelFile('Energy Indicators.xls').parse('Energy')
energy=energy.iloc[16:243][['Environmental Indicators: Energy','Unnamed: 3','Unnamed: 4','Unnamed: 5']].copy()
energy.columns=['Country', 'Energy Supply', 'Energy Supply per Capita', '% Renewable']
o="..."
n=np.NaN
energy = energy.replace('...', np.nan)
energy['Energy Supply']=energy['Energy Supply']*1000000
old=["Republic of Korea","United States of America","United Kingdom of "
+"Great Britain and Northern Ireland","China, Hong "
+"Kong Special Administrative Region"]
new=["South Korea","United States","United Kingdom","Hong Kong"]
for i in range(0,4):
energy = energy.replace(old[i], new[i])
#I'm trying to remove it here =====>
p="("
for j in range(16,243):
if p in energy.iloc[j]['Country']:
country=""
for c in energy.iloc[j]['Country'] :
while(c!=p & !c.isnumeric()):
country=c+country
energy = energy.replace(energy.iloc[j]['Country'], country)
return energy
Here is the .xls file i'm working on : https://drive.google.com/file/d/0B80lepon1RrYeDRNQVFWYVVENHM/view?usp=sharing
Use str.extract:
energy['country'] = energy['country'].str.extract('(^[a-zA-Z]+)', expand=False)
df
country
0 Bolivia (Plurinational State of)
1 Switzerland17
df['country'] = df['country'].str.extract('(^[a-zA-Z]+)', expand=False)
df
country
0 Bolivia
1 Switzerland
To handle countries with spaces in their names (very common), a small improvement to the regex would be enough.
df
country
0 Bolivia (Plurinational State of)
1 Switzerland17
2 West Indies (foo bar)
df['country'] = df['country'].str.extract('(^[a-zA-Z\s]+)', expand=False).str.strip()
df
country
0 Bolivia
1 Switzerland
2 West Indies