compare 2 csv files uing the merge and compare row by row - python

so I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.
in file1.cvs
research_groups_names
Chinese Academy of Sciences (CAS)
CAS
U-M
UQ
in file2.cvs
research_groups_names
Location
Chinese Academy of Sciences (CAS)
China
University of Michigan (U-M)
United States of America (USA)
The University of Queensland (UQ)
Australia
the Output.csv
f1_research_groups_names
f2_research_groups_names
Location
Chinese Academy of Sciences
Chinese Academy ofSciences(CAS)
China
CAS
Chinese Academy of Sciences (CAS)
China
U-M
University of Michigan (U-M)
United States of America(USA)
UQ
The University of Queensland (UQ)
Australia
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file1.csv')
df1 = df1.add_prefix('f1_')
df2 = df2.add_prefix('f2_')
def compare_nae(df):
if df1['f1_research_groups_names'] == df2['f2_research_groups_names']:
return 1
else:
return 0
result = pd.merge(df1, df2, left_on=['f1_research_groups_names'],right_on=['f2_research_groups_names'], how="left")
result.to_csv('output.csv')

You can try:
def fn(row):
for _, n in df2.iterrows():
if (
n["research_groups_names"] == row["research_groups_names"]
or row["research_groups_names"] in n["research_groups_names"]
):
return n
df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)
Prints:
f1_research_groups_names f2_research_groups_names f2_location
0 Chinese Academy of Sciences (CAS) Chinese Academy of Sciences (CAS) China
1 CAS Chinese Academy of Sciences (CAS) China
2 U-M University of Michigan (U-M) United States of America (USA)
3 UQ The University of Queensland (UQ) Australia
Note: If in df1 is name not found in df2 there will be None, None in columns "f2_research_groups_names" and "f2_location"

Related

Pandas Replace part of string with value from other column

I have a dataframe (see example, because sensitive data) with a full term (string), a text snippet (one big string) containing the full term and an abbreviation (string). I have been struggling with how to replace the full term in the text snippet with the corresponding abbrevation. Can anyone help? Example:
term text_snippet abbr
0 aanvullend onderzoek aanvullend onderzoek is vereist om... ao/
So I want to end up with:
term text_snippet abbr
0 aanvullend onderzoek ao/ is vereist om... ao/
You can use apply and replace terms with abbrs:
df['text_snippet'] = df.apply(
lambda x: x['text_snippet'].replace(x['term'], x['abbr']), axis=1)
df
Output:
term text_snippet abbr
0 aanvullend onderzoek ao/ is vereist om... ao/
Here is a solution. I made up a simple dataframe:
df = pd.DataFrame({'term':['United States of America','Germany','Japan'],
'text_snippet':['United States of America is in America',
'Germany is in Europe',
'Japan is in Asia'],
'abbr':['USA','DE','JP']})
df
Dataframe before solution:
term text_snippet abbr
0 United States of America United States of America is in America USA
1 Germany Germany is in Europe DE
2 Japan Japan is in Asia JP
Use apply function on every row:
df['text_snippet'] = df.apply(lambda row : row['text_snippet'].replace(row['term'],row['abbr']), axis= 1 )
Output:
term text_snippet abbr
0 United States of America USA is in America USA
1 Germany DE is in Europe DE
2 Japan JP is in Asia JP

Removing everything after a char in a dataframe

If I have the following dataframe 'countries':
country info
england london-europe
scotland edinburgh-europe
china beijing-asia
unitedstates washington-north_america
I would like to take the info field and have to remove everything after the '-', to become:
country info
england london
scotland edinburgh
china beijing
unitedstates washington
How do I do this?
Try:
countries['info'] = countries['info'].str.split('-').str[0]
Output:
country info
0 england london
1 scotland edinburgh
2 china beijing
3 unitedstates washington
You just need to keep the first part of the string after a split on the dash character:
countries['info'] = countries['info'].str.split('-').str[0]
Or, equivalently, you can use
countries['info'] = countries['info'].str.split('-').map(lambda x: x[0])
You can also use str.extract with pattern r"(\w+)(?=\-)"
Ex:
print(df['info'].str.extract(r"(\w+)(?=\-)"))
Output:
info
0 london
1 edinburgh
2 beijing
3 washington

how to iterate by loop with values in function using python?

I want to pass values using loop one by one in function using python.Values are stored in dataframe.
def eam(A,B):
y=A +" " +B
return y
Suppose I pass the values of A as country and B as capital .
Dataframe df is
country capital
India New Delhi
Indonesia Jakarta
Islamic Republic of Iran Tehran
Iraq Baghdad
Ireland Dublin
How can I get value using loop
0 India New Delhi
1 Indonesia Jakarta
2 Islamic Republic of Iran Tehran
3 Iraq Baghdad
4 Ireland Dublin
Here you go, just use the following syntax to get a new column in the dataframe. No need to write code to loop over the rows. However, if you must loop, df.iterrows() returns or df.itertuples() provide nice functionality to accomplish similar objectives.
>>> df = pd.read_clipboard(sep='\t')
>>> df.head()
country capital
0 India New Delhi
1 Indonesia Jakarta
2 Islamic Republic of Iran Tehran
3 Iraq Baghdad
4 Ireland Dublin
>>> df.columns
Index(['country', 'capital'], dtype='object')
>>> df['both'] = df['country'] + " " + df['capital']
>>> df.head()
country capital both
0 India New Delhi India New Delhi
1 Indonesia Jakarta Indonesia Jakarta
2 Islamic Republic of Iran Tehran Islamic Republic of Iran Tehran
3 Iraq Baghdad Iraq Baghdad
4 Ireland Dublin Ireland Dublin

BeautifulSoup not returning .text in csv and extra unwanted numbers

I have this code
import requests
import csv
from bs4 import BeautifulSoup
from time import sleep
f = csv.writer(open('destinations.csv', 'w'))
f.writerow(['Destinations', 'Country'])
pages = []
for i in range(1, 3):
url = 'http://www.travelindicator.com/destinations?page=' + str(i)
pages.append(url.decode('utf-8'))
for item in pages:
page = requests.get(item, sleep(2))
soup = BeautifulSoup(page.content.text, 'lxml')
for destinations_list in soup.select('.news-a header'):
destination = soup.select('h2 a')
country = soup.select('p a')
print (destinations_list.text)
f.writerow([destinations_list])
which gives me the console answer of:
Ellora
1
3/5
India
Volterra
2
2/5
Italy
Hamilton
3
3/5
New Zealand
London
4
5/5
United Kingdom
Sun Moon Lake
5
5/5
Taiwan
Texel
6
etc...
Firstly I am unsure why the extra numbers are being added as I have only specified the parts I want for each country.
Secondly, when I try and format it into a CSV file, it doesn't remove the HTML even though I have specified my soup to give me content.text. Been trying to figure it out for an hour and am at a loss.
pages = []
destinations_list = []
country = []
for item in pages:
page = requests.get(item)
soup = BeautifulSoup(page.content, 'lxml')
temp = soup.findAll('article')
for l in temp:
destinations_list.append(l.find('h2').text)
country = l.find('p')
country.append(country.find('a').text)
print destinations_list
print country
Output:
Ellora
India
Volterra
Italy
Hamilton
New Zealand
London
United Kingdom
Sun Moon Lake
Taiwan
Texel
The Netherlands
Zhengzhou
China
Vladivostok
Russia
Charleston
United States
Banska Bystrica
Slovakia
Lviv
Ukraine
Viareggio
Italy
Wakkanai
Japan
Nordkapp
Norway
Jericoacoara
Brazil
Tainan
Taiwan
Boston
United States
Keelung
Taiwan
Stockholm
Sweden
Shaoxing
China
Bohol
Distance to you
Bohol
Philippines
Saint Petersburg
Russia
Malmo
Sweden
Elba
Italy
Gdansk
Poland
Langkawi
Malaysia
Poznan
Poland
Daegu
South Korea
Abu Simbel
Egypt
Melbourne
Australia
Reunion
Reunion
Annecy
France
Colombo
Sri Lanka
Penghu
Taiwan
Conwy
United Kingdom
Monterrico
Guatemala
Janakpur
Nepal
Bimini
Bahamas
Lake Tahoe
United States
Essaouira
Morocco

How do I change values of column in dataframe while iterating over the column?

I have a dataframe like this:
Cause_of_death famous_for name nationality
suicide by hanging African jazz XYZ South
unknown Korean president ABC South
heart attack businessman EFG American
heart failure Prime Minister LMN Indian
heart problems African writer PQR South
And the dataframe is too big. What I want to do is to make changes in the nationality column. You can see that for the nationality = South, we have Korea and Africa as a part of the strings in the famous_for column. So What I want to do is change the nationality to South Africa if famous_for contains Africa and nationality to South Korea if famous_for contains Korea.
What I had tried is:
for i in deaths['nationality']:
if (deaths['nationality']=='South'):
if deaths['famous_for'].contains('Korea'):
deaths['nationality']='South Korea'
elif deaths['famous_for'].contains('Korea'):
deaths['nationality']='South Africa'
else:
pass
You can use contains() to check if the famous_for columns includes Korea or Africa and set nationality accordingly.
df.loc[df.famous_for.str.contains('Korean'), 'nationality']='South Korean'
df.loc[df.famous_for.str.contains('Africa'), 'nationality']='South Africa'
df
Out[783]:
Cause_of_death famous_for name nationality
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korean
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
Or you can do this in one line using:
df.nationality = (
df.nationality.str.cat(df.famous_for.str.extract('(Africa|Korea)',expand=False),
sep=' ', na_rep=''))
df
Out[801]:
Cause_of_death famous_for name nationality
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
If many conditions is possible use custom function with DataFrame.apply and axis=1 for process by rows:
def f(x):
if (x['nationality']=='South'):
if 'Korea' in x['famous_for']:
return 'South Korea'
elif 'Africa' in x['famous_for']:
return 'South Africa'
else:
return x['nationality']
deaths['nationality'] = deaths.apply(f, axis=1)
print (deaths)
Cause_of_death famous_for name nationality
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
But if only few conditions use str.contains with DataFrame.loc:
mask1 = deaths['nationality'] == 'South'
mask2 = deaths['famous_for'].str.contains('Korean')
mask3 = deaths['famous_for'].str.contains('Africa')
deaths.loc[mask1 & mask2, 'nationality']='South Korea'
deaths.loc[mask1 & mask3, 'nationality']='South Africa'
print (deaths)
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa
Another solution with mask:
mask1 = deaths['nationality'] == 'South'
mask2 = deaths['famous_for'].str.contains('Korean')
mask3 = deaths['famous_for'].str.contains('Africa')
deaths['nationality'] = deaths['nationality'].mask(mask1 & mask2, 'South Korea')
deaths['nationality'] = deaths['nationality'].mask(mask1 & mask3,'South Africa')
print (deaths)
0 suicide by hanging African jazz XYZ South Africa
1 unknown Korean president ABC South Korea
2 heart attack businessman EFG American
3 heart failure Prime Minister LMN Indian
4 heart problems African writer PQR South Africa

Categories

Resources