so I am working on a project of web scraping. My goal is to web scrape the shanghai university ranking to get name, country and rank. Right now I am only focusing on the name.
import requests
from bs4 import BeautifulSoup
arwu = open('arwu.txt', 'a')
arwu.truncate()
universities = []
#Gets the url from which it should web scrape
url = 'https://www.shanghairanking.com/rankings/arwu/2021.html'
response = requests.get(url)
#initializes the bs4 html parser
soup = BeautifulSoup(response.text, "html.parser")
#retrieves all the university names that are displayed and formats them
def find_universities():
for university in range(len(soup.findAll(class_ ='global-univ'))):
one_a_tag = str(soup.findAll(class_ = 'global-univ')[university].text)
one_a_tag=one_a_tag[len(one_a_tag)//2+16:]
universities.append(str(one_a_tag))
return universities
universities=find_universities()
for x in range(len(universities)):
arwu.write(universities[x]+ "\n")
arwu.close()
As of right now, this only retrieves the first 30 universities displayed on the first page. How can I access the other pages?
The data from the next pages are loaded dynamically by javascript that's why only the BeautifulSoup can't parse it. To grab the next pages data, you must need an automation tool something like selenium. Here I use selenium with BeautifulSoup to extract data from the next pages and it's working fine.
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome('chromedriver.exe')
driver.maximize_window()
time.sleep(8)
url = 'https://www.shanghairanking.com/rankings/arwu/2021'
driver.get(url)
time.sleep(4)
universities = []
while True:# while for the next page
soup = BeautifulSoup(driver.page_source, 'lxml')
for university in soup.select('div.link-container > a'):
un = university.select_one('span.univ-name')
versity = un.get_text(strip=True) if un else None
print(versity)
print("-" * 85)
next_page=driver.find_element_by_xpath('(//a[#class="ant-pagination-item-link"])[3]')#next page
if next_page:
next_page.click()
time.sleep(2)
else:
break
Output:
Harvard University
Stanford University
University of Cambridge
Massachusetts Institute of Technology (MIT)
University of California, Berkeley
Princeton University
University of Oxford
Columbia University
California Institute of Technology
University of Chicago
Yale University
Cornell University
Paris-Saclay University
University of California, Los Angeles
University of Pennsylvania
Johns Hopkins University
University College London
University of California, San Diego
University of Washington
University of California, San Francisco
ETH Zurich
University of Toronto
Washington University in St. Louis
The University of Tokyo
Imperial College London
New York University
Tsinghua University
University of North Carolina at Chapel Hill
University of Copenhagen
None
-------------------------------------------------------------------------------------
University of Wisconsin - Madison
Duke University
The University of Melbourne
Northwestern University
Sorbonne University
The University of Manchester
Kyoto University
PSL University
The University of Edinburgh
University of Minnesota, Twin Cities
The University of Texas at Austin
Karolinska Institute
Rockefeller University
University of British Columbia
Peking University
University of Colorado at Boulder
King's College London
The University of Texas Southwestern Medical Center at Dallas
University of Munich
Utrecht University
The University of Queensland
Technical University of Munich
Zhejiang University
University of Zurich
University of Illinois at Urbana-Champaign
University of Maryland, College Park
Heidelberg University
University of California, Santa Barbara
Shanghai Jiao Tong University
University of Geneva
None
-------------------------------------------------------------------------------------
University of Oslo
University of Southern California
University of Science and Technology of China
University of Groningen
The University of New South Wales
Vanderbilt University
McGill University
The University of Texas M. D. Anderson Cancer Center
University of Sydney
University of California, Irvine
Aarhus University
Ghent University
University of Paris
Stockholm University
National University of Singapore
The Australian National University
Fudan University
University of Bristol
Uppsala University
Monash University
Nanyang Technological University
University of Helsinki
Leiden University
Nagoya University
University of Bonn
Purdue University - West Lafayette
KU Leuven
University of Basel
Sun Yat-sen University
The Hebrew University of Jerusalem
None
-------------------------------------------------------------------------------------
Swiss Federal Institute of Technology Lausanne
McMaster University
Weizmann Institute of Science
Technion-Israel Institute of Technology
Boston University
The University of Western Australia
Carnegie Mellon University
Moscow State University
University of Florida
University of California, Davis
Aix Marseille University
Arizona State University
Brown University
Case Western Reserve University
Emory University
Erasmus University Rotterdam
Georgia Institute of Technology
Huazhong University of Science and Technology
Icahn School of Medicine at Mount Sinai
Indiana University Bloomington
King Abdulaziz University
King Saud University
Mayo Clinic Alix School of Medicine
Michigan State University
Nanjing University
Norwegian University of Science and Technology - NTNU
Pennsylvania State University - University Park
Radboud University Nijmegen
Rice University
Rutgers, The State University of New Jersey - New Brunswick
None
-------------------------------------------------------------------------------------
Seoul National University
The Chinese University of Hong Kong
The Ohio State University - Columbus
The University of Adelaide
The University of Hong Kong
The University of Sheffield
Tokyo Institute of Technology
Université Grenoble Alpes
Université libre de Bruxelles (ULB)
University of Alberta
University of Amsterdam
University of Arizona
University of Bern
University of Birmingham
University of Freiburg
University of Goettingen
University of Gothenburg
University of Lausanne
University of Leeds
University of Liverpool
University of Montreal
University of Nottingham
University of Pittsburgh
University of Sao Paulo
University of Strasbourg
University of Utah
University of Warwick
Vrije Universiteit Amsterdam
Wageningen University & Research
Xi'an Jiaotong University
None
-------------------------------------------------------------------------------------
University of Houston
University of Illinois at Chicago
University of Innsbruck
University of Iowa
University of Kansas
University of Kiel
University of Leipzig
University of Lisbon
University of Lorraine
University of Mainz
University of Massachusetts Amherst
University of Massachusetts Medical School - Worcester
University of Miami
University of Missouri - Columbia
University of Nebraska - Lincoln
University of Ottawa
University of Science and Technology Beijing
University of South Florida
University of Technology Sydney
University of Tennessee - Knoxville
University of Tsukuba
University of Turin
University of Wollongong
University of Wuerzburg
Virginia Commonwealth University
Virginia Polytechnic Institute and State University
Vrije Universiteit Brussel (VUB)
Western University
Xiamen University
Yonsei University
None
-------------------------------------------------------------------------------------
Indian Institute of Science
Istanbul University
Jagiellonian University
Jinan University
Kansas State University
King Fahd University of Petroleum & Minerals
Kobe University
Kyung Hee University
Mahidol University
Medical University of Innsbruck
Nanjing Normal University
Nanjing University of Information Science & Technology
National University of Ireland, Galway
National Yang Ming Chiao Tung University
Northern Arizona University
Okayama University
Pohang University of Science and Technology
Pompeu Fabra University
Pusan National University
Qingdao University
Queen's University Belfast
Rensselaer Polytechnic Institute
Saint Louis University
Scuola Normale Superiore - Pisa
Shandong University of Science and Technology
ShanghaiTech University
South China Agricultural University
Southern Medical University
None
-------------------------------------------------------------------------------------
University of Kent
University of Konstanz
University of Ljubljana
University of Navarra
University of Nevada - Reno
University of New Hampshire
University of Oklahoma - Norman
University of Palermo
University of Parma
University of Plymouth
University of Portsmouth
University of Regensburg
University of Rennes 1
University of Roma - Tor Vergata
University of Rostock
University of Salerno
University of Sherbrooke
University of Siena
Tampere University
University of Tromso
University of Ulsan
University of Verona
University of Vigo
University of Zaragoza
University Rovira i Virgili
Waseda University
Wenzhou Medical University
Yunnan University
Zhejiang University of Technology
None
-------------------------------------------------------------------------------------
Dalian Maritime University
Dokuz Eylul University
Federal University of Sao Carlos
Fluminense Federal University
Fujian Agriculture and Forestry University
Fujian Medical University
Fujian Normal University
Graz University of Technology
Guangxi University
Hacettepe University
Henan University
Indian Institute of Technology Delhi
Indian Institute of Technology Kharagpur
Indian Institute of Technology Madras
INHA University
Jawaharlal Nehru University
Kanazawa University
Kaohsiung Medical University
Kindai University
Kunming University of Science and Technology
Lincoln University
Mansoura University
Massey University
Medical University of Warsaw
National Research Nuclear University MEPhI (Moscow Engineering Physics Institute)
New Jersey Institute of Technology
New Mexico State University
Ningbo University
North China Electric Power University
None
-------------------------------------------------------------------------------------
Uniformed Services University of the Health Sciences
Universidad Andrés Bello
Universidad de Las Palmas de Gran Canaria
Universidad Pablo de Olavide
Université Gustave Eiffel
University of Agriculture Faisalabad
University of Alcalá
University of Cagliari
University of Concepcion
University of Cordoba
University of Engineering and Technology (UET)
University of Girona
University of Greenwich
University of Hull
University of L'Aquila
University of North Carolina at Greensboro
University of Savoy
University of St. Gallen
University of Stirling
University of Tabriz
University of the Punjab
University of Thessaly
University of Urbino
University of Veterinary Medicine Vienna
Vellore Institute of Technology
Warsaw University of Life Sciences
Westlake University
Wroclaw Medical University
Yanshan University
Zagazig University
None
-------------------------------------------------------------------------------------
University of Ulster
University of Valladolid
University of Wuppertal
University Paris Est Creteil
Vilnius Gediminas Technical University
Warsaw University of Technology
Williams College
Wroclaw University of Science and Technology
Wuhan University of Science and Technology
Yantai University
None
Related
I am trying to scrape the following website (https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0) and ultimately am interested in storing some of the data inside each 'li class="search-result-item"' to perform further analytics.
Example of one "search-result-item"
I want to capture the <h3>,<span class="plaque-role"> and <span class="plaque-location"> in a python dictionary:
<li class="search-result-item"><img class="search-result-image max-width" src="/siteassets/home/visit/blue-plaques/find-a-plaque/blue-plaques-f-j/helen-gwynne-vaughan-plaque.jpg?w=732&h=465&mode=crop&scale=both&cache=always&quality=60&anchor=&WebsiteVersion=20220516171525" alt="" title=""><div class="search-result-info"><h3>GWYNNE-VAUGHAN, Dame Helen (1879-1967)</h3><span class="plaque-role">Botanist and Military Officer</span><span class="plaque-location">Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden</span></div></li>
So far I am trying to isolate all the "search-result-item" but my current code prints absolutely nothing. If someone can help me sort that problem out and point me in the right direction to storing each data element into a python dictionary I would be very grateful.
from bs4 import BeautifulSoup
import requests
url = 'https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#print(soup.prettify())
print(soup.find_all(class_='search-result-item')).get_text()
You're not getting anything because the search results are generated by JavaScript. Use the API endpoint they fetch the data from.
For example:
import requests
api_url = "https://www.english-heritage.org.uk/api/BluePlaqueSearch/GetMatchingBluePlaques?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0&_=1653043005731"
plaques = requests.get(api_url).json()["plaques"]
for plaque in plaques:
print(plaque["title"])
print(plaque["address"])
print(f"https://www.english-heritage.org.uk{plaque['path']}")
print("-" * 80)
Output:
GWYNNE-VAUGHAN, Dame Helen (1879-1967)
Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden
https://www.english-heritage.org.uk/visit/blue-plaques/helen-gwynne-vaughan/
--------------------------------------------------------------------------------
READING, Lady Stella (1894-1971)
41 Tothill Street, London, City of Westminster, SW1H 9LQ, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/stella-lady-reading/
--------------------------------------------------------------------------------
32 SOHO SQUARE
32 Soho Square, Soho, London, W1D 3AP, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/soho-square/
--------------------------------------------------------------------------------
14 BUCKINGHAM STREET
14 Buckingham Street, Covent Garden, London, WC2N 6DF, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/buckingham-street/
--------------------------------------------------------------------------------
ABRAHAMS, Harold (1899-1978)
Hodford Lodge, 2 Hodford Road, Golders Green, London, NW11 8NP, London Borough of Barnet
https://www.english-heritage.org.uk/visit/blue-plaques/abrahams-harold/
--------------------------------------------------------------------------------
ADAM, ROBERT and HOOD, THOMAS and GALSWORTHY, JOHN and BARRIE, SIR JAMES
1-3 Robert Street, Adelphi, Charing Cross, London, WC2N 6BN, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/adam-hood-galsworthy-barrie/
--------------------------------------------------------------------------------
ADAMS, Henry Brooks (1838-1918)
98 Portland Place, Marylebone, London, W1B 1ET, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/united-states-embassy/
--------------------------------------------------------------------------------
ADELPHI, The
The Adelphi Terrace, Charing Cross, London, WC2N 6BJ, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/adelphi/
--------------------------------------------------------------------------------
ALDRIDGE, Ira (1807-1867)
5 Hamlet Road, Upper Norwood, London, SE19 2AP, London Borough of Bromley
https://www.english-heritage.org.uk/visit/blue-plaques/aldridge-ira/
--------------------------------------------------------------------------------
ALEXANDER, Sir George (1858-1918)
57 Pont Street, Chelsea, London, SW1X 0BD, London Borough of Kensington And Chelsea
https://www.english-heritage.org.uk/visit/blue-plaques/george-alexander/
--------------------------------------------------------------------------------
ALLENBY, Field Marshal Edmund Henry Hynman, Viscount Allenby (1861-1936)
24 Wetherby Gardens, South Kensington, London, SW5 0JR, London Borough of Kensington And Chelsea
https://www.english-heritage.org.uk/visit/blue-plaques/field-marshal-viscount-allenby/
--------------------------------------------------------------------------------
ALMA-TADEMA, Sir Lawrence, O.M. (1836-1912)
44 Grove End Road, St John's Wood, London, NW8 9NE, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/lawrence-alma-tadema/
--------------------------------------------------------------------------------
Content is generated dynamically by JavaScript so you wont find the elements / info you are looking for with BeautifulSoup, instead use their API.
Example
import requests
url = 'https://www.english-heritage.org.uk/api/BluePlaqueSearch/GetMatchingBluePlaques?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0'
page = requests.get(url).json()
data = []
for e in page['plaques']:
data.append(dict((k,v) for k,v in e.items() if k in ['title','professions','address']))
data
Output
[{'title': 'GWYNNE-VAUGHAN, Dame Helen (1879-1967)', 'address': 'Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden', 'professions': 'Botanist and Military Officer'}, {'title': 'READING, Lady Stella (1894-1971)', 'address': '41 Tothill Street, London, City of Westminster, SW1H 9LQ, City Of Westminster', 'professions': "Founder of the Women's Voluntary Service"}, {'title': '32 SOHO SQUARE', 'address': '32 Soho Square, Soho, London, W1D 3AP, City Of Westminster', 'professions': 'Botanists'}, {'title': '14 BUCKINGHAM STREET', 'address': '14 Buckingham Street, Covent Garden, London, WC2N 6DF, City Of Westminster', 'professions': 'Statesman, Diarist, Naval Official, Painter'}, {'title': 'ABRAHAMS, Harold (1899-1978)', 'address': 'Hodford Lodge, 2 Hodford Road, Golders Green, London, NW11 8NP, London Borough of Barnet', 'professions': 'Athlete'}, ...]
I'm attempting to convert the following into integers. I have literally tried everything and keep getting errors.
For instance:
pop2007 = pop2007.astype('int32')
ValueError: invalid literal for int() with base 10: '4,779,736'
Below is the DF I'm trying to convert. I've even attempted the .values method with no success.
pop2007
Alabama 4,779,736
Alaska 710,231
Arizona 6,392,017
Arkansas 2,915,918
California 37,253,956
Colorado 5,029,196
Connecticut 3,574,097
Delaware 897,934
Florida 18,801,310
Georgia 9,687,653
Idaho 1,567,582
Illinois 12,830,632
Indiana 6,483,802
Iowa 3,046,355
Kansas 2,853,118
Kentucky 4,339,367
Louisiana 4,533,372
Maine 1,328,361
Maryland 5,773,552
Massachusetts 6,547,629
Michigan 9,883,640
Minnesota 5,303,925
Mississippi 2,967,297
Missouri 5,988,927
Montana 989,415
Nebraska 1,826,341
Nevada 2,700,551
New Hampshire 1,316,470
New Jersey 8,791,894
New Mexico 2059179
New York 19378102
North Carolina 9535483
North Dakota 672591
Ohio 11536504
Oklahoma 3751351
Oregon 3831074
Pennsylvania 12702379
Rhode Island 1052567
South Carolina 4625364
South Dakota 814180
Tennessee 6346105
Texas 25,145,561
Utah 2,763,885
Vermont 625,741
Virginia 8,001,024
Washington 6,724,540
West Virginia 1,852,994
Wisconsin 5,686,986
Wyoming 563,626
Name: 3, dtype: object
You can't turn a string with commas into an integer. Try this.
my_int = '1,000,000'
my_int = int(my_int.replace(',', ''))
print(my_int)
Have you tried pop2007.replace(',','') to remove the commas from your string values before converting to integers?
I'm trying to convert a table found on a website (full details and photo below) to a CSV. I've started with the below code, but the table isn't returning anything. I think it must have something to do with me not understanding the right naming convention for the table, but any additional help will be appreciated to achieve my ultimate goal.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
url = 'https://www.privateequityinternational.com/database/#/pei-300'
page = requests.get(url) #gets info from page
soup = BeautifulSoup(page.content,'html.parser') #parses information
table = soup.findAll('table',{'class':'au-target pux--responsive-table'}) #collecting blocks of info inside of table
table
Output: []
In addition to the URL provided in the above code, I'm essentially trying to convert the below table (found on the website) to a CSV file:
The data is loaded from external URL via Ajax. You can use requests/json module to get it:
import json
import requests
url = 'https://ra.pei.blaize.io/api/v1/institutions/pei-300s?count=25&start=0'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data['data']:
print('{:<5} {:<30} {}'.format(item['id'], item['name'], item['headquarters']))
Prints:
5611 Blackstone New York, United States
5579 The Carlyle Group Washington DC, United States
5586 KKR New York, United States
6701 TPG Fort Worth, United States
5591 Warburg Pincus New York, United States
1801 NB Alternatives New York, United States
6457 CVC Capital Partners Luxembourg, Luxembourg
6477 EQT Stockholm, Sweden
6361 Advent International Boston, United States
8411 Vista Equity Partners Austin, United States
6571 Leonard Green & Partners Los Angeles, United States
6782 Cinven London, United Kingdom
6389 Bain Capital Boston, United States
8096 Apollo Global Management New York, United States
8759 Thoma Bravo San Francisco, United States
7597 Insight Partners New York, United States
867 BlackRock New York, United States
5471 General Atlantic New York, United States
6639 Permira Advisers London, United Kingdom
5903 Brookfield Asset Management Toronto, Canada
6473 EnCap Investments Houston, United States
6497 Francisco Partners San Francisco, United States
6960 Platinum Equity Beverly Hills, United States
16331 Hillhouse Capital Group Hong Kong, Hong Kong
5595 Partners Group Baar-Zug, Switzerland
And selenium version:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
url = 'https://www.privateequityinternational.com/database/#/pei-300'
driver.get(url) #gets info from page
time.sleep(5)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser') #parses information
table = soup.select_one('table.au-target.pux--responsive-table') #collecting blocks of info inside of table
dfs = pd.read_html(table.prettify())
df = pd.concat(dfs)
df.to_csv('file.csv')
print(df.head(25))
prints:
Ranking Name City, Country (HQ)
0 1 Blackstone New York, United States
1 2 The Carlyle Group Washington DC, United States
2 3 KKR New York, United States
3 4 TPG Fort Worth, United States
4 5 Warburg Pincus New York, United States
5 6 NB Alternatives New York, United States
6 7 CVC Capital Partners Luxembourg, Luxembourg
7 8 EQT Stockholm, Sweden
8 9 Advent International Boston, United States
9 10 Vista Equity Partners Austin, United States
10 11 Leonard Green & Partners Los Angeles, United States
11 12 Cinven London, United Kingdom
12 13 Bain Capital Boston, United States
13 14 Apollo Global Management New York, United States
14 15 Thoma Bravo San Francisco, United States
15 16 Insight Partners New York, United States
16 17 BlackRock New York, United States
17 18 General Atlantic New York, United States
18 19 Permira Advisers London, United Kingdom
19 20 Brookfield Asset Management Toronto, Canada
20 21 EnCap Investments Houston, United States
21 22 Francisco Partners San Francisco, United States
22 23 Platinum Equity Beverly Hills, United States
23 24 Hillhouse Capital Group Hong Kong, Hong Kong
24 25 Partners Group Baar-Zug, Switzerland
And also save data to a file.csv.
Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe
Image description is:
Tag is:
I was looking to get the data in the stock dropdown. I went into the source and found the tag but I can't get the code to access the data. Can someone please help me fix the bug?
url ="http://www.moneycontrol.com/india/fnoquote/reliance-industries/RI/2020-07-30"
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, "html.parser")
for i in soup.select("stock_id"):
print(i.text)
You can use #stock_code > option instead of stock_id to get the data in the stock dropdown.You can try it:
url ="http://www.moneycontrol.com/india/fnoquote/reliance-industries/RI/2020-07-30"
headers = {'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
from bs4 import BeautifulSoup
import requests
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, "html.parser")
a = soup.select("#stock_code > option")
for i in a:
print(i.text)
Output will be:
ACC
Adani Enterpris
Adani Ports
Adani Power
Ajanta Pharma
Allahabad Bank
Amara Raja Batt
Ambuja Cements
Apollo Hospital
Apollo Tyres
Arvind
Ashok Leyland
Asian Paints
Aurobindo Pharm
Axis Bank
Bajaj Auto
Bajaj Finance
Bajaj Finserv
Balkrishna Ind
Bank of Baroda
Bank of India
Bata India
BEML
Berger Paints
Bharat Elec
Bharat Fin
Bharat Forge
Bharti Airtel
Bharti Infratel
BHEL
Biocon
Bosch
BPCL
Britannia
Cadila Health
Can Fin Homes
Canara Bank
Capital First
Castrol
Ceat
Century
CESC
CG Power
Chennai Petro
Cholamandalam
Cipla
Coal India
Colgate
Container Corp
Cummins
Dabur India
Dalmia Bharat
DCB Bank
Dewan Housing
Dish TV
Divis Labs
DLF
Dr Reddys Labs
Eicher Motors
EngineersInd
Equitas Holding
Escorts
Exide Ind
Federal Bank
GAIL
Glenmark
GMR Infra
Godfrey Phillip
Godrej Consumer
Godrej Ind
Granules India
Grasim
GSFC
Havells India
HCL Tech
HDFC
HDFC Bank
Hero Motocorp
Hexaware Tech
Hind Constr
Hind Zinc
Hindalco
HPCL
HUL
ICICI Bank
ICICI Prudentia
IDBI Bank
IDFC
IDFC Bank
IFCI
IGL
India Cements
Indiabulls Hsg
Indian Bank
IndusInd Bank
Infibeam Avenue
Infosys
Interglobe Avi
IOC
IRB Infra
ITC
Jain Irrigation
Jaiprakash Asso
Jet Airways
Jindal Steel
JSW Steel
Jubilant Food
Just Dial
Kajaria Ceramic
Karnataka Bank
Kaveri Seed
Kotak Mahindra
KPIT Tech
L&T Finance
Larsen
LIC Housing Fin
Lupin
M&M
M&M Financial
Mahanagar Gas
Manappuram Fin
Marico
Maruti Suzuki
Max Financial
MCX India
Mindtree
Motherson Sumi
MRF
MRPL
Muthoot Finance
NALCO
NBCC (India)
NCC
Nestle
NHPC
NIIT Tech
NMDC
NTPC
Oil India
ONGC
Oracle Fin Serv
Oriental Bank
Page Industries
PC Jeweller
Petronet LNG
Pidilite Ind
Piramal Enter
PNB
Power Finance
Power Grid Corp
PTC India
PVR
Ramco Cements
Raymond
RBL Bank
REC
Rel Capital
Reliance
Reliance Comm
Reliance Infra
Reliance Power
Repco Home
SAIL
SBI
Shree Cements
Shriram Trans
Siemens
South Ind Bk
SREI Infra
SRF
Strides Pharma
Sun Pharma
Sun TV Network
Suzlon Energy
Syndicate Bank
Tata Chemicals
Tata Comm
Tata Elxsi
Tata Global Bev
Tata Motors
Tata Motors (D)
Tata Power
Tata Steel
TCS
Tech Mahindra
Titan Company
Torrent Pharma
Torrent Power
TV18 Broadcast
TVS Motor
Ujjivan Financi
UltraTechCement
Union Bank
United Brewerie
United Spirits
UPL
V-Guard Ind
Vedanta
Vodafone Idea
Voltas
Wipro
Wockhardt
Yes Bank
Zee Entertain
Select
ACC
Adani Enterpris
Adani Ports
Adani Power
Ajanta Pharma
Allahabad Bank
Amara Raja Batt
Ambuja Cements
Apollo Hospital
Apollo Tyres
Arvind
Ashok Leyland
Asian Paints
Aurobindo Pharm
Axis Bank
Bajaj Auto
Bajaj Finance
Bajaj Finserv
Balkrishna Ind
Bank of Baroda
Bank of India
Bata India
BEML
Berger Paints
Bharat Elec
Bharat Fin
Bharat Forge
Bharti Airtel
Bharti Infratel
BHEL
Biocon
Bosch
BPCL
Britannia
Cadila Health
Can Fin Homes
Canara Bank
Capital First
Castrol
Ceat
Century
CESC
CG Power
Chennai Petro
Cholamandalam
Cipla
Coal India
Colgate
Container Corp
Cummins
Dabur India
Dalmia Bharat
DCB Bank
Dewan Housing
Dish TV
Divis Labs
DLF
Dr Reddys Labs
Eicher Motors
EngineersInd
Equitas Holding
Escorts
Exide Ind
Federal Bank
GAIL
Glenmark
GMR Infra
Godfrey Phillip
Godrej Consumer
Godrej Ind
Granules India
Grasim
GSFC
Havells India
HCL Tech
HDFC
HDFC Bank
Hero Motocorp
Hexaware Tech
Hind Constr
Hind Zinc
Hindalco
HPCL
HUL
ICICI Bank
ICICI Prudentia
IDBI Bank
IDFC
IDFC Bank
IFCI
IGL
India Cements
Indiabulls Hsg
Indian Bank
IndusInd Bank
Infibeam Avenue
Infosys
Interglobe Avi
IOC
IRB Infra
ITC
Jain Irrigation
Jaiprakash Asso
Jet Airways
Jindal Steel
JSW Steel
Jubilant Food
Just Dial
Kajaria Ceramic
Karnataka Bank
Kaveri Seed
Kotak Mahindra
KPIT Tech
L&T Finance
Larsen
LIC Housing Fin
Lupin
M&M
M&M Financial
Mahanagar Gas
Manappuram Fin
Marico
Maruti Suzuki
Max Financial
MCX India
Mindtree
Motherson Sumi
MRF
MRPL
Muthoot Finance
NALCO
NBCC (India)
NCC
Nestle
NHPC
NIIT Tech
NMDC
NTPC
Oil India
ONGC
Oracle Fin Serv
Oriental Bank
Page Industries
PC Jeweller
Petronet LNG
Pidilite Ind
Piramal Enter
PNB
Power Finance
Power Grid Corp
PTC India
PVR
Ramco Cements
Raymond
RBL Bank
REC
Rel Capital
Reliance
Reliance Comm
Reliance Infra
Reliance Power
Repco Home
SAIL
SBI
Shree Cements
Shriram Trans
Siemens
South Ind Bk
SREI Infra
SRF
Strides Pharma
Sun Pharma
Sun TV Network
Suzlon Energy
Syndicate Bank
Tata Chemicals
Tata Comm
Tata Elxsi
Tata Global Bev
Tata Motors
Tata Motors (D)
Tata Power
Tata Steel
TCS
Tech Mahindra
Titan Company
Torrent Pharma
Torrent Power
TV18 Broadcast
TVS Motor
Ujjivan Financi
UltraTechCement
Union Bank
United Brewerie
United Spirits
UPL
V-Guard Ind
Vedanta
Vodafone Idea
Voltas
Wipro
Wockhardt
Yes Bank
Zee Entertain
I have a column on my dataframe that contains the following
Wal-Mart Stores, Inc., Clinton, IA 52732
Benton Packing, LLC, Clearfield, UT 84016
North Coast Iron Corp, Seattle, WA 98109
Messer Construction Co. Inc., Amarillo, TX 79109
Ocean Spray Cranberries, Inc., Henderson, NV 89011
W R Derrick & Co. Lexington, SC 29072
I am having problem to capture it using regex so far my regex works for first 2 lines:
[A-Z][A-za-z-\s]+,\s{1}(Inc.|LLC)
How do I split the column to 4 additional columns? i.e. Column1 = Company Name, Column 2 = City, Column 3 = State, Column 4 = Zipcode.
Example of the output is shown below:
Company_Name City State ZipCode
Wal-Mart Stores, Inc. Clinton IA 52732
The names are probably the trickiest part, but if you know that the structure of city, state, zip will always be the same (i.e. no extra commas) you could use rsplit to split the strings. Similarly pandas has a str.rsplit method as well.
df
Address
0 Wal-Mart Stores, Inc., Clinton, IA 52732
1 Benton Packing, LLC, Clearfield, UT 84016
2 North Coast Iron Corp, Seattle, WA 98109
3 Messer Construction Co. Inc., Amarillo, TX 79109
df['Zip'] = df.Address.map(lambda x: x.rsplit(' ', 1)[-1])
df['Name'], df['City'], df['State']= zip(*df.Address.map(lambda x: x.rsplit(' ', 1)[0].rsplit(',', 2)))
df
Address Zip \
0 Wal-Mart Stores, Inc., Clinton, IA 5273 5273
1 Benton Packing, LLC, Clearfield, UT 84016 84016
2 North Coast Iron Corp, Seattle, WA 98109 98109
3 Messer Construction Co. Inc., Amarillo, TX 79109 79109
Name City State
0 Wal-Mart Stores, Inc. Clinton IA
1 Benton Packing, LLC Clearfield UT
2 North Coast Iron Corp Seattle WA
3 Messer Construction Co. Inc. Amarillo TX