BeautifulSoup Assistance

BeautifulSoup Assistance - python

I am trying to scrape the following website (https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0) and ultimately am interested in storing some of the data inside each 'li class="search-result-item"' to perform further analytics.
Example of one "search-result-item"
I want to capture the <h3>,<span class="plaque-role"> and <span class="plaque-location"> in a python dictionary:
<li class="search-result-item"><img class="search-result-image max-width" src="/siteassets/home/visit/blue-plaques/find-a-plaque/blue-plaques-f-j/helen-gwynne-vaughan-plaque.jpg?w=732&h=465&mode=crop&scale=both&cache=always&quality=60&anchor=&WebsiteVersion=20220516171525" alt="" title=""><div class="search-result-info"><h3>GWYNNE-VAUGHAN, Dame Helen (1879-1967)</h3><span class="plaque-role">Botanist and Military Officer</span><span class="plaque-location">Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden</span></div></li>
So far I am trying to isolate all the "search-result-item" but my current code prints absolutely nothing. If someone can help me sort that problem out and point me in the right direction to storing each data element into a python dictionary I would be very grateful.
from bs4 import BeautifulSoup
import requests
url = 'https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#print(soup.prettify())
print(soup.find_all(class_='search-result-item')).get_text()

You're not getting anything because the search results are generated by JavaScript. Use the API endpoint they fetch the data from.
For example:
import requests
api_url = "https://www.english-heritage.org.uk/api/BluePlaqueSearch/GetMatchingBluePlaques?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0&_=1653043005731"
plaques = requests.get(api_url).json()["plaques"]
for plaque in plaques:
print(plaque["title"])
print(plaque["address"])
print(f"https://www.english-heritage.org.uk{plaque['path']}")
print("-" * 80)
Output:
GWYNNE-VAUGHAN, Dame Helen (1879-1967)
Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden
https://www.english-heritage.org.uk/visit/blue-plaques/helen-gwynne-vaughan/
--------------------------------------------------------------------------------
READING, Lady Stella (1894-1971)
41 Tothill Street, London, City of Westminster, SW1H 9LQ, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/stella-lady-reading/
--------------------------------------------------------------------------------
32 SOHO SQUARE
32 Soho Square, Soho, London, W1D 3AP, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/soho-square/
--------------------------------------------------------------------------------
14 BUCKINGHAM STREET
14 Buckingham Street, Covent Garden, London, WC2N 6DF, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/buckingham-street/
--------------------------------------------------------------------------------
ABRAHAMS, Harold (1899-1978)
Hodford Lodge, 2 Hodford Road, Golders Green, London, NW11 8NP, London Borough of Barnet
https://www.english-heritage.org.uk/visit/blue-plaques/abrahams-harold/
--------------------------------------------------------------------------------
ADAM, ROBERT and HOOD, THOMAS and GALSWORTHY, JOHN and BARRIE, SIR JAMES
1-3 Robert Street, Adelphi, Charing Cross, London, WC2N 6BN, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/adam-hood-galsworthy-barrie/
--------------------------------------------------------------------------------
ADAMS, Henry Brooks (1838-1918)
98 Portland Place, Marylebone, London, W1B 1ET, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/united-states-embassy/
--------------------------------------------------------------------------------
ADELPHI, The
The Adelphi Terrace, Charing Cross, London, WC2N 6BJ, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/adelphi/
--------------------------------------------------------------------------------
ALDRIDGE, Ira (1807-1867)
5 Hamlet Road, Upper Norwood, London, SE19 2AP, London Borough of Bromley
https://www.english-heritage.org.uk/visit/blue-plaques/aldridge-ira/
--------------------------------------------------------------------------------
ALEXANDER, Sir George (1858-1918)
57 Pont Street, Chelsea, London, SW1X 0BD, London Borough of Kensington And Chelsea
https://www.english-heritage.org.uk/visit/blue-plaques/george-alexander/
--------------------------------------------------------------------------------
ALLENBY, Field Marshal Edmund Henry Hynman, Viscount Allenby (1861-1936)
24 Wetherby Gardens, South Kensington, London, SW5 0JR, London Borough of Kensington And Chelsea
https://www.english-heritage.org.uk/visit/blue-plaques/field-marshal-viscount-allenby/
--------------------------------------------------------------------------------
ALMA-TADEMA, Sir Lawrence, O.M. (1836-1912)
44 Grove End Road, St John's Wood, London, NW8 9NE, City Of Westminster
https://www.english-heritage.org.uk/visit/blue-plaques/lawrence-alma-tadema/
--------------------------------------------------------------------------------

Content is generated dynamically by JavaScript so you wont find the elements / info you are looking for with BeautifulSoup, instead use their API.
Example
import requests
url = 'https://www.english-heritage.org.uk/api/BluePlaqueSearch/GetMatchingBluePlaques?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0'
page = requests.get(url).json()
data = []
for e in page['plaques']:
data.append(dict((k,v) for k,v in e.items() if k in ['title','professions','address']))
data
Output
[{'title': 'GWYNNE-VAUGHAN, Dame Helen (1879-1967)', 'address': 'Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden', 'professions': 'Botanist and Military Officer'}, {'title': 'READING, Lady Stella (1894-1971)', 'address': '41 Tothill Street, London, City of Westminster, SW1H 9LQ, City Of Westminster', 'professions': "Founder of the Women's Voluntary Service"}, {'title': '32 SOHO SQUARE', 'address': '32 Soho Square, Soho, London, W1D 3AP, City Of Westminster', 'professions': 'Botanists'}, {'title': '14 BUCKINGHAM STREET', 'address': '14 Buckingham Street, Covent Garden, London, WC2N 6DF, City Of Westminster', 'professions': 'Statesman, Diarist, Naval Official, Painter'}, {'title': 'ABRAHAMS, Harold (1899-1978)', 'address': 'Hodford Lodge, 2 Hodford Road, Golders Green, London, NW11 8NP, London Borough of Barnet', 'professions': 'Athlete'}, ...]

Related

beautifulsoup find text between span

I want to get just a text from span:
html = <a class="business-name" data-analytics='{"click_id":1600,"target":"name","feature_click":""}' href="/new-york-ny/bpp/upper-eastside-orthodontists-20151" rel=""><span>Upper Eastside Orthodontists</span></a>
name = html.find('a', {'class', 'business-name'})
print(name.find('span').text)
give me results:
print(name.find('span').text)
AttributeError: 'NoneType' object has no attribute 'text'
I want to get just the text: Upper Eastside Orthodontists

What you are actually looking for is not in the static/initial request. The page is rendered dynamically.
Luckily the data does come in under the <script> tags, and you can pull out the json and parse it from there:
import requests
from bs4 import BeautifulSoup
import re
import json
import pandas as pd
url = 'https://www.superpages.com/new-york-ny/dentists?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
script = soup.find_all('script', {'type':"application/ld+json"})[-2]
p = re.compile('({.*})')
result = p.search(str(script))
data = json.loads(result.group(0))
df = pd.DataFrame(data['mainEntity']['itemListElement'])
Output:
print(df.to_string())
#type name url
0 ItemPage Upper Eastside Orthodontists https://www.superpages.com/new-york-ny/bpp/upper-eastside-orthodontists-20151
1 ItemPage Kara https://www.superpages.com/new-york-ny/bpp/kara-5721648
2 ItemPage Central Park West Dentistry https://www.superpages.com/new-york-ny/bpp/central-park-west-dentistry-471054528
3 ItemPage Majid Rajabi Khamesi Advanced Family Dental https://www.superpages.com/new-york-ny/bpp/majid-rajabi-khamesi-advanced-family-dental-542761105
4 ItemPage Robert Veligdan, DMD, PC https://www.superpages.com/new-york-ny/bpp/robert-veligdan-dmd-pc-21238912
5 ItemPage Irina Rossinski, DDS https://www.superpages.com/new-york-ny/bpp/irina-rossinski-dds-462447740
6 ItemPage Dr. Michael J. Wei https://www.superpages.com/new-york-ny/bpp/dr-michael-j-wei-504012551
7 ItemPage Manhattan Dental Spa https://www.superpages.com/new-york-ny/bpp/manhattan-dental-spa-22612348
8 ItemPage Expert Dental PC https://www.superpages.com/new-york-ny/bpp/expert-dental-pc-459327373
9 ItemPage Dr. Jonathan Freed, D.D.S., P.C. https://www.superpages.com/new-york-ny/bpp/dr-jonathan-freed-d-d-s-p-c-503142997
10 ItemPage Clifford S. Melnick, DMD PC https://www.superpages.com/new-york-ny/bpp/clifford-s-melnick-dmd-pc-512698216
11 ItemPage Ronald Birnbaum Dds https://www.superpages.com/new-york-ny/bpp/ronald-birnbaum-dds-2757412
12 ItemPage Concerned Dental Care https://www.superpages.com/new-york-ny/bpp/concerned-dental-care-453434343
13 ItemPage DownTown Dental Cosmetic Center https://www.superpages.com/new-york-ny/bpp/downtown-dental-cosmetic-center-468569119
14 ItemPage Beth Caunitz, D.D.S. https://www.superpages.com/new-york-ny/bpp/beth-caunitz-d-d-s-479935675
15 ItemPage Alice Urbankova DDS, P https://www.superpages.com/new-york-ny/bpp/alice-urbankova-dds-p-474879958
16 ItemPage Wu Darryl DDS PC https://www.superpages.com/new-york-ny/bpp/wu-darryl-dds-pc-8291524
17 ItemPage Gerald Rosen DDS https://www.superpages.com/new-york-ny/bpp/gerald-rosen-dds-470302208
18 ItemPage Group Health Dental https://www.superpages.com/new-york-ny/bpp/group-health-dental-15648711
19 ItemPage Dr. Shaun Massiah, DMD https://www.superpages.com/new-york-ny/bpp/dr-shaun-massiah-dmd-453290181
20 ItemPage Park 56 Dental https://www.superpages.com/new-york-ny/bpp/park-56-dental-479624928?lid=1001970746762
21 ItemPage Rubin Esther S https://www.superpages.com/new-york-ny/bpp/rubin-esther-s-462458952
22 ItemPage David P Pitman DMD https://www.superpages.com/new-york-ny/bpp/david-p-pitman-dmd-9139813
23 ItemPage Daniell Jason Mishaan, DMD https://www.superpages.com/new-york-ny/bpp/daniell-jason-mishaan-dmd-479623764
24 ItemPage Dolman Oral Surgery https://www.superpages.com/new-york-ny/bpp/dolman-oral-surgery-534333982
25 ItemPage Emagen Dental https://www.superpages.com/new-york-ny/bpp/emagen-dental-460512214
26 ItemPage The Exchange Dental Group https://www.superpages.com/new-york-ny/bpp/the-exchange-dental-group-462981940
27 ItemPage Joshua M. Wilges DDS & Associates https://www.superpages.com/new-york-ny/bpp/joshua-m-wilges-dds-associates-497873451
28 ItemPage Oren Rahmanan, DDS https://www.superpages.com/new-york-ny/bpp/oren-rahmanan-dds-472633138
29 ItemPage Victoria Veytsman, DDS https://www.superpages.com/new-york-ny/bpp/victoria-veytsman-dds-456826960
You could then iterate through each link to get the data from their page.
The other option which is a little tricky is I did find it within the html. It's only tricky in that you need to cut out the excess (there's the sponsor ad, and then more after the initial 30 results, that don't follow the same html structure/pattern)
import requests
from bs4 import BeautifulSoup
import re
import json
import pandas as pd
url = 'https://www.superpages.com/new-york-ny/dentists?page=1'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
businesses = soup.find_all('a', {'class':'business-name'})
rows = []
for each in businesses[1:31]:
name = each.text
address = each.find_next('div', {'class':'street-address'}).text
phone = each.find_next('a', {'class':'phones phone primary'}).text.replace('Call Now','')
row = {'name':name,
'address':address,
'phone':phone}
rows.append(row)
df = pd.DataFrame(rows)
Output:
print(df.to_string())
name address phone
0 Upper Eastside Orthodontists 153 E 87th St Apt 1b, New York, NY, 10128 888-378-2976
1 Kara 30 E 60th St Rm 503, New York, NY, 10022 212-355-2195
2 Central Park West Dentistry 25 W 68th St, New York, NY, 10023 212-579-8885
3 Majid Rajabi Khamesi Advanced Family Dental 30 E 40th St Rm 705, New York, NY, 10016 212-481-2535
4 Robert Veligdan, DMD, PC 343 W 58th St, New York, NY, 10019 212-832-2330
5 Irina Rossinski, DDS 30 5th Ave Apt 1g, New York, NY, 10011 212-673-3700
6 Dr. Michael J. Wei 425 Madison Ave.20th Floor, New York, NY, 10017 646-798-6490
7 Manhattan Dental Spa 200 Madison Ave Ste 2201, New York, NY, 10016 212-683-2530
8 Expert Dental PC 110 E 40th St Rm 104, New York, NY, 10016 212-682-2965
9 Dr. Jonathan Freed, D.D.S., P.C. 315 Madison Ave Rm 509, New York, NY, 10017 212-682-5644
10 Clifford S. Melnick, DMD PC 41 W 58th St Apt 2e, New York, NY, 10019 212-355-1266
11 Ronald Birnbaum Dds 425 W 59th St, New York, NY, 10019 212-523-8030
12 Concerned Dental Care 30 E 40th St Rm 207, New York, NY, 10016 212-696-4979
13 DownTown Dental Cosmetic Center 160 Broadway, New York, NY, 10038 212-964-3337
14 Beth Caunitz, D.D.S. 30 East 40th Street, Suite 406, New York, NY, 10016 212-206-9002
15 Alice Urbankova DDS, P 630 5th Ave Ste 1860, New York, NY, 10111 212-765-7340
16 Wu Darryl DDS PC 41 Elizabeth St, New York, NY, 10013 212-925-7757
17 Gerald Rosen DDS 59 E 54th St, New York, NY, 10022 212-753-9860
18 Group Health Dental 230 W 41st St, New York, NY, 10036 212-398-9690
19 Dr. Shaun Massiah, DMD 50 W 97th St Apt 1c, New York, NY, 10025 212-222-5225
20 Park 56 Dental 120 E 56th St Rm 610, New York, NY, 10022 347-770-3915
21 Rubin Esther S 18 E 48th St, New York, NY, 10017 212-593-7272
22 David P Pitman DMD 57 W 57th St Ste 707, New York, NY, 10019 212-888-2833
23 Daniell Jason Mishaan, DMD 241 W 37th St, New York, NY, 10018 212-730-4440
24 Dolman Oral Surgery 16 E 52nd St Ste 402, New York, NY, 10022 212-696-0167
25 Emagen Dental 250 8th Ave Apt 2s, New York, NY, 10011 212-352-9300
26 The Exchange Dental Group 39 Broadway Rm 2115, New York, NY, 10006 212-422-9229
27 Joshua M. Wilges DDS & Associates 2 West 45th Street Suite 1708, New York, NY, 10036 646-590-2100
28 Oren Rahmanan, DDS 1 Rockefeller Plz Rm 2223, New York, NY, 10020 212-581-6736
29 Victoria Veytsman, DDS 509 Madison Ave Rm 1704, New York, NY, 10022 212-759-6700

Beautiful soup scraping with selenium

I'm learning how to scrape using Beautiful soup with selenium and I found a website that has multiple tables and found table tags (first time dealing with them). I'm learning how to try to scrape those texts from each table and append each element to respected list. First im trying to scrape the first table, and the rest I want to do on my own. But I cannot access the tag for some reason.
I also incorporated selenium to access the sites, because when I copy the link to the site onto another tab, the list of tables disappears, for some reason.
My code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
try:
page = requests.get(targetSite )
soup = BeautifulSoup(page.text, 'html.parser')
items = soup.find_all('table', {"class":"popdetail"})
for i in items:
event_title.append(item.find('b', {'class': "text"})).text.strip()
name.append(item.find('td', {'class': "text"})).text.strip()
address.append(item.find('td', {'class': "text"})).text.strip()
city.append(item.find('td', {'class': "text"})).text.strip()
state.append(item.find('td', {'class': "text"})).text.strip()
zipCode.append(item.find('td', {'class': "text"})).text.strip()
Can someone let me know if I am doing this correctly, This is my first time dealing with site's urls elements disappear when copied onto a new tab and/or window
So far, I am unable to append any information to each list.

One issue is with the for loop.
you have for i in items:, but then you are calling item instead of i.
And secondly, if you are using selenium to render the page, then you should probably use selenium to get the html. They also have some embedded tables within tables, so it's not as straight forward as iterating through the <table> tags. What I ended up doing was having pandas read in the tables (returns a list of dataframes), then iterating through those as there is a pattern of how the dataframes are constructed.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
dfs = pd.read_html(driver.page_source)
driver.close
for idx, table in enumerate(dfs):
if table.iloc[0,0] == 'Event Title':
event_title.append(table.iloc[-1,0])
tempA = dfs[idx+1]
tempA.index = tempA[0]
tempB = dfs[idx+4]
tempB.index = tempB[0]
tempC = dfs[idx+5]
tempC.index = tempC[0]
name.append(tempA.loc['Name',1])
address.append(tempA.loc['Address',1])
city.append(tempA.loc['City',1])
state.append(tempA.loc['State',1])
zipCode.append(tempA.loc['Zip',1])
location.append(tempA.loc['Location',1])
webSite.append(tempA.loc['Web Site',1])
fee.append(tempB.loc['Fee',1])
event_dates.append(tempB.loc['Dates',1])
opening_dates.append(tempB.loc['Opening Days',1])
description.append(tempC.loc['Event Description',1])
df = pd.DataFrame({'event_title':event_title,
'name':name,
'address':address,
'city':city,
'state':state,
'zipCode':zipCode,
'location':location,
'webSite':webSite,
'fee':fee,
'event_dates':event_dates,
'opening_dates':opening_dates,
'description':description})
Output:
print (df.to_string())
event_title name address city state zipCode location webSite fee event_dates opening_dates description
0 The San Diego Museum of Art Welcomes a Special... San Diego Museum of Art 1450 El Prado, Balboa Park San Diego CA 92101 Central San Diego https://www.sdmart.org/ NaN Starts On 6-18-2020 Ends On 1-10-2021 Opens virtually on June 18. The work will beco... The San Diego Museum of Art is launching its f...
1 New Exhibit: Miller Dairy Remembered Lemon Grove Historical Society 3185 Olive Street, Treganza Heritage Park Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Children 12 and under free and must be accompa... Starts On 6-27-2020 Ends On 12-4-2020 Exhibit on view Saturdays 11 am to 2 pm; close... From 1926 there were cows smack in the midst o...
2 Gizmos and Shivelight Distinction Gallery 317 E. Grand Ave Escondido CA 92025 North County Inland http://www.distinctionart.com NaN Starts On 7-14-2020 Ends On 9-5-2020 08/08/20 - 09/05/20 Distinction Gallery is proud to present our so...
3 Virtual Opening - July Exhibitions Vision Art Museum 2825 Dewey Rd. Suite 100 San Diego CA 92106 Central San Diego http://www.visionsartmuseum.org Free Starts On 7-18-2020 Ends On 10-4-2020 NaN Join Visions Art Museum for a virtual exhibiti...
4 Laying it Bare: The Art of Walter Redondo and ... Fresh Paint Gallery 1020-B Prospect Street La Jolla CA 92037 Central San Diego http://freshpaintgallery.com/ NaN Starts On 8-1-2020 Ends On 9-27-2020 Tuesday through Sunday. Mondays closed. A two-person exhibit of new abstract expressio...
5 Online oil painting lessons with Concetta Antico NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 8-10-2020 Ends On 8-31-2020 NaN Anyone can learn to paint like the masters! Ov...
6 MOMENTUM: A Creative Industry Symposium Vanguard Culture Via Zoom San Diego California 92101 Virtual https://www.eventbrite.com/e/momentum-a-creati... $10 suggested donation Starts On 8-17-2020 Ends On 9-7-2020 NaN MOMENTUM: A Creative Industry Symposium Monday...
7 Virtual Locals Invitational Show Art & Frames of Coronado 936 ORANGE AVE Coronado CA 92118 0 https://www.artsteps.com/view/5eed0ad62cd0d65b... free Starts On 8-21-2020 Ends On 8-1-2021 NaN Art and Frames of Coronado invites you to our ...
8 HERE & Now R.B. Stevenson Gallery 7661 Girard Avenue, Suite 101 La Jolla California 92037 Central San Diego http://www.rbstevensongallery.com Free Starts On 8-22-2020 Ends On 9-25-2020 Tuesday through Saturday R.B.Stevenson Gallery is pleased to announce t...
9 Art Unites Learning: Normal 2.0 Art Unites NaN San Diego NaN 92116 Central San Diego https://www.facebook.com/events/956878098104971 Free Starts On 8-25-2020 Ends On 8-25-2020 NaN Please join us on Tuesday, August 25th as we: ...
10 Image Quest Sojourn; Visual Journaling for Per... Pamela Underwood Studios Virtual NaN NaN NaN Virtual http://www.pamelaunderwood.com/event/new-onlin... $595.00 Starts On 8-26-2020 Ends On 11-11-2020 NaN Create a personal Image Quest resource journal...
11 Behind The Exhibition: Southern California Con... Oceanside Museum of Art 704 Pier View Way Oceanside California 92054 Virtual https://oma-online.org/events/behind-the-exhib... No fee required. Donations recommended. Starts On 8-27-2020 Ends On 8-27-2020 NaN Join curator Beth Smith and exhibitions manage...
12 Lay it on Thick, a Virtual Art Exhibition San Diego Watercolor Society 2825 Dewey Rd Bldg #202 San Diego California 92106 0 https://www.sdws.org NaN Starts On 8-30-2020 Ends On 9-26-2020 NaN The San Diego Watercolor Society proudly prese...
13 The Forum: Marketing & Branding for Creatives Vanguard Culture Via Zoom San Diego CA 92101 South San Diego http://vanguardculture.com/ $5 suggested donation Starts On 9-1-2020 Ends On 9-1-2020 NaN Attention creative industry professionals! Joi...
14 Write or Die Solo Exhibition You Belong Here 3619 EL CAJON BLVD San Diego CA 92104 Central San Diego http://www.youbelongsd.com/upcoming-events/wri... $10 donation to benefit You Belong Here Starts On 9-4-2020 Ends On 9-6-2020 NaN Write or Die is an immersive installation and ...
15 SDVAN presents Art San Diego at Bread and Salt San Diego Visual Arts Network 1955 Julian Avenue San Digo CA 92113 Central San Diego http://www.sdvisualarts.net and https://www.br... Free Starts On 9-5-2020 Ends On 10-24-2020 NaN We are pleased to announce the four artist rec...
16 The Coming of Treganza Heritage Park Lemon Grove Historical Society 3185 Olive Street Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Free for all ages Starts On 9-10-2020 Ends On 9-10-2020 The park is open daily, 8 am to 8 pm. Covid 19... Lemon Grove\'s central city park will be renam...
17 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 9-14-2020 Ends On 10-5-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
18 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 10-12-2020 Ends On 11-2-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
19 36th Annual Mission Fed ArtWalk Mission Fed ArtWalk Ash Street San Diego California 92101 Central San Diego www.missionfedartwalk.org Free Starts On 11-7-2020 Ends On 11-8-2020 Sat and Sun Nov 7 and 8 Mission Fed ArtWalk returns to San Diego’s Lit...
20 Mingei Pop Up Workshop: My Daruma Doll New Childrens Museum 200 West Island Avenue San Diego California 92101 Central San Diego http://thinkplaycreate.org/ Free with admission Starts On 11-13-2020 Ends On 11-13-2020 NaN Join Mingei International Museum at The New Ch...

Converting from HTML to CSV using Python

I'm trying to convert a table found on a website (full details and photo below) to a CSV. I've started with the below code, but the table isn't returning anything. I think it must have something to do with me not understanding the right naming convention for the table, but any additional help will be appreciated to achieve my ultimate goal.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
url = 'https://www.privateequityinternational.com/database/#/pei-300'
page = requests.get(url) #gets info from page
soup = BeautifulSoup(page.content,'html.parser') #parses information
table = soup.findAll('table',{'class':'au-target pux--responsive-table'}) #collecting blocks of info inside of table
table
Output: []
In addition to the URL provided in the above code, I'm essentially trying to convert the below table (found on the website) to a CSV file:

The data is loaded from external URL via Ajax. You can use requests/json module to get it:
import json
import requests
url = 'https://ra.pei.blaize.io/api/v1/institutions/pei-300s?count=25&start=0'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for item in data['data']:
print('{:<5} {:<30} {}'.format(item['id'], item['name'], item['headquarters']))
Prints:
5611 Blackstone New York, United States
5579 The Carlyle Group Washington DC, United States
5586 KKR New York, United States
6701 TPG Fort Worth, United States
5591 Warburg Pincus New York, United States
1801 NB Alternatives New York, United States
6457 CVC Capital Partners Luxembourg, Luxembourg
6477 EQT Stockholm, Sweden
6361 Advent International Boston, United States
8411 Vista Equity Partners Austin, United States
6571 Leonard Green & Partners Los Angeles, United States
6782 Cinven London, United Kingdom
6389 Bain Capital Boston, United States
8096 Apollo Global Management New York, United States
8759 Thoma Bravo San Francisco, United States
7597 Insight Partners New York, United States
867 BlackRock New York, United States
5471 General Atlantic New York, United States
6639 Permira Advisers London, United Kingdom
5903 Brookfield Asset Management Toronto, Canada
6473 EnCap Investments Houston, United States
6497 Francisco Partners San Francisco, United States
6960 Platinum Equity Beverly Hills, United States
16331 Hillhouse Capital Group Hong Kong, Hong Kong
5595 Partners Group Baar-Zug, Switzerland

And selenium version:
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')
url = 'https://www.privateequityinternational.com/database/#/pei-300'
driver.get(url) #gets info from page
time.sleep(5)
page = driver.page_source
driver.close()
soup = BeautifulSoup(page,'html.parser') #parses information
table = soup.select_one('table.au-target.pux--responsive-table') #collecting blocks of info inside of table
dfs = pd.read_html(table.prettify())
df = pd.concat(dfs)
df.to_csv('file.csv')
print(df.head(25))
prints:
Ranking Name City, Country (HQ)
0 1 Blackstone New York, United States
1 2 The Carlyle Group Washington DC, United States
2 3 KKR New York, United States
3 4 TPG Fort Worth, United States
4 5 Warburg Pincus New York, United States
5 6 NB Alternatives New York, United States
6 7 CVC Capital Partners Luxembourg, Luxembourg
7 8 EQT Stockholm, Sweden
8 9 Advent International Boston, United States
9 10 Vista Equity Partners Austin, United States
10 11 Leonard Green & Partners Los Angeles, United States
11 12 Cinven London, United Kingdom
12 13 Bain Capital Boston, United States
13 14 Apollo Global Management New York, United States
14 15 Thoma Bravo San Francisco, United States
15 16 Insight Partners New York, United States
16 17 BlackRock New York, United States
17 18 General Atlantic New York, United States
18 19 Permira Advisers London, United Kingdom
19 20 Brookfield Asset Management Toronto, Canada
20 21 EnCap Investments Houston, United States
21 22 Francisco Partners San Francisco, United States
22 23 Platinum Equity Beverly Hills, United States
23 24 Hillhouse Capital Group Hong Kong, Hong Kong
24 25 Partners Group Baar-Zug, Switzerland
And also save data to a file.csv.
Note yo need selenium and geckodriver and in this code geckodriver is set to be imported from c:/program/geckodriver.exe

Python3 Pandas Dataframe Split Columns

I have a column on my dataframe that contains the following
Wal-Mart Stores, Inc., Clinton, IA 52732
Benton Packing, LLC, Clearfield, UT 84016
North Coast Iron Corp, Seattle, WA 98109
Messer Construction Co. Inc., Amarillo, TX 79109
Ocean Spray Cranberries, Inc., Henderson, NV 89011
W R Derrick & Co. Lexington, SC 29072
I am having problem to capture it using regex so far my regex works for first 2 lines:
[A-Z][A-za-z-\s]+,\s{1}(Inc.|LLC)
How do I split the column to 4 additional columns? i.e. Column1 = Company Name, Column 2 = City, Column 3 = State, Column 4 = Zipcode.
Example of the output is shown below:
Company_Name City State ZipCode
Wal-Mart Stores, Inc. Clinton IA 52732

The names are probably the trickiest part, but if you know that the structure of city, state, zip will always be the same (i.e. no extra commas) you could use rsplit to split the strings. Similarly pandas has a str.rsplit method as well.
df
Address
0 Wal-Mart Stores, Inc., Clinton, IA 52732
1 Benton Packing, LLC, Clearfield, UT 84016
2 North Coast Iron Corp, Seattle, WA 98109
3 Messer Construction Co. Inc., Amarillo, TX 79109
df['Zip'] = df.Address.map(lambda x: x.rsplit(' ', 1)[-1])
df['Name'], df['City'], df['State']= zip(*df.Address.map(lambda x: x.rsplit(' ', 1)[0].rsplit(',', 2)))
df
Address Zip \
0 Wal-Mart Stores, Inc., Clinton, IA 5273 5273
1 Benton Packing, LLC, Clearfield, UT 84016 84016
2 North Coast Iron Corp, Seattle, WA 98109 98109
3 Messer Construction Co. Inc., Amarillo, TX 79109 79109
Name City State
0 Wal-Mart Stores, Inc. Clinton IA
1 Benton Packing, LLC Clearfield UT
2 North Coast Iron Corp Seattle WA
3 Messer Construction Co. Inc. Amarillo TX

BeautifulSoup not returning .text in csv and extra unwanted numbers

I have this code
import requests
import csv
from bs4 import BeautifulSoup
from time import sleep
f = csv.writer(open('destinations.csv', 'w'))
f.writerow(['Destinations', 'Country'])
pages = []
for i in range(1, 3):
url = 'http://www.travelindicator.com/destinations?page=' + str(i)
pages.append(url.decode('utf-8'))
for item in pages:
page = requests.get(item, sleep(2))
soup = BeautifulSoup(page.content.text, 'lxml')
for destinations_list in soup.select('.news-a header'):
destination = soup.select('h2 a')
country = soup.select('p a')
print (destinations_list.text)
f.writerow([destinations_list])
which gives me the console answer of:
Ellora
1
3/5
India
Volterra
2
2/5
Italy
Hamilton
3
3/5
New Zealand
London
4
5/5
United Kingdom
Sun Moon Lake
5
5/5
Taiwan
Texel
6
etc...
Firstly I am unsure why the extra numbers are being added as I have only specified the parts I want for each country.
Secondly, when I try and format it into a CSV file, it doesn't remove the HTML even though I have specified my soup to give me content.text. Been trying to figure it out for an hour and am at a loss.

pages = []
destinations_list = []
country = []
for item in pages:
page = requests.get(item)
soup = BeautifulSoup(page.content, 'lxml')
temp = soup.findAll('article')
for l in temp:
destinations_list.append(l.find('h2').text)
country = l.find('p')
country.append(country.find('a').text)
print destinations_list
print country
Output:
Ellora
India
Volterra
Italy
Hamilton
New Zealand
London
United Kingdom
Sun Moon Lake
Taiwan
Texel
The Netherlands
Zhengzhou
China
Vladivostok
Russia
Charleston
United States
Banska Bystrica
Slovakia
Lviv
Ukraine
Viareggio
Italy
Wakkanai
Japan
Nordkapp
Norway
Jericoacoara
Brazil
Tainan
Taiwan
Boston
United States
Keelung
Taiwan
Stockholm
Sweden
Shaoxing
China
Bohol
Distance to you
Bohol
Philippines
Saint Petersburg
Russia
Malmo
Sweden
Elba
Italy
Gdansk
Poland
Langkawi
Malaysia
Poznan
Poland
Daegu
South Korea
Abu Simbel
Egypt
Melbourne
Australia
Reunion
Reunion
Annecy
France
Colombo
Sri Lanka
Penghu
Taiwan
Conwy
United Kingdom
Monterrico
Guatemala
Janakpur
Nepal
Bimini
Bahamas
Lake Tahoe
United States
Essaouira
Morocco

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup Assistance - python

Related

beautifulsoup find text between span

Beautiful soup scraping with selenium

Converting from HTML to CSV using Python

Python3 Pandas Dataframe Split Columns

BeautifulSoup not returning .text in csv and extra unwanted numbers

Categories

Resources