How to scrape hidden class data using selenium and beautiful soup - python

I'm trying to scrape java script enabled web page content. I need to extract data in the table of that website. However each row of the table has button (arrow) by which we get additional information of that row.
I need to extract that additional description of each row. By inspecting it is observed that the contents of those arrow of each row belong to same class. However the class is hidden in source code. It can be observed only while inspecting. The data I'm trying to sparse is from the webpage.
I have used selenium and beautiful soup. I'm able to scrape data of table but not content of those arrows in the table. My python is returning me an empty list for the class of that arrow. But working for the classs of normal table data.
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/')
html_source = browser.page_source
soup = BeautifulSoup(html_source,'html.parser')
data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow")
print(data.text)

To print hidden data, you can use this example:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://projects.sfchronicle.com/2020/layoff-tracker/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href']
data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1]
data = json.loads(data.replace(r"\'", "'"))
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
for d in data[4:]:
print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values()))
Prints:
Company Layoffs City County Month Industry Company description
Tesla (Temporary layoffs. Factory reopened) 11083 Fremont Alameda County April Industrial Car maker
Bon Appetit Management Co. 3015 San Francisco San Francisco County April Food Food supplier
GSW Arena LLC-Chase Center 1720 San Francisco San Francisco County May Sports Arena vendors
YMCA of Silicon Valley 1657 Santa Clara Santa Clara County May Sports Gym
Nutanix Inc. (Temporary furlough of 2 weeks) 1434 San Jose Santa Clara County April Tech Cloud computing
TeamSanJose 1304 San Jose Santa Clara County April Travel Tourism bureau
San Francisco Giants 1200 San Francisco San Francisco County April Sports Stadium vendors
Lyft 982 San Francisco San Francisco County April Tech Ride hailing
YMCA of San Francisco 959 San Francisco San Francisco County May Sports Gym
Hilton San Francisco Union Square 923 San Francisco San Francisco County April Travel Hotel
Six Flags Discovery Kingdom 911 Vallejo Solano County June Entertainment Amusement park
San Francisco Marriott Marquis 808 San Francisco San Francisco County April Travel Hotel
Aramark 777 Oakland Alameda County April Food Food supplier
The Palace Hotel 774 San Francisco San Francisco County April Travel Hotel
Back of the House Inc 743 San Francisco San Francisco County April Food Restaurant
DPR Construction 715 Redwood City San Mateo County April Real estate Construction
...and so on.

The content you are interested in is generated when you click a button, so you would want to locate the button. A million ways you could do this but I would suggest something like:
element = driver.find_elements(By.XPATH, '//button')
for your specific case you could also use:
element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]')
Once you get the button element, we can then do:
element.click()
Parsing the page after this should get you the javascript generated content you are looking for

Related

Python - Scraping and classifying text in "fonts"

I would like to scrape the content of this website https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283 and create a table with the columns NAME, TITLE, LOCATION. I know some individuals have more or less "lines", but I am just trying to understand how I could even classify the first 3 lines for each person given that the text is in between "fonts" for all.
So far I have:
url="https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("font")
But once I am there and I have all the text within "font" in my "column" variable, I don't know how to proceed to differentiate between each person and build a loop where I would retrieve name, title, location etc. for each.
Any help would be highly appreciated!
Note: instead of using selenium, I simply fetched and parsed with soup = BeautifulSoup(requests.get(url).content, "html.parser"); as far as I an tell, the required section is not dynamic, so it shouldn't cause any issues.
would you have any idea about how to look for pairs of <br>
Since they represent empty lines, you could try simply splitting the text in that cell by \n\n\n
blockText = soup.select_one('td:has(font)').get_text(' ')
blockText = blockText.replace('-'*10, '\n\n\n') # pad "underlined" lines
blockSections = [sect.strip() for sect in '\n'.join([
l.strip('-').strip() for l in blockText.splitlines()
]).split('\n\n\n') if sect.strip()]
Although, if you looked at blockSections, you might notice that some headers [ROSTER and MEMBERS] get stuck to the end of the previous section - probably because their formatting means that an extra <br> is not needed to distinguish them from their adjacent sections. [I added the .replace('-'*10, '\n\n\n') line so that at least they're separated from the next section.]
Another risk is that I don't know if all versions and parsers will parse <br><br> to text as 3 line breaks - some omit br space entirely from text, and others might add extra space based on spaces between tags in the source html.
It's easier to split if you loop through the <br>s and pad them with something more distinctive to split by; the .insert... methods are useful here. (This method also has the advantage of being able to target bolded lined as well.)
blockSoup = soup.select_one('td:has(font)')
for br2 in blockSoup.select('br+br, font:has(br)'):
br2.insert_after(BeautifulSoup(f'<p>{"="*80}</p>').p)
br2.insert_before(BeautifulSoup(f'<p>{"="*80}</p>').p)
blockSections = [
sect.strip().strip('-').strip() for sect in
blockSoup.get_text(' ').split("="*80) if sect.strip()
]
This time, blockSections looks something like
['Membership Roster - ACE\n AIDS CLINICAL STUDIES AND EPIDEMIOLOGY STUDY SECTION\n Center For Scientific Review\n (Terms end 6/30 of the designated year)\n ROSTER',
'CHAIRPERSON',
'SCHACKER, TIMOTHY\n W\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF MINNESOTA\n MINNEAPOLIS,\n MN\n 55455',
'MEMBERS',
'ANDERSON, JEAN\n R\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF GYNECOLOGY AND OBSTETRICS\n JOHNS HOPKINS UNIVERSITY\n BALTIMORE,\n MD 21287',
'BALASUBRAMANYAM, ASHOK\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE AND\n MOLECULAR AND CELLULAR BIOLOGY\n DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM\n BAYLOR COLLEGE OF MEDICINE\n HOUSTON,\n TX 77030',
'BLATTNER, WILLIAM\n ALBERT\n , MD,\n (15)\n PROFESSOR AND ASSOCIATE DIRECTOR\n DEPARTMENT OF MEDICNE\n INSTITUTE OF HUMAN VIROLOGY\n UNIVERSITY OF MARYLAND, BALTIMORE\n BALTIMORE,\n MD 21201',
'CHEN, YING\n QING\n , PHD,\n (15)\n PROFESSOR\n PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS\n FRED HUTCHINSON CANCER RESEARCH CENTER\n SEATTLE,\n WA 981091024',
'COTTON, DEBORAH\n , MD,\n (13)\n PROFESSOR\n SECTION OF INFECTIOUS DISEASES\n DEPARTMENT OF MEDICINE\n BOSTON UNIVERSITY\n BOSTON,\n MA 02118',
'DANIELS, MICHAEL\n J\n , SCD,\n (16)\n PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF TEXAS AT AUSTIN\n AUSTIN,\n TX 78712',
'FOULKES, ANDREA\n SARAH\n , SCD,\n (14)\n ASSOCIATE PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF MASSACHUSETTS\n AMHERST,\n MA 01003',
'HEROLD, BETSY\n C\n , MD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n ALBERT EINSTEIN COLLEGE OF MEDICINE\n BRONX,\n NY 10461',
'JUSTICE, AMY\n CAROLINE\n , MD, PHD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n YALE UNIVERSITY\n NEW HAVEN,\n CT 06520',
'KATZENSTEIN, DAVID\n ALLENBERG\n , MD,\n (13)\n PROFESSOR\n DIVISION OF INFECTIOUS DISEASES\n STANFORD UNIVERSITY SCHOOL OF MEDICINE\n STANFORD,\n CA 94305',
'MARGOLIS, DAVID\n M\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL\n CHAPEL HILL,\n NC 27599',
'MONTANER, LUIS\n J\n , DVM, PHD,\n (13)\n PROFESSOR\n DEPARTMENT OF IMMUNOLOGY\n THE WISTAR INSTITUTE\n PHILADELPHIA,\n PA 19104',
'MONTANO, MONTY\n A\n , PHD,\n (15)\n RESEARCH SCIENTIST\n DEPARTMENT OF IMMUNOLOGY AND\n INFECTIOUS DISEASES\n BOSTON UNIVERSITY\n BOSTON,\n MA 02115',
'PAGE, KIMBERLY\n , PHD, MPH,\n (16)\n PROFESSOR\n DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH\n AND GLOBAL HEALTH SCIENCES\n DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n SAN FRANCISCO,\n CA 94105',
'SHIKUMA, CECILIA\n M\n , MD,\n (15)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n HAWAII AIDS CLINICAL RESEARCH PROGRAM\n UNIVERSITY OF HAWAII\n HONOLULU,\n HI 96816',
'WOOD, CHARLES\n , PHD,\n (13)\n PROFESSOR\n UNIVERSITY OF NEBRASKA\n LINCOLN,\n NE 68588']
create a table with the columns NAME, TITLE, LOCATION
There may be a more elegant solution, but I feel like the simplest way would be to just loop the siblings of the headers and keep count of consecutive brs.
doubleBr = soup.select('br')[:2] # [ so the last person also gets added ]
personsList = []
for f in soup.select('td>font>font:has(b br)'):
role, lCur,pCur,brCt = f.get_text(' ').strip('-').strip(), [],[],0
for lf in f.find_next_siblings(['font','br'])+doubleBr:
brCt = brCt+1 if lf.name == 'br' else 0
if pCur and (brCt>1 or lf.b):
pDets = {'role': role, 'name': '?'} # initiate
if len(pCur)>1: pDets['title'] = pCur[1]
pDets['name'], pCur = pCur[0], pCur[2:]
dList = pCur[:-2]
pDets['departments'] = dList[0] if len(dList)==1 else dList
if len(pCur)>1: pDets['institute'] = pCur[-2]
if pCur: pDets['location'] = pCur[-1]
personsList.append(pDets)
pCur, lCur, brCt = [], [], 0 # clear
if lf.b: break # rached next section
if lf.name == 'font': # [split and join to minimize whitespace]
lCur.append(' '.join(lf.get_text(' ').split())) # add to line
if brCt and lCur: pCur, lCur = pCur+[' '.join(lCur)], [] # newline
Since personsList is a list of dictionaries, it can be tabulated as simply as pandas.DataFrame(personsList) to get a DataFrame that looks like:
role
name
title
departments
institute
location
CHAIRPERSON
SCHACKER, TIMOTHY W , MD
PROFESSOR
DEPARTMENT OF MEDICINE
UNIVERSITY OF MINNESOTA
MINNEAPOLIS, MN 55455
MEMBERS
ANDERSON, JEAN R , MD
PROFESSOR
DEPARTMENT OF GYNECOLOGY AND OBSTETRICS
JOHNS HOPKINS UNIVERSITY
BALTIMORE, MD 21287
MEMBERS
BALASUBRAMANYAM, ASHOK , MD
PROFESSOR
['DEPARTMENT OF MEDICINE AND', 'MOLECULAR AND CELLULAR BIOLOGY', 'DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM']
BAYLOR COLLEGE OF MEDICINE
HOUSTON, TX 77030
MEMBERS
BLATTNER, WILLIAM ALBERT , MD
PROFESSOR AND ASSOCIATE DIRECTOR
['DEPARTMENT OF MEDICNE', 'INSTITUTE OF HUMAN VIROLOGY']
UNIVERSITY OF MARYLAND, BALTIMORE
BALTIMORE, MD 21201
MEMBERS
CHEN, YING QING , PHD
PROFESSOR
PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS
FRED HUTCHINSON CANCER RESEARCH CENTER
SEATTLE, WA 981091024
MEMBERS
COTTON, DEBORAH , MD
PROFESSOR
['SECTION OF INFECTIOUS DISEASES', 'DEPARTMENT OF MEDICINE']
BOSTON UNIVERSITY
BOSTON, MA 02118
MEMBERS
DANIELS, MICHAEL J , SCD
PROFESSOR
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF TEXAS AT AUSTIN
AUSTIN, TX 78712
MEMBERS
FOULKES, ANDREA SARAH , SCD
ASSOCIATE PROFESSOR
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF MASSACHUSETTS
AMHERST, MA 01003
MEMBERS
HEROLD, BETSY C , MD
PROFESSOR
DEPARTMENT OF PEDIATRICS
ALBERT EINSTEIN COLLEGE OF MEDICINE
BRONX, NY 10461
MEMBERS
JUSTICE, AMY CAROLINE , MD, PHD
PROFESSOR
DEPARTMENT OF PEDIATRICS
YALE UNIVERSITY
NEW HAVEN, CT 06520
MEMBERS
KATZENSTEIN, DAVID ALLENBERG , MD
PROFESSOR
DIVISION OF INFECTIOUS DISEASES
STANFORD UNIVERSITY SCHOOL OF MEDICINE
STANFORD, CA 94305
MEMBERS
MARGOLIS, DAVID M , MD
PROFESSOR
DEPARTMENT OF MEDICINE
UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL
CHAPEL HILL, NC 27599
MEMBERS
MONTANER, LUIS J , DVM, PHD
PROFESSOR
DEPARTMENT OF IMMUNOLOGY
THE WISTAR INSTITUTE
PHILADELPHIA, PA 19104
MEMBERS
MONTANO, MONTY A , PHD
RESEARCH SCIENTIST
['DEPARTMENT OF IMMUNOLOGY AND', 'INFECTIOUS DISEASES']
BOSTON UNIVERSITY
BOSTON, MA 02115
MEMBERS
PAGE, KIMBERLY , PHD, MPH
PROFESSOR
['DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH', 'AND GLOBAL HEALTH SCIENCES', 'DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS', 'UNIVERSITY OF CALIFORNIA, SAN FRANCISCO']
UNIVERSITY OF CALIFORNIA, SAN FRANCISCO
SAN FRANCISCO, CA 94105
MEMBERS
SHIKUMA, CECILIA M , MD
PROFESSOR
['DEPARTMENT OF MEDICINE', 'HAWAII AIDS CLINICAL RESEARCH PROGRAM']
UNIVERSITY OF HAWAII
HONOLULU, HI 96816
MEMBERS
WOOD, CHARLES , PHD
PROFESSOR
[]
UNIVERSITY OF NEBRASKA
LINCOLN, NE 68588
[ Btw, if the .select('br+br, font:has(br)') and .select('td>font>font:has(b br)') parts are unfamiliar to you, you can look up .select and CSS selectors. Combinators [like >/+/,] and pseudo-classes [like :has] allow us to get very specific with out targets. ]

How can I parse non-tagged data from beautiful soup output?

I am trying to parse the following data extracted with beautiful soup.
{"currentLink":"/torrance-ca-90510/","regionId":96168,"displayRegionName":"90510"}],"universities":[]},"showAttributeLinks":null}},"mapState":{"customRegionPolygonWkt":null,"schoolPolygonWkt":null,"isCurrentLocationSearch":false,"userPosition":{"lat":null,"lon":null}},"regionState":{"regionInfo":[{"regionType":6,"regionId":54722,"regionName":"Torrance","displayName":"Torrance CA","isPointRegion":false}],"regionBounds":{"north":33.887061,"east":-118.308127,"south":33.780217,"west":-118.394107}},"searchPageSeoObject":{"baseUrl":"/torrance-ca/","windowTitle":"Torrance CA Real Estate - Torrance CA Homes For Sale | Zillow","metaDescription":"Zillow has 100 homes for sale in Torrance CA. View listing photos, review sales history, and use our detailed real estate filters to find the perfect place."},"abTrials":{"SXP_HDP_CONTINGENT_V2":"ON","SXP_REGION_AUTOCOMPLETE_SOURCE":"TRULIA","RE_Move_In_Date_Filter":"TEST","SXP_SENTRY":"ON","SEOTEST__SXP_LIST_ONLY_SRP":"CONTROL","SXP_SAVE_SEARCH_COLOR":"CONTROL","SXP_REACT_FOOTER_DESKTOP":"CONTROL","RE_Web_PersonalizedSort":"CONTROL","ACQ_Search_Filters_Upsell":"CONTROL","VARIANTS_BDP_768_PLUS":"CONTROL","SXP_FLOATING_ACTION_BAR":"ON","SET_CTA_COUNTLESS":"CONTROL","SXP_LISTING_SUBTYPE":"CONTROL","SXP_Search_Refinement_Filters":"CONTROL","RE_SearchByBuildingName":"TEST","SXP_PREXIT_CLAIMS_INFO":"ENABLED","SXP_NONMLS_OFF":"CONTROL","SXP_PARTIAL_PAGE_LOAD_REFACTOR":"CONTROL","SXP_NAV_AD_LOADING":"CONTROL","ACT_FILTER_ON_LAND":"CONTROL","SXP_PHOTO_CAROUSEL":"CONTROL","SEO__SXP_REMOVE_ANCHOR_TEXT":"CONTROL","DXP_NEW_MAP_DOTS_WEB":"CONTROL","SXP_ACT_REMOVE_SEARCHBOX_GLEAM":"NO_GLEAM","SXP_Exclude_Referer":"TEST","SXP_PAGE_LOAD":"FASTER","RE_Rentals_Badging_v1":"CONTROL","SXP_REACT_GPT":"REMOVED","SXP_QU_PHASE_2":"ON","RE_HDP_REDIRECT":"CONTROL","RE_Search_Refinement_Filters":"CONTROL","MIGHTY_MONTH_2022_HOLDOUT":"MIGHTY_MONTH_ON","SXP_NEW_LANE_CLICKSTREAM":"ON","SXP_REDUCED_SERVER_SIDE_RENDER":"CONTROL","SXP_DelayJS":"AFTER_LOAD","ACQ_Banner_Suppression":"CONTROL","GS_RATING_CLEANUP":"CONTROL","SXP_DEFERRED_RENDERER":"ASYNC_INITIAL_HYDRATE","SXP_STREETVIEW_REQUEST_TYPE":"CONTROL","RE_RentalsHomesForYouSort":"CONTROL","SP_FOR_RENT_PAGE":"CONTROL","Activation_NewLane_Metrics_Enabled":"DISABLED","SEOTEST__SXP_REMOVE_WHY_ZILLOW":"REMOVE_WHY_ZILLOW","Activation_Enabled":"ENABLED","DXP_RTB_LINKING":"ON","DXP_PHOTO_CAROUSEL":"CONTROL","DXP_MAP_ICONS":"CONTROL","WEB_HIDDEN_HOMES_2022":"ON","SXP_Rentals_Apartment_Community_Filter":"TEST","DXP_HOMEPAGE_OMP_CLIENT_REFACTOR":"ON","SXP_WOW_LIST_CARD":"CONTROL","SXP_KF_FILTERS_AC":"ON","SXP_SDS_INTEGRATION":"USE_FOR_ALL","SXP_MOBILE_MAP_PRIORITY":"CONTROL","SXP_MAP_DOT_STYLE":"CONTROL","Activation_GA_Metrics_Enabled":"ENABLED","SXP_Pers_SimilarResults":"CONTROL","RE_RentalHomeDetailsService":"CONTROL","ACQ_SigninSRP_Module":"Variant_Module_A","DXP_CONST_PROPCARD_MAPVIEW":"CONTROL","SXP_FLYBAR_PSL_ZGSEARCH":"ON","ADS_Tagless":"Casale_On","SEOTEST__NC_H1":"ALTERNATE","SXP_FLYBAR_REGION_API":"CONTROL","RMX_3RD_PARTY_P1":"ON","RUM_VIA_PRE_ENDPOINT":"TREATMENT_OFF","DXP_CONST_PROPCARD_LISTVIEW":"ON","SXP_EVENT_MARKUP":"CONTROL","SXP_SEARCH_DISPATCHER_SERVICE":"CONTROL","SXP_DISPLAY_AD_LOADING":"CONTROL","SP_ZO_HDP_PAGE":"CONTROL","DXP_DYNAMIC_ADS":"CONTROL","DXP_MAP_DOT_COLORS":"CONTROL","DESKTOP_COMMUTE_FILTER_MVP":"CONTROL","DXP_AUTH_GATED_COLLECTIONS":"ON","RE_FR_Photo_Carousel":"CONTROL","ADT_TOP_SLOT_SRP":"ONSITE_MESSAGING","DXP_HIDE_HOME":"CONTROL","VL_BDP_SSR_QUERY":"CONTROL_CACHED","VL_BDP_NEW_TAB":"CONTROL","SXP_PREXIT_CLAIMS_CHECK":"ENABLED","SEOTEST__SXP_SEO_TEST":"CONTROL","SXP_OPEN_HOUSE_FLEX":"OPEN_HOUSE_BOOSTED","SP_FOR_SALE_PAGE":"CONTROL","SXP_NO_SRPTOGGLE":"CONTROL","SXP_PAGINATION":"LEGACY_PAGINATION","SXP_KF_FILTERS_V2":"ON","SXP_MLS_NONMLS_FILTER":"CONTROL","DXP_MULTIPLE_COLLECTIONS":"ON","RE_GuidedSearchFiltersPOC":"CONTROL","SXP_3DHOME_FILTER":"ON","SP_BUILDING_PAGE":"CONTROL","SXP_QU_MIGRATION":"ON","ACQ_MOBILE_UPSELL_SXP":"CONTROL","SXP_LIST_ONLY_SRP":"CONTROL","RE_JanusBrainSort":"TEST_ALL_STATES","SEOTEST__RE_ForRentForSaleSRPBreadcrumbs":"CONTROL","SXP_MAKE_ME_MOVE":"REMOVED","SHO_GA_ResultsTotalEvent":"ON","DXP_MAP_DOTS_WEB":"ON","SXP_MULTIREGION_SEARCH":"CONTROL","DXP_HERO_SHORTENING":"ON","SXP_VISUAL_AUDIT_2021":"ON","SXP_HDP_BLUE_TO_RED":"ON","HDP_DESKTOP_LAYOUT_TOPNAV":"CONTROL","SXP_HEADER_TAG_WRAPPER":"ON","DXP_HOMEPAGE_OMP":"ON","SXP_Multifamily_Filter":"MULTIFAMILY_SEPARATE","SXP_FOOTER":"RESPONSIVE_REACT","SP_PAID_BUILDER_PAGE":"VIA_SHOPPER_PLATFORM","SP_OFF_MARKET_PAGE":"VIA_SHOPPER_PLATFORM","SXP_KINGFISHER_FILTERS":"P1_PHASE_1","RE_SECOND_BOOST":"SLOT_4","DXP_TG_SCHOOLS_DISABLED":"CONTROL","ADT_PROGRESSIVE_MESSAGE":"CONTROL","SXP_Combined_Filter_Apartments_Condos":"TEST","DXP_HOME_RECS":"ON","SEOTEST__SXP_REACT_FOOTER_DESKTOP":"CONTROL","SXP_3DTOUR_MAP_DOT":"ON","DXP_SEE_MORE_RECS":"SCROLL_ON"},"cat1":{"searchResults":{"listResults":[{"zpid":"21328879","id":"21328879","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/3e81a218088316bafa7b199e8dc4923f-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/20412-Wayne-Ave-Torrance-CA-90503/21328879_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$1,275,000","unformattedPrice":1275000,"address":"20412 Wayne Ave, Torrance, CA 90503","addressStreet":"20412 Wayne Ave","addressCity":"Torrance","addressState":"CA","addressZipcode":"90503","isUndisclosedAddress":false,"beds":4,"baths":2.0,"area":1890,"latLong":{"latitude":33.845936,"longitude":-118.37223},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 12-3pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21328879,"streetAddress":"20412 Wayne Ave","zipcode":"90503","city":"Torrance","state":"CA","latitude":33.845936,"longitude":-118.37223,"price":1275000.0,"bathrooms":2.0,"bedrooms":4.0,"livingArea":1890.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":1264181,"rentZestimate":4699,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 12-3pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1275000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673726400000,"open_house_end":1673737200000},{"open_house_start":1673816400000,"open_house_end":1673827200000},{"open_house_start":1673902800000,"open_house_end":1673913600000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":171633.0,"lotAreaValue":7172.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T12:00:00","openHouseEndDate":"2023-01-14T15:00:00","openHouseDescription":"Open House - 0:00 - 3:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":1264181,"shouldShowZestimateAsPrice":false,"has3DModel":true,"hasVideo":false,"isHomeRec":false,"brokerName":"Re/Max Estate Properties","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"2060330967","id":"2060330967","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/f65a190c100bab31301becdba3cdf7cc-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/23701-S-Western-Ave-SPACE-244-Torrance-CA-90501/2060330967_zpid/","statusType":"FOR_SALE","statusText":"Home for sale","countryCurrency":"$","price":"$88,000","unformattedPrice":88000,"address":"23701 S Western Ave SPACE 244, Torrance, CA 90501","addressStreet":"23701 S Western Ave SPACE 244","addressCity":"Torrance","addressState":"CA","addressZipcode":"90501","isUndisclosedAddress":false,"beds":3,"baths":2.0,"area":1000,"latLong":{"latitude":33.809013,"longitude":-118.311035},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 2-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":2060330967,"streetAddress":"23701 S Western Ave SPACE 244","zipcode":"90501","city":"Torrance","state":"CA","latitude":33.809013,"longitude":-118.311035,"price":88000.0,"datePriceChanged":1673078400000,"bathrooms":2.0,"bedrooms":3.0,"livingArea":1000.0,"homeType":"MANUFACTURED","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 2-4pm","priceReduction":"$2,000 (Jan 7)","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":88000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673733600000,"open_house_end":1673740800000},{"open_house_start":1673820000000,"open_house_end":1673827200000}]},"priceChange":-2000,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","unit":"Space 244","lotAreaValue":16.3319,"lotAreaUnit":"acres"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T14:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 2:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":null,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"eXp Realty of California, Inc.","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21338409","id":"21338409","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/1f90c7be6ceca4a76d64d904010f0cb7-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/25924-Matfield-Dr-Torrance-CA-90505/21338409_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$1,100,000","unformattedPrice":1100000,"address":"25924 Matfield Dr, Torrance, CA 90505","addressStreet":"25924 Matfield Dr","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":4,"baths":3.0,"area":1531,"latLong":{"latitude":33.78651,"longitude":-118.334206},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21338409,"streetAddress":"25924 Matfield Dr","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.78651,"longitude":-118.334206,"price":1100000.0,"bathrooms":3.0,"bedrooms":4.0,"livingArea":1531.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":1100006,"rentZestimate":3800,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1100000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000},{"open_house_start":1673816400000,"open_house_end":1673827200000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":436335.0,"lotAreaValue":7987.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":1100006,"shouldShowZestimateAsPrice":false,"has3DModel":true,"hasVideo":false,"isHomeRec":false,"brokerName":"Equity Union","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21337140","id":"21337140","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/cb80b5736be022cbeeafb3bb77ec0e83-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/4068-Newton-St-Torrance-CA-90505/21337140_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$999,900","unformattedPrice":999900,"address":"4068 Newton St, Torrance, CA 90505","addressStreet":"4068 Newton St","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":2,"baths":2.0,"area":1268,"latLong":{"latitude":33.803745,"longitude":-118.35658},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21337140,"streetAddress":"4068 Newton St","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.803745,"longitude":-118.35658,"price":999900.0,"bathrooms":2.0,"bedrooms":2.0,"livingArea":1268.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":999907,"rentZestimate":3999,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":999900.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000},{"open_house_start":1673816400000,"open_house_end":1673827200000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":521866.0,"lotAreaValue":5050.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":999907,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Compass","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"63093583","id":"63093583","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/8eec0c61013d143353c8893258ad1770-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/3120-Sepulveda-Blvd-UNIT-414-Torrance-CA-90505/63093583_zpid/","statusType":"FOR_SALE","statusText":"Condo for sale","countryCurrency":"$","price":"$389,000","unformattedPrice":389000,"address":"3120 Sepulveda Blvd UNIT 414, Torrance, CA 90505","addressStreet":"3120 Sepulveda Blvd UNIT 414","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":1,"baths":1.0,"area":526,"latLong":{"latitude":33.823784,"longitude":-118.34189},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":63093583,"streetAddress":"3120 Sepulveda Blvd UNIT 414","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.823784,"longitude":-118.34189,"price":389000.0,"bathrooms":1.0,"bedrooms":1.0,"livingArea":526.0,"homeType":"CONDO","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":389001,"rentZestimate":2084,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":389000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":360000.0,"unit":"Unit 414","lotAreaValue":1.0985,"lotAreaUnit":"acres"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":389001,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Re/Max Estate Properties","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21324245","id":"21324245","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/4c47d6593842fdf0036f1805838c1673-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/417-Paseo-De-La-Playa-Redondo-Beach-CA-90277/21324245_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$19,995,000","unformattedPrice":19995000,"address":"417 Paseo De La Playa, Redondo Beach, CA 90277","addressStreet":"417 Paseo De La Playa","addressCity":"Redondo Beach","addressState":"CA","addressZipcode":"90277","isUndisclosedAddress":false,"beds":10,"baths":15.0,"area":15728,"latLong":{"latitude":33.810413,"longitude":-118.39131},"isZillowOwned":false,"variableData":{"type":"PRICE_REDUCTION","text":"$3,000,000 (Nov 10)"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21324245,"streetAddress":"417 Paseo De La Playa","zipcode":"90277","city":"Redondo Beach","state":"CA","latitude":33.810413,"longitude":-118.39131,"price":1.9995E7,"datePriceChanged":1668067200000,"bathrooms":15.0,"bedrooms":10.0,"livingArea":15728.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":17760922,"rentZestimate":83834,"listing_sub_type":{"is_FSBA":true},"priceReduction":"$3,000,000 (Nov 10)","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1.9995E7,"priceChange":-3000000,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":1.2904489E7,"lotAreaValue":1.4407,"lotAreaUnit":"acres"}},"isSaved":false,"isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":17760922,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Douglas Elliman of California, Inc.","info6String":"Joshua Altman DRE # 01764587","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21272955","id":"21272955","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/0d9ac33c6a2d1c8683ec45e7aef895a5-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/3101-Plaza-Del-Amo-UNIT-5-Torrance-CA-90503/21272955_zpid/","statusType":"FOR_SALE","statusText":"Townhouse for sale","countryCurrency":"$","price":"$835,000","unformattedPrice":835000,"address":"3101 Plaza Del Amo UNIT 5, Torrance, CA 90503","addressStreet":"3101 Plaza Del Amo UNIT 5","addressCity":"Torrance","addressState":"CA","addressZipcode":"90503","isUndisclosedAddress":false,"beds":3,"baths":3.0,"area":1446,"latLong":{"latitude":33.828743,"longitude":-118.34023},"isZillowOwned":false,"variableData":{"type":"DAYS_ON","text":"3 days on Zillow"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21272955,"streetAddress":"3101 Plaza Del Amo UNIT 5","zipcode":"90503","city":"Torrance","state":"CA","latitude":33.828743,"longitude":-118.34023,"price":835000.0,"bathrooms":3.0,"bedrooms":3.0,"livingArea":1446.0,"homeType":"TOWNHOUSE","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":852600,"rentZestimate":3499,"listing_sub_type":{"is_FSBA":true},"isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":835000.0,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":788066.0,"unit":"Unit 5","lotAreaValue":5.4866,"lotAreaUnit":"acres"}},"isSaved":false,"isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":852600,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"RELO REDAC, Inc.","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},
From what I can tell, this data exists in a section of the beautiful soup object within a < script > tag. I didn't include all data (there's a lot), but here's an excerpt of the last tag I can find before the region I'd like to extract.
<script data-zrr-shared-data-key="mobileSearchPageStore" type="application/json"><!--{"queryState":{"mapBounds":{"north":33.887061,"south":33.780217,"east":-118.308127,"west":-118.394107},"regionSelection":[{"regionId":54722,"regionType":6}],"isMapVisible":true,"filterState":{"sortSelection":{"value":"globalrelevanceex"},"isAllHomes":{"value":true}}},"filterDefinitions":{"keywords":{"id":"keywords","shortId":"att","labels":{"default":"Keywords","tracking":"Keyword"},"sortOrder":2,"type":"String","defaultValue":{"value":""},"exposedPillEnabled":true},"isPublicSchool":{"id":"isPublicSchool","shortId":"schp","labels":{"default":"Public"},"type":"Boolean","defaultValue":{"value":true}},"isCityView":{"id":"isCityView","shortId":"cityv","labels":
Is this json data? Is there a way to extract just this part of the data?
You can try to parse the data with json module:
import json
from bs4 import BeautifulSoup
html_doc = '''\
<script data-zrr-shared-data-key="mobileSearchPageStore" type="application/json"><!--{"currentLink":"/torrance-ca-90510/","regionId":96168}--></script>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# get the right tag
data = soup.select_one('script[data-zrr-shared-data-key="mobileSearchPageStore"]')
# get the contents of this tag, strip the html comments
data = data.contents[0].strip('><!-')
# parse the data
data = json.loads(data)
# print the data
print(data)
Prints:
{'currentLink': '/torrance-ca-90510/', 'regionId': 96168}

beautiful soup 4 issue in mulitple data fetching. it is confusing me

When i am fetching one data it is working fine as i mentioned below code. Whenever i am finding all datas in a similar tagging (example - {'class': 'doctor-name'}) it showing output as none.
Single tag output
from bs4 import BeautifulSoup
s = """
<a class="doctor-name" itemprop="name" href="/doctors/gastroenterologists/dr-isaac-raijman-md-1689679557">Dr. Isaac Raijman, MD</a>
"""
soup = BeautifulSoup(s, 'html.parser')
print(soup.find('a ', {'class': 'doctor-name'}).text)
print(soup.find('a ', {'itemprop': 'name'}).text)
Output -
[Dr. Isaac Raijman, MD,
Dr. Isaac Raijman, MD]
Finding all using similar tagging but showing output as none-
import requests, bs4
from bs4 import BeautifulSoup
url = "https://soandso.org/doctors/gastroenterologists"
page = requests.get(url)
page
page.status_code
page.content
soup = BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())
lists = soup.find_all('section', attrs={'class': 'search-page find-a-doctor'})
for list in lists:
doctor = list.find('a', attrs={'class': 'doctor-name'})#.text
info = [doctor]
print(info)
Output - none
Please help me to solve this issue. Share your understanding as a code and #hastags definitions also fine.
That information is built up by the browser and is not returned in the HTML. An easier approach is to request it from the JSON API as follows:
import requests
headers = {'Authorization' : 'eyJhbGciOiJodHRwOi8vd3d3LnczLm9yZy8yMDAxLzA0L3htbGRzaWctbW9yZSNobWFjLXNoYTI1NiIsInR5cCI6IkpXVCJ9.eyJodHRwOi8vc2NoZW1hcy54bWxzb2FwLm9yZy93cy8yMDA1LzA1L2lkZW50aXR5L2NsYWltcy9uYW1lIjoiYWRtaW4iLCJleHAiOjIxMjcwNDQ1MTcsImlzcyI6Imh0dHBzOi8vZGV2ZWxvcGVyLmhlYWx0aHBvc3QuY29tIiwiYXVkIjoiaHR0cHM6Ly9kZXZlbG9wZXIuaGVhbHRocG9zdC5jb20ifQ.zNvR3WpI17CCMC7rIrHQCrnJg_6qGM21BvTP_ed_Hj8'}
json_post = {"query":"","start":0,"rows":10,"selectedFilters":{"availability":[],"clinicalInterest":[],"distance":[20],"gender":["Both"],"hasOnlineScheduling":False,"insurance":[],"isMHMG":False,"language":[],"locationType":[],"lonlat":[-95.36,29.76],"onlineScheduling":["Any"],"specialty":["Gastroenterology"]}}
req = requests.post("https://api.memorialhermann.org/api/doctorsearch", json=json_post, headers=headers)
data = req.json()
for doctor in data['docs']:
print(f"{doctor['Name']:30} {doctor['PrimarySpecialty']:20} {doctor['PrimaryFacility']}")
Giving you:
Dr. Isaac Raijman, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Gabriel Lee, MD Gastroenterology Memorial Hermann Southeast Hospital
Dr. Dang Nguyen, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Harshinie Amaratunge, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Tanima Jana, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Tugrul Purnak, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Dimpal Bhakta, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Dharmendra Verma, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Jennifer Shroff, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Brooks Cash, MD Gastroenterology Memorial Hermann Texas Medical Center

How to scrape information from tables selecting each of the Dropdown options using Selenium and Python?

Trying to help someone who works for a nonprofit. Currently trying to pull info from the STL County Boards/Commissions website(https://boards.stlouisco.com/).
Having trouble for a few reasons:
Was going to attempt to use BeautifulSoup, but the actual data isn't even shown until you choose a Board/Commission from a dropdown bar above, so I have switched to Selenium, which I am new at.
Is this task possible? When I look at the html code for the site, I see that the info isn't stored in the page, but pulled from another location and just displayed on the site based on the option chosen from the dropdown menu.
function ShowMemberList(selectedBoard) {
ClearMeetingsAndMembers();
var htmlString = "";
var boardsList = [{"id":407,"name":"Aging Ahead","isActive":true,"description":"... ...1.","totalSeats":14}];
var totalMembers = boardsList[$("select[name='BoardsList'] option:selected").index() - 1].totalSeats;
$.get("/api/boards/" + selectedBoard + "/members", function (data) {
if (data.length > 0) {
htmlString += "<table id=\"MemberTable\" class=\"table table-hover\">";
htmlString += "<thead><th>Member Name</th><th>Title</th><th>Position</th><th>Expiration Date</th></thead><tbody>";
for (var i = 0; i < totalMembers; i++) {
if (i < data.length) {
htmlString += "<tr><td>" + FormatString(data[i].firstName) + " " + FormatString(data[i].lastName) + "</td><td>" + FormatString(data[i].title) + "</td><td>" + FormatString(data[i].position) + "</td><td>" + FormatString(data[i].expirationDate) + "</td></tr>";
} else {
htmlString += "<tr><td colspan=\"4\">---Vacant Seat---</td></tr>"
}
}
htmlString += "</tbody></table>";
} else {
htmlString = "<span id=\"MemberTable\">There was no data found for this board.</span>";
}
$("#Results").append(htmlString);
});
}
So far, I have this (not a lot), which goes to the page and selects every board from the list:
driver = webdriver.Chrome()
driver.get("https://boards.stlouisco.com/")
select = Select(wait(driver, 10).until(EC.presence_of_element_located((By.ID, 'BoardsList'))))
options = select.options
for board in options:
select.select_by_visible_text(board.text)
From here I would like to be able to scrape the info from the MemberTable but I don't know how to move forward/if it is something in the scope of my abilities, or even if it is something possible with Selenium.
I've tried using find_by a few different elements to click on the members table but am met with errors. I have also tried calling for the memberstable after my select, but it is not able to find that element. Any tips/pointers/advice is appreciated!
You can use this script to save all members from all boards to csv:
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://boards.stlouisco.com/'
members_url = 'https://boards.stlouisco.com/api/boards/{}/members'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for o in soup.select('#BoardsList option[value]'):
print(o['value'], o.text)
data = requests.get(members_url.format(o['value'])).json()
for d in data:
all_data.append(dict(board=o.text, **d))
df = pd.DataFrame(all_data)
print(df)
df.to_csv('data.csv')
Prints:
board boardMemberId memberId boardName ... lastName title position expirationDate
0 Aging Ahead 39003 27007 None ... Anderson None ST. LOUIS COUNTY EXECUTIVE APPOINTEE 10/1/2020
1 Aging Ahead 38963 27797 None ... Bauers None St. Charles County Community Action Agency App... None
2 Aging Ahead 39004 27815 None ... Berkowitz None ST. LOUIS COUNTY EXECUTIVE APPOINTEE 10/1/2020
3 Aging Ahead 38964 27798 None ... Biehle None Jefferson County Community Action Corp. Appointee None
4 Aging Ahead 38581 27597 None ... Bowers None Franklin County Commission Appointee None
.. ... ... ... ... ... ... ... ... ...
725 Zoo-Museum District - Zoological Park Subdistr... 38863 26745 None ... Seat (Robert R. Hermann, Jr.) St. Louis County 12/31/2019
726 Zoo-Museum District - Zoological Park Subdistr... 38864 26745 None ... Seat (Winthrop Reed) St. Louis County 12/31/2016
727 Zoo-Museum District - Zoological Park Subdistr... 38669 26745 None ... Seat (Lawrence Thomas) St. Louis County 12/31/2018
728 Zoo-Museum District - Zoological Park Subdistr... 38670 26745 None ... Seat (Peggy Ritter ) Advisory Commissioner Non-Voting St. Louis County 12/31/2019
729 Zoo-Museum District - Zoological Park Subdistr... 38394 27512 None ... Wilson Advisory Commissioner Non-Voting City of St. Louis None
[730 rows x 9 columns]
And saves data.csv with all boards/members (screenshot from LibreOffice):
To choose each of the Board / Commission from the html-select Dropdown and scrape the page you have to induce WebDriverWait for the element_to_be_clickable() and you can use the following Locator Strategies:
Code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import Select
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://boards.stlouisco.com/")
select = Select(WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, 'BoardsList'))))
for option in select.options:
option.click()
print("Scrapping :"+option.text)
Console Output:
Scrapping :---Choose a Board---
Scrapping :Aging Ahead
Scrapping :Aging Ahead Advisory Council
Scrapping :Air Pollution & Noise Control Appeal Board
Scrapping :Animal Care & Control Advisory Board
Scrapping :Bi-State Development Agency (Metro)
Scrapping :Board Of Examiners For Mechanical Licensing
Scrapping :Board of Freeholders
Scrapping :Boundary Commission
Scrapping :Building Code Review Committee
Scrapping :Building Commission & Board Of Building Appeals
Scrapping :Business Advisory Council
Scrapping :Center for Educational Media
Scrapping :Civil Service Commission
Scrapping :Commission On Disabilities
Scrapping :County Health Advisory Board
Scrapping :Domestic And Family Violence Council
Scrapping :East-West Gateway Council of Governments Board of Directors
Scrapping :Economic Development Collaborative Advisory Board
Scrapping :Economic Rescue Team
Scrapping :Electrical Code Review Committee
Scrapping :Electrical Examiners, Board Of
Scrapping :Emergency Communications System Commission
Scrapping :Equalization, Board Of
Scrapping :Fire Standards Commission
Scrapping :Friends of the Kathy J. Weinman Shelter for Battered Women, Inc.
Scrapping :Fund Investment Advisory Committee
Scrapping :Historic Building Commission
Scrapping :Housing Authority
Scrapping :Housing Resources Commission
Scrapping :Human Relations Commission
Scrapping :Industrial Development Authority Board
Scrapping :Justice Services Advisory Board
Scrapping :Lambert Airport Eastern Perimeter Joint Development Commission
Scrapping :Land Clearance For Redevelopment Authority
Scrapping :Lemay Community Improvement District
Scrapping :Library Board
Scrapping :Local Emergency Planning Committee
Scrapping :Mechanical Code Review Committee
Scrapping :Metropolitan Park And Recreation District Board Of Directors (Great Rivers Greenway)
Scrapping :Metropolitan St. Louis Sewer District
Scrapping :Metropolitan Taxicab Commission
Scrapping :Metropolitan Zoological Park and Museum District Board
Scrapping :Municipal Court Judges
Scrapping :Older Adult Commission
Scrapping :Parks And Recreation Advisory Board
Scrapping :Planning Commission
Scrapping :Plumbing Code Review Committee
Scrapping :Plumbing Examiners, Board Of
Scrapping :Police Commissioners, Board Of
Scrapping :Port Authority Board Of Commissioners
Scrapping :Private Security Advisory Committee
Scrapping :Productive Living Board
Scrapping :Public Transportation Commission of St. Louis County
Scrapping :Regional Arts Commission
Scrapping :Regional Convention & Sports Complex Authority
Scrapping :Regional Convention & Visitors Commission
Scrapping :REJIS Commission
Scrapping :Restaurant Commission
Scrapping :Retirement Board Of Trustees
Scrapping :St. Louis Airport Commission
Scrapping :St. Louis County Children's Service Fund Board
Scrapping :St. Louis County Clean Energy Development Board (PACE)
Scrapping :St. Louis County Workforce Development Board
Scrapping :St. Louis Economic Development Partnership
Scrapping :St. Louis Regional Health Commission
Scrapping :St. Louis-Jefferson Solid Waste Management District
Scrapping :Tax Increment Financing Commission of St. Louis County
Scrapping :Transportation Board
Scrapping :Waste Management Commission
Scrapping :World Trade Center - St. Louis
Scrapping :Zoning Adjustment, Board of
Scrapping :Zoo-Museum District - Art Museum Subdistrict Board of Commissioners
Scrapping :Zoo-Museum District - Botanical Garden Subdistrict Board of Commissioners
Scrapping :Zoo-Museum District - Missouri History Museum Subdistrict Board of Commissioners
Scrapping :Zoo-Museum District - St. Louis Science Center Subdistrict Board of Commissioners
Scrapping :Zoo-Museum District - Zoological Park Subdistrict Board of Commissioners
References
You can find a couple of relevant discussions in:
Message: Element could not be scrolled into view while trying to click on an option within a dropdown menu through Selenium
How to open the option items of a select tag (dropdown) in different tabs/windows?

Scraping content with python and selenium

I would like to extract all the league names (e.g. England Premier League, Scotland Premiership, etc.) from this website https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1
Taking the inspector tools from Chrome/Firefox I can see that they are located here:
<span>England Premier League</span>
So I tried this
from lxml import html
from selenium import webdriver
session = webdriver.Firefox()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
session.get(url)
tree = html.fromstring(session.page_source)
leagues = tree.xpath('//span/text()')
print(leagues)
Unfortunately this doesn't return the desired results :-(
To me it looks like the website has different frames and I'm extracting the content from the wrong frame.
Could anyone please help me out here or point me in the right direction? As an alternative if someone knows how to extract the information through their api then this would obviously be the superior solution.
Any help is much appreciated. Thank you!
Hope you are looking for something like this:
from selenium import webdriver
import bs4, time
driver = webdriver.Chrome()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
driver.get(url)
driver.maximize_window()
# sleep is given so that JS populate data in this time
time.sleep(10)
pSource= driver.page_source
soup = bs4.BeautifulSoup(pSource, "html.parser")
for data in soup.findAll('div',{'class':'eventWrapper'}):
for res in data.find_all('span'):
print res.text
It will print the below data:
Wednesday's Matches
International List
Elite Euro List
UK List
Australia List
Club Friendly List
England Premier League
England EFL Cup
England Championship
England League 1
England League 2
England National League
England National League North
England National League South
Scotland Premiership
Scotland League Cup
Scotland Championship
Scotland League One
Scotland League Two
Northern Ireland Reserve League
Scotland Development League East
Wales Premier League
Wales Cymru Alliance
Asia - World Cup Qualifying
UEFA Champions League
UEFA Europa League
Wednesday's Matches
International List
Elite Euro List
UK List
Australia List
Club Friendly List
England Premier League
England EFL Cup
England Championship
England League 1
England League 2
England National League
England National League North
England National League South
Scotland Premiership
Scotland League Cup
Scotland Championship
Scotland League One
Scotland League Two
Northern Ireland Reserve League
Scotland Development League East
Wales Premier League
Wales Cymru Alliance
Asia - World Cup Qualifying
UEFA Champions League
UEFA Europa League
Only problem is its printing result set twice
Required content is absent in initial page source. It comes dynamically from https://mobile.bet365.com/V6/sport/splash/splash.aspx?zone=0&isocode=RO&tzi=4&key=1&gn=0&cid=1&lng=1&ctg=1&ct=156&clt=8881&ot=2
To be able to get this content you can use ExplicitWait as below:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
session = webdriver.Firefox()
url = 'https://mobile.bet365.com/#type=Splash;key=1;ip=0;lng=1'
session.get(url)
WebDriverWait(session, 10).until(EC.presence_of_element_located((By.ID, 'Splash')))
for collapsed in session.find_elements_by_xpath('//h3[contains(#class, "collapsed")]'):
collapsed.location_once_scrolled_into_view
collapsed.click()
for event in session.find_elements_by_xpath('//div[contains(#class, "eventWrapper")]//span'):
print(event.text)

Categories

Resources