beautiful soup 4 issue in mulitple data fetching. it is confusing me - python

When i am fetching one data it is working fine as i mentioned below code. Whenever i am finding all datas in a similar tagging (example - {'class': 'doctor-name'}) it showing output as none.
Single tag output
from bs4 import BeautifulSoup
s = """
<a class="doctor-name" itemprop="name" href="/doctors/gastroenterologists/dr-isaac-raijman-md-1689679557">Dr. Isaac Raijman, MD</a>
"""
soup = BeautifulSoup(s, 'html.parser')
print(soup.find('a ', {'class': 'doctor-name'}).text)
print(soup.find('a ', {'itemprop': 'name'}).text)
Output -
[Dr. Isaac Raijman, MD,
Dr. Isaac Raijman, MD]
Finding all using similar tagging but showing output as none-
import requests, bs4
from bs4 import BeautifulSoup
url = "https://soandso.org/doctors/gastroenterologists"
page = requests.get(url)
page
page.status_code
page.content
soup = BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())
lists = soup.find_all('section', attrs={'class': 'search-page find-a-doctor'})
for list in lists:
doctor = list.find('a', attrs={'class': 'doctor-name'})#.text
info = [doctor]
print(info)
Output - none
Please help me to solve this issue. Share your understanding as a code and #hastags definitions also fine.

That information is built up by the browser and is not returned in the HTML. An easier approach is to request it from the JSON API as follows:
import requests
headers = {'Authorization' : 'eyJhbGciOiJodHRwOi8vd3d3LnczLm9yZy8yMDAxLzA0L3htbGRzaWctbW9yZSNobWFjLXNoYTI1NiIsInR5cCI6IkpXVCJ9.eyJodHRwOi8vc2NoZW1hcy54bWxzb2FwLm9yZy93cy8yMDA1LzA1L2lkZW50aXR5L2NsYWltcy9uYW1lIjoiYWRtaW4iLCJleHAiOjIxMjcwNDQ1MTcsImlzcyI6Imh0dHBzOi8vZGV2ZWxvcGVyLmhlYWx0aHBvc3QuY29tIiwiYXVkIjoiaHR0cHM6Ly9kZXZlbG9wZXIuaGVhbHRocG9zdC5jb20ifQ.zNvR3WpI17CCMC7rIrHQCrnJg_6qGM21BvTP_ed_Hj8'}
json_post = {"query":"","start":0,"rows":10,"selectedFilters":{"availability":[],"clinicalInterest":[],"distance":[20],"gender":["Both"],"hasOnlineScheduling":False,"insurance":[],"isMHMG":False,"language":[],"locationType":[],"lonlat":[-95.36,29.76],"onlineScheduling":["Any"],"specialty":["Gastroenterology"]}}
req = requests.post("https://api.memorialhermann.org/api/doctorsearch", json=json_post, headers=headers)
data = req.json()
for doctor in data['docs']:
print(f"{doctor['Name']:30} {doctor['PrimarySpecialty']:20} {doctor['PrimaryFacility']}")
Giving you:
Dr. Isaac Raijman, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Gabriel Lee, MD Gastroenterology Memorial Hermann Southeast Hospital
Dr. Dang Nguyen, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Harshinie Amaratunge, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Tanima Jana, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Tugrul Purnak, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Dimpal Bhakta, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Dharmendra Verma, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Jennifer Shroff, MD Gastroenterology Memorial Hermann Texas Medical Center
Dr. Brooks Cash, MD Gastroenterology Memorial Hermann Texas Medical Center

Related

Python - Scraping and classifying text in "fonts"

I would like to scrape the content of this website https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283 and create a table with the columns NAME, TITLE, LOCATION. I know some individuals have more or less "lines", but I am just trying to understand how I could even classify the first 3 lines for each person given that the text is in between "fonts" for all.
So far I have:
url="https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("font")
But once I am there and I have all the text within "font" in my "column" variable, I don't know how to proceed to differentiate between each person and build a loop where I would retrieve name, title, location etc. for each.
Any help would be highly appreciated!
Note: instead of using selenium, I simply fetched and parsed with soup = BeautifulSoup(requests.get(url).content, "html.parser"); as far as I an tell, the required section is not dynamic, so it shouldn't cause any issues.
would you have any idea about how to look for pairs of <br>
Since they represent empty lines, you could try simply splitting the text in that cell by \n\n\n
blockText = soup.select_one('td:has(font)').get_text(' ')
blockText = blockText.replace('-'*10, '\n\n\n') # pad "underlined" lines
blockSections = [sect.strip() for sect in '\n'.join([
l.strip('-').strip() for l in blockText.splitlines()
]).split('\n\n\n') if sect.strip()]
Although, if you looked at blockSections, you might notice that some headers [ROSTER and MEMBERS] get stuck to the end of the previous section - probably because their formatting means that an extra <br> is not needed to distinguish them from their adjacent sections. [I added the .replace('-'*10, '\n\n\n') line so that at least they're separated from the next section.]
Another risk is that I don't know if all versions and parsers will parse <br><br> to text as 3 line breaks - some omit br space entirely from text, and others might add extra space based on spaces between tags in the source html.
It's easier to split if you loop through the <br>s and pad them with something more distinctive to split by; the .insert... methods are useful here. (This method also has the advantage of being able to target bolded lined as well.)
blockSoup = soup.select_one('td:has(font)')
for br2 in blockSoup.select('br+br, font:has(br)'):
br2.insert_after(BeautifulSoup(f'<p>{"="*80}</p>').p)
br2.insert_before(BeautifulSoup(f'<p>{"="*80}</p>').p)
blockSections = [
sect.strip().strip('-').strip() for sect in
blockSoup.get_text(' ').split("="*80) if sect.strip()
]
This time, blockSections looks something like
['Membership Roster - ACE\n AIDS CLINICAL STUDIES AND EPIDEMIOLOGY STUDY SECTION\n Center For Scientific Review\n (Terms end 6/30 of the designated year)\n ROSTER',
'CHAIRPERSON',
'SCHACKER, TIMOTHY\n W\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF MINNESOTA\n MINNEAPOLIS,\n MN\n 55455',
'MEMBERS',
'ANDERSON, JEAN\n R\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF GYNECOLOGY AND OBSTETRICS\n JOHNS HOPKINS UNIVERSITY\n BALTIMORE,\n MD 21287',
'BALASUBRAMANYAM, ASHOK\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE AND\n MOLECULAR AND CELLULAR BIOLOGY\n DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM\n BAYLOR COLLEGE OF MEDICINE\n HOUSTON,\n TX 77030',
'BLATTNER, WILLIAM\n ALBERT\n , MD,\n (15)\n PROFESSOR AND ASSOCIATE DIRECTOR\n DEPARTMENT OF MEDICNE\n INSTITUTE OF HUMAN VIROLOGY\n UNIVERSITY OF MARYLAND, BALTIMORE\n BALTIMORE,\n MD 21201',
'CHEN, YING\n QING\n , PHD,\n (15)\n PROFESSOR\n PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS\n FRED HUTCHINSON CANCER RESEARCH CENTER\n SEATTLE,\n WA 981091024',
'COTTON, DEBORAH\n , MD,\n (13)\n PROFESSOR\n SECTION OF INFECTIOUS DISEASES\n DEPARTMENT OF MEDICINE\n BOSTON UNIVERSITY\n BOSTON,\n MA 02118',
'DANIELS, MICHAEL\n J\n , SCD,\n (16)\n PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF TEXAS AT AUSTIN\n AUSTIN,\n TX 78712',
'FOULKES, ANDREA\n SARAH\n , SCD,\n (14)\n ASSOCIATE PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF MASSACHUSETTS\n AMHERST,\n MA 01003',
'HEROLD, BETSY\n C\n , MD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n ALBERT EINSTEIN COLLEGE OF MEDICINE\n BRONX,\n NY 10461',
'JUSTICE, AMY\n CAROLINE\n , MD, PHD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n YALE UNIVERSITY\n NEW HAVEN,\n CT 06520',
'KATZENSTEIN, DAVID\n ALLENBERG\n , MD,\n (13)\n PROFESSOR\n DIVISION OF INFECTIOUS DISEASES\n STANFORD UNIVERSITY SCHOOL OF MEDICINE\n STANFORD,\n CA 94305',
'MARGOLIS, DAVID\n M\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL\n CHAPEL HILL,\n NC 27599',
'MONTANER, LUIS\n J\n , DVM, PHD,\n (13)\n PROFESSOR\n DEPARTMENT OF IMMUNOLOGY\n THE WISTAR INSTITUTE\n PHILADELPHIA,\n PA 19104',
'MONTANO, MONTY\n A\n , PHD,\n (15)\n RESEARCH SCIENTIST\n DEPARTMENT OF IMMUNOLOGY AND\n INFECTIOUS DISEASES\n BOSTON UNIVERSITY\n BOSTON,\n MA 02115',
'PAGE, KIMBERLY\n , PHD, MPH,\n (16)\n PROFESSOR\n DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH\n AND GLOBAL HEALTH SCIENCES\n DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n SAN FRANCISCO,\n CA 94105',
'SHIKUMA, CECILIA\n M\n , MD,\n (15)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n HAWAII AIDS CLINICAL RESEARCH PROGRAM\n UNIVERSITY OF HAWAII\n HONOLULU,\n HI 96816',
'WOOD, CHARLES\n , PHD,\n (13)\n PROFESSOR\n UNIVERSITY OF NEBRASKA\n LINCOLN,\n NE 68588']
create a table with the columns NAME, TITLE, LOCATION
There may be a more elegant solution, but I feel like the simplest way would be to just loop the siblings of the headers and keep count of consecutive brs.
doubleBr = soup.select('br')[:2] # [ so the last person also gets added ]
personsList = []
for f in soup.select('td>font>font:has(b br)'):
role, lCur,pCur,brCt = f.get_text(' ').strip('-').strip(), [],[],0
for lf in f.find_next_siblings(['font','br'])+doubleBr:
brCt = brCt+1 if lf.name == 'br' else 0
if pCur and (brCt>1 or lf.b):
pDets = {'role': role, 'name': '?'} # initiate
if len(pCur)>1: pDets['title'] = pCur[1]
pDets['name'], pCur = pCur[0], pCur[2:]
dList = pCur[:-2]
pDets['departments'] = dList[0] if len(dList)==1 else dList
if len(pCur)>1: pDets['institute'] = pCur[-2]
if pCur: pDets['location'] = pCur[-1]
personsList.append(pDets)
pCur, lCur, brCt = [], [], 0 # clear
if lf.b: break # rached next section
if lf.name == 'font': # [split and join to minimize whitespace]
lCur.append(' '.join(lf.get_text(' ').split())) # add to line
if brCt and lCur: pCur, lCur = pCur+[' '.join(lCur)], [] # newline
Since personsList is a list of dictionaries, it can be tabulated as simply as pandas.DataFrame(personsList) to get a DataFrame that looks like:
role
name
title
departments
institute
location
CHAIRPERSON
SCHACKER, TIMOTHY W , MD
PROFESSOR
DEPARTMENT OF MEDICINE
UNIVERSITY OF MINNESOTA
MINNEAPOLIS, MN 55455
MEMBERS
ANDERSON, JEAN R , MD
PROFESSOR
DEPARTMENT OF GYNECOLOGY AND OBSTETRICS
JOHNS HOPKINS UNIVERSITY
BALTIMORE, MD 21287
MEMBERS
BALASUBRAMANYAM, ASHOK , MD
PROFESSOR
['DEPARTMENT OF MEDICINE AND', 'MOLECULAR AND CELLULAR BIOLOGY', 'DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM']
BAYLOR COLLEGE OF MEDICINE
HOUSTON, TX 77030
MEMBERS
BLATTNER, WILLIAM ALBERT , MD
PROFESSOR AND ASSOCIATE DIRECTOR
['DEPARTMENT OF MEDICNE', 'INSTITUTE OF HUMAN VIROLOGY']
UNIVERSITY OF MARYLAND, BALTIMORE
BALTIMORE, MD 21201
MEMBERS
CHEN, YING QING , PHD
PROFESSOR
PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS
FRED HUTCHINSON CANCER RESEARCH CENTER
SEATTLE, WA 981091024
MEMBERS
COTTON, DEBORAH , MD
PROFESSOR
['SECTION OF INFECTIOUS DISEASES', 'DEPARTMENT OF MEDICINE']
BOSTON UNIVERSITY
BOSTON, MA 02118
MEMBERS
DANIELS, MICHAEL J , SCD
PROFESSOR
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF TEXAS AT AUSTIN
AUSTIN, TX 78712
MEMBERS
FOULKES, ANDREA SARAH , SCD
ASSOCIATE PROFESSOR
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF MASSACHUSETTS
AMHERST, MA 01003
MEMBERS
HEROLD, BETSY C , MD
PROFESSOR
DEPARTMENT OF PEDIATRICS
ALBERT EINSTEIN COLLEGE OF MEDICINE
BRONX, NY 10461
MEMBERS
JUSTICE, AMY CAROLINE , MD, PHD
PROFESSOR
DEPARTMENT OF PEDIATRICS
YALE UNIVERSITY
NEW HAVEN, CT 06520
MEMBERS
KATZENSTEIN, DAVID ALLENBERG , MD
PROFESSOR
DIVISION OF INFECTIOUS DISEASES
STANFORD UNIVERSITY SCHOOL OF MEDICINE
STANFORD, CA 94305
MEMBERS
MARGOLIS, DAVID M , MD
PROFESSOR
DEPARTMENT OF MEDICINE
UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL
CHAPEL HILL, NC 27599
MEMBERS
MONTANER, LUIS J , DVM, PHD
PROFESSOR
DEPARTMENT OF IMMUNOLOGY
THE WISTAR INSTITUTE
PHILADELPHIA, PA 19104
MEMBERS
MONTANO, MONTY A , PHD
RESEARCH SCIENTIST
['DEPARTMENT OF IMMUNOLOGY AND', 'INFECTIOUS DISEASES']
BOSTON UNIVERSITY
BOSTON, MA 02115
MEMBERS
PAGE, KIMBERLY , PHD, MPH
PROFESSOR
['DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH', 'AND GLOBAL HEALTH SCIENCES', 'DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS', 'UNIVERSITY OF CALIFORNIA, SAN FRANCISCO']
UNIVERSITY OF CALIFORNIA, SAN FRANCISCO
SAN FRANCISCO, CA 94105
MEMBERS
SHIKUMA, CECILIA M , MD
PROFESSOR
['DEPARTMENT OF MEDICINE', 'HAWAII AIDS CLINICAL RESEARCH PROGRAM']
UNIVERSITY OF HAWAII
HONOLULU, HI 96816
MEMBERS
WOOD, CHARLES , PHD
PROFESSOR
[]
UNIVERSITY OF NEBRASKA
LINCOLN, NE 68588
[ Btw, if the .select('br+br, font:has(br)') and .select('td>font>font:has(b br)') parts are unfamiliar to you, you can look up .select and CSS selectors. Combinators [like >/+/,] and pseudo-classes [like :has] allow us to get very specific with out targets. ]

How can I parse non-tagged data from beautiful soup output?

I am trying to parse the following data extracted with beautiful soup.
{"currentLink":"/torrance-ca-90510/","regionId":96168,"displayRegionName":"90510"}],"universities":[]},"showAttributeLinks":null}},"mapState":{"customRegionPolygonWkt":null,"schoolPolygonWkt":null,"isCurrentLocationSearch":false,"userPosition":{"lat":null,"lon":null}},"regionState":{"regionInfo":[{"regionType":6,"regionId":54722,"regionName":"Torrance","displayName":"Torrance CA","isPointRegion":false}],"regionBounds":{"north":33.887061,"east":-118.308127,"south":33.780217,"west":-118.394107}},"searchPageSeoObject":{"baseUrl":"/torrance-ca/","windowTitle":"Torrance CA Real Estate - Torrance CA Homes For Sale | Zillow","metaDescription":"Zillow has 100 homes for sale in Torrance CA. View listing photos, review sales history, and use our detailed real estate filters to find the perfect place."},"abTrials":{"SXP_HDP_CONTINGENT_V2":"ON","SXP_REGION_AUTOCOMPLETE_SOURCE":"TRULIA","RE_Move_In_Date_Filter":"TEST","SXP_SENTRY":"ON","SEOTEST__SXP_LIST_ONLY_SRP":"CONTROL","SXP_SAVE_SEARCH_COLOR":"CONTROL","SXP_REACT_FOOTER_DESKTOP":"CONTROL","RE_Web_PersonalizedSort":"CONTROL","ACQ_Search_Filters_Upsell":"CONTROL","VARIANTS_BDP_768_PLUS":"CONTROL","SXP_FLOATING_ACTION_BAR":"ON","SET_CTA_COUNTLESS":"CONTROL","SXP_LISTING_SUBTYPE":"CONTROL","SXP_Search_Refinement_Filters":"CONTROL","RE_SearchByBuildingName":"TEST","SXP_PREXIT_CLAIMS_INFO":"ENABLED","SXP_NONMLS_OFF":"CONTROL","SXP_PARTIAL_PAGE_LOAD_REFACTOR":"CONTROL","SXP_NAV_AD_LOADING":"CONTROL","ACT_FILTER_ON_LAND":"CONTROL","SXP_PHOTO_CAROUSEL":"CONTROL","SEO__SXP_REMOVE_ANCHOR_TEXT":"CONTROL","DXP_NEW_MAP_DOTS_WEB":"CONTROL","SXP_ACT_REMOVE_SEARCHBOX_GLEAM":"NO_GLEAM","SXP_Exclude_Referer":"TEST","SXP_PAGE_LOAD":"FASTER","RE_Rentals_Badging_v1":"CONTROL","SXP_REACT_GPT":"REMOVED","SXP_QU_PHASE_2":"ON","RE_HDP_REDIRECT":"CONTROL","RE_Search_Refinement_Filters":"CONTROL","MIGHTY_MONTH_2022_HOLDOUT":"MIGHTY_MONTH_ON","SXP_NEW_LANE_CLICKSTREAM":"ON","SXP_REDUCED_SERVER_SIDE_RENDER":"CONTROL","SXP_DelayJS":"AFTER_LOAD","ACQ_Banner_Suppression":"CONTROL","GS_RATING_CLEANUP":"CONTROL","SXP_DEFERRED_RENDERER":"ASYNC_INITIAL_HYDRATE","SXP_STREETVIEW_REQUEST_TYPE":"CONTROL","RE_RentalsHomesForYouSort":"CONTROL","SP_FOR_RENT_PAGE":"CONTROL","Activation_NewLane_Metrics_Enabled":"DISABLED","SEOTEST__SXP_REMOVE_WHY_ZILLOW":"REMOVE_WHY_ZILLOW","Activation_Enabled":"ENABLED","DXP_RTB_LINKING":"ON","DXP_PHOTO_CAROUSEL":"CONTROL","DXP_MAP_ICONS":"CONTROL","WEB_HIDDEN_HOMES_2022":"ON","SXP_Rentals_Apartment_Community_Filter":"TEST","DXP_HOMEPAGE_OMP_CLIENT_REFACTOR":"ON","SXP_WOW_LIST_CARD":"CONTROL","SXP_KF_FILTERS_AC":"ON","SXP_SDS_INTEGRATION":"USE_FOR_ALL","SXP_MOBILE_MAP_PRIORITY":"CONTROL","SXP_MAP_DOT_STYLE":"CONTROL","Activation_GA_Metrics_Enabled":"ENABLED","SXP_Pers_SimilarResults":"CONTROL","RE_RentalHomeDetailsService":"CONTROL","ACQ_SigninSRP_Module":"Variant_Module_A","DXP_CONST_PROPCARD_MAPVIEW":"CONTROL","SXP_FLYBAR_PSL_ZGSEARCH":"ON","ADS_Tagless":"Casale_On","SEOTEST__NC_H1":"ALTERNATE","SXP_FLYBAR_REGION_API":"CONTROL","RMX_3RD_PARTY_P1":"ON","RUM_VIA_PRE_ENDPOINT":"TREATMENT_OFF","DXP_CONST_PROPCARD_LISTVIEW":"ON","SXP_EVENT_MARKUP":"CONTROL","SXP_SEARCH_DISPATCHER_SERVICE":"CONTROL","SXP_DISPLAY_AD_LOADING":"CONTROL","SP_ZO_HDP_PAGE":"CONTROL","DXP_DYNAMIC_ADS":"CONTROL","DXP_MAP_DOT_COLORS":"CONTROL","DESKTOP_COMMUTE_FILTER_MVP":"CONTROL","DXP_AUTH_GATED_COLLECTIONS":"ON","RE_FR_Photo_Carousel":"CONTROL","ADT_TOP_SLOT_SRP":"ONSITE_MESSAGING","DXP_HIDE_HOME":"CONTROL","VL_BDP_SSR_QUERY":"CONTROL_CACHED","VL_BDP_NEW_TAB":"CONTROL","SXP_PREXIT_CLAIMS_CHECK":"ENABLED","SEOTEST__SXP_SEO_TEST":"CONTROL","SXP_OPEN_HOUSE_FLEX":"OPEN_HOUSE_BOOSTED","SP_FOR_SALE_PAGE":"CONTROL","SXP_NO_SRPTOGGLE":"CONTROL","SXP_PAGINATION":"LEGACY_PAGINATION","SXP_KF_FILTERS_V2":"ON","SXP_MLS_NONMLS_FILTER":"CONTROL","DXP_MULTIPLE_COLLECTIONS":"ON","RE_GuidedSearchFiltersPOC":"CONTROL","SXP_3DHOME_FILTER":"ON","SP_BUILDING_PAGE":"CONTROL","SXP_QU_MIGRATION":"ON","ACQ_MOBILE_UPSELL_SXP":"CONTROL","SXP_LIST_ONLY_SRP":"CONTROL","RE_JanusBrainSort":"TEST_ALL_STATES","SEOTEST__RE_ForRentForSaleSRPBreadcrumbs":"CONTROL","SXP_MAKE_ME_MOVE":"REMOVED","SHO_GA_ResultsTotalEvent":"ON","DXP_MAP_DOTS_WEB":"ON","SXP_MULTIREGION_SEARCH":"CONTROL","DXP_HERO_SHORTENING":"ON","SXP_VISUAL_AUDIT_2021":"ON","SXP_HDP_BLUE_TO_RED":"ON","HDP_DESKTOP_LAYOUT_TOPNAV":"CONTROL","SXP_HEADER_TAG_WRAPPER":"ON","DXP_HOMEPAGE_OMP":"ON","SXP_Multifamily_Filter":"MULTIFAMILY_SEPARATE","SXP_FOOTER":"RESPONSIVE_REACT","SP_PAID_BUILDER_PAGE":"VIA_SHOPPER_PLATFORM","SP_OFF_MARKET_PAGE":"VIA_SHOPPER_PLATFORM","SXP_KINGFISHER_FILTERS":"P1_PHASE_1","RE_SECOND_BOOST":"SLOT_4","DXP_TG_SCHOOLS_DISABLED":"CONTROL","ADT_PROGRESSIVE_MESSAGE":"CONTROL","SXP_Combined_Filter_Apartments_Condos":"TEST","DXP_HOME_RECS":"ON","SEOTEST__SXP_REACT_FOOTER_DESKTOP":"CONTROL","SXP_3DTOUR_MAP_DOT":"ON","DXP_SEE_MORE_RECS":"SCROLL_ON"},"cat1":{"searchResults":{"listResults":[{"zpid":"21328879","id":"21328879","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/3e81a218088316bafa7b199e8dc4923f-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/20412-Wayne-Ave-Torrance-CA-90503/21328879_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$1,275,000","unformattedPrice":1275000,"address":"20412 Wayne Ave, Torrance, CA 90503","addressStreet":"20412 Wayne Ave","addressCity":"Torrance","addressState":"CA","addressZipcode":"90503","isUndisclosedAddress":false,"beds":4,"baths":2.0,"area":1890,"latLong":{"latitude":33.845936,"longitude":-118.37223},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 12-3pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21328879,"streetAddress":"20412 Wayne Ave","zipcode":"90503","city":"Torrance","state":"CA","latitude":33.845936,"longitude":-118.37223,"price":1275000.0,"bathrooms":2.0,"bedrooms":4.0,"livingArea":1890.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":1264181,"rentZestimate":4699,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 12-3pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1275000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673726400000,"open_house_end":1673737200000},{"open_house_start":1673816400000,"open_house_end":1673827200000},{"open_house_start":1673902800000,"open_house_end":1673913600000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":171633.0,"lotAreaValue":7172.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T12:00:00","openHouseEndDate":"2023-01-14T15:00:00","openHouseDescription":"Open House - 0:00 - 3:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":1264181,"shouldShowZestimateAsPrice":false,"has3DModel":true,"hasVideo":false,"isHomeRec":false,"brokerName":"Re/Max Estate Properties","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"2060330967","id":"2060330967","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/f65a190c100bab31301becdba3cdf7cc-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/23701-S-Western-Ave-SPACE-244-Torrance-CA-90501/2060330967_zpid/","statusType":"FOR_SALE","statusText":"Home for sale","countryCurrency":"$","price":"$88,000","unformattedPrice":88000,"address":"23701 S Western Ave SPACE 244, Torrance, CA 90501","addressStreet":"23701 S Western Ave SPACE 244","addressCity":"Torrance","addressState":"CA","addressZipcode":"90501","isUndisclosedAddress":false,"beds":3,"baths":2.0,"area":1000,"latLong":{"latitude":33.809013,"longitude":-118.311035},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 2-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":2060330967,"streetAddress":"23701 S Western Ave SPACE 244","zipcode":"90501","city":"Torrance","state":"CA","latitude":33.809013,"longitude":-118.311035,"price":88000.0,"datePriceChanged":1673078400000,"bathrooms":2.0,"bedrooms":3.0,"livingArea":1000.0,"homeType":"MANUFACTURED","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 2-4pm","priceReduction":"$2,000 (Jan 7)","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":88000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673733600000,"open_house_end":1673740800000},{"open_house_start":1673820000000,"open_house_end":1673827200000}]},"priceChange":-2000,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","unit":"Space 244","lotAreaValue":16.3319,"lotAreaUnit":"acres"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T14:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 2:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":null,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"eXp Realty of California, Inc.","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21338409","id":"21338409","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/1f90c7be6ceca4a76d64d904010f0cb7-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/25924-Matfield-Dr-Torrance-CA-90505/21338409_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$1,100,000","unformattedPrice":1100000,"address":"25924 Matfield Dr, Torrance, CA 90505","addressStreet":"25924 Matfield Dr","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":4,"baths":3.0,"area":1531,"latLong":{"latitude":33.78651,"longitude":-118.334206},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21338409,"streetAddress":"25924 Matfield Dr","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.78651,"longitude":-118.334206,"price":1100000.0,"bathrooms":3.0,"bedrooms":4.0,"livingArea":1531.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":1100006,"rentZestimate":3800,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1100000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000},{"open_house_start":1673816400000,"open_house_end":1673827200000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":436335.0,"lotAreaValue":7987.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":1100006,"shouldShowZestimateAsPrice":false,"has3DModel":true,"hasVideo":false,"isHomeRec":false,"brokerName":"Equity Union","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21337140","id":"21337140","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/cb80b5736be022cbeeafb3bb77ec0e83-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/4068-Newton-St-Torrance-CA-90505/21337140_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$999,900","unformattedPrice":999900,"address":"4068 Newton St, Torrance, CA 90505","addressStreet":"4068 Newton St","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":2,"baths":2.0,"area":1268,"latLong":{"latitude":33.803745,"longitude":-118.35658},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21337140,"streetAddress":"4068 Newton St","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.803745,"longitude":-118.35658,"price":999900.0,"bathrooms":2.0,"bedrooms":2.0,"livingArea":1268.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":999907,"rentZestimate":3999,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":999900.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000},{"open_house_start":1673816400000,"open_house_end":1673827200000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":521866.0,"lotAreaValue":5050.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":999907,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Compass","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"63093583","id":"63093583","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/8eec0c61013d143353c8893258ad1770-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/3120-Sepulveda-Blvd-UNIT-414-Torrance-CA-90505/63093583_zpid/","statusType":"FOR_SALE","statusText":"Condo for sale","countryCurrency":"$","price":"$389,000","unformattedPrice":389000,"address":"3120 Sepulveda Blvd UNIT 414, Torrance, CA 90505","addressStreet":"3120 Sepulveda Blvd UNIT 414","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":1,"baths":1.0,"area":526,"latLong":{"latitude":33.823784,"longitude":-118.34189},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":63093583,"streetAddress":"3120 Sepulveda Blvd UNIT 414","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.823784,"longitude":-118.34189,"price":389000.0,"bathrooms":1.0,"bedrooms":1.0,"livingArea":526.0,"homeType":"CONDO","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":389001,"rentZestimate":2084,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":389000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":360000.0,"unit":"Unit 414","lotAreaValue":1.0985,"lotAreaUnit":"acres"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":389001,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Re/Max Estate Properties","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21324245","id":"21324245","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/4c47d6593842fdf0036f1805838c1673-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/417-Paseo-De-La-Playa-Redondo-Beach-CA-90277/21324245_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$19,995,000","unformattedPrice":19995000,"address":"417 Paseo De La Playa, Redondo Beach, CA 90277","addressStreet":"417 Paseo De La Playa","addressCity":"Redondo Beach","addressState":"CA","addressZipcode":"90277","isUndisclosedAddress":false,"beds":10,"baths":15.0,"area":15728,"latLong":{"latitude":33.810413,"longitude":-118.39131},"isZillowOwned":false,"variableData":{"type":"PRICE_REDUCTION","text":"$3,000,000 (Nov 10)"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21324245,"streetAddress":"417 Paseo De La Playa","zipcode":"90277","city":"Redondo Beach","state":"CA","latitude":33.810413,"longitude":-118.39131,"price":1.9995E7,"datePriceChanged":1668067200000,"bathrooms":15.0,"bedrooms":10.0,"livingArea":15728.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":17760922,"rentZestimate":83834,"listing_sub_type":{"is_FSBA":true},"priceReduction":"$3,000,000 (Nov 10)","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1.9995E7,"priceChange":-3000000,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":1.2904489E7,"lotAreaValue":1.4407,"lotAreaUnit":"acres"}},"isSaved":false,"isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":17760922,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Douglas Elliman of California, Inc.","info6String":"Joshua Altman DRE # 01764587","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21272955","id":"21272955","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/0d9ac33c6a2d1c8683ec45e7aef895a5-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/3101-Plaza-Del-Amo-UNIT-5-Torrance-CA-90503/21272955_zpid/","statusType":"FOR_SALE","statusText":"Townhouse for sale","countryCurrency":"$","price":"$835,000","unformattedPrice":835000,"address":"3101 Plaza Del Amo UNIT 5, Torrance, CA 90503","addressStreet":"3101 Plaza Del Amo UNIT 5","addressCity":"Torrance","addressState":"CA","addressZipcode":"90503","isUndisclosedAddress":false,"beds":3,"baths":3.0,"area":1446,"latLong":{"latitude":33.828743,"longitude":-118.34023},"isZillowOwned":false,"variableData":{"type":"DAYS_ON","text":"3 days on Zillow"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21272955,"streetAddress":"3101 Plaza Del Amo UNIT 5","zipcode":"90503","city":"Torrance","state":"CA","latitude":33.828743,"longitude":-118.34023,"price":835000.0,"bathrooms":3.0,"bedrooms":3.0,"livingArea":1446.0,"homeType":"TOWNHOUSE","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":852600,"rentZestimate":3499,"listing_sub_type":{"is_FSBA":true},"isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":835000.0,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":788066.0,"unit":"Unit 5","lotAreaValue":5.4866,"lotAreaUnit":"acres"}},"isSaved":false,"isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":852600,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"RELO REDAC, Inc.","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},
From what I can tell, this data exists in a section of the beautiful soup object within a < script > tag. I didn't include all data (there's a lot), but here's an excerpt of the last tag I can find before the region I'd like to extract.
<script data-zrr-shared-data-key="mobileSearchPageStore" type="application/json"><!--{"queryState":{"mapBounds":{"north":33.887061,"south":33.780217,"east":-118.308127,"west":-118.394107},"regionSelection":[{"regionId":54722,"regionType":6}],"isMapVisible":true,"filterState":{"sortSelection":{"value":"globalrelevanceex"},"isAllHomes":{"value":true}}},"filterDefinitions":{"keywords":{"id":"keywords","shortId":"att","labels":{"default":"Keywords","tracking":"Keyword"},"sortOrder":2,"type":"String","defaultValue":{"value":""},"exposedPillEnabled":true},"isPublicSchool":{"id":"isPublicSchool","shortId":"schp","labels":{"default":"Public"},"type":"Boolean","defaultValue":{"value":true}},"isCityView":{"id":"isCityView","shortId":"cityv","labels":
Is this json data? Is there a way to extract just this part of the data?
You can try to parse the data with json module:
import json
from bs4 import BeautifulSoup
html_doc = '''\
<script data-zrr-shared-data-key="mobileSearchPageStore" type="application/json"><!--{"currentLink":"/torrance-ca-90510/","regionId":96168}--></script>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# get the right tag
data = soup.select_one('script[data-zrr-shared-data-key="mobileSearchPageStore"]')
# get the contents of this tag, strip the html comments
data = data.contents[0].strip('><!-')
# parse the data
data = json.loads(data)
# print the data
print(data)
Prints:
{'currentLink': '/torrance-ca-90510/', 'regionId': 96168}

How do I return the list of names using webscraping that I'm looking for?

Very new to webscraping and trying to do a project for myself where I scrape the list of names from the MLB Top 100 Prospects site here: https://www.mlb.com/prospects/top100/
Currently my code looks like the following after I load in the HTML code (although I've used a variety of different techniques):
***from bs4 import BeautifulSoup
import requests
#### Parse the html content
soup = BeautifulSoup(html, "lxml")
#### Find all name tags:
prospects = soup.find_all("div.prospect-heashot__name")
#### Iterate through all name tags
for prospect in prospects:
#### Get text from each tag
print(prospect.text)***
Final result should look something like:
Francisco Alvarez
Gunnar Henderson
Corbin Carroll
Grayson Rodriguez
Anthony Volpe
etc
Any help would be greatly appreciated!
This was fun problem :) The data is stored inside the page in Json form. You can parse it with json module and then search for relevant data in the nested dict (I used recursion for the task):
import re
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.mlb.com/prospects/top100/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("[data-init-state]")["data-init-state"])
pat1 = re.compile(r"Player:\d+$")
pat2 = re.compile(r"getProspectRankings.*\)\.\d+$")
def get_data(o, pat):
if isinstance(o, dict):
for k, v in o.items():
if pat.search(k):
yield k, v
else:
yield from get_data(v, pat)
elif isinstance(o, list):
for v in o:
yield from get_data(v, pat)
players = {}
for k, v in get_data(data, pat1):
players[k] = v["useName"], v["boxscoreName"]
rankings = []
for k, v in get_data(data, pat2):
rankings.append((v["rank"], players[v["player"]["id"]]))
for rank, (name, surname) in sorted(rankings):
print("{:>03}. {:<15} {:<15}".format(rank, name, surname))
Prints:
001. Francisco Álvarez, F
002. Gunnar Henderson
003. Corbin Carroll
004. Grayson Rodriguez, G
005. Anthony Volpe
006. Jordan Walker
007. Marcelo Mayer
008. Diego Cartaya
009. Eury Pérez
010. Jackson Chourio
011. Druw Jones
012. Jordan Lawlar
013. Jackson Holliday
014. Elly De La Cruz
015. Daniel Espino
016. Marco Luciano
017. Noelvi Marte, N
018. Brett Baty
019. Henry Davis
020. Taj Bradley
021. Kyle Harrison
022. Robert Hassell III
023. Zac Veen
024. Andrew Painter
025. Triston Casas
026. Bobby Miller
027. Ezequiel Tovar
028. Elijah Green
029. Termarr Johnson
030. Pete Crow-Armstrong
031. George Valera
032. Brooks Lee
033. Ricky Tiedemann
034. James Wood
035. Curtis Mead
036. Josh Jung
037. Kevin Parada
038. Jackson Jobe
039. Jasson Domínguez
040. Colton Cowser
041. Miguel Vargas, M
042. Michael Busch
043. Max Meyer
044. Quinn Priester
045. Jack Leiter
046. Sal Frelick
047. Tyler Soderstrom
048. Brennen Davis, B
049. Jacob Berry
050. Oswald Peraza
051. Masyn Winn
052. Edwin Arroyo
053. Gavin Williams
054. Mick Abel
055. Cade Cavalli
056. Evan Carter
057. Colson Montgomery
058. Royce Lewis
059. Owen White
060. Cam Collier
061. Adael Amador
062. Liover Peguero
063. Drew Romo
064. Logan O'Hoppe
065. Harry Ford
066. Andy Pages
067. Ken Waldichuk
068. Hunter Brown, H
069. Brayan Rocchio
070. Orelvis Martinez
071. Jace Jung
072. Gavin Cross
073. Matt McLain
074. Ryan Pepiot
075. Bo Naylor, B
076. Jordan Westburg
077. Gavin Stone
078. Justin Foscue
079. Gordon Graceffo
080. Matthew Liberatore
081. Carson Williams
082. Austin Wells
083. Jackson Merrill
084. Joey Wiemer
085. Alex Ramirez
086. Kevin Alcantara
087. DL Hall, DL
088. Alec Burleson
089. Brock Porter
090. Brandon Pfaadt
091. Tink Hence
092. Emmanuel Rodriguez, Em
093. Nick Gonzales, N
094. Zack Gelof
095. Oscar Colas
096. Ceddanne Rafaela
097. Endy Rodriguez, E
098. Dylan Lesko
099. Tanner Bibee
100. Wilmer Flores

How to scrape hidden class data using selenium and beautiful soup

I'm trying to scrape java script enabled web page content. I need to extract data in the table of that website. However each row of the table has button (arrow) by which we get additional information of that row.
I need to extract that additional description of each row. By inspecting it is observed that the contents of those arrow of each row belong to same class. However the class is hidden in source code. It can be observed only while inspecting. The data I'm trying to sparse is from the webpage.
I have used selenium and beautiful soup. I'm able to scrape data of table but not content of those arrows in the table. My python is returning me an empty list for the class of that arrow. But working for the classs of normal table data.
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/')
html_source = browser.page_source
soup = BeautifulSoup(html_source,'html.parser')
data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow")
print(data.text)
To print hidden data, you can use this example:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://projects.sfchronicle.com/2020/layoff-tracker/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href']
data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1]
data = json.loads(data.replace(r"\'", "'"))
# uncomment this to see all data:
# print(json.dumps(data, indent=4))
for d in data[4:]:
print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values()))
Prints:
Company Layoffs City County Month Industry Company description
Tesla (Temporary layoffs. Factory reopened) 11083 Fremont Alameda County April Industrial Car maker
Bon Appetit Management Co. 3015 San Francisco San Francisco County April Food Food supplier
GSW Arena LLC-Chase Center 1720 San Francisco San Francisco County May Sports Arena vendors
YMCA of Silicon Valley 1657 Santa Clara Santa Clara County May Sports Gym
Nutanix Inc. (Temporary furlough of 2 weeks) 1434 San Jose Santa Clara County April Tech Cloud computing
TeamSanJose 1304 San Jose Santa Clara County April Travel Tourism bureau
San Francisco Giants 1200 San Francisco San Francisco County April Sports Stadium vendors
Lyft 982 San Francisco San Francisco County April Tech Ride hailing
YMCA of San Francisco 959 San Francisco San Francisco County May Sports Gym
Hilton San Francisco Union Square 923 San Francisco San Francisco County April Travel Hotel
Six Flags Discovery Kingdom 911 Vallejo Solano County June Entertainment Amusement park
San Francisco Marriott Marquis 808 San Francisco San Francisco County April Travel Hotel
Aramark 777 Oakland Alameda County April Food Food supplier
The Palace Hotel 774 San Francisco San Francisco County April Travel Hotel
Back of the House Inc 743 San Francisco San Francisco County April Food Restaurant
DPR Construction 715 Redwood City San Mateo County April Real estate Construction
...and so on.
The content you are interested in is generated when you click a button, so you would want to locate the button. A million ways you could do this but I would suggest something like:
element = driver.find_elements(By.XPATH, '//button')
for your specific case you could also use:
element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]')
Once you get the button element, we can then do:
element.click()
Parsing the page after this should get you the javascript generated content you are looking for

How to scrape all information on a web page after the id = "firstheading" in python?

I am trying to scrape all text from a web page (using python) that comes after the first heading . The tag for that heading is : <h1 id="firstHeading" class="firstHeading" lang="en">Albert Einstein</h1>
I don't want any information before this heading . I want to scrape all text written after this heading . Can I use BeautifulSoup in python for this ?
I am running the following code :
` *
import requests
import bs4
from bs4 import BeautifulSoup
urlpage = 'https://en.wikipedia.org/wiki/Albert_Einstein#Publications'
res = requests.get(urlpage)
soup1 = (bs4.BeautifulSoup(res.text, 'lxml')).get_text()
print(soup1)
` *
The web page has the following information :
Albert Einstein - Wikipedia
document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Albert_Einstein","wgTitle":"Albert Einstein","wgCurRevisionId":920687884,"wgRevisionId":920687884,"wgArticleId":736,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with missing ISBNs","Webarchive template wayback links","CS1 German-language sources (de)","CS1: Julian–Gregorian uncertainty","CS1 French-language sources (fr)","CS1 errors: missing periodical","CS1: long volume value","Wikipedia indefinitely semi-protected pages","Use American English from February 2019","All Wikipedia articles written in American English","Articles with short description","Good articles","Articles containing German-language text","Biography with signature","Articles with hCards","Articles with hAudio microformats","All articles with unsourced statements",
"Articles with unsourced statements from July 2019","Commons category link from Wikidata","Articles with Wikilivres links","Articles with Curlie links","Articles with Project Gutenberg links","Articles with Internet Archive links","Articles with LibriVox links","Use dmy dates from August 2019","Wikipedia articles with BIBSYS identifiers","Wikipedia articles with BNE identifiers","Wikipedia articles with BNF identifiers","Wikipedia articles with GND identifiers","Wikipedia articles with HDS identifiers","Wikipedia articles with ISNI identifiers","Wikipedia articles with LCCN identifiers","Wikipedia articles with LNB identifiers","Wikipedia articles with MGP identifiers","Wikipedia articles with NARA identifiers","Wikipedia articles with NCL identifiers","Wikipedia articles with NDL identifiers","Wikipedia articles with NKC identifiers","Wikipedia articles with NLA identifiers","Wikipedia articles with NLA-person identifiers","Wikipedia articles with NLI identifiers",
"Wikipedia articles with NLR identifiers","Wikipedia articles with NSK identifiers","Wikipedia articles with NTA identifiers","Wikipedia articles with SBN identifiers","Wikipedia articles with SELIBR identifiers","Wikipedia articles with SNAC-ID identifiers","Wikipedia articles with SUDOC identifiers","Wikipedia articles with ULAN identifiers","Wikipedia articles with VIAF identifiers","Wikipedia articles with WorldCat-VIAF identifiers","AC with 25 elements","Wikipedia articles with suppressed authority control identifiers","Pages using authority control with parameters","Articles containing timelines","Pantheists","Spinozists","Albert Einstein","1879 births","1955 deaths","20th-century American engineers","20th-century American writers","20th-century German writers","20th-century physicists","American agnostics","American inventors","American letter writers","American pacifists","American people of German-Jewish descent","American physicists","American science writers",
"American socialists","American Zionists","Ashkenazi Jews","Charles University in Prague faculty","Corresponding Members of the Russian Academy of Sciences (1917–25)","Cosmologists","Deaths from abdominal aortic aneurysm","Einstein family","ETH Zurich alumni","ETH Zurich faculty","German agnostics","German Jews","German emigrants to Switzerland","German Nobel laureates","German inventors","German physicists","German socialists","European democratic socialists","Institute for Advanced Study faculty","Jewish agnostics","Jewish American scientists","Jewish emigrants from Nazi Germany to the United States","Jews who emigrated to escape Nazism","Jewish engineers","Jewish inventors","Jewish philosophers","Jewish physicists","Jewish socialists","Leiden University faculty","Foreign Fellows of the Indian National Science Academy","Foreign Members of the Royal Society","Members of the American Philosophical Society","Members of the Bavarian Academy of Sciences","Members of the Lincean Academy"
,"Members of the Royal Netherlands Academy of Arts and Sciences","Members of the United States National Academy of Sciences","Honorary Members of the USSR Academy of Sciences","Naturalised citizens of Austria","Naturalised citizens of Switzerland","New Jersey socialists","Nobel laureates in Physics","Patent examiners","People from Berlin","People from Bern","People from Munich","People from Princeton, New Jersey","People from Ulm","People from Zürich","People who lost German citizenship","People with acquired American citizenship","Philosophers of science","Relativity theorists","Stateless people","Swiss agnostics","Swiss emigrants to the United States","Swiss Jews","Swiss physicists","Theoretical physicists","Winners of the Max Planck Medal","World federalists","Recipients of the Pour le Mérite (civil class)","Determinists","Activists from New Jersey","Mathematicians involved with Mathematische Annalen","Intellectual Cooperation","Disease-related deaths in New Jersey"],
"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Albert_Einstein","wgRelevantArticleId":736,"wgRequestId":"XaChjApAICIAALSsYfgAAABV","wgCSPNonce":!1,"wgIsProbablyEditable":!1,"wgRelevantPageIsProbablyEditable":!1,"wgRestrictionEdit":["autoconfirmed"],"wgRestrictionMove":["sysop"],"wgMediaViewerOnClick":!0,"wgMediaViewerEnabledByDefault":!0,"wgPopupsReferencePreviews":!1,"wgPopupsConflictsWithNavPopupGadget":!1,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en"},"wgMFDisplayWikibaseDescriptions":{"search":!0,"nearby":!0,"watchlist":!0,"tagline":
!1},"wgWMESchemaEditAttemptStepOversample":!1,"wgULSCurrentAutonym":"English","wgNoticeProject":"wikipedia","wgWikibaseItemId":"Q937","wgCentralAuthMobileDomain":!1,"wgEditSubmitButtonLabelPublish":!0};RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","noscript":"ready","user.styles":"ready","ext.globalCssJs.user":"ready","user":"ready","user.options":"ready","user.tokens":"loading","ext.cite.styles":"ready","ext.math.styles":"ready","mediawiki.legacy.shared":"ready","mediawiki.legacy.commonPrint":"ready","jquery.makeCollapsible.styles":"ready","mediawiki.toc.styles":"ready","wikibase.client.init":"ready","ext.visualEditor.desktopArticleTarget.noscript":"ready","ext.uls.interlanguage":"ready","ext.wikimediaBadges":"ready","ext.3d.styles":"ready","mediawiki.skinning.interface":"ready","skins.vector.styles":"ready"};RLPAGEMODULES=["ext.cite.ux-enhancements","ext.cite.tracking","ext.math.scripts","ext.scribunto.logs","site","mediawiki.page.startup",
"mediawiki.page.ready","jquery.makeCollapsible","mediawiki.toc","mediawiki.searchSuggest","ext.gadget.teahouse","ext.gadget.ReferenceTooltips","ext.gadget.watchlist-notice","ext.gadget.DRN-wizard","ext.gadget.charinsert","ext.gadget.refToolbar","ext.gadget.extra-toolbar-buttons","ext.gadget.switcher","ext.centralauth.centralautologin","mmv.head","mmv.bootstrap.autostart","ext.popups","ext.visualEditor.desktopArticleTarget.init","ext.visualEditor.targetLoader","ext.eventLogging","ext.wikimediaEvents","ext.navigationTiming","ext.uls.compactlinks","ext.uls.interface","ext.cx.eventlogging.campaigns","ext.quicksurveys.init","ext.centralNotice.geoIP","ext.centralNotice.startUp","skins.vector.js"];
(RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.tokens#tffin",function($,jQuery,require,module){/*#nomin*/mw.user.tokens.set({"patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});
});});
Albert Einstein
From Wikipedia, the free encyclopedia
Jump to navigation Jump to search "Einstein" redirects here. For other
people, see Einstein (surname). For other uses, see Albert Einstein
(disambiguation) and Einstein (disambiguation).
German-born physicist and developer of the theory of relativity
Albert EinsteinEinstein in 1921Born(1879-03-14)14 March 1879Ulm,
Kingdom of Württemberg, German EmpireDied18 April 1955(1955-04-18)
(aged 76)Princeton, New Jersey, United StatesResidenceGermany, Italy,
Switzerland, Austria (present-day Czech Republic), Belgium, United
StatesCitizenship Subject of the Kingdom of Württemberg during the
German Empire (1879–1896)[note 1] Stateless (1896–1901) Citizen of
Switzerland (1901–1955) Austrian subject of the Austro-Hungarian
Empire (1911–1912) Subject of the Kingdom of Prussia during the German
Empire (1914–1918)[note 1] German citizen of the Free State of Prussia
(Weimar Republic, 1918–1933) Citizen of the United States (1940–1955)
Education Federal polytechnic school (1896–1900; B.A., 1900)
University of Zurich (Ph.D., 1905) Known for General relativity
Special relativity Photoelectric effect E=mc2 (Mass–energy
equivalence) E=hf (Planck–Einstein relation) Theory of Brownian motion
Einstein field equations Bose–Einstein statistics Bose–Einstein
condensate Gravitational wave Cosmological constant Unified field
theory EPR paradox Ensemble interpretation List of other concepts
Spouse(s)Mileva Marić(m. 1903; div. 1919)Elsa Löwenthal(m. 1919;
died[1][2] 1936)Children"Lieserl" Einstein Hans Albert Einstein Eduard
"Tete" EinsteinAwards Barnard Medal (1920) Nobel Prize in Physics
(1921) Matteucci Medal (1921) ForMemRS (1921)[3] Copley Medal
(1925)[3] Gold Medal of the Royal Astronomical Society (1926) Max
Planck Medal (1929) Member of the National Academy of Sciences (1942)
Time Person of the Century (1999) Scientific careerFieldsPhysics,
philosophyInstitutions Swiss Patent Office (Bern) (1902–1909)
University of Bern (1908–1909) University of Zurich (1909–1911)
Charles University in Prague (1911–1912) ETH Zurich (1912–1914)
Prussian Academy of Sciences (1914–1933) Humboldt University of Berlin
(1914–1933) Kaiser Wilhelm Institute (director, 1917–1933) German
Physical Society (president, 1916–1918) Leiden University (visits,
1920) Institute for Advanced Study (1933–1955) Caltech (visits,
1931–1933) University of Oxford (visits, 1931–1933) ThesisEine neue
Bestimmung der Moleküldimensionen (A New Determination of Molecular
Dimensions) (1905)Doctoral advisorAlfred KleinerOther academic
advisorsHeinrich Friedrich WeberInfluences Arthur Schopenhauer Baruch
Spinoza Bernhard Riemann David Hume Ernst Mach Hendrik Lorentz Hermann
Minkowski Isaac Newton James Clerk Maxwell Michele Besso Moritz
Schlick Thomas Young Influenced Virtually all modern physics
Signature Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt
ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born
theoretical physicist[5] who developed the theory of relativity, one
of the two pillars of modern physics (alongside quantum
mechanics).[3][6]:274 His work is also known for its influence on the
philosophy of science.[7][8] He is best known to the general public
for his mass–energy equivalence formula . . . . .
I only want text after the first heading "Albert Einstein"
First find h1 tag and then use find_next_siblings('div') and print the text value.
import requests
import bs4
urlpage = 'https://en.wikipedia.org/wiki/Albert_Einstein#Publications'
res = requests.get(urlpage)
soup1 =bs4.BeautifulSoup(res.text, 'lxml')
h1=soup1.find('h1')
for item in h1.find_next_siblings('div'):
print(item.text)
If you do want to get the text such as described, I suggest a bit of an "non-parser" way.
By cutting the string directly from the response object.
Let's do this:
import requests
urlpage = "https://en.wikipedia.org/wiki/Albert_Einstein#Publications"
my_string = """<h1 id="firstHeading" class="firstHeading" lang="en">Albert Einstein</h1>""" # define the string you want
response = requests.get(urlpage).text # get the full response html as str
cut_response = response[response.find(my_string)::] # cut the str from your string on
soup1 = (bs4.BeautifulSoup(cut_response, 'lxml')).get_text() # get soup object, but of cut string
print(soup1)
Should work.

Categories

Resources