I would like to scrape the content of this website https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283 and create a table with the columns NAME, TITLE, LOCATION. I know some individuals have more or less "lines", but I am just trying to understand how I could even classify the first 3 lines for each person given that the text is in between "fonts" for all.
So far I have:
url="https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("font")
But once I am there and I have all the text within "font" in my "column" variable, I don't know how to proceed to differentiate between each person and build a loop where I would retrieve name, title, location etc. for each.
Any help would be highly appreciated!
Note: instead of using selenium, I simply fetched and parsed with soup = BeautifulSoup(requests.get(url).content, "html.parser"); as far as I an tell, the required section is not dynamic, so it shouldn't cause any issues.
would you have any idea about how to look for pairs of <br>
Since they represent empty lines, you could try simply splitting the text in that cell by \n\n\n
blockText = soup.select_one('td:has(font)').get_text(' ')
blockText = blockText.replace('-'*10, '\n\n\n') # pad "underlined" lines
blockSections = [sect.strip() for sect in '\n'.join([
l.strip('-').strip() for l in blockText.splitlines()
]).split('\n\n\n') if sect.strip()]
Although, if you looked at blockSections, you might notice that some headers [ROSTER and MEMBERS] get stuck to the end of the previous section - probably because their formatting means that an extra <br> is not needed to distinguish them from their adjacent sections. [I added the .replace('-'*10, '\n\n\n') line so that at least they're separated from the next section.]
Another risk is that I don't know if all versions and parsers will parse <br><br> to text as 3 line breaks - some omit br space entirely from text, and others might add extra space based on spaces between tags in the source html.
It's easier to split if you loop through the <br>s and pad them with something more distinctive to split by; the .insert... methods are useful here. (This method also has the advantage of being able to target bolded lined as well.)
blockSoup = soup.select_one('td:has(font)')
for br2 in blockSoup.select('br+br, font:has(br)'):
br2.insert_after(BeautifulSoup(f'<p>{"="*80}</p>').p)
br2.insert_before(BeautifulSoup(f'<p>{"="*80}</p>').p)
blockSections = [
sect.strip().strip('-').strip() for sect in
blockSoup.get_text(' ').split("="*80) if sect.strip()
]
This time, blockSections looks something like
['Membership Roster - ACE\n AIDS CLINICAL STUDIES AND EPIDEMIOLOGY STUDY SECTION\n Center For Scientific Review\n (Terms end 6/30 of the designated year)\n ROSTER',
'CHAIRPERSON',
'SCHACKER, TIMOTHY\n W\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF MINNESOTA\n MINNEAPOLIS,\n MN\n 55455',
'MEMBERS',
'ANDERSON, JEAN\n R\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF GYNECOLOGY AND OBSTETRICS\n JOHNS HOPKINS UNIVERSITY\n BALTIMORE,\n MD 21287',
'BALASUBRAMANYAM, ASHOK\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE AND\n MOLECULAR AND CELLULAR BIOLOGY\n DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM\n BAYLOR COLLEGE OF MEDICINE\n HOUSTON,\n TX 77030',
'BLATTNER, WILLIAM\n ALBERT\n , MD,\n (15)\n PROFESSOR AND ASSOCIATE DIRECTOR\n DEPARTMENT OF MEDICNE\n INSTITUTE OF HUMAN VIROLOGY\n UNIVERSITY OF MARYLAND, BALTIMORE\n BALTIMORE,\n MD 21201',
'CHEN, YING\n QING\n , PHD,\n (15)\n PROFESSOR\n PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS\n FRED HUTCHINSON CANCER RESEARCH CENTER\n SEATTLE,\n WA 981091024',
'COTTON, DEBORAH\n , MD,\n (13)\n PROFESSOR\n SECTION OF INFECTIOUS DISEASES\n DEPARTMENT OF MEDICINE\n BOSTON UNIVERSITY\n BOSTON,\n MA 02118',
'DANIELS, MICHAEL\n J\n , SCD,\n (16)\n PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF TEXAS AT AUSTIN\n AUSTIN,\n TX 78712',
'FOULKES, ANDREA\n SARAH\n , SCD,\n (14)\n ASSOCIATE PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF MASSACHUSETTS\n AMHERST,\n MA 01003',
'HEROLD, BETSY\n C\n , MD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n ALBERT EINSTEIN COLLEGE OF MEDICINE\n BRONX,\n NY 10461',
'JUSTICE, AMY\n CAROLINE\n , MD, PHD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n YALE UNIVERSITY\n NEW HAVEN,\n CT 06520',
'KATZENSTEIN, DAVID\n ALLENBERG\n , MD,\n (13)\n PROFESSOR\n DIVISION OF INFECTIOUS DISEASES\n STANFORD UNIVERSITY SCHOOL OF MEDICINE\n STANFORD,\n CA 94305',
'MARGOLIS, DAVID\n M\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL\n CHAPEL HILL,\n NC 27599',
'MONTANER, LUIS\n J\n , DVM, PHD,\n (13)\n PROFESSOR\n DEPARTMENT OF IMMUNOLOGY\n THE WISTAR INSTITUTE\n PHILADELPHIA,\n PA 19104',
'MONTANO, MONTY\n A\n , PHD,\n (15)\n RESEARCH SCIENTIST\n DEPARTMENT OF IMMUNOLOGY AND\n INFECTIOUS DISEASES\n BOSTON UNIVERSITY\n BOSTON,\n MA 02115',
'PAGE, KIMBERLY\n , PHD, MPH,\n (16)\n PROFESSOR\n DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH\n AND GLOBAL HEALTH SCIENCES\n DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n SAN FRANCISCO,\n CA 94105',
'SHIKUMA, CECILIA\n M\n , MD,\n (15)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n HAWAII AIDS CLINICAL RESEARCH PROGRAM\n UNIVERSITY OF HAWAII\n HONOLULU,\n HI 96816',
'WOOD, CHARLES\n , PHD,\n (13)\n PROFESSOR\n UNIVERSITY OF NEBRASKA\n LINCOLN,\n NE 68588']
create a table with the columns NAME, TITLE, LOCATION
There may be a more elegant solution, but I feel like the simplest way would be to just loop the siblings of the headers and keep count of consecutive brs.
doubleBr = soup.select('br')[:2] # [ so the last person also gets added ]
personsList = []
for f in soup.select('td>font>font:has(b br)'):
role, lCur,pCur,brCt = f.get_text(' ').strip('-').strip(), [],[],0
for lf in f.find_next_siblings(['font','br'])+doubleBr:
brCt = brCt+1 if lf.name == 'br' else 0
if pCur and (brCt>1 or lf.b):
pDets = {'role': role, 'name': '?'} # initiate
if len(pCur)>1: pDets['title'] = pCur[1]
pDets['name'], pCur = pCur[0], pCur[2:]
dList = pCur[:-2]
pDets['departments'] = dList[0] if len(dList)==1 else dList
if len(pCur)>1: pDets['institute'] = pCur[-2]
if pCur: pDets['location'] = pCur[-1]
personsList.append(pDets)
pCur, lCur, brCt = [], [], 0 # clear
if lf.b: break # rached next section
if lf.name == 'font': # [split and join to minimize whitespace]
lCur.append(' '.join(lf.get_text(' ').split())) # add to line
if brCt and lCur: pCur, lCur = pCur+[' '.join(lCur)], [] # newline
Since personsList is a list of dictionaries, it can be tabulated as simply as pandas.DataFrame(personsList) to get a DataFrame that looks like:
role
name
title
departments
institute
location
CHAIRPERSON
SCHACKER, TIMOTHY W , MD
PROFESSOR
DEPARTMENT OF MEDICINE
UNIVERSITY OF MINNESOTA
MINNEAPOLIS, MN 55455
MEMBERS
ANDERSON, JEAN R , MD
PROFESSOR
DEPARTMENT OF GYNECOLOGY AND OBSTETRICS
JOHNS HOPKINS UNIVERSITY
BALTIMORE, MD 21287
MEMBERS
BALASUBRAMANYAM, ASHOK , MD
PROFESSOR
['DEPARTMENT OF MEDICINE AND', 'MOLECULAR AND CELLULAR BIOLOGY', 'DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM']
BAYLOR COLLEGE OF MEDICINE
HOUSTON, TX 77030
MEMBERS
BLATTNER, WILLIAM ALBERT , MD
PROFESSOR AND ASSOCIATE DIRECTOR
['DEPARTMENT OF MEDICNE', 'INSTITUTE OF HUMAN VIROLOGY']
UNIVERSITY OF MARYLAND, BALTIMORE
BALTIMORE, MD 21201
MEMBERS
CHEN, YING QING , PHD
PROFESSOR
PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS
FRED HUTCHINSON CANCER RESEARCH CENTER
SEATTLE, WA 981091024
MEMBERS
COTTON, DEBORAH , MD
PROFESSOR
['SECTION OF INFECTIOUS DISEASES', 'DEPARTMENT OF MEDICINE']
BOSTON UNIVERSITY
BOSTON, MA 02118
MEMBERS
DANIELS, MICHAEL J , SCD
PROFESSOR
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF TEXAS AT AUSTIN
AUSTIN, TX 78712
MEMBERS
FOULKES, ANDREA SARAH , SCD
ASSOCIATE PROFESSOR
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF MASSACHUSETTS
AMHERST, MA 01003
MEMBERS
HEROLD, BETSY C , MD
PROFESSOR
DEPARTMENT OF PEDIATRICS
ALBERT EINSTEIN COLLEGE OF MEDICINE
BRONX, NY 10461
MEMBERS
JUSTICE, AMY CAROLINE , MD, PHD
PROFESSOR
DEPARTMENT OF PEDIATRICS
YALE UNIVERSITY
NEW HAVEN, CT 06520
MEMBERS
KATZENSTEIN, DAVID ALLENBERG , MD
PROFESSOR
DIVISION OF INFECTIOUS DISEASES
STANFORD UNIVERSITY SCHOOL OF MEDICINE
STANFORD, CA 94305
MEMBERS
MARGOLIS, DAVID M , MD
PROFESSOR
DEPARTMENT OF MEDICINE
UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL
CHAPEL HILL, NC 27599
MEMBERS
MONTANER, LUIS J , DVM, PHD
PROFESSOR
DEPARTMENT OF IMMUNOLOGY
THE WISTAR INSTITUTE
PHILADELPHIA, PA 19104
MEMBERS
MONTANO, MONTY A , PHD
RESEARCH SCIENTIST
['DEPARTMENT OF IMMUNOLOGY AND', 'INFECTIOUS DISEASES']
BOSTON UNIVERSITY
BOSTON, MA 02115
MEMBERS
PAGE, KIMBERLY , PHD, MPH
PROFESSOR
['DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH', 'AND GLOBAL HEALTH SCIENCES', 'DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS', 'UNIVERSITY OF CALIFORNIA, SAN FRANCISCO']
UNIVERSITY OF CALIFORNIA, SAN FRANCISCO
SAN FRANCISCO, CA 94105
MEMBERS
SHIKUMA, CECILIA M , MD
PROFESSOR
['DEPARTMENT OF MEDICINE', 'HAWAII AIDS CLINICAL RESEARCH PROGRAM']
UNIVERSITY OF HAWAII
HONOLULU, HI 96816
MEMBERS
WOOD, CHARLES , PHD
PROFESSOR
[]
UNIVERSITY OF NEBRASKA
LINCOLN, NE 68588
[ Btw, if the .select('br+br, font:has(br)') and .select('td>font>font:has(b br)') parts are unfamiliar to you, you can look up .select and CSS selectors. Combinators [like >/+/,] and pseudo-classes [like :has] allow us to get very specific with out targets. ]
I am trying to parse the following data extracted with beautiful soup.
{"currentLink":"/torrance-ca-90510/","regionId":96168,"displayRegionName":"90510"}],"universities":[]},"showAttributeLinks":null}},"mapState":{"customRegionPolygonWkt":null,"schoolPolygonWkt":null,"isCurrentLocationSearch":false,"userPosition":{"lat":null,"lon":null}},"regionState":{"regionInfo":[{"regionType":6,"regionId":54722,"regionName":"Torrance","displayName":"Torrance CA","isPointRegion":false}],"regionBounds":{"north":33.887061,"east":-118.308127,"south":33.780217,"west":-118.394107}},"searchPageSeoObject":{"baseUrl":"/torrance-ca/","windowTitle":"Torrance CA Real Estate - Torrance CA Homes For Sale | Zillow","metaDescription":"Zillow has 100 homes for sale in Torrance CA. View listing photos, review sales history, and use our detailed real estate filters to find the perfect place."},"abTrials":{"SXP_HDP_CONTINGENT_V2":"ON","SXP_REGION_AUTOCOMPLETE_SOURCE":"TRULIA","RE_Move_In_Date_Filter":"TEST","SXP_SENTRY":"ON","SEOTEST__SXP_LIST_ONLY_SRP":"CONTROL","SXP_SAVE_SEARCH_COLOR":"CONTROL","SXP_REACT_FOOTER_DESKTOP":"CONTROL","RE_Web_PersonalizedSort":"CONTROL","ACQ_Search_Filters_Upsell":"CONTROL","VARIANTS_BDP_768_PLUS":"CONTROL","SXP_FLOATING_ACTION_BAR":"ON","SET_CTA_COUNTLESS":"CONTROL","SXP_LISTING_SUBTYPE":"CONTROL","SXP_Search_Refinement_Filters":"CONTROL","RE_SearchByBuildingName":"TEST","SXP_PREXIT_CLAIMS_INFO":"ENABLED","SXP_NONMLS_OFF":"CONTROL","SXP_PARTIAL_PAGE_LOAD_REFACTOR":"CONTROL","SXP_NAV_AD_LOADING":"CONTROL","ACT_FILTER_ON_LAND":"CONTROL","SXP_PHOTO_CAROUSEL":"CONTROL","SEO__SXP_REMOVE_ANCHOR_TEXT":"CONTROL","DXP_NEW_MAP_DOTS_WEB":"CONTROL","SXP_ACT_REMOVE_SEARCHBOX_GLEAM":"NO_GLEAM","SXP_Exclude_Referer":"TEST","SXP_PAGE_LOAD":"FASTER","RE_Rentals_Badging_v1":"CONTROL","SXP_REACT_GPT":"REMOVED","SXP_QU_PHASE_2":"ON","RE_HDP_REDIRECT":"CONTROL","RE_Search_Refinement_Filters":"CONTROL","MIGHTY_MONTH_2022_HOLDOUT":"MIGHTY_MONTH_ON","SXP_NEW_LANE_CLICKSTREAM":"ON","SXP_REDUCED_SERVER_SIDE_RENDER":"CONTROL","SXP_DelayJS":"AFTER_LOAD","ACQ_Banner_Suppression":"CONTROL","GS_RATING_CLEANUP":"CONTROL","SXP_DEFERRED_RENDERER":"ASYNC_INITIAL_HYDRATE","SXP_STREETVIEW_REQUEST_TYPE":"CONTROL","RE_RentalsHomesForYouSort":"CONTROL","SP_FOR_RENT_PAGE":"CONTROL","Activation_NewLane_Metrics_Enabled":"DISABLED","SEOTEST__SXP_REMOVE_WHY_ZILLOW":"REMOVE_WHY_ZILLOW","Activation_Enabled":"ENABLED","DXP_RTB_LINKING":"ON","DXP_PHOTO_CAROUSEL":"CONTROL","DXP_MAP_ICONS":"CONTROL","WEB_HIDDEN_HOMES_2022":"ON","SXP_Rentals_Apartment_Community_Filter":"TEST","DXP_HOMEPAGE_OMP_CLIENT_REFACTOR":"ON","SXP_WOW_LIST_CARD":"CONTROL","SXP_KF_FILTERS_AC":"ON","SXP_SDS_INTEGRATION":"USE_FOR_ALL","SXP_MOBILE_MAP_PRIORITY":"CONTROL","SXP_MAP_DOT_STYLE":"CONTROL","Activation_GA_Metrics_Enabled":"ENABLED","SXP_Pers_SimilarResults":"CONTROL","RE_RentalHomeDetailsService":"CONTROL","ACQ_SigninSRP_Module":"Variant_Module_A","DXP_CONST_PROPCARD_MAPVIEW":"CONTROL","SXP_FLYBAR_PSL_ZGSEARCH":"ON","ADS_Tagless":"Casale_On","SEOTEST__NC_H1":"ALTERNATE","SXP_FLYBAR_REGION_API":"CONTROL","RMX_3RD_PARTY_P1":"ON","RUM_VIA_PRE_ENDPOINT":"TREATMENT_OFF","DXP_CONST_PROPCARD_LISTVIEW":"ON","SXP_EVENT_MARKUP":"CONTROL","SXP_SEARCH_DISPATCHER_SERVICE":"CONTROL","SXP_DISPLAY_AD_LOADING":"CONTROL","SP_ZO_HDP_PAGE":"CONTROL","DXP_DYNAMIC_ADS":"CONTROL","DXP_MAP_DOT_COLORS":"CONTROL","DESKTOP_COMMUTE_FILTER_MVP":"CONTROL","DXP_AUTH_GATED_COLLECTIONS":"ON","RE_FR_Photo_Carousel":"CONTROL","ADT_TOP_SLOT_SRP":"ONSITE_MESSAGING","DXP_HIDE_HOME":"CONTROL","VL_BDP_SSR_QUERY":"CONTROL_CACHED","VL_BDP_NEW_TAB":"CONTROL","SXP_PREXIT_CLAIMS_CHECK":"ENABLED","SEOTEST__SXP_SEO_TEST":"CONTROL","SXP_OPEN_HOUSE_FLEX":"OPEN_HOUSE_BOOSTED","SP_FOR_SALE_PAGE":"CONTROL","SXP_NO_SRPTOGGLE":"CONTROL","SXP_PAGINATION":"LEGACY_PAGINATION","SXP_KF_FILTERS_V2":"ON","SXP_MLS_NONMLS_FILTER":"CONTROL","DXP_MULTIPLE_COLLECTIONS":"ON","RE_GuidedSearchFiltersPOC":"CONTROL","SXP_3DHOME_FILTER":"ON","SP_BUILDING_PAGE":"CONTROL","SXP_QU_MIGRATION":"ON","ACQ_MOBILE_UPSELL_SXP":"CONTROL","SXP_LIST_ONLY_SRP":"CONTROL","RE_JanusBrainSort":"TEST_ALL_STATES","SEOTEST__RE_ForRentForSaleSRPBreadcrumbs":"CONTROL","SXP_MAKE_ME_MOVE":"REMOVED","SHO_GA_ResultsTotalEvent":"ON","DXP_MAP_DOTS_WEB":"ON","SXP_MULTIREGION_SEARCH":"CONTROL","DXP_HERO_SHORTENING":"ON","SXP_VISUAL_AUDIT_2021":"ON","SXP_HDP_BLUE_TO_RED":"ON","HDP_DESKTOP_LAYOUT_TOPNAV":"CONTROL","SXP_HEADER_TAG_WRAPPER":"ON","DXP_HOMEPAGE_OMP":"ON","SXP_Multifamily_Filter":"MULTIFAMILY_SEPARATE","SXP_FOOTER":"RESPONSIVE_REACT","SP_PAID_BUILDER_PAGE":"VIA_SHOPPER_PLATFORM","SP_OFF_MARKET_PAGE":"VIA_SHOPPER_PLATFORM","SXP_KINGFISHER_FILTERS":"P1_PHASE_1","RE_SECOND_BOOST":"SLOT_4","DXP_TG_SCHOOLS_DISABLED":"CONTROL","ADT_PROGRESSIVE_MESSAGE":"CONTROL","SXP_Combined_Filter_Apartments_Condos":"TEST","DXP_HOME_RECS":"ON","SEOTEST__SXP_REACT_FOOTER_DESKTOP":"CONTROL","SXP_3DTOUR_MAP_DOT":"ON","DXP_SEE_MORE_RECS":"SCROLL_ON"},"cat1":{"searchResults":{"listResults":[{"zpid":"21328879","id":"21328879","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/3e81a218088316bafa7b199e8dc4923f-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/20412-Wayne-Ave-Torrance-CA-90503/21328879_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$1,275,000","unformattedPrice":1275000,"address":"20412 Wayne Ave, Torrance, CA 90503","addressStreet":"20412 Wayne Ave","addressCity":"Torrance","addressState":"CA","addressZipcode":"90503","isUndisclosedAddress":false,"beds":4,"baths":2.0,"area":1890,"latLong":{"latitude":33.845936,"longitude":-118.37223},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 12-3pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21328879,"streetAddress":"20412 Wayne Ave","zipcode":"90503","city":"Torrance","state":"CA","latitude":33.845936,"longitude":-118.37223,"price":1275000.0,"bathrooms":2.0,"bedrooms":4.0,"livingArea":1890.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":1264181,"rentZestimate":4699,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 12-3pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1275000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673726400000,"open_house_end":1673737200000},{"open_house_start":1673816400000,"open_house_end":1673827200000},{"open_house_start":1673902800000,"open_house_end":1673913600000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":171633.0,"lotAreaValue":7172.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T12:00:00","openHouseEndDate":"2023-01-14T15:00:00","openHouseDescription":"Open House - 0:00 - 3:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":1264181,"shouldShowZestimateAsPrice":false,"has3DModel":true,"hasVideo":false,"isHomeRec":false,"brokerName":"Re/Max Estate Properties","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"2060330967","id":"2060330967","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/f65a190c100bab31301becdba3cdf7cc-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/23701-S-Western-Ave-SPACE-244-Torrance-CA-90501/2060330967_zpid/","statusType":"FOR_SALE","statusText":"Home for sale","countryCurrency":"$","price":"$88,000","unformattedPrice":88000,"address":"23701 S Western Ave SPACE 244, Torrance, CA 90501","addressStreet":"23701 S Western Ave SPACE 244","addressCity":"Torrance","addressState":"CA","addressZipcode":"90501","isUndisclosedAddress":false,"beds":3,"baths":2.0,"area":1000,"latLong":{"latitude":33.809013,"longitude":-118.311035},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 2-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":2060330967,"streetAddress":"23701 S Western Ave SPACE 244","zipcode":"90501","city":"Torrance","state":"CA","latitude":33.809013,"longitude":-118.311035,"price":88000.0,"datePriceChanged":1673078400000,"bathrooms":2.0,"bedrooms":3.0,"livingArea":1000.0,"homeType":"MANUFACTURED","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 2-4pm","priceReduction":"$2,000 (Jan 7)","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":88000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673733600000,"open_house_end":1673740800000},{"open_house_start":1673820000000,"open_house_end":1673827200000}]},"priceChange":-2000,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","unit":"Space 244","lotAreaValue":16.3319,"lotAreaUnit":"acres"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T14:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 2:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":null,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"eXp Realty of California, Inc.","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21338409","id":"21338409","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/1f90c7be6ceca4a76d64d904010f0cb7-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/25924-Matfield-Dr-Torrance-CA-90505/21338409_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$1,100,000","unformattedPrice":1100000,"address":"25924 Matfield Dr, Torrance, CA 90505","addressStreet":"25924 Matfield Dr","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":4,"baths":3.0,"area":1531,"latLong":{"latitude":33.78651,"longitude":-118.334206},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21338409,"streetAddress":"25924 Matfield Dr","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.78651,"longitude":-118.334206,"price":1100000.0,"bathrooms":3.0,"bedrooms":4.0,"livingArea":1531.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":1100006,"rentZestimate":3800,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1100000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000},{"open_house_start":1673816400000,"open_house_end":1673827200000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":436335.0,"lotAreaValue":7987.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":1100006,"shouldShowZestimateAsPrice":false,"has3DModel":true,"hasVideo":false,"isHomeRec":false,"brokerName":"Equity Union","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21337140","id":"21337140","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/cb80b5736be022cbeeafb3bb77ec0e83-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/4068-Newton-St-Torrance-CA-90505/21337140_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$999,900","unformattedPrice":999900,"address":"4068 Newton St, Torrance, CA 90505","addressStreet":"4068 Newton St","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":2,"baths":2.0,"area":1268,"latLong":{"latitude":33.803745,"longitude":-118.35658},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21337140,"streetAddress":"4068 Newton St","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.803745,"longitude":-118.35658,"price":999900.0,"bathrooms":2.0,"bedrooms":2.0,"livingArea":1268.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":999907,"rentZestimate":3999,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":999900.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000},{"open_house_start":1673816400000,"open_house_end":1673827200000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":521866.0,"lotAreaValue":5050.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":999907,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Compass","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"63093583","id":"63093583","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/8eec0c61013d143353c8893258ad1770-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/3120-Sepulveda-Blvd-UNIT-414-Torrance-CA-90505/63093583_zpid/","statusType":"FOR_SALE","statusText":"Condo for sale","countryCurrency":"$","price":"$389,000","unformattedPrice":389000,"address":"3120 Sepulveda Blvd UNIT 414, Torrance, CA 90505","addressStreet":"3120 Sepulveda Blvd UNIT 414","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":1,"baths":1.0,"area":526,"latLong":{"latitude":33.823784,"longitude":-118.34189},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":63093583,"streetAddress":"3120 Sepulveda Blvd UNIT 414","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.823784,"longitude":-118.34189,"price":389000.0,"bathrooms":1.0,"bedrooms":1.0,"livingArea":526.0,"homeType":"CONDO","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":389001,"rentZestimate":2084,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":389000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":360000.0,"unit":"Unit 414","lotAreaValue":1.0985,"lotAreaUnit":"acres"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":389001,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Re/Max Estate Properties","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21324245","id":"21324245","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/4c47d6593842fdf0036f1805838c1673-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/417-Paseo-De-La-Playa-Redondo-Beach-CA-90277/21324245_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$19,995,000","unformattedPrice":19995000,"address":"417 Paseo De La Playa, Redondo Beach, CA 90277","addressStreet":"417 Paseo De La Playa","addressCity":"Redondo Beach","addressState":"CA","addressZipcode":"90277","isUndisclosedAddress":false,"beds":10,"baths":15.0,"area":15728,"latLong":{"latitude":33.810413,"longitude":-118.39131},"isZillowOwned":false,"variableData":{"type":"PRICE_REDUCTION","text":"$3,000,000 (Nov 10)"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21324245,"streetAddress":"417 Paseo De La Playa","zipcode":"90277","city":"Redondo Beach","state":"CA","latitude":33.810413,"longitude":-118.39131,"price":1.9995E7,"datePriceChanged":1668067200000,"bathrooms":15.0,"bedrooms":10.0,"livingArea":15728.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":17760922,"rentZestimate":83834,"listing_sub_type":{"is_FSBA":true},"priceReduction":"$3,000,000 (Nov 10)","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1.9995E7,"priceChange":-3000000,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":1.2904489E7,"lotAreaValue":1.4407,"lotAreaUnit":"acres"}},"isSaved":false,"isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":17760922,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Douglas Elliman of California, Inc.","info6String":"Joshua Altman DRE # 01764587","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21272955","id":"21272955","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/0d9ac33c6a2d1c8683ec45e7aef895a5-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/3101-Plaza-Del-Amo-UNIT-5-Torrance-CA-90503/21272955_zpid/","statusType":"FOR_SALE","statusText":"Townhouse for sale","countryCurrency":"$","price":"$835,000","unformattedPrice":835000,"address":"3101 Plaza Del Amo UNIT 5, Torrance, CA 90503","addressStreet":"3101 Plaza Del Amo UNIT 5","addressCity":"Torrance","addressState":"CA","addressZipcode":"90503","isUndisclosedAddress":false,"beds":3,"baths":3.0,"area":1446,"latLong":{"latitude":33.828743,"longitude":-118.34023},"isZillowOwned":false,"variableData":{"type":"DAYS_ON","text":"3 days on Zillow"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21272955,"streetAddress":"3101 Plaza Del Amo UNIT 5","zipcode":"90503","city":"Torrance","state":"CA","latitude":33.828743,"longitude":-118.34023,"price":835000.0,"bathrooms":3.0,"bedrooms":3.0,"livingArea":1446.0,"homeType":"TOWNHOUSE","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":852600,"rentZestimate":3499,"listing_sub_type":{"is_FSBA":true},"isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":835000.0,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":788066.0,"unit":"Unit 5","lotAreaValue":5.4866,"lotAreaUnit":"acres"}},"isSaved":false,"isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":852600,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"RELO REDAC, Inc.","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},
From what I can tell, this data exists in a section of the beautiful soup object within a < script > tag. I didn't include all data (there's a lot), but here's an excerpt of the last tag I can find before the region I'd like to extract.
<script data-zrr-shared-data-key="mobileSearchPageStore" type="application/json"><!--{"queryState":{"mapBounds":{"north":33.887061,"south":33.780217,"east":-118.308127,"west":-118.394107},"regionSelection":[{"regionId":54722,"regionType":6}],"isMapVisible":true,"filterState":{"sortSelection":{"value":"globalrelevanceex"},"isAllHomes":{"value":true}}},"filterDefinitions":{"keywords":{"id":"keywords","shortId":"att","labels":{"default":"Keywords","tracking":"Keyword"},"sortOrder":2,"type":"String","defaultValue":{"value":""},"exposedPillEnabled":true},"isPublicSchool":{"id":"isPublicSchool","shortId":"schp","labels":{"default":"Public"},"type":"Boolean","defaultValue":{"value":true}},"isCityView":{"id":"isCityView","shortId":"cityv","labels":
Is this json data? Is there a way to extract just this part of the data?
You can try to parse the data with json module:
import json
from bs4 import BeautifulSoup
html_doc = '''\
<script data-zrr-shared-data-key="mobileSearchPageStore" type="application/json"><!--{"currentLink":"/torrance-ca-90510/","regionId":96168}--></script>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# get the right tag
data = soup.select_one('script[data-zrr-shared-data-key="mobileSearchPageStore"]')
# get the contents of this tag, strip the html comments
data = data.contents[0].strip('><!-')
# parse the data
data = json.loads(data)
# print the data
print(data)
Prints:
{'currentLink': '/torrance-ca-90510/', 'regionId': 96168}