I would like to scrape the content of this website https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283 and create a table with the columns NAME, TITLE, LOCATION. I know some individuals have more or less "lines", but I am just trying to understand how I could even classify the first 3 lines for each person given that the text is in between "fonts" for all.
So far I have:
url="https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("font")
But once I am there and I have all the text within "font" in my "column" variable, I don't know how to proceed to differentiate between each person and build a loop where I would retrieve name, title, location etc. for each.
Any help would be highly appreciated!
Note: instead of using selenium, I simply fetched and parsed with soup = BeautifulSoup(requests.get(url).content, "html.parser"); as far as I an tell, the required section is not dynamic, so it shouldn't cause any issues.
would you have any idea about how to look for pairs of <br>
Since they represent empty lines, you could try simply splitting the text in that cell by \n\n\n
blockText = soup.select_one('td:has(font)').get_text(' ')
blockText = blockText.replace('-'*10, '\n\n\n') # pad "underlined" lines
blockSections = [sect.strip() for sect in '\n'.join([
l.strip('-').strip() for l in blockText.splitlines()
]).split('\n\n\n') if sect.strip()]
Although, if you looked at blockSections, you might notice that some headers [ROSTER and MEMBERS] get stuck to the end of the previous section - probably because their formatting means that an extra <br> is not needed to distinguish them from their adjacent sections. [I added the .replace('-'*10, '\n\n\n') line so that at least they're separated from the next section.]
Another risk is that I don't know if all versions and parsers will parse <br><br> to text as 3 line breaks - some omit br space entirely from text, and others might add extra space based on spaces between tags in the source html.
It's easier to split if you loop through the <br>s and pad them with something more distinctive to split by; the .insert... methods are useful here. (This method also has the advantage of being able to target bolded lined as well.)
blockSoup = soup.select_one('td:has(font)')
for br2 in blockSoup.select('br+br, font:has(br)'):
br2.insert_after(BeautifulSoup(f'<p>{"="*80}</p>').p)
br2.insert_before(BeautifulSoup(f'<p>{"="*80}</p>').p)
blockSections = [
sect.strip().strip('-').strip() for sect in
blockSoup.get_text(' ').split("="*80) if sect.strip()
]
This time, blockSections looks something like
['Membership Roster - ACE\n AIDS CLINICAL STUDIES AND EPIDEMIOLOGY STUDY SECTION\n Center For Scientific Review\n (Terms end 6/30 of the designated year)\n ROSTER',
'CHAIRPERSON',
'SCHACKER, TIMOTHY\n W\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF MINNESOTA\n MINNEAPOLIS,\n MN\n 55455',
'MEMBERS',
'ANDERSON, JEAN\n R\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF GYNECOLOGY AND OBSTETRICS\n JOHNS HOPKINS UNIVERSITY\n BALTIMORE,\n MD 21287',
'BALASUBRAMANYAM, ASHOK\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE AND\n MOLECULAR AND CELLULAR BIOLOGY\n DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM\n BAYLOR COLLEGE OF MEDICINE\n HOUSTON,\n TX 77030',
'BLATTNER, WILLIAM\n ALBERT\n , MD,\n (15)\n PROFESSOR AND ASSOCIATE DIRECTOR\n DEPARTMENT OF MEDICNE\n INSTITUTE OF HUMAN VIROLOGY\n UNIVERSITY OF MARYLAND, BALTIMORE\n BALTIMORE,\n MD 21201',
'CHEN, YING\n QING\n , PHD,\n (15)\n PROFESSOR\n PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS\n FRED HUTCHINSON CANCER RESEARCH CENTER\n SEATTLE,\n WA 981091024',
'COTTON, DEBORAH\n , MD,\n (13)\n PROFESSOR\n SECTION OF INFECTIOUS DISEASES\n DEPARTMENT OF MEDICINE\n BOSTON UNIVERSITY\n BOSTON,\n MA 02118',
'DANIELS, MICHAEL\n J\n , SCD,\n (16)\n PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF TEXAS AT AUSTIN\n AUSTIN,\n TX 78712',
'FOULKES, ANDREA\n SARAH\n , SCD,\n (14)\n ASSOCIATE PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF MASSACHUSETTS\n AMHERST,\n MA 01003',
'HEROLD, BETSY\n C\n , MD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n ALBERT EINSTEIN COLLEGE OF MEDICINE\n BRONX,\n NY 10461',
'JUSTICE, AMY\n CAROLINE\n , MD, PHD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n YALE UNIVERSITY\n NEW HAVEN,\n CT 06520',
'KATZENSTEIN, DAVID\n ALLENBERG\n , MD,\n (13)\n PROFESSOR\n DIVISION OF INFECTIOUS DISEASES\n STANFORD UNIVERSITY SCHOOL OF MEDICINE\n STANFORD,\n CA 94305',
'MARGOLIS, DAVID\n M\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL\n CHAPEL HILL,\n NC 27599',
'MONTANER, LUIS\n J\n , DVM, PHD,\n (13)\n PROFESSOR\n DEPARTMENT OF IMMUNOLOGY\n THE WISTAR INSTITUTE\n PHILADELPHIA,\n PA 19104',
'MONTANO, MONTY\n A\n , PHD,\n (15)\n RESEARCH SCIENTIST\n DEPARTMENT OF IMMUNOLOGY AND\n INFECTIOUS DISEASES\n BOSTON UNIVERSITY\n BOSTON,\n MA 02115',
'PAGE, KIMBERLY\n , PHD, MPH,\n (16)\n PROFESSOR\n DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH\n AND GLOBAL HEALTH SCIENCES\n DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n SAN FRANCISCO,\n CA 94105',
'SHIKUMA, CECILIA\n M\n , MD,\n (15)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n HAWAII AIDS CLINICAL RESEARCH PROGRAM\n UNIVERSITY OF HAWAII\n HONOLULU,\n HI 96816',
'WOOD, CHARLES\n , PHD,\n (13)\n PROFESSOR\n UNIVERSITY OF NEBRASKA\n LINCOLN,\n NE 68588']
create a table with the columns NAME, TITLE, LOCATION
There may be a more elegant solution, but I feel like the simplest way would be to just loop the siblings of the headers and keep count of consecutive brs.
doubleBr = soup.select('br')[:2] # [ so the last person also gets added ]
personsList = []
for f in soup.select('td>font>font:has(b br)'):
role, lCur,pCur,brCt = f.get_text(' ').strip('-').strip(), [],[],0
for lf in f.find_next_siblings(['font','br'])+doubleBr:
brCt = brCt+1 if lf.name == 'br' else 0
if pCur and (brCt>1 or lf.b):
pDets = {'role': role, 'name': '?'} # initiate
if len(pCur)>1: pDets['title'] = pCur[1]
pDets['name'], pCur = pCur[0], pCur[2:]
dList = pCur[:-2]
pDets['departments'] = dList[0] if len(dList)==1 else dList
if len(pCur)>1: pDets['institute'] = pCur[-2]
if pCur: pDets['location'] = pCur[-1]
personsList.append(pDets)
pCur, lCur, brCt = [], [], 0 # clear
if lf.b: break # rached next section
if lf.name == 'font': # [split and join to minimize whitespace]
lCur.append(' '.join(lf.get_text(' ').split())) # add to line
if brCt and lCur: pCur, lCur = pCur+[' '.join(lCur)], [] # newline
Since personsList is a list of dictionaries, it can be tabulated as simply as pandas.DataFrame(personsList) to get a DataFrame that looks like:
role
name
title
departments
institute
location
CHAIRPERSON
SCHACKER, TIMOTHY W , MD
PROFESSOR
DEPARTMENT OF MEDICINE
UNIVERSITY OF MINNESOTA
MINNEAPOLIS, MN 55455
MEMBERS
ANDERSON, JEAN R , MD
PROFESSOR
DEPARTMENT OF GYNECOLOGY AND OBSTETRICS
JOHNS HOPKINS UNIVERSITY
BALTIMORE, MD 21287
MEMBERS
BALASUBRAMANYAM, ASHOK , MD
PROFESSOR
['DEPARTMENT OF MEDICINE AND', 'MOLECULAR AND CELLULAR BIOLOGY', 'DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM']
BAYLOR COLLEGE OF MEDICINE
HOUSTON, TX 77030
MEMBERS
BLATTNER, WILLIAM ALBERT , MD
PROFESSOR AND ASSOCIATE DIRECTOR
['DEPARTMENT OF MEDICNE', 'INSTITUTE OF HUMAN VIROLOGY']
UNIVERSITY OF MARYLAND, BALTIMORE
BALTIMORE, MD 21201
MEMBERS
CHEN, YING QING , PHD
PROFESSOR
PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS
FRED HUTCHINSON CANCER RESEARCH CENTER
SEATTLE, WA 981091024
MEMBERS
COTTON, DEBORAH , MD
PROFESSOR
['SECTION OF INFECTIOUS DISEASES', 'DEPARTMENT OF MEDICINE']
BOSTON UNIVERSITY
BOSTON, MA 02118
MEMBERS
DANIELS, MICHAEL J , SCD
PROFESSOR
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF TEXAS AT AUSTIN
AUSTIN, TX 78712
MEMBERS
FOULKES, ANDREA SARAH , SCD
ASSOCIATE PROFESSOR
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF MASSACHUSETTS
AMHERST, MA 01003
MEMBERS
HEROLD, BETSY C , MD
PROFESSOR
DEPARTMENT OF PEDIATRICS
ALBERT EINSTEIN COLLEGE OF MEDICINE
BRONX, NY 10461
MEMBERS
JUSTICE, AMY CAROLINE , MD, PHD
PROFESSOR
DEPARTMENT OF PEDIATRICS
YALE UNIVERSITY
NEW HAVEN, CT 06520
MEMBERS
KATZENSTEIN, DAVID ALLENBERG , MD
PROFESSOR
DIVISION OF INFECTIOUS DISEASES
STANFORD UNIVERSITY SCHOOL OF MEDICINE
STANFORD, CA 94305
MEMBERS
MARGOLIS, DAVID M , MD
PROFESSOR
DEPARTMENT OF MEDICINE
UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL
CHAPEL HILL, NC 27599
MEMBERS
MONTANER, LUIS J , DVM, PHD
PROFESSOR
DEPARTMENT OF IMMUNOLOGY
THE WISTAR INSTITUTE
PHILADELPHIA, PA 19104
MEMBERS
MONTANO, MONTY A , PHD
RESEARCH SCIENTIST
['DEPARTMENT OF IMMUNOLOGY AND', 'INFECTIOUS DISEASES']
BOSTON UNIVERSITY
BOSTON, MA 02115
MEMBERS
PAGE, KIMBERLY , PHD, MPH
PROFESSOR
['DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH', 'AND GLOBAL HEALTH SCIENCES', 'DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS', 'UNIVERSITY OF CALIFORNIA, SAN FRANCISCO']
UNIVERSITY OF CALIFORNIA, SAN FRANCISCO
SAN FRANCISCO, CA 94105
MEMBERS
SHIKUMA, CECILIA M , MD
PROFESSOR
['DEPARTMENT OF MEDICINE', 'HAWAII AIDS CLINICAL RESEARCH PROGRAM']
UNIVERSITY OF HAWAII
HONOLULU, HI 96816
MEMBERS
WOOD, CHARLES , PHD
PROFESSOR
[]
UNIVERSITY OF NEBRASKA
LINCOLN, NE 68588
[ Btw, if the .select('br+br, font:has(br)') and .select('td>font>font:has(b br)') parts are unfamiliar to you, you can look up .select and CSS selectors. Combinators [like >/+/,] and pseudo-classes [like :has] allow us to get very specific with out targets. ]
I am trying to parse the following data extracted with beautiful soup.
{"currentLink":"/torrance-ca-90510/","regionId":96168,"displayRegionName":"90510"}],"universities":[]},"showAttributeLinks":null}},"mapState":{"customRegionPolygonWkt":null,"schoolPolygonWkt":null,"isCurrentLocationSearch":false,"userPosition":{"lat":null,"lon":null}},"regionState":{"regionInfo":[{"regionType":6,"regionId":54722,"regionName":"Torrance","displayName":"Torrance CA","isPointRegion":false}],"regionBounds":{"north":33.887061,"east":-118.308127,"south":33.780217,"west":-118.394107}},"searchPageSeoObject":{"baseUrl":"/torrance-ca/","windowTitle":"Torrance CA Real Estate - Torrance CA Homes For Sale | Zillow","metaDescription":"Zillow has 100 homes for sale in Torrance CA. View listing photos, review sales history, and use our detailed real estate filters to find the perfect place."},"abTrials":{"SXP_HDP_CONTINGENT_V2":"ON","SXP_REGION_AUTOCOMPLETE_SOURCE":"TRULIA","RE_Move_In_Date_Filter":"TEST","SXP_SENTRY":"ON","SEOTEST__SXP_LIST_ONLY_SRP":"CONTROL","SXP_SAVE_SEARCH_COLOR":"CONTROL","SXP_REACT_FOOTER_DESKTOP":"CONTROL","RE_Web_PersonalizedSort":"CONTROL","ACQ_Search_Filters_Upsell":"CONTROL","VARIANTS_BDP_768_PLUS":"CONTROL","SXP_FLOATING_ACTION_BAR":"ON","SET_CTA_COUNTLESS":"CONTROL","SXP_LISTING_SUBTYPE":"CONTROL","SXP_Search_Refinement_Filters":"CONTROL","RE_SearchByBuildingName":"TEST","SXP_PREXIT_CLAIMS_INFO":"ENABLED","SXP_NONMLS_OFF":"CONTROL","SXP_PARTIAL_PAGE_LOAD_REFACTOR":"CONTROL","SXP_NAV_AD_LOADING":"CONTROL","ACT_FILTER_ON_LAND":"CONTROL","SXP_PHOTO_CAROUSEL":"CONTROL","SEO__SXP_REMOVE_ANCHOR_TEXT":"CONTROL","DXP_NEW_MAP_DOTS_WEB":"CONTROL","SXP_ACT_REMOVE_SEARCHBOX_GLEAM":"NO_GLEAM","SXP_Exclude_Referer":"TEST","SXP_PAGE_LOAD":"FASTER","RE_Rentals_Badging_v1":"CONTROL","SXP_REACT_GPT":"REMOVED","SXP_QU_PHASE_2":"ON","RE_HDP_REDIRECT":"CONTROL","RE_Search_Refinement_Filters":"CONTROL","MIGHTY_MONTH_2022_HOLDOUT":"MIGHTY_MONTH_ON","SXP_NEW_LANE_CLICKSTREAM":"ON","SXP_REDUCED_SERVER_SIDE_RENDER":"CONTROL","SXP_DelayJS":"AFTER_LOAD","ACQ_Banner_Suppression":"CONTROL","GS_RATING_CLEANUP":"CONTROL","SXP_DEFERRED_RENDERER":"ASYNC_INITIAL_HYDRATE","SXP_STREETVIEW_REQUEST_TYPE":"CONTROL","RE_RentalsHomesForYouSort":"CONTROL","SP_FOR_RENT_PAGE":"CONTROL","Activation_NewLane_Metrics_Enabled":"DISABLED","SEOTEST__SXP_REMOVE_WHY_ZILLOW":"REMOVE_WHY_ZILLOW","Activation_Enabled":"ENABLED","DXP_RTB_LINKING":"ON","DXP_PHOTO_CAROUSEL":"CONTROL","DXP_MAP_ICONS":"CONTROL","WEB_HIDDEN_HOMES_2022":"ON","SXP_Rentals_Apartment_Community_Filter":"TEST","DXP_HOMEPAGE_OMP_CLIENT_REFACTOR":"ON","SXP_WOW_LIST_CARD":"CONTROL","SXP_KF_FILTERS_AC":"ON","SXP_SDS_INTEGRATION":"USE_FOR_ALL","SXP_MOBILE_MAP_PRIORITY":"CONTROL","SXP_MAP_DOT_STYLE":"CONTROL","Activation_GA_Metrics_Enabled":"ENABLED","SXP_Pers_SimilarResults":"CONTROL","RE_RentalHomeDetailsService":"CONTROL","ACQ_SigninSRP_Module":"Variant_Module_A","DXP_CONST_PROPCARD_MAPVIEW":"CONTROL","SXP_FLYBAR_PSL_ZGSEARCH":"ON","ADS_Tagless":"Casale_On","SEOTEST__NC_H1":"ALTERNATE","SXP_FLYBAR_REGION_API":"CONTROL","RMX_3RD_PARTY_P1":"ON","RUM_VIA_PRE_ENDPOINT":"TREATMENT_OFF","DXP_CONST_PROPCARD_LISTVIEW":"ON","SXP_EVENT_MARKUP":"CONTROL","SXP_SEARCH_DISPATCHER_SERVICE":"CONTROL","SXP_DISPLAY_AD_LOADING":"CONTROL","SP_ZO_HDP_PAGE":"CONTROL","DXP_DYNAMIC_ADS":"CONTROL","DXP_MAP_DOT_COLORS":"CONTROL","DESKTOP_COMMUTE_FILTER_MVP":"CONTROL","DXP_AUTH_GATED_COLLECTIONS":"ON","RE_FR_Photo_Carousel":"CONTROL","ADT_TOP_SLOT_SRP":"ONSITE_MESSAGING","DXP_HIDE_HOME":"CONTROL","VL_BDP_SSR_QUERY":"CONTROL_CACHED","VL_BDP_NEW_TAB":"CONTROL","SXP_PREXIT_CLAIMS_CHECK":"ENABLED","SEOTEST__SXP_SEO_TEST":"CONTROL","SXP_OPEN_HOUSE_FLEX":"OPEN_HOUSE_BOOSTED","SP_FOR_SALE_PAGE":"CONTROL","SXP_NO_SRPTOGGLE":"CONTROL","SXP_PAGINATION":"LEGACY_PAGINATION","SXP_KF_FILTERS_V2":"ON","SXP_MLS_NONMLS_FILTER":"CONTROL","DXP_MULTIPLE_COLLECTIONS":"ON","RE_GuidedSearchFiltersPOC":"CONTROL","SXP_3DHOME_FILTER":"ON","SP_BUILDING_PAGE":"CONTROL","SXP_QU_MIGRATION":"ON","ACQ_MOBILE_UPSELL_SXP":"CONTROL","SXP_LIST_ONLY_SRP":"CONTROL","RE_JanusBrainSort":"TEST_ALL_STATES","SEOTEST__RE_ForRentForSaleSRPBreadcrumbs":"CONTROL","SXP_MAKE_ME_MOVE":"REMOVED","SHO_GA_ResultsTotalEvent":"ON","DXP_MAP_DOTS_WEB":"ON","SXP_MULTIREGION_SEARCH":"CONTROL","DXP_HERO_SHORTENING":"ON","SXP_VISUAL_AUDIT_2021":"ON","SXP_HDP_BLUE_TO_RED":"ON","HDP_DESKTOP_LAYOUT_TOPNAV":"CONTROL","SXP_HEADER_TAG_WRAPPER":"ON","DXP_HOMEPAGE_OMP":"ON","SXP_Multifamily_Filter":"MULTIFAMILY_SEPARATE","SXP_FOOTER":"RESPONSIVE_REACT","SP_PAID_BUILDER_PAGE":"VIA_SHOPPER_PLATFORM","SP_OFF_MARKET_PAGE":"VIA_SHOPPER_PLATFORM","SXP_KINGFISHER_FILTERS":"P1_PHASE_1","RE_SECOND_BOOST":"SLOT_4","DXP_TG_SCHOOLS_DISABLED":"CONTROL","ADT_PROGRESSIVE_MESSAGE":"CONTROL","SXP_Combined_Filter_Apartments_Condos":"TEST","DXP_HOME_RECS":"ON","SEOTEST__SXP_REACT_FOOTER_DESKTOP":"CONTROL","SXP_3DTOUR_MAP_DOT":"ON","DXP_SEE_MORE_RECS":"SCROLL_ON"},"cat1":{"searchResults":{"listResults":[{"zpid":"21328879","id":"21328879","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/3e81a218088316bafa7b199e8dc4923f-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/20412-Wayne-Ave-Torrance-CA-90503/21328879_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$1,275,000","unformattedPrice":1275000,"address":"20412 Wayne Ave, Torrance, CA 90503","addressStreet":"20412 Wayne Ave","addressCity":"Torrance","addressState":"CA","addressZipcode":"90503","isUndisclosedAddress":false,"beds":4,"baths":2.0,"area":1890,"latLong":{"latitude":33.845936,"longitude":-118.37223},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 12-3pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21328879,"streetAddress":"20412 Wayne Ave","zipcode":"90503","city":"Torrance","state":"CA","latitude":33.845936,"longitude":-118.37223,"price":1275000.0,"bathrooms":2.0,"bedrooms":4.0,"livingArea":1890.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":1264181,"rentZestimate":4699,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 12-3pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1275000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673726400000,"open_house_end":1673737200000},{"open_house_start":1673816400000,"open_house_end":1673827200000},{"open_house_start":1673902800000,"open_house_end":1673913600000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":171633.0,"lotAreaValue":7172.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T12:00:00","openHouseEndDate":"2023-01-14T15:00:00","openHouseDescription":"Open House - 0:00 - 3:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":1264181,"shouldShowZestimateAsPrice":false,"has3DModel":true,"hasVideo":false,"isHomeRec":false,"brokerName":"Re/Max Estate Properties","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"2060330967","id":"2060330967","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/f65a190c100bab31301becdba3cdf7cc-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/23701-S-Western-Ave-SPACE-244-Torrance-CA-90501/2060330967_zpid/","statusType":"FOR_SALE","statusText":"Home for sale","countryCurrency":"$","price":"$88,000","unformattedPrice":88000,"address":"23701 S Western Ave SPACE 244, Torrance, CA 90501","addressStreet":"23701 S Western Ave SPACE 244","addressCity":"Torrance","addressState":"CA","addressZipcode":"90501","isUndisclosedAddress":false,"beds":3,"baths":2.0,"area":1000,"latLong":{"latitude":33.809013,"longitude":-118.311035},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 2-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":2060330967,"streetAddress":"23701 S Western Ave SPACE 244","zipcode":"90501","city":"Torrance","state":"CA","latitude":33.809013,"longitude":-118.311035,"price":88000.0,"datePriceChanged":1673078400000,"bathrooms":2.0,"bedrooms":3.0,"livingArea":1000.0,"homeType":"MANUFACTURED","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 2-4pm","priceReduction":"$2,000 (Jan 7)","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":88000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673733600000,"open_house_end":1673740800000},{"open_house_start":1673820000000,"open_house_end":1673827200000}]},"priceChange":-2000,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","unit":"Space 244","lotAreaValue":16.3319,"lotAreaUnit":"acres"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T14:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 2:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":null,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"eXp Realty of California, Inc.","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21338409","id":"21338409","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/1f90c7be6ceca4a76d64d904010f0cb7-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/25924-Matfield-Dr-Torrance-CA-90505/21338409_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$1,100,000","unformattedPrice":1100000,"address":"25924 Matfield Dr, Torrance, CA 90505","addressStreet":"25924 Matfield Dr","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":4,"baths":3.0,"area":1531,"latLong":{"latitude":33.78651,"longitude":-118.334206},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21338409,"streetAddress":"25924 Matfield Dr","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.78651,"longitude":-118.334206,"price":1100000.0,"bathrooms":3.0,"bedrooms":4.0,"livingArea":1531.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":1100006,"rentZestimate":3800,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1100000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000},{"open_house_start":1673816400000,"open_house_end":1673827200000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":436335.0,"lotAreaValue":7987.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":1100006,"shouldShowZestimateAsPrice":false,"has3DModel":true,"hasVideo":false,"isHomeRec":false,"brokerName":"Equity Union","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21337140","id":"21337140","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/cb80b5736be022cbeeafb3bb77ec0e83-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/4068-Newton-St-Torrance-CA-90505/21337140_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$999,900","unformattedPrice":999900,"address":"4068 Newton St, Torrance, CA 90505","addressStreet":"4068 Newton St","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":2,"baths":2.0,"area":1268,"latLong":{"latitude":33.803745,"longitude":-118.35658},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21337140,"streetAddress":"4068 Newton St","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.803745,"longitude":-118.35658,"price":999900.0,"bathrooms":2.0,"bedrooms":2.0,"livingArea":1268.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":999907,"rentZestimate":3999,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":999900.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000},{"open_house_start":1673816400000,"open_house_end":1673827200000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":521866.0,"lotAreaValue":5050.0,"lotAreaUnit":"sqft"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":999907,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Compass","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"63093583","id":"63093583","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/8eec0c61013d143353c8893258ad1770-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/3120-Sepulveda-Blvd-UNIT-414-Torrance-CA-90505/63093583_zpid/","statusType":"FOR_SALE","statusText":"Condo for sale","countryCurrency":"$","price":"$389,000","unformattedPrice":389000,"address":"3120 Sepulveda Blvd UNIT 414, Torrance, CA 90505","addressStreet":"3120 Sepulveda Blvd UNIT 414","addressCity":"Torrance","addressState":"CA","addressZipcode":"90505","isUndisclosedAddress":false,"beds":1,"baths":1.0,"area":526,"latLong":{"latitude":33.823784,"longitude":-118.34189},"isZillowOwned":false,"variableData":{"type":"OPEN_HOUSE","text":"Open: Sat. 1-4pm"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":63093583,"streetAddress":"3120 Sepulveda Blvd UNIT 414","zipcode":"90505","city":"Torrance","state":"CA","latitude":33.823784,"longitude":-118.34189,"price":389000.0,"bathrooms":1.0,"bedrooms":1.0,"livingArea":526.0,"homeType":"CONDO","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":389001,"rentZestimate":2084,"listing_sub_type":{"is_openHouse":true,"is_FSBA":true},"openHouse":"Sat. 1-4pm","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":389000.0,"open_house_info":{"open_house_showing":[{"open_house_start":1673730000000,"open_house_end":1673740800000}]},"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":360000.0,"unit":"Unit 414","lotAreaValue":1.0985,"lotAreaUnit":"acres"}},"isSaved":false,"hasOpenHouse":true,"openHouseStartDate":"2023-01-14T13:00:00","openHouseEndDate":"2023-01-14T16:00:00","openHouseDescription":"Open House - 1:00 - 4:00 PM","isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":389001,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Re/Max Estate Properties","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21324245","id":"21324245","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/4c47d6593842fdf0036f1805838c1673-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/417-Paseo-De-La-Playa-Redondo-Beach-CA-90277/21324245_zpid/","statusType":"FOR_SALE","statusText":"House for sale","countryCurrency":"$","price":"$19,995,000","unformattedPrice":19995000,"address":"417 Paseo De La Playa, Redondo Beach, CA 90277","addressStreet":"417 Paseo De La Playa","addressCity":"Redondo Beach","addressState":"CA","addressZipcode":"90277","isUndisclosedAddress":false,"beds":10,"baths":15.0,"area":15728,"latLong":{"latitude":33.810413,"longitude":-118.39131},"isZillowOwned":false,"variableData":{"type":"PRICE_REDUCTION","text":"$3,000,000 (Nov 10)"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21324245,"streetAddress":"417 Paseo De La Playa","zipcode":"90277","city":"Redondo Beach","state":"CA","latitude":33.810413,"longitude":-118.39131,"price":1.9995E7,"datePriceChanged":1668067200000,"bathrooms":15.0,"bedrooms":10.0,"livingArea":15728.0,"homeType":"SINGLE_FAMILY","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":17760922,"rentZestimate":83834,"listing_sub_type":{"is_FSBA":true},"priceReduction":"$3,000,000 (Nov 10)","isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":1.9995E7,"priceChange":-3000000,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":1.2904489E7,"lotAreaValue":1.4407,"lotAreaUnit":"acres"}},"isSaved":false,"isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":17760922,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"Douglas Elliman of California, Inc.","info6String":"Joshua Altman DRE # 01764587","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},{"zpid":"21272955","id":"21272955","providerListingId":null,"imgSrc":"https://photos.zillowstatic.com/fp/0d9ac33c6a2d1c8683ec45e7aef895a5-p_e.jpg","hasImage":true,"detailUrl":"https://www.zillow.com/homedetails/3101-Plaza-Del-Amo-UNIT-5-Torrance-CA-90503/21272955_zpid/","statusType":"FOR_SALE","statusText":"Townhouse for sale","countryCurrency":"$","price":"$835,000","unformattedPrice":835000,"address":"3101 Plaza Del Amo UNIT 5, Torrance, CA 90503","addressStreet":"3101 Plaza Del Amo UNIT 5","addressCity":"Torrance","addressState":"CA","addressZipcode":"90503","isUndisclosedAddress":false,"beds":3,"baths":3.0,"area":1446,"latLong":{"latitude":33.828743,"longitude":-118.34023},"isZillowOwned":false,"variableData":{"type":"DAYS_ON","text":"3 days on Zillow"},"badgeInfo":null,"hdpData":{"homeInfo":{"zpid":21272955,"streetAddress":"3101 Plaza Del Amo UNIT 5","zipcode":"90503","city":"Torrance","state":"CA","latitude":33.828743,"longitude":-118.34023,"price":835000.0,"bathrooms":3.0,"bedrooms":3.0,"livingArea":1446.0,"homeType":"TOWNHOUSE","homeStatus":"FOR_SALE","daysOnZillow":-1,"isFeatured":false,"shouldHighlight":false,"zestimate":852600,"rentZestimate":3499,"listing_sub_type":{"is_FSBA":true},"isUnmappable":false,"isPreforeclosureAuction":false,"homeStatusForHDP":"FOR_SALE","priceForHDP":835000.0,"isNonOwnerOccupied":true,"isPremierBuilder":false,"isZillowOwned":false,"currency":"USD","country":"USA","taxAssessedValue":788066.0,"unit":"Unit 5","lotAreaValue":5.4866,"lotAreaUnit":"acres"}},"isSaved":false,"isUserClaimingOwner":false,"isUserConfirmedClaim":false,"pgapt":"ForSale","sgapt":"For Sale (Broker)","zestimate":852600,"shouldShowZestimateAsPrice":false,"has3DModel":false,"hasVideo":false,"isHomeRec":false,"brokerName":"RELO REDAC, Inc.","hasAdditionalAttributions":true,"isFeaturedListing":false,"availabilityDate":null,"list":true,"relaxed":false},
From what I can tell, this data exists in a section of the beautiful soup object within a < script > tag. I didn't include all data (there's a lot), but here's an excerpt of the last tag I can find before the region I'd like to extract.
<script data-zrr-shared-data-key="mobileSearchPageStore" type="application/json"><!--{"queryState":{"mapBounds":{"north":33.887061,"south":33.780217,"east":-118.308127,"west":-118.394107},"regionSelection":[{"regionId":54722,"regionType":6}],"isMapVisible":true,"filterState":{"sortSelection":{"value":"globalrelevanceex"},"isAllHomes":{"value":true}}},"filterDefinitions":{"keywords":{"id":"keywords","shortId":"att","labels":{"default":"Keywords","tracking":"Keyword"},"sortOrder":2,"type":"String","defaultValue":{"value":""},"exposedPillEnabled":true},"isPublicSchool":{"id":"isPublicSchool","shortId":"schp","labels":{"default":"Public"},"type":"Boolean","defaultValue":{"value":true}},"isCityView":{"id":"isCityView","shortId":"cityv","labels":
Is this json data? Is there a way to extract just this part of the data?
You can try to parse the data with json module:
import json
from bs4 import BeautifulSoup
html_doc = '''\
<script data-zrr-shared-data-key="mobileSearchPageStore" type="application/json"><!--{"currentLink":"/torrance-ca-90510/","regionId":96168}--></script>
'''
soup = BeautifulSoup(html_doc, 'html.parser')
# get the right tag
data = soup.select_one('script[data-zrr-shared-data-key="mobileSearchPageStore"]')
# get the contents of this tag, strip the html comments
data = data.contents[0].strip('><!-')
# parse the data
data = json.loads(data)
# print the data
print(data)
Prints:
{'currentLink': '/torrance-ca-90510/', 'regionId': 96168}
I am trying to scrape all text from a web page (using python) that comes after the first heading . The tag for that heading is : <h1 id="firstHeading" class="firstHeading" lang="en">Albert Einstein</h1>
I don't want any information before this heading . I want to scrape all text written after this heading . Can I use BeautifulSoup in python for this ?
I am running the following code :
` *
import requests
import bs4
from bs4 import BeautifulSoup
urlpage = 'https://en.wikipedia.org/wiki/Albert_Einstein#Publications'
res = requests.get(urlpage)
soup1 = (bs4.BeautifulSoup(res.text, 'lxml')).get_text()
print(soup1)
` *
The web page has the following information :
Albert Einstein - Wikipedia
document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Albert_Einstein","wgTitle":"Albert Einstein","wgCurRevisionId":920687884,"wgRevisionId":920687884,"wgArticleId":736,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with missing ISBNs","Webarchive template wayback links","CS1 German-language sources (de)","CS1: Julian–Gregorian uncertainty","CS1 French-language sources (fr)","CS1 errors: missing periodical","CS1: long volume value","Wikipedia indefinitely semi-protected pages","Use American English from February 2019","All Wikipedia articles written in American English","Articles with short description","Good articles","Articles containing German-language text","Biography with signature","Articles with hCards","Articles with hAudio microformats","All articles with unsourced statements",
"Articles with unsourced statements from July 2019","Commons category link from Wikidata","Articles with Wikilivres links","Articles with Curlie links","Articles with Project Gutenberg links","Articles with Internet Archive links","Articles with LibriVox links","Use dmy dates from August 2019","Wikipedia articles with BIBSYS identifiers","Wikipedia articles with BNE identifiers","Wikipedia articles with BNF identifiers","Wikipedia articles with GND identifiers","Wikipedia articles with HDS identifiers","Wikipedia articles with ISNI identifiers","Wikipedia articles with LCCN identifiers","Wikipedia articles with LNB identifiers","Wikipedia articles with MGP identifiers","Wikipedia articles with NARA identifiers","Wikipedia articles with NCL identifiers","Wikipedia articles with NDL identifiers","Wikipedia articles with NKC identifiers","Wikipedia articles with NLA identifiers","Wikipedia articles with NLA-person identifiers","Wikipedia articles with NLI identifiers",
"Wikipedia articles with NLR identifiers","Wikipedia articles with NSK identifiers","Wikipedia articles with NTA identifiers","Wikipedia articles with SBN identifiers","Wikipedia articles with SELIBR identifiers","Wikipedia articles with SNAC-ID identifiers","Wikipedia articles with SUDOC identifiers","Wikipedia articles with ULAN identifiers","Wikipedia articles with VIAF identifiers","Wikipedia articles with WorldCat-VIAF identifiers","AC with 25 elements","Wikipedia articles with suppressed authority control identifiers","Pages using authority control with parameters","Articles containing timelines","Pantheists","Spinozists","Albert Einstein","1879 births","1955 deaths","20th-century American engineers","20th-century American writers","20th-century German writers","20th-century physicists","American agnostics","American inventors","American letter writers","American pacifists","American people of German-Jewish descent","American physicists","American science writers",
"American socialists","American Zionists","Ashkenazi Jews","Charles University in Prague faculty","Corresponding Members of the Russian Academy of Sciences (1917–25)","Cosmologists","Deaths from abdominal aortic aneurysm","Einstein family","ETH Zurich alumni","ETH Zurich faculty","German agnostics","German Jews","German emigrants to Switzerland","German Nobel laureates","German inventors","German physicists","German socialists","European democratic socialists","Institute for Advanced Study faculty","Jewish agnostics","Jewish American scientists","Jewish emigrants from Nazi Germany to the United States","Jews who emigrated to escape Nazism","Jewish engineers","Jewish inventors","Jewish philosophers","Jewish physicists","Jewish socialists","Leiden University faculty","Foreign Fellows of the Indian National Science Academy","Foreign Members of the Royal Society","Members of the American Philosophical Society","Members of the Bavarian Academy of Sciences","Members of the Lincean Academy"
,"Members of the Royal Netherlands Academy of Arts and Sciences","Members of the United States National Academy of Sciences","Honorary Members of the USSR Academy of Sciences","Naturalised citizens of Austria","Naturalised citizens of Switzerland","New Jersey socialists","Nobel laureates in Physics","Patent examiners","People from Berlin","People from Bern","People from Munich","People from Princeton, New Jersey","People from Ulm","People from Zürich","People who lost German citizenship","People with acquired American citizenship","Philosophers of science","Relativity theorists","Stateless people","Swiss agnostics","Swiss emigrants to the United States","Swiss Jews","Swiss physicists","Theoretical physicists","Winners of the Max Planck Medal","World federalists","Recipients of the Pour le Mérite (civil class)","Determinists","Activists from New Jersey","Mathematicians involved with Mathematische Annalen","Intellectual Cooperation","Disease-related deaths in New Jersey"],
"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Albert_Einstein","wgRelevantArticleId":736,"wgRequestId":"XaChjApAICIAALSsYfgAAABV","wgCSPNonce":!1,"wgIsProbablyEditable":!1,"wgRelevantPageIsProbablyEditable":!1,"wgRestrictionEdit":["autoconfirmed"],"wgRestrictionMove":["sysop"],"wgMediaViewerOnClick":!0,"wgMediaViewerEnabledByDefault":!0,"wgPopupsReferencePreviews":!1,"wgPopupsConflictsWithNavPopupGadget":!1,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en"},"wgMFDisplayWikibaseDescriptions":{"search":!0,"nearby":!0,"watchlist":!0,"tagline":
!1},"wgWMESchemaEditAttemptStepOversample":!1,"wgULSCurrentAutonym":"English","wgNoticeProject":"wikipedia","wgWikibaseItemId":"Q937","wgCentralAuthMobileDomain":!1,"wgEditSubmitButtonLabelPublish":!0};RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","noscript":"ready","user.styles":"ready","ext.globalCssJs.user":"ready","user":"ready","user.options":"ready","user.tokens":"loading","ext.cite.styles":"ready","ext.math.styles":"ready","mediawiki.legacy.shared":"ready","mediawiki.legacy.commonPrint":"ready","jquery.makeCollapsible.styles":"ready","mediawiki.toc.styles":"ready","wikibase.client.init":"ready","ext.visualEditor.desktopArticleTarget.noscript":"ready","ext.uls.interlanguage":"ready","ext.wikimediaBadges":"ready","ext.3d.styles":"ready","mediawiki.skinning.interface":"ready","skins.vector.styles":"ready"};RLPAGEMODULES=["ext.cite.ux-enhancements","ext.cite.tracking","ext.math.scripts","ext.scribunto.logs","site","mediawiki.page.startup",
"mediawiki.page.ready","jquery.makeCollapsible","mediawiki.toc","mediawiki.searchSuggest","ext.gadget.teahouse","ext.gadget.ReferenceTooltips","ext.gadget.watchlist-notice","ext.gadget.DRN-wizard","ext.gadget.charinsert","ext.gadget.refToolbar","ext.gadget.extra-toolbar-buttons","ext.gadget.switcher","ext.centralauth.centralautologin","mmv.head","mmv.bootstrap.autostart","ext.popups","ext.visualEditor.desktopArticleTarget.init","ext.visualEditor.targetLoader","ext.eventLogging","ext.wikimediaEvents","ext.navigationTiming","ext.uls.compactlinks","ext.uls.interface","ext.cx.eventlogging.campaigns","ext.quicksurveys.init","ext.centralNotice.geoIP","ext.centralNotice.startUp","skins.vector.js"];
(RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.tokens#tffin",function($,jQuery,require,module){/*#nomin*/mw.user.tokens.set({"patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});
});});
Albert Einstein
From Wikipedia, the free encyclopedia
Jump to navigation Jump to search "Einstein" redirects here. For other
people, see Einstein (surname). For other uses, see Albert Einstein
(disambiguation) and Einstein (disambiguation).
German-born physicist and developer of the theory of relativity
Albert EinsteinEinstein in 1921Born(1879-03-14)14 March 1879Ulm,
Kingdom of Württemberg, German EmpireDied18 April 1955(1955-04-18)
(aged 76)Princeton, New Jersey, United StatesResidenceGermany, Italy,
Switzerland, Austria (present-day Czech Republic), Belgium, United
StatesCitizenship Subject of the Kingdom of Württemberg during the
German Empire (1879–1896)[note 1] Stateless (1896–1901) Citizen of
Switzerland (1901–1955) Austrian subject of the Austro-Hungarian
Empire (1911–1912) Subject of the Kingdom of Prussia during the German
Empire (1914–1918)[note 1] German citizen of the Free State of Prussia
(Weimar Republic, 1918–1933) Citizen of the United States (1940–1955)
Education Federal polytechnic school (1896–1900; B.A., 1900)
University of Zurich (Ph.D., 1905) Known for General relativity
Special relativity Photoelectric effect E=mc2 (Mass–energy
equivalence) E=hf (Planck–Einstein relation) Theory of Brownian motion
Einstein field equations Bose–Einstein statistics Bose–Einstein
condensate Gravitational wave Cosmological constant Unified field
theory EPR paradox Ensemble interpretation List of other concepts
Spouse(s)Mileva Marić(m. 1903; div. 1919)Elsa Löwenthal(m. 1919;
died[1][2] 1936)Children"Lieserl" Einstein Hans Albert Einstein Eduard
"Tete" EinsteinAwards Barnard Medal (1920) Nobel Prize in Physics
(1921) Matteucci Medal (1921) ForMemRS (1921)[3] Copley Medal
(1925)[3] Gold Medal of the Royal Astronomical Society (1926) Max
Planck Medal (1929) Member of the National Academy of Sciences (1942)
Time Person of the Century (1999) Scientific careerFieldsPhysics,
philosophyInstitutions Swiss Patent Office (Bern) (1902–1909)
University of Bern (1908–1909) University of Zurich (1909–1911)
Charles University in Prague (1911–1912) ETH Zurich (1912–1914)
Prussian Academy of Sciences (1914–1933) Humboldt University of Berlin
(1914–1933) Kaiser Wilhelm Institute (director, 1917–1933) German
Physical Society (president, 1916–1918) Leiden University (visits,
1920) Institute for Advanced Study (1933–1955) Caltech (visits,
1931–1933) University of Oxford (visits, 1931–1933) ThesisEine neue
Bestimmung der Moleküldimensionen (A New Determination of Molecular
Dimensions) (1905)Doctoral advisorAlfred KleinerOther academic
advisorsHeinrich Friedrich WeberInfluences Arthur Schopenhauer Baruch
Spinoza Bernhard Riemann David Hume Ernst Mach Hendrik Lorentz Hermann
Minkowski Isaac Newton James Clerk Maxwell Michele Besso Moritz
Schlick Thomas Young Influenced Virtually all modern physics
Signature Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt
ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born
theoretical physicist[5] who developed the theory of relativity, one
of the two pillars of modern physics (alongside quantum
mechanics).[3][6]:274 His work is also known for its influence on the
philosophy of science.[7][8] He is best known to the general public
for his mass–energy equivalence formula . . . . .
I only want text after the first heading "Albert Einstein"
First find h1 tag and then use find_next_siblings('div') and print the text value.
import requests
import bs4
urlpage = 'https://en.wikipedia.org/wiki/Albert_Einstein#Publications'
res = requests.get(urlpage)
soup1 =bs4.BeautifulSoup(res.text, 'lxml')
h1=soup1.find('h1')
for item in h1.find_next_siblings('div'):
print(item.text)
If you do want to get the text such as described, I suggest a bit of an "non-parser" way.
By cutting the string directly from the response object.
Let's do this:
import requests
urlpage = "https://en.wikipedia.org/wiki/Albert_Einstein#Publications"
my_string = """<h1 id="firstHeading" class="firstHeading" lang="en">Albert Einstein</h1>""" # define the string you want
response = requests.get(urlpage).text # get the full response html as str
cut_response = response[response.find(my_string)::] # cut the str from your string on
soup1 = (bs4.BeautifulSoup(cut_response, 'lxml')).get_text() # get soup object, but of cut string
print(soup1)
Should work.