Python - Scraping and classifying text in "fonts" - python
I would like to scrape the content of this website https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283 and create a table with the columns NAME, TITLE, LOCATION. I know some individuals have more or less "lines", but I am just trying to understand how I could even classify the first 3 lines for each person given that the text is in between "fonts" for all.
So far I have:
url="https://web.archive.org/web/20130318062052/http://internet.csr.nih.gov/Roster_proto1/member_roster.asp?srg=ACE&SRGDISPLAY=ACE&CID=102283"
driver.maximize_window()
driver.implicitly_wait(5) # wait up to 3 seconds before calls to find elements time out
driver.get(url)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
column = soup.find_all("font")
But once I am there and I have all the text within "font" in my "column" variable, I don't know how to proceed to differentiate between each person and build a loop where I would retrieve name, title, location etc. for each.
Any help would be highly appreciated!
Note: instead of using selenium, I simply fetched and parsed with soup = BeautifulSoup(requests.get(url).content, "html.parser"); as far as I an tell, the required section is not dynamic, so it shouldn't cause any issues.
would you have any idea about how to look for pairs of <br>
Since they represent empty lines, you could try simply splitting the text in that cell by \n\n\n
blockText = soup.select_one('td:has(font)').get_text(' ')
blockText = blockText.replace('-'*10, '\n\n\n') # pad "underlined" lines
blockSections = [sect.strip() for sect in '\n'.join([
l.strip('-').strip() for l in blockText.splitlines()
]).split('\n\n\n') if sect.strip()]
Although, if you looked at blockSections, you might notice that some headers [ROSTER and MEMBERS] get stuck to the end of the previous section - probably because their formatting means that an extra <br> is not needed to distinguish them from their adjacent sections. [I added the .replace('-'*10, '\n\n\n') line so that at least they're separated from the next section.]
Another risk is that I don't know if all versions and parsers will parse <br><br> to text as 3 line breaks - some omit br space entirely from text, and others might add extra space based on spaces between tags in the source html.
It's easier to split if you loop through the <br>s and pad them with something more distinctive to split by; the .insert... methods are useful here. (This method also has the advantage of being able to target bolded lined as well.)
blockSoup = soup.select_one('td:has(font)')
for br2 in blockSoup.select('br+br, font:has(br)'):
br2.insert_after(BeautifulSoup(f'<p>{"="*80}</p>').p)
br2.insert_before(BeautifulSoup(f'<p>{"="*80}</p>').p)
blockSections = [
sect.strip().strip('-').strip() for sect in
blockSoup.get_text(' ').split("="*80) if sect.strip()
]
This time, blockSections looks something like
['Membership Roster - ACE\n AIDS CLINICAL STUDIES AND EPIDEMIOLOGY STUDY SECTION\n Center For Scientific Review\n (Terms end 6/30 of the designated year)\n ROSTER',
'CHAIRPERSON',
'SCHACKER, TIMOTHY\n W\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF MINNESOTA\n MINNEAPOLIS,\n MN\n 55455',
'MEMBERS',
'ANDERSON, JEAN\n R\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF GYNECOLOGY AND OBSTETRICS\n JOHNS HOPKINS UNIVERSITY\n BALTIMORE,\n MD 21287',
'BALASUBRAMANYAM, ASHOK\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE AND\n MOLECULAR AND CELLULAR BIOLOGY\n DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM\n BAYLOR COLLEGE OF MEDICINE\n HOUSTON,\n TX 77030',
'BLATTNER, WILLIAM\n ALBERT\n , MD,\n (15)\n PROFESSOR AND ASSOCIATE DIRECTOR\n DEPARTMENT OF MEDICNE\n INSTITUTE OF HUMAN VIROLOGY\n UNIVERSITY OF MARYLAND, BALTIMORE\n BALTIMORE,\n MD 21201',
'CHEN, YING\n QING\n , PHD,\n (15)\n PROFESSOR\n PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS\n FRED HUTCHINSON CANCER RESEARCH CENTER\n SEATTLE,\n WA 981091024',
'COTTON, DEBORAH\n , MD,\n (13)\n PROFESSOR\n SECTION OF INFECTIOUS DISEASES\n DEPARTMENT OF MEDICINE\n BOSTON UNIVERSITY\n BOSTON,\n MA 02118',
'DANIELS, MICHAEL\n J\n , SCD,\n (16)\n PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF TEXAS AT AUSTIN\n AUSTIN,\n TX 78712',
'FOULKES, ANDREA\n SARAH\n , SCD,\n (14)\n ASSOCIATE PROFESSOR\n DEPARTMENT OF BIOSTATISTICS\n UNIVERSITY OF MASSACHUSETTS\n AMHERST,\n MA 01003',
'HEROLD, BETSY\n C\n , MD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n ALBERT EINSTEIN COLLEGE OF MEDICINE\n BRONX,\n NY 10461',
'JUSTICE, AMY\n CAROLINE\n , MD, PHD,\n (16)\n PROFESSOR\n DEPARTMENT OF PEDIATRICS\n YALE UNIVERSITY\n NEW HAVEN,\n CT 06520',
'KATZENSTEIN, DAVID\n ALLENBERG\n , MD,\n (13)\n PROFESSOR\n DIVISION OF INFECTIOUS DISEASES\n STANFORD UNIVERSITY SCHOOL OF MEDICINE\n STANFORD,\n CA 94305',
'MARGOLIS, DAVID\n M\n , MD,\n (14)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL\n CHAPEL HILL,\n NC 27599',
'MONTANER, LUIS\n J\n , DVM, PHD,\n (13)\n PROFESSOR\n DEPARTMENT OF IMMUNOLOGY\n THE WISTAR INSTITUTE\n PHILADELPHIA,\n PA 19104',
'MONTANO, MONTY\n A\n , PHD,\n (15)\n RESEARCH SCIENTIST\n DEPARTMENT OF IMMUNOLOGY AND\n INFECTIOUS DISEASES\n BOSTON UNIVERSITY\n BOSTON,\n MA 02115',
'PAGE, KIMBERLY\n , PHD, MPH,\n (16)\n PROFESSOR\n DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH\n AND GLOBAL HEALTH SCIENCES\n DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n UNIVERSITY OF CALIFORNIA, SAN FRANCISCO\n SAN FRANCISCO,\n CA 94105',
'SHIKUMA, CECILIA\n M\n , MD,\n (15)\n PROFESSOR\n DEPARTMENT OF MEDICINE\n HAWAII AIDS CLINICAL RESEARCH PROGRAM\n UNIVERSITY OF HAWAII\n HONOLULU,\n HI 96816',
'WOOD, CHARLES\n , PHD,\n (13)\n PROFESSOR\n UNIVERSITY OF NEBRASKA\n LINCOLN,\n NE 68588']
create a table with the columns NAME, TITLE, LOCATION
There may be a more elegant solution, but I feel like the simplest way would be to just loop the siblings of the headers and keep count of consecutive brs.
doubleBr = soup.select('br')[:2] # [ so the last person also gets added ]
personsList = []
for f in soup.select('td>font>font:has(b br)'):
role, lCur,pCur,brCt = f.get_text(' ').strip('-').strip(), [],[],0
for lf in f.find_next_siblings(['font','br'])+doubleBr:
brCt = brCt+1 if lf.name == 'br' else 0
if pCur and (brCt>1 or lf.b):
pDets = {'role': role, 'name': '?'} # initiate
if len(pCur)>1: pDets['title'] = pCur[1]
pDets['name'], pCur = pCur[0], pCur[2:]
dList = pCur[:-2]
pDets['departments'] = dList[0] if len(dList)==1 else dList
if len(pCur)>1: pDets['institute'] = pCur[-2]
if pCur: pDets['location'] = pCur[-1]
personsList.append(pDets)
pCur, lCur, brCt = [], [], 0 # clear
if lf.b: break # rached next section
if lf.name == 'font': # [split and join to minimize whitespace]
lCur.append(' '.join(lf.get_text(' ').split())) # add to line
if brCt and lCur: pCur, lCur = pCur+[' '.join(lCur)], [] # newline
Since personsList is a list of dictionaries, it can be tabulated as simply as pandas.DataFrame(personsList) to get a DataFrame that looks like:
role
name
title
departments
institute
location
CHAIRPERSON
SCHACKER, TIMOTHY W , MD
PROFESSOR
DEPARTMENT OF MEDICINE
UNIVERSITY OF MINNESOTA
MINNEAPOLIS, MN 55455
MEMBERS
ANDERSON, JEAN R , MD
PROFESSOR
DEPARTMENT OF GYNECOLOGY AND OBSTETRICS
JOHNS HOPKINS UNIVERSITY
BALTIMORE, MD 21287
MEMBERS
BALASUBRAMANYAM, ASHOK , MD
PROFESSOR
['DEPARTMENT OF MEDICINE AND', 'MOLECULAR AND CELLULAR BIOLOGY', 'DIVISION OF DIABETES, ENDOCRINOLOGY AND METABOLISM']
BAYLOR COLLEGE OF MEDICINE
HOUSTON, TX 77030
MEMBERS
BLATTNER, WILLIAM ALBERT , MD
PROFESSOR AND ASSOCIATE DIRECTOR
['DEPARTMENT OF MEDICNE', 'INSTITUTE OF HUMAN VIROLOGY']
UNIVERSITY OF MARYLAND, BALTIMORE
BALTIMORE, MD 21201
MEMBERS
CHEN, YING QING , PHD
PROFESSOR
PROGRAM IN BIOSTATISTICS AND BIOMATHEMATICS
FRED HUTCHINSON CANCER RESEARCH CENTER
SEATTLE, WA 981091024
MEMBERS
COTTON, DEBORAH , MD
PROFESSOR
['SECTION OF INFECTIOUS DISEASES', 'DEPARTMENT OF MEDICINE']
BOSTON UNIVERSITY
BOSTON, MA 02118
MEMBERS
DANIELS, MICHAEL J , SCD
PROFESSOR
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF TEXAS AT AUSTIN
AUSTIN, TX 78712
MEMBERS
FOULKES, ANDREA SARAH , SCD
ASSOCIATE PROFESSOR
DEPARTMENT OF BIOSTATISTICS
UNIVERSITY OF MASSACHUSETTS
AMHERST, MA 01003
MEMBERS
HEROLD, BETSY C , MD
PROFESSOR
DEPARTMENT OF PEDIATRICS
ALBERT EINSTEIN COLLEGE OF MEDICINE
BRONX, NY 10461
MEMBERS
JUSTICE, AMY CAROLINE , MD, PHD
PROFESSOR
DEPARTMENT OF PEDIATRICS
YALE UNIVERSITY
NEW HAVEN, CT 06520
MEMBERS
KATZENSTEIN, DAVID ALLENBERG , MD
PROFESSOR
DIVISION OF INFECTIOUS DISEASES
STANFORD UNIVERSITY SCHOOL OF MEDICINE
STANFORD, CA 94305
MEMBERS
MARGOLIS, DAVID M , MD
PROFESSOR
DEPARTMENT OF MEDICINE
UNIVERSITY OF NORTH CAROLINA, CHAPEL HILL
CHAPEL HILL, NC 27599
MEMBERS
MONTANER, LUIS J , DVM, PHD
PROFESSOR
DEPARTMENT OF IMMUNOLOGY
THE WISTAR INSTITUTE
PHILADELPHIA, PA 19104
MEMBERS
MONTANO, MONTY A , PHD
RESEARCH SCIENTIST
['DEPARTMENT OF IMMUNOLOGY AND', 'INFECTIOUS DISEASES']
BOSTON UNIVERSITY
BOSTON, MA 02115
MEMBERS
PAGE, KIMBERLY , PHD, MPH
PROFESSOR
['DIVISION OF PREVENTIVE MEDICINE AND PUBLIC HEALTH', 'AND GLOBAL HEALTH SCIENCES', 'DEPARTMENT OF EPIDEMIOLOGY AND BIOSTATICTICS', 'UNIVERSITY OF CALIFORNIA, SAN FRANCISCO']
UNIVERSITY OF CALIFORNIA, SAN FRANCISCO
SAN FRANCISCO, CA 94105
MEMBERS
SHIKUMA, CECILIA M , MD
PROFESSOR
['DEPARTMENT OF MEDICINE', 'HAWAII AIDS CLINICAL RESEARCH PROGRAM']
UNIVERSITY OF HAWAII
HONOLULU, HI 96816
MEMBERS
WOOD, CHARLES , PHD
PROFESSOR
[]
UNIVERSITY OF NEBRASKA
LINCOLN, NE 68588
[ Btw, if the .select('br+br, font:has(br)') and .select('td>font>font:has(b br)') parts are unfamiliar to you, you can look up .select and CSS selectors. Combinators [like >/+/,] and pseudo-classes [like :has] allow us to get very specific with out targets. ]
Related
beautiful soup 4 issue in mulitple data fetching. it is confusing me
When i am fetching one data it is working fine as i mentioned below code. Whenever i am finding all datas in a similar tagging (example - {'class': 'doctor-name'}) it showing output as none. Single tag output from bs4 import BeautifulSoup s = """ <a class="doctor-name" itemprop="name" href="/doctors/gastroenterologists/dr-isaac-raijman-md-1689679557">Dr. Isaac Raijman, MD</a> """ soup = BeautifulSoup(s, 'html.parser') print(soup.find('a ', {'class': 'doctor-name'}).text) print(soup.find('a ', {'itemprop': 'name'}).text) Output - [Dr. Isaac Raijman, MD, Dr. Isaac Raijman, MD] Finding all using similar tagging but showing output as none- import requests, bs4 from bs4 import BeautifulSoup url = "https://soandso.org/doctors/gastroenterologists" page = requests.get(url) page page.status_code page.content soup = BeautifulSoup(page.content, 'html.parser') soup print(soup.prettify()) lists = soup.find_all('section', attrs={'class': 'search-page find-a-doctor'}) for list in lists: doctor = list.find('a', attrs={'class': 'doctor-name'})#.text info = [doctor] print(info) Output - none Please help me to solve this issue. Share your understanding as a code and #hastags definitions also fine.
That information is built up by the browser and is not returned in the HTML. An easier approach is to request it from the JSON API as follows: import requests headers = {'Authorization' : 'eyJhbGciOiJodHRwOi8vd3d3LnczLm9yZy8yMDAxLzA0L3htbGRzaWctbW9yZSNobWFjLXNoYTI1NiIsInR5cCI6IkpXVCJ9.eyJodHRwOi8vc2NoZW1hcy54bWxzb2FwLm9yZy93cy8yMDA1LzA1L2lkZW50aXR5L2NsYWltcy9uYW1lIjoiYWRtaW4iLCJleHAiOjIxMjcwNDQ1MTcsImlzcyI6Imh0dHBzOi8vZGV2ZWxvcGVyLmhlYWx0aHBvc3QuY29tIiwiYXVkIjoiaHR0cHM6Ly9kZXZlbG9wZXIuaGVhbHRocG9zdC5jb20ifQ.zNvR3WpI17CCMC7rIrHQCrnJg_6qGM21BvTP_ed_Hj8'} json_post = {"query":"","start":0,"rows":10,"selectedFilters":{"availability":[],"clinicalInterest":[],"distance":[20],"gender":["Both"],"hasOnlineScheduling":False,"insurance":[],"isMHMG":False,"language":[],"locationType":[],"lonlat":[-95.36,29.76],"onlineScheduling":["Any"],"specialty":["Gastroenterology"]}} req = requests.post("https://api.memorialhermann.org/api/doctorsearch", json=json_post, headers=headers) data = req.json() for doctor in data['docs']: print(f"{doctor['Name']:30} {doctor['PrimarySpecialty']:20} {doctor['PrimaryFacility']}") Giving you: Dr. Isaac Raijman, MD Gastroenterology Memorial Hermann Texas Medical Center Dr. Gabriel Lee, MD Gastroenterology Memorial Hermann Southeast Hospital Dr. Dang Nguyen, MD Gastroenterology Memorial Hermann Texas Medical Center Dr. Harshinie Amaratunge, MD Gastroenterology Memorial Hermann Texas Medical Center Dr. Tanima Jana, MD Gastroenterology Memorial Hermann Texas Medical Center Dr. Tugrul Purnak, MD Gastroenterology Memorial Hermann Texas Medical Center Dr. Dimpal Bhakta, MD Gastroenterology Memorial Hermann Texas Medical Center Dr. Dharmendra Verma, MD Gastroenterology Memorial Hermann Texas Medical Center Dr. Jennifer Shroff, MD Gastroenterology Memorial Hermann Texas Medical Center Dr. Brooks Cash, MD Gastroenterology Memorial Hermann Texas Medical Center
How to scrape hidden class data using selenium and beautiful soup
I'm trying to scrape java script enabled web page content. I need to extract data in the table of that website. However each row of the table has button (arrow) by which we get additional information of that row. I need to extract that additional description of each row. By inspecting it is observed that the contents of those arrow of each row belong to same class. However the class is hidden in source code. It can be observed only while inspecting. The data I'm trying to sparse is from the webpage. I have used selenium and beautiful soup. I'm able to scrape data of table but not content of those arrows in the table. My python is returning me an empty list for the class of that arrow. But working for the classs of normal table data. from bs4 import BeautifulSoup from selenium import webdriver browser = webdriver.Firefox() browser.get('https://projects.sfchronicle.com/2020/layoff-tracker/') html_source = browser.page_source soup = BeautifulSoup(html_source,'html.parser') data = soup.find_all('div',class_="sc-fzoLsD jxXBhc rdt_ExpanderRow") print(data.text)
To print hidden data, you can use this example: import re import json import requests from bs4 import BeautifulSoup url = 'https://projects.sfchronicle.com/2020/layoff-tracker/' soup = BeautifulSoup(requests.get(url).content, 'html.parser') data_url = 'https://projects.sfchronicle.com' + soup.select_one('link[href*="commons-"]')['href'] data = re.findall(r'n\.exports=JSON\.parse\(\'(.*?)\'\)', requests.get(data_url).text)[1] data = json.loads(data.replace(r"\'", "'")) # uncomment this to see all data: # print(json.dumps(data, indent=4)) for d in data[4:]: print('{:<50}{:<10}{:<30}{:<30}{:<30}{:<30}{:<30}'.format(*d.values())) Prints: Company Layoffs City County Month Industry Company description Tesla (Temporary layoffs. Factory reopened) 11083 Fremont Alameda County April Industrial Car maker Bon Appetit Management Co. 3015 San Francisco San Francisco County April Food Food supplier GSW Arena LLC-Chase Center 1720 San Francisco San Francisco County May Sports Arena vendors YMCA of Silicon Valley 1657 Santa Clara Santa Clara County May Sports Gym Nutanix Inc. (Temporary furlough of 2 weeks) 1434 San Jose Santa Clara County April Tech Cloud computing TeamSanJose 1304 San Jose Santa Clara County April Travel Tourism bureau San Francisco Giants 1200 San Francisco San Francisco County April Sports Stadium vendors Lyft 982 San Francisco San Francisco County April Tech Ride hailing YMCA of San Francisco 959 San Francisco San Francisco County May Sports Gym Hilton San Francisco Union Square 923 San Francisco San Francisco County April Travel Hotel Six Flags Discovery Kingdom 911 Vallejo Solano County June Entertainment Amusement park San Francisco Marriott Marquis 808 San Francisco San Francisco County April Travel Hotel Aramark 777 Oakland Alameda County April Food Food supplier The Palace Hotel 774 San Francisco San Francisco County April Travel Hotel Back of the House Inc 743 San Francisco San Francisco County April Food Restaurant DPR Construction 715 Redwood City San Mateo County April Real estate Construction ...and so on.
The content you are interested in is generated when you click a button, so you would want to locate the button. A million ways you could do this but I would suggest something like: element = driver.find_elements(By.XPATH, '//button') for your specific case you could also use: element = driver.find_elements(By.CSS_SELECTOR, 'button[class|="sc"]') Once you get the button element, we can then do: element.click() Parsing the page after this should get you the javascript generated content you are looking for
How to scrape all information on a web page after the id = "firstheading" in python?
I am trying to scrape all text from a web page (using python) that comes after the first heading . The tag for that heading is : <h1 id="firstHeading" class="firstHeading" lang="en">Albert Einstein</h1> I don't want any information before this heading . I want to scrape all text written after this heading . Can I use BeautifulSoup in python for this ? I am running the following code : ` * import requests import bs4 from bs4 import BeautifulSoup urlpage = 'https://en.wikipedia.org/wiki/Albert_Einstein#Publications' res = requests.get(urlpage) soup1 = (bs4.BeautifulSoup(res.text, 'lxml')).get_text() print(soup1) ` * The web page has the following information : Albert Einstein - Wikipedia document.documentElement.className="client-js";RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Albert_Einstein","wgTitle":"Albert Einstein","wgCurRevisionId":920687884,"wgRevisionId":920687884,"wgArticleId":736,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages with missing ISBNs","Webarchive template wayback links","CS1 German-language sources (de)","CS1: Julian–Gregorian uncertainty","CS1 French-language sources (fr)","CS1 errors: missing periodical","CS1: long volume value","Wikipedia indefinitely semi-protected pages","Use American English from February 2019","All Wikipedia articles written in American English","Articles with short description","Good articles","Articles containing German-language text","Biography with signature","Articles with hCards","Articles with hAudio microformats","All articles with unsourced statements", "Articles with unsourced statements from July 2019","Commons category link from Wikidata","Articles with Wikilivres links","Articles with Curlie links","Articles with Project Gutenberg links","Articles with Internet Archive links","Articles with LibriVox links","Use dmy dates from August 2019","Wikipedia articles with BIBSYS identifiers","Wikipedia articles with BNE identifiers","Wikipedia articles with BNF identifiers","Wikipedia articles with GND identifiers","Wikipedia articles with HDS identifiers","Wikipedia articles with ISNI identifiers","Wikipedia articles with LCCN identifiers","Wikipedia articles with LNB identifiers","Wikipedia articles with MGP identifiers","Wikipedia articles with NARA identifiers","Wikipedia articles with NCL identifiers","Wikipedia articles with NDL identifiers","Wikipedia articles with NKC identifiers","Wikipedia articles with NLA identifiers","Wikipedia articles with NLA-person identifiers","Wikipedia articles with NLI identifiers", "Wikipedia articles with NLR identifiers","Wikipedia articles with NSK identifiers","Wikipedia articles with NTA identifiers","Wikipedia articles with SBN identifiers","Wikipedia articles with SELIBR identifiers","Wikipedia articles with SNAC-ID identifiers","Wikipedia articles with SUDOC identifiers","Wikipedia articles with ULAN identifiers","Wikipedia articles with VIAF identifiers","Wikipedia articles with WorldCat-VIAF identifiers","AC with 25 elements","Wikipedia articles with suppressed authority control identifiers","Pages using authority control with parameters","Articles containing timelines","Pantheists","Spinozists","Albert Einstein","1879 births","1955 deaths","20th-century American engineers","20th-century American writers","20th-century German writers","20th-century physicists","American agnostics","American inventors","American letter writers","American pacifists","American people of German-Jewish descent","American physicists","American science writers", "American socialists","American Zionists","Ashkenazi Jews","Charles University in Prague faculty","Corresponding Members of the Russian Academy of Sciences (1917–25)","Cosmologists","Deaths from abdominal aortic aneurysm","Einstein family","ETH Zurich alumni","ETH Zurich faculty","German agnostics","German Jews","German emigrants to Switzerland","German Nobel laureates","German inventors","German physicists","German socialists","European democratic socialists","Institute for Advanced Study faculty","Jewish agnostics","Jewish American scientists","Jewish emigrants from Nazi Germany to the United States","Jews who emigrated to escape Nazism","Jewish engineers","Jewish inventors","Jewish philosophers","Jewish physicists","Jewish socialists","Leiden University faculty","Foreign Fellows of the Indian National Science Academy","Foreign Members of the Royal Society","Members of the American Philosophical Society","Members of the Bavarian Academy of Sciences","Members of the Lincean Academy" ,"Members of the Royal Netherlands Academy of Arts and Sciences","Members of the United States National Academy of Sciences","Honorary Members of the USSR Academy of Sciences","Naturalised citizens of Austria","Naturalised citizens of Switzerland","New Jersey socialists","Nobel laureates in Physics","Patent examiners","People from Berlin","People from Bern","People from Munich","People from Princeton, New Jersey","People from Ulm","People from Zürich","People who lost German citizenship","People with acquired American citizenship","Philosophers of science","Relativity theorists","Stateless people","Swiss agnostics","Swiss emigrants to the United States","Swiss Jews","Swiss physicists","Theoretical physicists","Winners of the Max Planck Medal","World federalists","Recipients of the Pour le Mérite (civil class)","Determinists","Activists from New Jersey","Mathematicians involved with Mathematische Annalen","Intellectual Cooperation","Disease-related deaths in New Jersey"], "wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Albert_Einstein","wgRelevantArticleId":736,"wgRequestId":"XaChjApAICIAALSsYfgAAABV","wgCSPNonce":!1,"wgIsProbablyEditable":!1,"wgRelevantPageIsProbablyEditable":!1,"wgRestrictionEdit":["autoconfirmed"],"wgRestrictionMove":["sysop"],"wgMediaViewerOnClick":!0,"wgMediaViewerEnabledByDefault":!0,"wgPopupsReferencePreviews":!1,"wgPopupsConflictsWithNavPopupGadget":!1,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en"},"wgMFDisplayWikibaseDescriptions":{"search":!0,"nearby":!0,"watchlist":!0,"tagline": !1},"wgWMESchemaEditAttemptStepOversample":!1,"wgULSCurrentAutonym":"English","wgNoticeProject":"wikipedia","wgWikibaseItemId":"Q937","wgCentralAuthMobileDomain":!1,"wgEditSubmitButtonLabelPublish":!0};RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","noscript":"ready","user.styles":"ready","ext.globalCssJs.user":"ready","user":"ready","user.options":"ready","user.tokens":"loading","ext.cite.styles":"ready","ext.math.styles":"ready","mediawiki.legacy.shared":"ready","mediawiki.legacy.commonPrint":"ready","jquery.makeCollapsible.styles":"ready","mediawiki.toc.styles":"ready","wikibase.client.init":"ready","ext.visualEditor.desktopArticleTarget.noscript":"ready","ext.uls.interlanguage":"ready","ext.wikimediaBadges":"ready","ext.3d.styles":"ready","mediawiki.skinning.interface":"ready","skins.vector.styles":"ready"};RLPAGEMODULES=["ext.cite.ux-enhancements","ext.cite.tracking","ext.math.scripts","ext.scribunto.logs","site","mediawiki.page.startup", "mediawiki.page.ready","jquery.makeCollapsible","mediawiki.toc","mediawiki.searchSuggest","ext.gadget.teahouse","ext.gadget.ReferenceTooltips","ext.gadget.watchlist-notice","ext.gadget.DRN-wizard","ext.gadget.charinsert","ext.gadget.refToolbar","ext.gadget.extra-toolbar-buttons","ext.gadget.switcher","ext.centralauth.centralautologin","mmv.head","mmv.bootstrap.autostart","ext.popups","ext.visualEditor.desktopArticleTarget.init","ext.visualEditor.targetLoader","ext.eventLogging","ext.wikimediaEvents","ext.navigationTiming","ext.uls.compactlinks","ext.uls.interface","ext.cx.eventlogging.campaigns","ext.quicksurveys.init","ext.centralNotice.geoIP","ext.centralNotice.startUp","skins.vector.js"]; (RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.tokens#tffin",function($,jQuery,require,module){/*#nomin*/mw.user.tokens.set({"patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"}); });}); Albert Einstein From Wikipedia, the free encyclopedia Jump to navigation Jump to search "Einstein" redirects here. For other people, see Einstein (surname). For other uses, see Albert Einstein (disambiguation) and Einstein (disambiguation). German-born physicist and developer of the theory of relativity Albert EinsteinEinstein in 1921Born(1879-03-14)14 March 1879Ulm, Kingdom of Württemberg, German EmpireDied18 April 1955(1955-04-18) (aged 76)Princeton, New Jersey, United StatesResidenceGermany, Italy, Switzerland, Austria (present-day Czech Republic), Belgium, United StatesCitizenship Subject of the Kingdom of Württemberg during the German Empire (1879–1896)[note 1] Stateless (1896–1901) Citizen of Switzerland (1901–1955) Austrian subject of the Austro-Hungarian Empire (1911–1912) Subject of the Kingdom of Prussia during the German Empire (1914–1918)[note 1] German citizen of the Free State of Prussia (Weimar Republic, 1918–1933) Citizen of the United States (1940–1955) Education Federal polytechnic school (1896–1900; B.A., 1900) University of Zurich (Ph.D., 1905) Known for General relativity Special relativity Photoelectric effect E=mc2 (Mass–energy equivalence) E=hf (Planck–Einstein relation) Theory of Brownian motion Einstein field equations Bose–Einstein statistics Bose–Einstein condensate Gravitational wave Cosmological constant Unified field theory EPR paradox Ensemble interpretation List of other concepts Spouse(s)Mileva Marić(m. 1903; div. 1919)Elsa Löwenthal(m. 1919; died[1][2] 1936)Children"Lieserl" Einstein Hans Albert Einstein Eduard "Tete" EinsteinAwards Barnard Medal (1920) Nobel Prize in Physics (1921) Matteucci Medal (1921) ForMemRS (1921)[3] Copley Medal (1925)[3] Gold Medal of the Royal Astronomical Society (1926) Max Planck Medal (1929) Member of the National Academy of Sciences (1942) Time Person of the Century (1999) Scientific careerFieldsPhysics, philosophyInstitutions Swiss Patent Office (Bern) (1902–1909) University of Bern (1908–1909) University of Zurich (1909–1911) Charles University in Prague (1911–1912) ETH Zurich (1912–1914) Prussian Academy of Sciences (1914–1933) Humboldt University of Berlin (1914–1933) Kaiser Wilhelm Institute (director, 1917–1933) German Physical Society (president, 1916–1918) Leiden University (visits, 1920) Institute for Advanced Study (1933–1955) Caltech (visits, 1931–1933) University of Oxford (visits, 1931–1933) ThesisEine neue Bestimmung der Moleküldimensionen (A New Determination of Molecular Dimensions) (1905)Doctoral advisorAlfred KleinerOther academic advisorsHeinrich Friedrich WeberInfluences Arthur Schopenhauer Baruch Spinoza Bernhard Riemann David Hume Ernst Mach Hendrik Lorentz Hermann Minkowski Isaac Newton James Clerk Maxwell Michele Besso Moritz Schlick Thomas Young Influenced Virtually all modern physics Signature Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist[5] who developed the theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics).[3][6]:274 His work is also known for its influence on the philosophy of science.[7][8] He is best known to the general public for his mass–energy equivalence formula . . . . . I only want text after the first heading "Albert Einstein"
First find h1 tag and then use find_next_siblings('div') and print the text value. import requests import bs4 urlpage = 'https://en.wikipedia.org/wiki/Albert_Einstein#Publications' res = requests.get(urlpage) soup1 =bs4.BeautifulSoup(res.text, 'lxml') h1=soup1.find('h1') for item in h1.find_next_siblings('div'): print(item.text)
If you do want to get the text such as described, I suggest a bit of an "non-parser" way. By cutting the string directly from the response object. Let's do this: import requests urlpage = "https://en.wikipedia.org/wiki/Albert_Einstein#Publications" my_string = """<h1 id="firstHeading" class="firstHeading" lang="en">Albert Einstein</h1>""" # define the string you want response = requests.get(urlpage).text # get the full response html as str cut_response = response[response.find(my_string)::] # cut the str from your string on soup1 = (bs4.BeautifulSoup(cut_response, 'lxml')).get_text() # get soup object, but of cut string print(soup1) Should work.
Python and Selenium: how to pull the data from the web text which has no id, class?
I had an website to pull information from. For example, http://www.worldhospitaldirectory.com/alaska-native-medical-center/info/8500 I need to pull the information and save into CSV file. For example, Category: General Hospitals Name: Alaska Native Medical Center Address: 4315 Diplomacy Drive Phone: (907) 563-2662 City: Anchorage State: Alaska But the problem now is that I cannot locate these information. The web code is as below: <b>Category:</b> General Hospitals <br> <b>Address:</b> 4315 Diplomacy Drive <br> <b>Subcontinent and Continent:</b> North America, America <br> Please give me some suggestions or code to help me get those text.
import requests, bs4 r = requests.get('http://www.worldhospitaldirectory.com/alaska-native-medical-center/info/8500') soup = bs4.BeautifulSoup(r.text, 'lxml') start = soup.find('em') for b in start.find_next_siblings('b'): print(b.text, b.next_sibling.strip()) out: Category: General Hospitals Address: 4315 Diplomacy Drive Subcontinent and Continent: North America , America Country: United States Phone (907) 563-2662 Website: City: State: Email: Latitude: 61.1827 Longitude: -149.80009 Zip Code: 99508 Contact Address: 4315 Diplomacy Dr, Anchorage, AK 99508, United States Latitude in Degree, Minute, Second [Direction]: 61° 10' 57" N
Want to store variable names in list, not said variable's contents
Sorry if the title is confusing; let me explain. So, I've written a program that categorizes emails by topic using nltk and tools from sklearn. Here is that code: #Extract Emails tech = extract_message("C:\\Users\\Cody\\Documents\\Emails\\tech.html") gary = extract_message("C:\\Users\\Cody\\Documents\\Emails\\gary.html") gary2 = extract_message("C:\\Users\\Cody\\Documents\\Emails\\gary2.html") jesus = extract_message("C:\\Users\\Cody\\Documents\\Emails\\Jesus.html") jesus2 = extract_message("C:\\Users\\Cody\\Documents\\Emails\\jesus2.html") hockey = extract_message("C:\\Users\\Cody\\Documents\\Emails\\hockey.html") hockey2 = extract_message("C:\\Users\\Cody\\Documents\\Emails\\hockey2.html") shop = extract_message("C:\\Users\\Cody\\Documents\\Emails\\shop.html") #Build dictionary of features count_vect = CountVectorizer() x_train_counts = count_vect.fit_transform(news.data) #Downscaling tfidf_transformer = TfidfTransformer() x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts) tf_transformer = TfidfTransformer(use_idf=False).fit(x_train_counts) x_train_tf = tf_transformer.transform(x_train_counts) #Train classifier clf = MultinomialNB().fit(x_train_tfidf, news.target) #List of the extracted emails docs_new = [gary, gary2, jesus, jesus2, shop, tech, hockey, hockey2] #Extract feautures from emails x_new_counts = count_vect.transform(docs_new) x_new_tfidf = tfidf_transformer.transform(x_new_counts) #Predict the categories for each email predicted = clf.predict(x_new_tfidf) Now I'm looking to store each variable in an appropriate list, based off of the predicted label. I figured I could do that doing this: #Store Files in a category hockey_emails = [] computer_emails = [] politics_emails = [] tech_emails = [] religion_emails = [] forsale_emails = [] #Print out results and store each email in the appropritate category list for doc, category in zip(docs_new, predicted): print('%r ---> %s' % (doc, news.target_names[category])) if(news.target_names[category] == 'comp.sys.ibm.pc.hardware'): computer_emails.append(doc) if(news.target_names[category] == 'rec.sport.hockey'): hockey_emails.append(doc) if(news.target_names[category] == 'talk.politics.misc'): politics_emails.append(doc) if(news.target_names[category] == 'soc.religion.christian'): religion_emails.append(doc) if(news.target_names[category] == 'misc.forsale'): forsale_emails.append(doc) if(news.target_names[category] == 'comp.sys.ibm.pc.hardware'): computer_emails.append(doc) My output if I were to print out one of these lists, let's say hockey for instance, displays the contents stored in the variable rather than the variable itself. I want this: print(hockey_emails) output: ['hockey', 'hockey2'] but instead I'm getting this: output: ['View View online click here Hi Thanks for signing up as a EA SPORTS NHL insider You ll now receive all of the latest and greatest news and info at this e mail address as you ve requested EA com If you need technical assistance please contact EA Help Privacy Policy Our Certified Online Privacy Policy gives you confidence whenever you play EA games To view our complete Privacy and Cookie Policy go to privacy ea com or write to Privacy Policy Administrator Electronic Arts Inc Redwood Shores Parkway Redwood City CA Electronic Arts Inc All Rights Reserved Privacy Policy User Agreement Legal ActionsMark as UnreadMark as ReadMark as SpamStarClear StarArchive Previous Next ', 'View News From The Hockey Writers The Editor s Choice stories from The Hockey Writers View this email in your browser edition Recap Stars Steamroll Predators By Matt Pryor on Dec am As the old Mary Chapin Carpenter song goes Sometimes you re the windshield Sometimes you re the bug It hasn t happened very often this season but the Dallas Stars had a windshield Continue Reading A Review of Years in Blue and White Damien Cox One on One By Anthony Fusco on Dec pm The Toronto Maple Leafs are one of the most storied and iconic franchises in the entire National Hockey League They have a century of history that spans all the way back to the early s When you have an Continue Reading Bruins Will Not Miss Beleskey By Kyle Benson on Dec am On Monday it was announced that Matt Beleskey will miss the next six weeks due to a knee injury he sustained over the weekend in a game against the Buffalo Sabres Six weeks is a long stint to be without a potential top Continue Reading Recent Articles Galchenyuk Injury Costly for CanadiensFacing Off Picking Team Canada for World JuniorsAre Johnson s Nomadic Days Over Share Tweet Forward Latest News Prospects Anaheim Ducks Arizona Coyotes Boston Bruins Buffalo Sabres Calgary Flames Carolina Hurricanes Chicago Blackhawks Colorado Avalanche Columbus Blue Jackets Dallas Stars Detroit Red Wings Edmonton Oilers Florida Panthers Los Angeles Kings Minnesota Wild Montreal Canadiens Nashville Predators New Jersey Devils New York Islanders New York Rangers Philadelphia Flyers Pittsburgh Penguins Ottawa Senators San Jose Sharks St Louis Blues Tampa Bay Lightning Toronto Maple Leafs Vancouver Canucks Washington Capitals Winnipeg Jets Copyright The Hockey Writers All rights reserved You are receiving this email because you opted in at The Hockey Writers or one of our Network Sites Our mailing address is The Hockey Writers Victoria Ave St Lambert QC J R R CanadaAdd us to your address book unsubscribe from this list update subscription preferences ActionsMark as UnreadMark as ReadMark as SpamStarClear StarArchive Previous Next '] I figured this would be simple, but I'm sitting here scratching my head. Is this even possible? Should I use something else instead of a list? This is probably simple I'm just blanking.
You have to keep track of the names yourself, Python won't do it for you. names = 'gary gary2 Jesus jesus2 shop tech hockey hockey2'.split() docs_new = [extract_message("C:\\Users\\Cody\\Documents\\Emails\\%s.html" % name) for name in names] for name, category in zip(names, predicted): print('%r ---> %s' % (name, news.target_names[category])) if (news.target_names[category] == 'comp.sys.ibm.pc.hardware'): computer_emails.append(name)
Don't do this. Use a dictionary to hold your collection of emails, and you can print the dictionary keys when you want to know what is what. docs_new = dict() docs_new["tech"] = extract_message("C:\\Users\\Cody\\Documents\\Emails\\tech.html") docs_new["gary"] = extract_message("C:\\Users\\Cody\\Documents\\Emails\\gary.html") etc. When you iterate over the dictionary, you'll see the keys. for doc, category in zip(docs_new, predicted): print('%s ---> %s' % (doc, news.target_names[category])) (More dictionary basics: To iterate over dict values, replace docs_new above with docs_new.values(); or use docs_new.items() for both keys and values.)