Scraping and parsing citation info from Google Scholar search results

Scraping and parsing citation info from Google Scholar search results - python

I have a list of around 20000 article's titles and i want to scrape their citation count from google scholar. I am new to BeautifulSoup library. I have this code:
import requests
from bs4 import BeautifulSoup
query = ['Role for migratory wild birds in the global spread of avian
influenza H5N8','Uncoupling conformational states from activity in an
allosteric enzyme','Technological Analysis of the World’s Earliest
Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer
Headdress from the Early Holocene Site of Star Carr, North Yorkshire,
UK','Oxidative potential of PM 2.5 during Atlanta rush hour:
Measurements of in-vehicle dithiothreitol (DTT) activity','Primary
Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-
wrapped Graphene and Their Oxygen Reduction Activity','Relations of
Preschoolers Visual-Motor and Object Manipulation Skills With Executive
Function and Social Behavior','We Know Who Likes Us, but Not Who Competes
Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-
8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
results.append({"title": entry.a.text, "url": entry.a['href']})
but it returns only title and url. i don't know how to get the citation information from another tag. Please help me out here.

You need to loop the list. You can use Session for efficiency. The below is for bs 4.7.1 which supports :contains pseudo class for finding the citation count. Looks like you can remove the h3 type selector from the css selector and just use class before the a i.e. .gs_rt a. If you don't have 4.7.1. you can use [title=Cite] + a to select citation count instead.
import requests
from bs4 import BeautifulSoup as bs
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('h3.gs_rt a').text if soup.select_one('h3.gs_rt a') is not None else 'No title'
link = soup.select_one('h3.gs_rt a')['href'] if title != 'No title' else 'No link'
citations = soup.select_one('a:contains("Cited by")').text if soup.select_one('a:contains("Cited by")') is not None else 'No citation count'
print(title, link, citations)
The alternative for < 4.7.1.
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('.gs_rt a')
if title is None:
title = 'No title'
link = 'No link'
else:
link = title['href']
title = title.text
citations = soup.select_one('[title=Cite] + a')
if citations is None:
citations = 'No citation count'
else:
citations = citations.text
print(title, link, citations)
Bottom version re-written thanks to comments from #facelessuser. Top version left for comparison:
It would probably be more efficient to not call select_one twice in single line if statement. While the pattern building is cached, the returned tag is not cached. I personally would set the variable to whatever is returned by select_one and then, only if the variable is None, change it to No link or No title etc. It isn't as compact, but it will be more efficient
[...]always check if if tag is None: and not just if tag:. With selectors, it isn't a big deal as they will only return tags, but if you ever do something like for x in tag.descendants: you get text nodes (strings) and tags, and an empty string will evaluate false even though it is a valid node. In that case, it is safest to to check for None

Instead of finding all <h3> tags, I suggest you to search for the tags enclosing both <h3> and the citation (inside <div class="gs_rs>"), i.e. find all <div class="gs_ri"> tags.
Then from these tags, you should be able to get all you need:
query = ['Role for migratory wild birds in the global spread of avian influenza H5N8','Uncoupling conformational states from activity in an allosteric enzyme','Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK','Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity','Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer- wrapped Graphene and Their Oxygen Reduction Activity','Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior','We Know Who Likes Us, but Not Who Competes Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("div", attrs={"class": "gs_ri"}): #tag containing both h3 and citation
results.append({"title": entry.h3.a.text, "url": entry.a['href'], "citation": entry.find("div", attrs={"class": "gs_rs"}).text})

Make sure you're using user-agent because default requests user-agent is python-requests and Google might block your requests and you receive a different HTML with some sort of error that doesn't contain selectors you're trying to select. Check what's your user-agent.
It also might be a good idea to rotate user-agents while making requests.
Code and full example that scrapes much more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
for query in queries:
params = {
"q": query,
"hl": "en",
}
html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
# Container where all needed data is located
for result in soup.select('.gs_ri'):
title = result.select_one('.gs_rt').text
title_link = result.select_one('.gs_rt a')['href']
cited_by = result.select_one('#gs_res_ccl_mid .gs_nph+ a')['href']
cited_by_count = result.select_one('#gs_res_ccl_mid .gs_nph+ a').text.split(' ')[2]
print(f"{title}\n{title_link}\n{cited_by}\n{cited_by_count}\n")
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want, rather than figuring out why certain things don't work as they should.
Code to integrate:
from serpapi import GoogleSearch
import os
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
for query in queries:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": query,
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results['organic_results']:
data.append({
'title': result['title'],
'link': result['link'],
'publication_info': result['publication_info']['summary'],
'snippet': result['snippet'],
'cited_by': result['inline_links']['cited_by']['link'],
'related_versions': result['inline_links']['related_pages_link'],
})
print(json.dumps(data, indent=2, ensure_ascii=False))
P.S - I wrote a blog post about how to scrape pretty much everything on Google Scholar with visual representation.
Disclaimer, I work for SerpApi.

Related

Python XPATH doesn't return any data

Trying to scrape data from the listings but for some reason, it returns empty. Similar code on other websites has worked, I am baffled why it won't on this site. Please help!
import requests
from lxml import html
start_url ="https://www.anybusiness.com.au/search?page=1&sort=date-new-old"
res = requests.get(start_url)
tree = html.fromstring(res.content)
# Retrieve listing title
title_xpath = "/html/body/div[2]/section/div[2]/div/div[2]/div[2]/div/a/div/div/text()"
title_value= tree.xpath(title_xpath)
print(title_value)
>> []

import requests
from bs4 import BeautifulSoup
from pprint import pp
def main(url):
params = {
"page": 1,
"sort": "date-new-old"
}
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
pp([x.get_text(strip=True)
for x in soup.select('div[style^="position: relative; padding: 10px; "]')])
main("https://www.anybusiness.com.au/search")
Output:
['COMING SOON Bar Under Management - $250k/month ave takings',
'Mobile Windscreen and Auto Glass',
'21168 Profitable Engineering Business – Coal Mining Industry',
'Exciting Food Franchise Opportunities. Sydney.',
'Exciting Food Franchise Opportunities. Brisbane.',
'Pizza Hut Franchise Sale',
'Luxury Horse Float Business – National Opportunity! (6543)',
'Mobile Windscreen and Auto Glass Business. Offers over $220,000 WIWO',
'SIMPLE ESPRESSO BAR - HILLS DISTRICT - 00805.',
'The Ox and Hound Bistro - 1P5306',
'A new lifestyle awaits - own and operate your own business in a growing '
'market',
'Join the Fernwood Fitness Franchise Family',
'Rare Business Opportunity - Appliance Finance Business -Ipswich',
'Make a Lifestyle Change to Beautiful Bendigo/Huntly Newsagency and Store',
'Immaculately Presented Motel in a Thriving Regional City - 1P5317M']
Updated Answer:
import requests
from bs4 import BeautifulSoup
from pprint import pp
def main(url):
params = {
"page": 1,
"sort": "date-new-old"
}
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
pp([(x.a['href'], x.select_one('.ellipsis').get_text(strip=True))
for x in soup.select('.basic')])
main("https://www.anybusiness.com.au/search")
Output:
[('/listings/oakleigh-east-vic-3166-retail-food-beverage-3330830',
"Portman's Continental Butcher & Deli"),
('/listings/melbourne-vic-3000-leisure-entertainment-bars-nightclubs-3330829',
'COMING SOON Bar Under Management - $250k/month ave takings'),
('/listings/north-mackay-qld-4740-industrial-manufacturing-mining-earth-moving-manufacturing-engineering-3330827',
'21168 Profitable Engineering Business – Coal Mining Industry'),
('/listings/food-hospitality-cafe-coffee-shop-3330826',
'Exciting Food Franchise Opportunities. Sydney.'),
('/listings/food-hospitality-cafe-coffee-shop-3330825',
'Exciting Food Franchise Opportunities. Brisbane.'),
('/listings/alderley-qld-4051-food-hospitality-takeaway-food-3330824',
'Pizza Hut Franchise Sale'),
('/listings/castle-hill-nsw-2154-food-hospitality-cafe-coffee-shop-takeaway-food-retail-food-beverage-3330821',
'SIMPLE ESPRESSO BAR - HILLS DISTRICT - 00805.'),
('/listings/beechworth-vic-3747-food-hospitality-restaurant-cafe-coffee-shop-retail-food-beverage-3330820',
'The Ox and Hound Bistro - 1P5306'),
('/listings/west-launceston-tas-7250-services-cleaning-home-garden-home-based-franchise-3330819',
'A new lifestyle awaits - own and operate your own business in a growing '
'market'),
('/listings/leisure-entertainment-sports-complex-gym-recreation-sport-franchise-3330818',
'Join the Fernwood Fitness Franchise Family'),
('/listings/melbourne-vic-3000-professional-finance-services-hire-rent-retail-homeware-hardware-3330817',
'Rare Business Opportunity - Appliance Finance Business -Ipswich'),
('/listings/huntly-north-vic-3551-retail-newsagency-tatts-food-hospitality-convenience-store-office-supplies-3330816',
'Make a Lifestyle Change to Beautiful Bendigo/Huntly Newsagency and Store')]

Scraping a website for multiple pages that contains _dopostback method and the URL doesn't change for the pages

I am using BeautifulSoup to scrape from the
https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019
There are a total of two pages of information and to navigate over the pages, there are several links in the top as well in the bottom like 1,2. These links use _dopostback
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView2','Page$2')"
The problem is when we try to navigate from one page to another, the Url doesn't change only the bold text changes i.e for Page 1 it is Page$1, for Page 2 it is Page$2. How do I use BeautifulSoup to iterate over several pages and extract the information? The form data is as follows.
ctl00$ScriptManager1: ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2
ctl00$ContentPlaceHolder1$ddl_District: 019
ctl00$ContentPlaceHolder1$rdo_Govt_Flag: G
__EVENTTARGET: ctl00$ContentPlaceHolder1$GridView2
__EVENTARGUMENT: Page$2
There is also a variable called _VIEWSTATEin the form data, but the contents are so huge.
I looked at multiple solutions and posts that are suggesting to see the parameters of post call and use them but I am unable to make sense of the parameters that are provided in post.

You can use this example how to load next page on this site using requests:
import requests
from bs4 import BeautifulSoup
url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
def load_page(soup, page_num):
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
}
payload = {
"ctl00$ScriptManager1": "ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2",
"__EVENTTARGET": "ctl00$ContentPlaceHolder1$GridView2",
"__EVENTARGUMENT": "Page${}".format(page_num),
"__LASTFOCUS": "",
"__ASYNCPOST": "true",
}
for inp in soup.select("input"):
payload[inp["name"]] = inp.get("value")
payload["ctl00$ContentPlaceHolder1$ddl_District"] = "019"
payload["ctl00$ContentPlaceHolder1$rdo_Govt_Flag"] = "G"
del payload["ctl00$ContentPlaceHolder1$chk_Available"]
api_url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(
requests.post(api_url, data=payload, headers=headers).content,
"html.parser",
)
return soup
# print hospitals from first page:
for h5 in soup.select("h5"):
print(h5.text)
# load second page
soup = load_page(soup, 2)
# print hospitals from second page
for h5 in soup.select("h5"):
print(h5.text)
Prints:
 AMRI, Salt Lake - Vivekananda Yuba Bharati Krirangan Salt Lake Stadium (Satellite Govt. Building)
 Calcutta National Medical College and Hospital (Government Hospital)
 CHITTARANJAN NATIONAL CANCER INSTITUTE-CNCI (Government Hospital)
 College of Medicine Sagore Dutta Hospital (Government Hospital)
 ESI Hospital Maniktala (Government Hospital)
 ESI Hospital Sealdah (Government Hospital)
 I.D. And B.G. Hospital (Government Hospital)
 M R Bangur Hospital (Government Hospital)
 Medical College and Hospital, Kolkata, (Government Hospital)
 Nil Ratan Sarkar Medical College and Hospital (Government Hospital)
 R. G. Kar Medical College and Hospital (Government Hospital)
 Sambhunath Pandit Hospital (Government Hospital)

I am doing RSS feed news scrapting using python3.7. I am not get the exact information. Help me to get the proper data

Here I am trying to get the news from the RSS feed and I am not getting the exact information.
I am using the requests and BeautifulSoup to achieve the goal.
I have the following object.
<item>
<title>
US making very good headway in respect to Covid-19 vaccines: Donald Trump
</title>
<description>
<img border="0" hspace="10" align="left" style="margin-top:3px;margin-right:5px;" src="https://timesofindia.indiatimes.com/photo/76399892.cms" />Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.
</description>
<link>
https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms
</link>
<guid>
https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms
</guid>
<pubDate>
Mon, 15 Jun 2020 22:11:06 PT
</pubDate>
</item>
The code for the desire problem is here..
def timesofindiaNews():
URL = 'https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms'
page = requests.get(URL)
soup = BeautifulSoup(page.content, features = 'xml')
# print(soup.prettify())
news_elems = soup.find_all('item')
news = []
print(news_elems[0].prettify())
for news_elem in news_elems:
title = news_elem.title.text
news_description = news_elem.description.text
image = news_elem.description.img
# news_date = news_elem.pubDate.text
news_link = news_elem.link.text
I want the description from the tag but the contains the more details like and which is not require in the description.
The above code give the following output.
{
"image": null,
"news_description": "<img border=\"0\" hspace=\"10\" align=\"left\" style=\"margin-top:3px;margin-right:5px;\" src=\"https://timesofindia.indiatimes.com/photo/76399892.cms\" />Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.",
"news_link": "https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms",
"source": "trucknews",
"title": "US making very good headway in respect to Covid-19 vaccines: Donald Trump"
}
Expected output ===>
{
"image": "image/link/from/the/description",
"news_description": "Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.",
"news_link": "https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms",
"source": "trucknews",
"title": "US making very good headway in respect to Covid-19 vaccines: Donald Trump"
}

< > changed to < and &gt. Thats why I use formatter=None and changing someting to control it.Please see the news_description. I think you got your result. you can try it:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
def timesofindiaNews():
URL = 'https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms'
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.text, 'xml')
# print(soup.prettify())
news_elems = soup.find_all('item')
news = []
# print(news_elems[0].prettify())
for news_elem in news_elems:
title = news_elem.title.text
n_description = news_elem.description
store = n_description.prettify(formatter=None)
sp = BeautifulSoup(store, 'xml')
news_description = sp.find("a").nextSibling
print(news_description)
# print(news_description)
image = news_elem.description.img
# news_date = news_elem.pubDate.text
news_link = news_elem.link.text
timesofindiaNews()
output will be:
Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.
The proposed suspension could extend into the government's new fiscal year beginning October 1, when many new visas are issued, The Wall Street Journal reported on Thursday, quoting unnamed administration officials.
The team of researchers at the University of Georgia (UGA) in the US noted that the SARS-CoV-2 protein PLpro is essential for the replication and the ability of the virus to suppress host immune function.
After two weeks of protests over the death of George Floyd, hundreds of New Yorkers took to the streets again calling for reform in law enforcement and the withdrawal of police department funding.
Indian-origin California Senator Kamala Harris has joined former vice president and 2020 Democratic presidential nominee Joe Biden to raise USD 3.5 million for the upcoming November elections.
and so on....

Script produces wrong results when linebreak comes into play

I've written a script in python to scrape some disorganized content located within b tags and thier next_sibling from a webpage. The thing is my script fails when linebreaks come between. I'm trying to extract the title's and their concerning description from that page starting from CHIEF COMPLAINT: Bright red blood per rectum to just before Keywords:.
Website address
I've tried so far with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select_one("hr").find_next_siblings('b'):
print(item.text,item.next_sibling)
The portion of output giving me unwanted results are like:
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
How can I get the titles and their concerning description accordingly?

Here's a scraper that's more robust compared to yesterday's solutions.
How to loop through scraping multiple documents on multiple web pages using BeautifulSoup?
How can I grab the entire body text from a web page using BeautifulSoup?
It extracts, title, description and all sections properly
import re
import copy
import requests
from bs4 import BeautifulSoup, Tag, Comment, NavigableString
from urllib.parse import urljoin
from pprint import pprint
import itertools
import concurrent
from concurrent.futures import ThreadPoolExecutor
BASE_URL = 'https://www.mtsamples.com'
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url)
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')
return soup
def clean_soup(soup: BeautifulSoup) -> BeautifulSoup:
soup = copy.copy(soup)
h1 = soup.select_one('h1')
kw_re = re.compile('.*Keywords.*', flags=re.IGNORECASE)
kw = soup.find('b', text=kw_re)
for el in (*h1.previous_siblings, *kw.next_siblings):
el.extract()
kw.extract()
for ad in soup.select('[id*="ad"]'):
ad.extract()
for script in soup.script:
script.extract()
for c in h1.parent.children:
if isinstance(c, Comment):
c.extract()
return h1.parent
def extract_meta(soup: BeautifulSoup) -> dict:
h1 = soup.select_one('h1')
title = h1.text.strip()
desc_parts = []
desc_re = re.compile('.*Description.*', flags=re.IGNORECASE)
desc = soup.find('b', text=desc_re)
hr = soup.select_one('hr')
for s in desc.next_siblings:
if s is hr:
break
if isinstance(s, NavigableString):
desc_parts.append(str(s).strip())
elif isinstance(s, Tag):
desc_parts.append(s.text.strip())
description = '\n'.join(p.strip() for p in desc_parts if p.strip())
return {
'title': title,
'description': description
}
def extract_sections(soup: BeautifulSoup) -> list:
titles = [b for b in soup.select('b') if b.text.isupper()]
parts = []
for t in titles:
title = t.text.strip(': ').title()
text_parts = []
for s in t.next_siblings:
# walk forward until we see another title
if s in titles:
break
if isinstance(s, Comment):
continue
if isinstance(s, NavigableString):
text_parts.append(str(s).strip())
if isinstance(s, Tag):
text_parts.append(s.text.strip())
text = '\n'.join(p for p in text_parts if p.strip())
p = {
'title': title,
'text': text
}
parts.append(p)
return parts
def extract_page(url: str) -> dict:
soup = make_soup(url)
clean = clean_soup(soup)
meta = extract_meta(clean)
sections = extract_sections(clean)
return {
**meta,
'sections': sections
}
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
page = extract_page(url)
pprint(page, width=2000)
output:
{'description': 'Status post colonoscopy. After discharge, experienced bloody bowel movements and returned to the emergency department for evaluation.\n(Medical Transcription Sample Report)',
'sections': [{'text': 'Bright red blood per rectum', 'title': 'Chief Complaint'},
# some elements removed for brevity
{'text': '', 'title': 'Labs'},
{'text': 'WBC count: 6,500 per mL\nHemoglobin: 10.3 g/dL\nHematocrit:31.8%\nPlatelet count: 248 per mL\nMean corpuscular volume: 86.5 fL\nRDW: 18%', 'title': 'Cbc'},
{'text': 'Sodium: 131 mmol/L\nPotassium: 3.5 mmol/L\nChloride: 98 mmol/L\nBicarbonate: 23 mmol/L\nBUN: 11 mg/dL\nCreatinine: 1.1 mg/dL\nGlucose: 105 mg/dL', 'title': 'Chem 7'},
{'text': 'PT 15.7 sec\nINR 1.6\nPTT 29.5 sec', 'title': 'Coagulation Studies'},
{'text': 'The patient receive ... ula.', 'title': 'Hospital Course'}],
'title': 'Sample Type / Medical Specialty: Gastroenterology\nSample Name: Blood per Rectum'}

Code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology& Sample=941-BloodperRectum'
res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html,'html.parser')
# Cut the division containing required text,used Right Click and Inspect element in broweser to find the respective div/tag
sampletext_div = soup.find('div', {'id': "sampletext"})
print(sampletext_div.find('h1').text) # TO print header
Output:
Sample Type / Medical Specialty: Gastroenterology
Sample Name: Blood per Rectum
Code:
# Find all the tag
b_all=sampletext_div.findAll('b')
for b in b_all[4:]:
print(b.text, b.next_sibling)
Output:
CHIEF COMPLAINT: Bright red blood per rectum
HISTORY OF PRESENT ILLNESS: This 73-year-old woman had a recent medical history significant for renal and bladder cancer, deep venous thrombosis of the right lower extremity, and anticoagulation therapy complicated by lower gastrointestinal bleeding. Colonoscopy during that admission showed internal hemorrhoids and diverticulosis, but a bleeding site was not identified. Five days after discharge to a nursing home, she again experienced bloody bowel movements and returned to the emergency department for evaluation.
REVIEW OF SYMPTOMS: No chest pain, palpitations, abdominal pain or cramping, nausea, vomiting, or lightheadedness. Positive for generalized weakness and diarrhea the day of admission.
PRIOR MEDICAL HISTORY: Long-standing hypertension, intermittent atrial fibrillation, and hypercholesterolemia. Renal cell carcinoma and transitional cell bladder cancer status post left nephrectomy, radical cystectomy, and ileal loop diversion 6 weeks prior to presentation, postoperative course complicated by pneumonia, urinary tract infection, and retroperitoneal bleed. Deep venous thrombosis 2 weeks prior to presentation, management complicated by lower gastrointestinal bleeding, status post inferior vena cava filter placement.
MEDICATIONS: Diltiazem 30 mg tid, pantoprazole 40 mg qd, epoetin alfa 40,000 units weekly, iron 325 mg bid, cholestyramine. Warfarin discontinued approximately 10 days earlier.
ALLERGIES: Celecoxib (rash).
SOCIAL HISTORY: Resided at nursing home. Denied alcohol, tobacco, and drug use.
FAMILY HISTORY: Non-contributory.
PHYSICAL EXAM: 
LABS: 
CBC: 
CHEM 7: 
COAGULATION STUDIES: 
HOSPITAL COURSE: The patient received 1 liter normal saline and diltiazem (a total of 20 mg intravenously and 30 mg orally) in the emergency department. Emergency department personnel made several attempts to place a nasogastric tube for gastric lavage, but were unsuccessful. During her evaluation, the patient was noted to desaturate to 80% on room air, with an increase in her respiratory rate to 34 breaths per minute. She was administered 50% oxygen by nonrebreadier mask, with improvement in her oxygen saturation to 89%. Computed tomographic angiography was negative for pulmonary embolism.
Keywords:
gastroenterology, blood per rectum, bright red, bladder cancer, deep venous thrombosis, colonoscopy, gastrointestinal bleeding, diverticulosis, hospital course, lower gastrointestinal bleeding, nasogastric tube, oxygen saturation, emergency department, rectum, thrombosis, emergency, department, gastrointestinal, blood, bleeding, oxygen,
NOTE : These transcribed medical transcription sample reports and examples are provided by various users and
are for reference purpose only. MTHelpLine does not certify accuracy and quality of sample reports.
These transcribed medical transcription sample reports may include some uncommon or unusual formats;
this would be due to the preference of the dictating physician. All names and dates have been
changed (or removed) to keep confidentiality. Any resemblance of any type of name or date or
place or anything else to real world is purely incidental.

Python: searching through html file grabbing <a> tags with the href and text content

I need help with a solution to search through a html file with Python3 and retreive all of the <a> links on the page. Then appending the grabbed value to a dictionary with the adjacent href (url).
This is what I've already tried.
import urllib3
import re
http = urllib3.PoolManager()
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL"
a = http.request("GET",my_url)
html = a.data
links = re.finditer(' href="?([^\s^"]+)', html)
for link in links:
print(link)
I'm getting this error...
TypeError: can't use a string pattern on a bytes-like object
Thanks for your help.
I've also tried lxml...
links = lxml.html.parse("http://www.google.co.uk/?gws_rd=ssl#q=apple+stock&tbm=nws").xpath("//a/#href")
for link in links:
print(link)
The result does not show all the links and I'm not sure why.
UPDATE:
New code =>
def news_feed(self, stock):
http = urllib3.PoolManager()
my_url = "https://in.finance.yahoo.com/q/h?s="+stock
a = http.request("GET",my_url)
html = a.data.decode('utf-8')
xml = fromstring(html, HTMLParser())
a_tags = xml.xpath("//a/#href")
xml = fromstring(html, HTMLParser())
a_tags = xml.xpath("//table[#id='yfncsumtab']//a")
self.paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./#href")[0]) for a in a_tags)
pp(self.paired)

Use a html parser and decode the bytes as suggested, BeautifulSoup will make the job very easy and it a lot more reliable than a regex when parsing html:
http = urllib3.PoolManager()
my_url = "https://in.finance.yahoo.com/q/h?s=AAPL"
a = http.request("GET", my_url)
html = a.data.decode("utf-8")
from bs4 import BeautifulSoup
print([a["href"] for a in BeautifulSoup(html).find_all("a",href=True)])
If you only want the links starting with http you can use a css select:
soup = BeautifulSoup(html)
print([a["href"] for a in soup.select("a[href^=http]")])
Which will give you:
['https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL', 'https://help.yahoo.com/l/in/yahoo/finance/', 'http://in.yahoo.com/bin/set?cmp=uheader&src=others', 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN', 'http://in.my.yahoo.com', 'https://in.yahoo.com/', 'https://in.finance.yahoo.com', 'https://in.finance.yahoo.com/investing/', 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy', 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html', 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html', 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html', 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html', 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html', 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html', 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html', 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html', 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html', 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html', 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html', 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html', 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html', 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html', 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html', 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html', 'https://in.finance.yahoo.com/news/apple-sees-first-sales-dip-011402926.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-031840725.html', 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html', 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html', 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html', 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote', 'http://www.capitaliq.com', 'http://www.csidata.com', 'http://www.morningstar.com/']
To get the text and href:
soup = BeautifulSoup(html)
a_tags = soup.select("a[href^=http]")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)
pp(paired)
Output:
{u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
u'Capital IQ': 'http://www.capitaliq.com',
u'Commodity Systems, Inc. (CSI)': 'http://www.csidata.com',
u'Download the new Yahoo Mail app': 'https://in.mobile.yahoo.com/mail/',
u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
u'Help': 'https://help.yahoo.com/l/in/yahoo/finance/',
u'Mail': 'https://in.mail.yahoo.com/?.intl=in&.lang=en-IN',
u'Markets': 'https://in.finance.yahoo.com/investing/',
u'Morningstar, Inc.': 'http://www.morningstar.com/',
u'My Yahoo': 'http://in.my.yahoo.com',
u'New User? Register': 'https://edit.yahoo.com/config/eval_register?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL',
u'Report an Issue': 'https://yahoo.uservoice.com/forums/170320-india-finance/category/84926-data-accuracy',
u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
u'Sign In': 'https://login.yahoo.com/config/login?.src=quote&.intl=in&.lang=en-IN&.done=https://in.finance.yahoo.com/q/h%3fs=AAPL',
u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html',
u'Yahoo': 'https://in.yahoo.com/',
u'Yahoo India Finance': 'https://in.finance.yahoo.com',
u'other exchanges': 'http://help.yahoo.com/l/in/yahoo/finance/basics/fitadelay2.html',
u'premium service.': 'http://billing.finance.yahoo.com/realtime_quotes/signup?.src=quote&.refer=quote'}
The a[href^=http] means give me all the a tags that have href's and those href's values start with http.
Using lxml and using the table id to get just the story links which you are probably most interested in:
from lxml.etree import fromstring, HTMLParser
xml = fromstring(_html, HTMLParser())
a_tags = xml.xpath("//table[#id='yfncsumtab']//a")
paired = dict((a.xpath(".//text()")[0].strip(), a.xpath("./#href")[0]) for a in a_tags)
from pprint import pprint as pp
pp(paired)
Gives you:
{'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'}
We can do the same with out select:
soup = BeautifulSoup(_html)
a_tags = soup.select("#yfncsumtab a")
from pprint import pprint as pp
paired = dict((a.text, a["href"]) for a in a_tags)
pp(paired)
Which will match our lxml output:
{u'Alphabet overtakes Apple in market value - for now': 'https://in.finance.yahoo.com/news/alphabet-overtakes-apple-market-value-140918508.html',
u'Alphabet passes Apple to become most valuable traded U.S. company': 'https://in.finance.yahoo.com/news/alphabet-passes-apple-become-most-012844730.html',
u'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
u'Apple iPhone sales weaker than expected': 'https://in.finance.yahoo.com/news/apple-iphone-sales-weaker-expected-221908381.html',
u'Apple likely to invoke free-speech rights in encryption fight': 'https://in.finance.yahoo.com/news/apple-likely-invoke-free-speech-030713050.html',
u'Apple sees first sales dip in more than a decade as super-growth era falters': 'https://in.finance.yahoo.com/news/apple-sells-fewer-iphones-expected-213516373.html',
u'Apple shares fall most in two years in wake of earnings report': 'https://in.finance.yahoo.com/news/apple-shares-seen-staying-muted-132628994.html',
u"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
u"Bad run continues for 'Freedom 251', website down again on second day": 'https://in.finance.yahoo.com/news/bad-run-continues-freedom-251-092804374.html',
u"EXCLUSIVE - Common mobile software could have opened San Bernardino shooter's iPhone": 'https://in.finance.yahoo.com/news/common-mobile-software-could-opened-030713243.html',
u'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
u'Older Headlines': '/q/h?s=AAPL&t=2016-01-27T03:49:08+05:30',
u'Samsung Elec warns of difficult 2016 as smartphone troubles spread': 'https://in.finance.yahoo.com/news/samsung-electronics-warns-difficult-2016-003517395.html',
u'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
u"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
u'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html',
u'U.S. appeals court upholds Apple e-book settlement': 'https://in.finance.yahoo.com/news/u-appeals-court-upholds-apple-164738354.html',
u'U.S., Apple ratchet up rhetoric in fight over encryption': 'https://in.finance.yahoo.com/news/u-apple-ratchet-rhetoric-fight-030713673.html',
u'With China weakening, Apple turns to India': 'https://in.finance.yahoo.com/news/china-weakening-apple-turns-india-032020575.html'}
You could just use //*[#id='yfncsumtab']//a as id's should be unique.
To get the first six links from the table using an xpath, we can use the ul's and extract the first 6 using ul[position() < 7]:
a_tags = xml.xpath("//*[#id='yfncsumtab']//ul[position() < 7]//a")
paired = dict((a.xpath("./text()")[0].strip(), a.xpath("./#href")[0]) for a in a_tags)
from pprint import pprint as pp
pp(paired)
Which will give you:
{'Apple bets new 4-inch iPhone to draw big-screen converts in China, India': 'https://in.finance.yahoo.com/news/apple-bets-4-inch-iphone-112024627.html',
"Apple's new iPhone faces challenge measuring up in China, India": 'https://in.finance.yahoo.com/news/apples-iphone-faces-challenge-measuring-002248597.html',
'Foxconn says in talks with Sharp over deal hitch; hopes for agreement': 'https://in.finance.yahoo.com/news/sharp-shares-plunge-foxconn-puts-031032213.html',
'Samsung wins appeal in patent dispute with Apple': 'https://in.finance.yahoo.com/news/samsung-wins-appeal-patent-dispute-022120916.html',
"Signs of life for Apple's stock as Wall St eyes new iPhone": 'https://in.finance.yahoo.com/news/signs-life-apples-stock-wall-165820189.html',
'Solid support for Apple in iPhone encryption fight - Reuters/Ipsos': 'https://in.finance.yahoo.com/news/solid-support-apple-iphone-encryption-121255578.html'}
For small tables, you could also simply slice.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.