Python XPATH doesn't return any data

Python XPATH doesn't return any data - python

Trying to scrape data from the listings but for some reason, it returns empty. Similar code on other websites has worked, I am baffled why it won't on this site. Please help!
import requests
from lxml import html
start_url ="https://www.anybusiness.com.au/search?page=1&sort=date-new-old"
res = requests.get(start_url)
tree = html.fromstring(res.content)
# Retrieve listing title
title_xpath = "/html/body/div[2]/section/div[2]/div/div[2]/div[2]/div/a/div/div/text()"
title_value= tree.xpath(title_xpath)
print(title_value)
>> []

import requests
from bs4 import BeautifulSoup
from pprint import pp
def main(url):
params = {
"page": 1,
"sort": "date-new-old"
}
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
pp([x.get_text(strip=True)
for x in soup.select('div[style^="position: relative; padding: 10px; "]')])
main("https://www.anybusiness.com.au/search")
Output:
['COMING SOON Bar Under Management - $250k/month ave takings',
'Mobile Windscreen and Auto Glass',
'21168 Profitable Engineering Business – Coal Mining Industry',
'Exciting Food Franchise Opportunities. Sydney.',
'Exciting Food Franchise Opportunities. Brisbane.',
'Pizza Hut Franchise Sale',
'Luxury Horse Float Business – National Opportunity! (6543)',
'Mobile Windscreen and Auto Glass Business. Offers over $220,000 WIWO',
'SIMPLE ESPRESSO BAR - HILLS DISTRICT - 00805.',
'The Ox and Hound Bistro - 1P5306',
'A new lifestyle awaits - own and operate your own business in a growing '
'market',
'Join the Fernwood Fitness Franchise Family',
'Rare Business Opportunity - Appliance Finance Business -Ipswich',
'Make a Lifestyle Change to Beautiful Bendigo/Huntly Newsagency and Store',
'Immaculately Presented Motel in a Thriving Regional City - 1P5317M']
Updated Answer:
import requests
from bs4 import BeautifulSoup
from pprint import pp
def main(url):
params = {
"page": 1,
"sort": "date-new-old"
}
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
pp([(x.a['href'], x.select_one('.ellipsis').get_text(strip=True))
for x in soup.select('.basic')])
main("https://www.anybusiness.com.au/search")
Output:
[('/listings/oakleigh-east-vic-3166-retail-food-beverage-3330830',
"Portman's Continental Butcher & Deli"),
('/listings/melbourne-vic-3000-leisure-entertainment-bars-nightclubs-3330829',
'COMING SOON Bar Under Management - $250k/month ave takings'),
('/listings/north-mackay-qld-4740-industrial-manufacturing-mining-earth-moving-manufacturing-engineering-3330827',
'21168 Profitable Engineering Business – Coal Mining Industry'),
('/listings/food-hospitality-cafe-coffee-shop-3330826',
'Exciting Food Franchise Opportunities. Sydney.'),
('/listings/food-hospitality-cafe-coffee-shop-3330825',
'Exciting Food Franchise Opportunities. Brisbane.'),
('/listings/alderley-qld-4051-food-hospitality-takeaway-food-3330824',
'Pizza Hut Franchise Sale'),
('/listings/castle-hill-nsw-2154-food-hospitality-cafe-coffee-shop-takeaway-food-retail-food-beverage-3330821',
'SIMPLE ESPRESSO BAR - HILLS DISTRICT - 00805.'),
('/listings/beechworth-vic-3747-food-hospitality-restaurant-cafe-coffee-shop-retail-food-beverage-3330820',
'The Ox and Hound Bistro - 1P5306'),
('/listings/west-launceston-tas-7250-services-cleaning-home-garden-home-based-franchise-3330819',
'A new lifestyle awaits - own and operate your own business in a growing '
'market'),
('/listings/leisure-entertainment-sports-complex-gym-recreation-sport-franchise-3330818',
'Join the Fernwood Fitness Franchise Family'),
('/listings/melbourne-vic-3000-professional-finance-services-hire-rent-retail-homeware-hardware-3330817',
'Rare Business Opportunity - Appliance Finance Business -Ipswich'),
('/listings/huntly-north-vic-3551-retail-newsagency-tatts-food-hospitality-convenience-store-office-supplies-3330816',
'Make a Lifestyle Change to Beautiful Bendigo/Huntly Newsagency and Store')]

Related

Scraping a website for multiple pages that contains _dopostback method and the URL doesn't change for the pages

I am using BeautifulSoup to scrape from the
https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019
There are a total of two pages of information and to navigate over the pages, there are several links in the top as well in the bottom like 1,2. These links use _dopostback
href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView2','Page$2')"
The problem is when we try to navigate from one page to another, the Url doesn't change only the bold text changes i.e for Page 1 it is Page$1, for Page 2 it is Page$2. How do I use BeautifulSoup to iterate over several pages and extract the information? The form data is as follows.
ctl00$ScriptManager1: ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2
ctl00$ContentPlaceHolder1$ddl_District: 019
ctl00$ContentPlaceHolder1$rdo_Govt_Flag: G
__EVENTTARGET: ctl00$ContentPlaceHolder1$GridView2
__EVENTARGUMENT: Page$2
There is also a variable called _VIEWSTATEin the form data, but the contents are so huge.
I looked at multiple solutions and posts that are suggesting to see the parameters of post call and use them but I am unable to make sense of the parameters that are provided in post.

You can use this example how to load next page on this site using requests:
import requests
from bs4 import BeautifulSoup
url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
def load_page(soup, page_num):
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
}
payload = {
"ctl00$ScriptManager1": "ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$GridView2",
"__EVENTTARGET": "ctl00$ContentPlaceHolder1$GridView2",
"__EVENTARGUMENT": "Page${}".format(page_num),
"__LASTFOCUS": "",
"__ASYNCPOST": "true",
}
for inp in soup.select("input"):
payload[inp["name"]] = inp.get("value")
payload["ctl00$ContentPlaceHolder1$ddl_District"] = "019"
payload["ctl00$ContentPlaceHolder1$rdo_Govt_Flag"] = "G"
del payload["ctl00$ContentPlaceHolder1$chk_Available"]
api_url = "https://excise.wb.gov.in/chms/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx?Public_District_Code=019"
soup = BeautifulSoup(
requests.post(api_url, data=payload, headers=headers).content,
"html.parser",
)
return soup
# print hospitals from first page:
for h5 in soup.select("h5"):
print(h5.text)
# load second page
soup = load_page(soup, 2)
# print hospitals from second page
for h5 in soup.select("h5"):
print(h5.text)
Prints:
 AMRI, Salt Lake - Vivekananda Yuba Bharati Krirangan Salt Lake Stadium (Satellite Govt. Building)
 Calcutta National Medical College and Hospital (Government Hospital)
 CHITTARANJAN NATIONAL CANCER INSTITUTE-CNCI (Government Hospital)
 College of Medicine Sagore Dutta Hospital (Government Hospital)
 ESI Hospital Maniktala (Government Hospital)
 ESI Hospital Sealdah (Government Hospital)
 I.D. And B.G. Hospital (Government Hospital)
 M R Bangur Hospital (Government Hospital)
 Medical College and Hospital, Kolkata, (Government Hospital)
 Nil Ratan Sarkar Medical College and Hospital (Government Hospital)
 R. G. Kar Medical College and Hospital (Government Hospital)
 Sambhunath Pandit Hospital (Government Hospital)

I wanted to scrape article titles from a website but result shows none

I wanted to scrape titles of news articles from new york times website and add it to a list but the result shows an empty list.
when I put just 'a' in the soup.findAll line, it works fine(it prints all the links) but when I changed it to class, it doesn't work.
import requests
from bs4 import BeautifulSoup
def get_titles():
tlist = []
url = 'https://www.nytimes.com/'
get_link = requests.get(url)
get_link_text = get_link.text
soup = BeautifulSoup(get_link_text,'html.parser')
for row in soup.findAll('h2', {'class': 'balancedHeadline'}):
tlist.append(row)
print(tlist)
get_titles()

The webpage is rendered dynamically by js. So you have to use selenium to scrape it.
Then, the h2 titles have no class named balancedHeadline, so you have to select the span inside the h2
Try this:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
def get_titles():
tlist = []
url = 'https://www.nytimes.com/'
browser = webdriver.Firefox()
browser.get(url)
soup = BeautifulSoup(browser.page_source)
for row in soup.find_all('h2', {'class': 'esl82me0'}):
spantext = row.find('span', {'class': 'balancedHeadline'})
if spantext:
tlist.append(spantext.text)
print(tlist)
get_titles()
RESULT:
[
'U.S. Delays Some China Tariffs Until Stores Stock Up for Holidays',
'After a Chaotic Night of Protests, Calm at Hong Kong Airport, for Now',
'Guards at Jail Where Epstein Died Were Sleeping, Officials Say',
'How a Trump Ally Tested the Boundaries of Washington’s Influence Game',
'‘Juul-alikes’ Are Filling Shelves With Sweet, Teen-Friendly Nicotine Flavors',
'A Boom Time for the Bunker Business and Doomsday Capitalists',
'Introducing The 1619 Project'
]
EDIT:
I didn't see there are some title with no span, so I had a test and you'll find all the titles:
CODE:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
def get_titles():
tlist = []
url = 'https://www.nytimes.com/'
browser = webdriver.Firefox()
browser.get(url)
soup = BeautifulSoup(browser.page_source)
for row in soup.find_all('h2', {'class': 'esl82me0'}):
span = row.find('span', {'class': 'balancedHeadline'})
if span:
tlist.append(span.text)
else:
tlist.append(row.text)
print(tlist)
get_titles()
RESULTS:
['Your Wednesday Briefing',
'Listen to ‘The Daily’',
'The Book Review Podcast',
'U.S. Delays Some China Tariffs Until Stores Stock Up for Holidays',
'While visiting a chemical plant, Mr. Trump railed against China, former '
'President Barack Obama and the news media.',
'Two counties in California filed a lawsuit to block the administration’s new '
'green card “wealth” test.',
'After a Chaotic Night of Protests, Calm at Hong Kong Airport, for Now',
'Protesters apologized after scenes of violence and disorder at the airport.',
'Guards at Jail Where Epstein Died Were Sleeping, Officials Say',
'How a Trump Ally Tested the Boundaries of Washington’s Influence Game',
'Here are four takeaways from our report on Mr. Broidy.',
'‘Juul-alikes’ Are Filling Shelves With Sweet, Teen-Friendly Nicotine Flavors',
'A Boom Time for the Bunker Business and Doomsday Capitalists',
'The Cold Truth About the Jeffrey Epstein Case',
'‘My Name Is Darlin. I Just Came Out of Detention.’',
'Trump and Xi Sittin’ in a Tree',
'This Drug Will Save Children’s Lives. It Costs $2 Million.',
'The Battle for Hong Kong Is Being Fought in Sydney and Vancouver',
'No Need to Deport Me. This Dreamer’s Dream Is Dead.',
'Threats to Animals: Pesticides. Pollution. President Trump.',
'Jeffrey Epstein and When to Take Conspiracies Seriously',
'Why Trump Fears Women of Color',
'The Religious Hunger of the Radical Right',
'No, I Won’t Sign Your Petition',
'Introducing The 1619 Project',
'A Surfing Adventure in … Ireland?',
'When the Creepy Carnival Comes to Town']

scrape specific data from td tags

I have to scrape this data
The name of the company that is hiring
The location of the company
The position that the ad is for
This is the website that I want to scrape from link. I was able to get td data but I need to start from a specific td tag (i.e start from this tr tag)
<tr style="height:14px"></tr>
<tr class='athing' id='20463814'>
<td align="right" valign="top" class="title"><span class="rank"></span></td> <td></td><td class="title">Mino Games (YC W11) Is Hiring Game Developers in Montreal<span class="sitebit comhead"> (<span class="sitestr">workable.com</span>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">
<span class="age">11 hours ago</span> </td></tr>
and then keep on moving towards other tags and at the same time keep getting the data of company name, location and position in a separate variable. I know it's a lot to ask for but I would appreciate any help that you can provide.
this is what I tried:
import requests
from bs4 import BeautifulSoup
url = 'https://news.ycombinator.com/jobs'
plain_html_text = requests.get(url);
soup = BeautifulSoup(plain_html_text.text, "html.parser")
table_body = soup.find('tbody')
rows = soup.find('tr')
for row in rows:
cols = row.find_all('td')
cols = [x.text.strip() for x in cols]
print (cols)

What you want is not easy problem, but this script could get you started:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://news.ycombinator.com/jobs'
plain_html_text = requests.get(url);
soup = BeautifulSoup(plain_html_text.text, "html.parser")
rows = []
for title in soup.select('.title:not(:has(.morelink)) .storylink'):
t = title.get_text(strip=True)
company = re.findall(r'^(.*?)(?:is hiring|is looking|seeking|hiring)', t, flags=re.I)
if company:
company = company[0].strip()
else:
company = '-'
position = re.findall(r'(?:is hiring|is looking|seeking|hiring)(.*?)(?=\bin\b|$)', t, flags=re.I)
if position:
position = position[0].strip()
else:
position = '-'
location = re.findall(r'(?:\bin\b)(.*)', t, flags=re.I)
if location:
location = location[0].strip()
else:
location = '-'
rows.append([company, position, location])
print('{: ^50}{: ^80}{: ^20}'.format('Company', 'Position', 'Location'))
for row in rows:
c, p, l = row
print('{: <50}{: <80}{: <20}'.format(c, p, l))
Prints:
Company Position Location
Scale AI engineers to accelerate the development of AI -
Mino Games (YC W11) Game Developers Montreal
BuildZoom (YC W13) – Help us un-break construction -
Bitmovin (YC S15) a Video Solutions Architect/Software Engineer Brazil
Streak – CRM for Gmail (YC S11) Vancouver
ZeroCater (YC W11) a Director of Engineer SF
UpCodes (YC S17) engineers to automate compliance for architects -
Tech Nonprofit Upsolve (YC W19) a Software Engineer -
Gitlab (YC W15) an Engineering Manager, Ecosystem -
Saleswhale (YC S16) Our First U.S. Strategic Account Executive -
Jerry (YC S17) for a Director of Ops and Growth -
Sourceress (YC S17) Product and ML Engineers (Remote OK, No Prior ML OK) -
GiveCampus (YC S15) a Product Designer who cares about education -
Iris Automation an Account Executive for B2B Flying Vehicle Software -
LogDNA (YC W15) Software Engineers – DevOps Monitoring at Scale -
Flexport software engineers to work on our trucking apps Chicago
Mux an ML engineer to help train our machines to deliver better video -
The Muse (YC W12) a Product Director for Growth -
OneSignal an SRE to scale our bare-metal infrastructure -
Atomwise (YC W15) a Senior Systems/Cloud Engineer -
Demodesk (YC W19) Software Engineers Munich
Gusto for Android and iOS developers to build our native mobile app -
Fond (YC W12) an Engineering Manager Portland
ReadMe (YC W15) – Help us make APIs easy to use -
Keeper (YC W19) a lead engineer – help save gig workers money on taxes -
Asseta (YC S13) a technical lead -
Tesorio (YC S15) Engineering Managers, Senior Engineers -
Standard Cognition (YC S17) – Work on vision systems Rust
Curebase (YC S18) first sales hire – distributed clinical research -
Mashgin (YC W15) a Fullstack SWE Interested Computer Vision/AI

This is a basic scraper that splits titles into company and position.
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url, headers={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'})
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')
return soup
def extract_jobs(soup: BeautifulSoup) -> list:
titles = soup.select('.storylink')
hiring_re = re.compile('\s+(is)?\s+(hiring|seeking|looking)\s+(for)?', flags=re.IGNORECASE)
jobs = []
for el in titles:
title = el.text.strip()
m = hiring_re.search(title)
if not m:
continue
company = title[:m.start()].strip()
offer = title[m.end():].strip().title()
jobs.append({
'company': company,
'wants': offer,
})
return jobs
url = 'https://news.ycombinator.com/jobs'
soup = make_soup(url)
jobs = extract_jobs(soup)
pprint(jobs)
output:
{'company': 'Mino Games (YC W11)', 'wants': 'Game Developers In Montreal'},
{'company': 'BuildZoom (YC W13)', 'wants': '– Help Us Un-Break Construction'},
{'company': 'Streak – CRM for Gmail (YC S11)', 'wants': 'In Vancouver'},
{'company': 'ZeroCater (YC W11)', 'wants': 'A Director Of Engineer In Sf'},
{'company': 'UpCodes (YC S17)', 'wants': 'Engineers To Automate Compliance For Architects'},
{'company': 'Tech Nonprofit Upsolve (YC W19)', 'wants': 'A Software Engineer'},
...

Script produces wrong results when linebreak comes into play

I've written a script in python to scrape some disorganized content located within b tags and thier next_sibling from a webpage. The thing is my script fails when linebreaks come between. I'm trying to extract the title's and their concerning description from that page starting from CHIEF COMPLAINT: Bright red blood per rectum to just before Keywords:.
Website address
I've tried so far with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select_one("hr").find_next_siblings('b'):
print(item.text,item.next_sibling)
The portion of output giving me unwanted results are like:
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
How can I get the titles and their concerning description accordingly?

Here's a scraper that's more robust compared to yesterday's solutions.
How to loop through scraping multiple documents on multiple web pages using BeautifulSoup?
How can I grab the entire body text from a web page using BeautifulSoup?
It extracts, title, description and all sections properly
import re
import copy
import requests
from bs4 import BeautifulSoup, Tag, Comment, NavigableString
from urllib.parse import urljoin
from pprint import pprint
import itertools
import concurrent
from concurrent.futures import ThreadPoolExecutor
BASE_URL = 'https://www.mtsamples.com'
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url)
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')
return soup
def clean_soup(soup: BeautifulSoup) -> BeautifulSoup:
soup = copy.copy(soup)
h1 = soup.select_one('h1')
kw_re = re.compile('.*Keywords.*', flags=re.IGNORECASE)
kw = soup.find('b', text=kw_re)
for el in (*h1.previous_siblings, *kw.next_siblings):
el.extract()
kw.extract()
for ad in soup.select('[id*="ad"]'):
ad.extract()
for script in soup.script:
script.extract()
for c in h1.parent.children:
if isinstance(c, Comment):
c.extract()
return h1.parent
def extract_meta(soup: BeautifulSoup) -> dict:
h1 = soup.select_one('h1')
title = h1.text.strip()
desc_parts = []
desc_re = re.compile('.*Description.*', flags=re.IGNORECASE)
desc = soup.find('b', text=desc_re)
hr = soup.select_one('hr')
for s in desc.next_siblings:
if s is hr:
break
if isinstance(s, NavigableString):
desc_parts.append(str(s).strip())
elif isinstance(s, Tag):
desc_parts.append(s.text.strip())
description = '\n'.join(p.strip() for p in desc_parts if p.strip())
return {
'title': title,
'description': description
}
def extract_sections(soup: BeautifulSoup) -> list:
titles = [b for b in soup.select('b') if b.text.isupper()]
parts = []
for t in titles:
title = t.text.strip(': ').title()
text_parts = []
for s in t.next_siblings:
# walk forward until we see another title
if s in titles:
break
if isinstance(s, Comment):
continue
if isinstance(s, NavigableString):
text_parts.append(str(s).strip())
if isinstance(s, Tag):
text_parts.append(s.text.strip())
text = '\n'.join(p for p in text_parts if p.strip())
p = {
'title': title,
'text': text
}
parts.append(p)
return parts
def extract_page(url: str) -> dict:
soup = make_soup(url)
clean = clean_soup(soup)
meta = extract_meta(clean)
sections = extract_sections(clean)
return {
**meta,
'sections': sections
}
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
page = extract_page(url)
pprint(page, width=2000)
output:
{'description': 'Status post colonoscopy. After discharge, experienced bloody bowel movements and returned to the emergency department for evaluation.\n(Medical Transcription Sample Report)',
'sections': [{'text': 'Bright red blood per rectum', 'title': 'Chief Complaint'},
# some elements removed for brevity
{'text': '', 'title': 'Labs'},
{'text': 'WBC count: 6,500 per mL\nHemoglobin: 10.3 g/dL\nHematocrit:31.8%\nPlatelet count: 248 per mL\nMean corpuscular volume: 86.5 fL\nRDW: 18%', 'title': 'Cbc'},
{'text': 'Sodium: 131 mmol/L\nPotassium: 3.5 mmol/L\nChloride: 98 mmol/L\nBicarbonate: 23 mmol/L\nBUN: 11 mg/dL\nCreatinine: 1.1 mg/dL\nGlucose: 105 mg/dL', 'title': 'Chem 7'},
{'text': 'PT 15.7 sec\nINR 1.6\nPTT 29.5 sec', 'title': 'Coagulation Studies'},
{'text': 'The patient receive ... ula.', 'title': 'Hospital Course'}],
'title': 'Sample Type / Medical Specialty: Gastroenterology\nSample Name: Blood per Rectum'}

Code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology& Sample=941-BloodperRectum'
res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html,'html.parser')
# Cut the division containing required text,used Right Click and Inspect element in broweser to find the respective div/tag
sampletext_div = soup.find('div', {'id': "sampletext"})
print(sampletext_div.find('h1').text) # TO print header
Output:
Sample Type / Medical Specialty: Gastroenterology
Sample Name: Blood per Rectum
Code:
# Find all the tag
b_all=sampletext_div.findAll('b')
for b in b_all[4:]:
print(b.text, b.next_sibling)
Output:
CHIEF COMPLAINT: Bright red blood per rectum
HISTORY OF PRESENT ILLNESS: This 73-year-old woman had a recent medical history significant for renal and bladder cancer, deep venous thrombosis of the right lower extremity, and anticoagulation therapy complicated by lower gastrointestinal bleeding. Colonoscopy during that admission showed internal hemorrhoids and diverticulosis, but a bleeding site was not identified. Five days after discharge to a nursing home, she again experienced bloody bowel movements and returned to the emergency department for evaluation.
REVIEW OF SYMPTOMS: No chest pain, palpitations, abdominal pain or cramping, nausea, vomiting, or lightheadedness. Positive for generalized weakness and diarrhea the day of admission.
PRIOR MEDICAL HISTORY: Long-standing hypertension, intermittent atrial fibrillation, and hypercholesterolemia. Renal cell carcinoma and transitional cell bladder cancer status post left nephrectomy, radical cystectomy, and ileal loop diversion 6 weeks prior to presentation, postoperative course complicated by pneumonia, urinary tract infection, and retroperitoneal bleed. Deep venous thrombosis 2 weeks prior to presentation, management complicated by lower gastrointestinal bleeding, status post inferior vena cava filter placement.
MEDICATIONS: Diltiazem 30 mg tid, pantoprazole 40 mg qd, epoetin alfa 40,000 units weekly, iron 325 mg bid, cholestyramine. Warfarin discontinued approximately 10 days earlier.
ALLERGIES: Celecoxib (rash).
SOCIAL HISTORY: Resided at nursing home. Denied alcohol, tobacco, and drug use.
FAMILY HISTORY: Non-contributory.
PHYSICAL EXAM: 
LABS: 
CBC: 
CHEM 7: 
COAGULATION STUDIES: 
HOSPITAL COURSE: The patient received 1 liter normal saline and diltiazem (a total of 20 mg intravenously and 30 mg orally) in the emergency department. Emergency department personnel made several attempts to place a nasogastric tube for gastric lavage, but were unsuccessful. During her evaluation, the patient was noted to desaturate to 80% on room air, with an increase in her respiratory rate to 34 breaths per minute. She was administered 50% oxygen by nonrebreadier mask, with improvement in her oxygen saturation to 89%. Computed tomographic angiography was negative for pulmonary embolism.
Keywords:
gastroenterology, blood per rectum, bright red, bladder cancer, deep venous thrombosis, colonoscopy, gastrointestinal bleeding, diverticulosis, hospital course, lower gastrointestinal bleeding, nasogastric tube, oxygen saturation, emergency department, rectum, thrombosis, emergency, department, gastrointestinal, blood, bleeding, oxygen,
NOTE : These transcribed medical transcription sample reports and examples are provided by various users and
are for reference purpose only. MTHelpLine does not certify accuracy and quality of sample reports.
These transcribed medical transcription sample reports may include some uncommon or unusual formats;
this would be due to the preference of the dictating physician. All names and dates have been
changed (or removed) to keep confidentiality. Any resemblance of any type of name or date or
place or anything else to real world is purely incidental.

Scraping and parsing citation info from Google Scholar search results

I have a list of around 20000 article's titles and i want to scrape their citation count from google scholar. I am new to BeautifulSoup library. I have this code:
import requests
from bs4 import BeautifulSoup
query = ['Role for migratory wild birds in the global spread of avian
influenza H5N8','Uncoupling conformational states from activity in an
allosteric enzyme','Technological Analysis of the World’s Earliest
Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer
Headdress from the Early Holocene Site of Star Carr, North Yorkshire,
UK','Oxidative potential of PM 2.5 during Atlanta rush hour:
Measurements of in-vehicle dithiothreitol (DTT) activity','Primary
Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-
wrapped Graphene and Their Oxygen Reduction Activity','Relations of
Preschoolers Visual-Motor and Object Manipulation Skills With Executive
Function and Social Behavior','We Know Who Likes Us, but Not Who Competes
Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-
8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
results.append({"title": entry.a.text, "url": entry.a['href']})
but it returns only title and url. i don't know how to get the citation information from another tag. Please help me out here.

You need to loop the list. You can use Session for efficiency. The below is for bs 4.7.1 which supports :contains pseudo class for finding the citation count. Looks like you can remove the h3 type selector from the css selector and just use class before the a i.e. .gs_rt a. If you don't have 4.7.1. you can use [title=Cite] + a to select citation count instead.
import requests
from bs4 import BeautifulSoup as bs
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('h3.gs_rt a').text if soup.select_one('h3.gs_rt a') is not None else 'No title'
link = soup.select_one('h3.gs_rt a')['href'] if title != 'No title' else 'No link'
citations = soup.select_one('a:contains("Cited by")').text if soup.select_one('a:contains("Cited by")') is not None else 'No citation count'
print(title, link, citations)
The alternative for < 4.7.1.
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('.gs_rt a')
if title is None:
title = 'No title'
link = 'No link'
else:
link = title['href']
title = title.text
citations = soup.select_one('[title=Cite] + a')
if citations is None:
citations = 'No citation count'
else:
citations = citations.text
print(title, link, citations)
Bottom version re-written thanks to comments from #facelessuser. Top version left for comparison:
It would probably be more efficient to not call select_one twice in single line if statement. While the pattern building is cached, the returned tag is not cached. I personally would set the variable to whatever is returned by select_one and then, only if the variable is None, change it to No link or No title etc. It isn't as compact, but it will be more efficient
[...]always check if if tag is None: and not just if tag:. With selectors, it isn't a big deal as they will only return tags, but if you ever do something like for x in tag.descendants: you get text nodes (strings) and tags, and an empty string will evaluate false even though it is a valid node. In that case, it is safest to to check for None

Instead of finding all <h3> tags, I suggest you to search for the tags enclosing both <h3> and the citation (inside <div class="gs_rs>"), i.e. find all <div class="gs_ri"> tags.
Then from these tags, you should be able to get all you need:
query = ['Role for migratory wild birds in the global spread of avian influenza H5N8','Uncoupling conformational states from activity in an allosteric enzyme','Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK','Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity','Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer- wrapped Graphene and Their Oxygen Reduction Activity','Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior','We Know Who Likes Us, but Not Who Competes Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("div", attrs={"class": "gs_ri"}): #tag containing both h3 and citation
results.append({"title": entry.h3.a.text, "url": entry.a['href'], "citation": entry.find("div", attrs={"class": "gs_rs"}).text})

Make sure you're using user-agent because default requests user-agent is python-requests and Google might block your requests and you receive a different HTML with some sort of error that doesn't contain selectors you're trying to select. Check what's your user-agent.
It also might be a good idea to rotate user-agents while making requests.
Code and full example that scrapes much more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
for query in queries:
params = {
"q": query,
"hl": "en",
}
html = requests.get('https://scholar.google.com/scholar', headers=headers, params=params, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')
# Container where all needed data is located
for result in soup.select('.gs_ri'):
title = result.select_one('.gs_rt').text
title_link = result.select_one('.gs_rt a')['href']
cited_by = result.select_one('#gs_res_ccl_mid .gs_nph+ a')['href']
cited_by_count = result.select_one('#gs_res_ccl_mid .gs_nph+ a').text.split(' ')[2]
print(f"{title}\n{title_link}\n{cited_by}\n{cited_by_count}\n")
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want, rather than figuring out why certain things don't work as they should.
Code to integrate:
from serpapi import GoogleSearch
import os
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
for query in queries:
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar",
"q": query,
}
search = GoogleSearch(params)
results = search.get_dict()
data = []
for result in results['organic_results']:
data.append({
'title': result['title'],
'link': result['link'],
'publication_info': result['publication_info']['summary'],
'snippet': result['snippet'],
'cited_by': result['inline_links']['cited_by']['link'],
'related_versions': result['inline_links']['related_pages_link'],
})
print(json.dumps(data, indent=2, ensure_ascii=False))
P.S - I wrote a blog post about how to scrape pretty much everything on Google Scholar with visual representation.
Disclaimer, I work for SerpApi.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.