I'm new to python and have been taking on various projects to get up to speed. At the moment, I'm working on a routine that will read through the Code of Federal Regulations and for each paragraph, print the organizational hierarchy for that paragraph. For example, a simplified version of the CFR's XML scheme would look like:
<CHAPTER>
<HD SOURCE="HED">PART 229—NONDISCRIMINATION ON THE BASIS OF SEX IN EDUCATION PROGRAMS OR ACTIVITIES RECEIVING FEDERAL FINANCIAL ASSISTANCE</HD>
<SECTION>
<SECTNO>### 229.120</SECTNO>
<SUBJECT>Transfers of property.</SUBJECT>
<P>If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).</P>
</SECTION>
I'd like to be able to print this to a CSV so that I can run text analysis:
Title 22, Volume 2, Part 229, Section 228.120, If a recipient sells or otherwise transfers property (…) subject to the provisions of ### 229.205 through 229.235(a).
Note that I'm not taking the Title and Volume numbers from the XML, because they are actually included in the file name in a much more standardized format.
Because I'm such a Python newbie, the code is mostly based on the search-engine code from Udacity's computer science course. Here's the Python I've written/adapted so far:
import os
import urllib2
from xml.dom.minidom import parseString
file_path = '/Users/owner1/Downloads/CFR-2012/title-22/CFR-2012-title22-vol1.xml'
file_name = os.path.basename(file_path) #Gets the filename from the path.
doc = open(file_path)
page = doc.read()
def clean_title(file_name): #Gets the title number from the filename.
start_title = file_name.find('title')
end_title = file_name.find("-", start_title+1)
title = file_name[start_title+5:end_title]
return title
def clean_volume(file_name): #Gets the volume number from the filename.
start_volume = file_name.find('vol')
end_volume = file_name.find('.xml', start_volume)
volume = file_name[start_volume+3:end_volume]
return volume
def get_next_section(page): #Gets all of the text between <SECTION> tags.
start_section = page.find('<SECTION')
if start_section == -1:
return None, 0
start_text = page.find('>', start_section)
end_quote = page.find('</SECTION>', start_text + 1)
section = page[start_text + 1:end_quote]
return section, end_quote
def get_section_number(section): #Within the <SECTION> tag, find the section number based on the <SECTNO> tag.
start_section_number = section.find('<SECTNO>###')
if start_section_number == -1:
return None, 0
end_section_number = section.find('</SECTNO>', start_section_number)
section_number = section[start_section_number+11:end_section_number]
return section_number, end_section_number
def get_paragraph(section): #Within the <SECTION> tag, finds <P> paragraphs.
start_paragraph = section.find('<P>')
if start_paragraph == -1:
return None, 0
end_paragraph = section.find('</P>', start_paragraph)
paragraph = section[start_paragraph+3:end_paragraph]
return start_paragraph, paragraph, end_paragraph
def print_all_paragraphs(page): #This is the section that I would *like* to have print each paragraph and the citation hierarchy.
section, endpos = get_next_section(page)
for pragraph in section:
title = clean_title(file_name)
volume = clean_volume(file_name)
section, endpos = get_next_section(page)
section_number, end_section_number = get_section_number(section)
start_paragraph, paragraph, end_paragraph = get_paragraph(section)
if paragraph:
print "Title: "+ title + " Volume: "+ volume +" Section Number: "+ section_number + " Text: "+ paragraph
page = page[end_paragraph:]
else:
break
print print_all_paragraphs(page)
doc.close()
At the moment, this code has the following issues (example output to follow):
It prints the first paragraph multiple times. How can I print each tag with its own title number, volume number, etc?
The CFR has empty sections that are "Reserved". These sections don't have tags, so the if loop breaks. I've tried implementing for/while loops, but for some reason when I do this the code then just prints the first paragraph it finds repeatedly.
Here's an example of the output:
Title: 22 Volume: 1 Section Number: 9.10 Text: All requests to the Department by a member
of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number: 9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number: 9.10 Text: All requests to the Department by a member of the public, a government employee, or an agency to declassify and release information shall result in a prompt declassification review of the information in accordance with procedures set forth in 22 CFR 171.20-25. Mandatory declassification review requests should be directed to the Information and Privacy Coordinator, U.S. Department of State, SA-2, 515 22nd St., NW., Washington, DC 20522-6001.
Title: 22 Volume: 1 Section Number: 9.11 Text: The Information and Privacy Coordinator shall be responsible for conducting a program for systematic declassification review of historically valuable records that were exempted from the automatic declassification provisions of section 3.3 of the Executive Order. The Information and Privacy Coordinator shall prioritize such review on the basis of researcher interest and the likelihood of declassification upon review.
Title: 22 Volume: 1 Section Number: 9.12 Text: For Department procedures regarding the access to classified information by historical researchers and certain former government personnel, see Sec. 171.24 of this Title.
Title: 22 Volume: 1 Section Number: 9.13 Text: Specific controls on the use, processing, storage, reproduction, and transmittal of classified information within the Department to provide protection for such information and to prevent access by unauthorized persons are contained in Volume 12 of the Department's Foreign Affairs Manual.
Title: 22 Volume: 1 Section Number: 9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled “Classification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.”
Title: 22 Volume: 1 Section Number: 9a.1 Text: These regulations implement Executive Order 11932 dated August 4, 1976 (41 FR 32691, August 5, 1976) entitled “Classification of Certain Information and Material Obtained from Advisory Bodies Created to Implement the International Energy Program.”
None
Ideally, each of the entries after the citation information would be different.
What kind of loop should I run to print this properly? Is there a more "pythonic" way of doing this kind of text extraction?
I understand that I am a complete novice, and one of the major problems I'm facing is that I simply don't have the vocabulary or topic knowledge to really find detailed answers about parsing XML with this level of detail. Any recommended reading would also be welcome.
I like to solve problems like this with XPATH or XSLT. You can find a great implementation in lxml (not in standard distro, needs to be installed). For instance, the XPATH //CHAPTER/HD/SECTION[SECTNO] selects all sections with data. You use relative XPATH statements to grab the values you want from there. Multiple nested for loops disappear. XPATH has a bit of a learning curve, but there many examples out there.
Related
I am trying to scrape movie data from https://www.imdb.com/search/title/?title_type=feature&genres=comedy&explore=genres but when I try to scrape the movie runtime text I get an error saying get_text is not callable and that is because in some of the movies that I am scraping there is no runtime. How can I make my code skip the movies with no runtime?
source = requests.get('https://www.imdb.com/search/title/?title_type=feature&genres=comedy&explore=genres')
source.raise_for_status()
soup = BeautifulSoup(source.text, 'html.parser')
comedy_movies = soup.find_all('div', class_ = "lister-item mode-advanced")
for movies in comedy_movies:
#movie title
movie_title = movies.find('div', class_ = 'lister-item-content').a.text
#Parental Advisory
advisory = movies.find('span', class_ = 'certificate') #figure out how to single out advisory-
#Movie runtime
runtime = movies.find('span', class_ = 'runtime') #figure out how to single out runtime
#Movie Genre
genre = movies.find('span', class_ = 'genre').get_text()
#Movie Rating
rating = movies.find('span', class_ = 'global-sprite rating-star imdb-rating') #Figure out how to single out ratings
#MetaScore
metascore = movies.find('div', class_ = 'inline-block ratings-metascore') #.span.text same here missing values
#Movie Description
description = movies.find('div', class_ = 'lister-item-content').p.text
print(runtime)
Also when I try to scrape the descriptions. I am not getting the descriptions, I am getting another text with the same and class. How can I fix these? I will appreciate it a lot if someone can help.my code executed with runtime showing the None values
To avoid the error you can simply first check wheteher find returned anything that is not None, like
runtime = movies.find('span', class_ = 'runtime')
if runtime is not None:
runtime = runtime.text
As for ratings, you want the contents of the <strong> tag next to the span you were finding:
rating = movies.find(
'span', class_ = 'global-sprite rating-star imdb-rating'
).find_next('strong').text
and for description, you would need to look for the p tag with class="text-muted" after the div with class="ratings-bar":
rating = movies.find(
'div', class_ = 'ratings-bar'
).find_next('p', class_ = 'text-muted').text
although this will find None [and then raise error] when ratings is missing...
You might have noticed by now that some data (description, rating, metascore and title) would need more than one if...is not None checks to avoid getting errors if anything returns None, so it might be preferable [especially with nested elements] to select_one instead. (If you are unfamiliar with css selectors, check this for reference.)
Then, you would be able to get metascore as simply as:
metascore = movies.select_one('div.inline-block.ratings-metascore span')
if metascore is not None:
metascore = metascore.get_text()
In fact, you could define a dictionary with a selector for each piece of information you need and restructure your for-loop to something like
selectorDict = {
'movie_title': 'div.lister-item-content a',
'advisory': 'span.certificate',
'runtime': 'span.runtime',
'genre': 'span.genre',
'rating': 'span.global-sprite.rating-star.imdb-rating~strong',
'metascore': 'div.inline-block.ratings-metascore span',
'description': 'div.lister-item-content p~p'
#'description': 'div.ratings-bar~p.text-muted'
# ^--misses description when rating is missing
}
movieData = []
for movie in comedy_movies:
mData = {}
for k in selectorDict:
dTag = movie.select_one(selectorDict[k])
if dTag is not None:
mData[k] = dTag.get_text(strip=True)
else: mData[k] = None # OPTIONAL
movieData.append(mData)
with this, you could easily explore the collected data at once; for example, as a pandas dataframe with
# import pandas
pandas.DataFrame(movieData)
[As you might notice in the output below, some cells are blank (because value=None), but no errors would have been raised while the for-loop is running because of it.]
index
movie_title
advisory
runtime
genre
rating
metascore
description
0
Amsterdam
R
134 min
Comedy, Drama, History
6.2
48
In the 1930s, three friends witness a murder, are framed for it, and uncover one of the most outrageous plots in American history.
1
Hocus Pocus 2
PG
103 min
Comedy, Family, Fantasy
6.1
55
Two young women accidentally bring back the Sanderson Sisters to modern day Salem and must figure out how to stop the child-hungry witches from wreaking havoc on the world.
2
Hocus Pocus
PG
96 min
Comedy, Family, Fantasy
6.9
43
A teenage boy named Max and his little sister move to Salem, where he struggles to fit in before awakening a trio of diabolical witches that were executed in the 17th century.
3
The Super Mario Bros. Movie
Animation, Adventure, Comedy
A plumber named Mario travels through an underground labyrinth with his brother, Luigi, trying to save a captured princess. Feature film adaptation of the popular video game.
4
Bullet Train
R
127 min
Action, Comedy, Thriller
7.4
49
Five assassins aboard a swiftly-moving bullet train to find out that their missions have something in common.
5
Spirited
PG-13
127 min
Comedy, Family, Musical
A musical version of Charles Dickens's story of a miserly misanthrope who is taken on a magical journey.
---
---
---
---
---
---
---
---
47
Scooby-Doo
PG
86 min
Adventure, Comedy, Family
5.2
35
After an acrimonious break up, the Mystery Inc. gang are individually brought to an island resort to investigate strange goings on.
48
Casper
PG
100 min
Comedy, Family, Fantasy
6.1
49
An afterlife therapist and his daughter meet a friendly young ghost when they move into a crumbling mansion in order to rid the premises of wicked spirits.
49
Ghostbusters
PG
105 min
Action, Comedy, Fantasy
7.8
71
Three parapsychologists forced out of their university funding set up shop as a unique ghost removal service in New York City, attracting frightened yet skeptical customers.
I am trying to scrape a keyword in an xml document with BeautifulSoup but am unsure how to do so.
The xml document contains "Central Index Key," which changes each time for each document scraped. How would I be able to log the central index key for every unique xml document I scrape?
A sample is below. I want to log the string "0001773427" in this example:
<SEC-DOCUMENT>0001104659-22-079974.txt : 20220715
<SEC-HEADER>0001104659-22-079974.hdr.sgml : 20220715
<ACCEPTANCE-DATETIME>20220715060341
ACCESSION NUMBER: 0001104659-22-079974
CONFORMED SUBMISSION TYPE: 8-K
PUBLIC DOCUMENT COUNT: 14
CONFORMED PERIOD OF REPORT: 20220714
ITEM INFORMATION: Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers: Compensatory Arrangements of Certain Officers
ITEM INFORMATION: Financial Statements and Exhibits
FILED AS OF DATE: 20220715
DATE AS OF CHANGE: 20220715
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: SpringWorks Therapeutics, Inc.
CENTRAL INDEX KEY: 0001773427
STANDARD INDUSTRIAL CLASSIFICATION: BIOLOGICAL PRODUCTS (NO DIAGNOSTIC SUBSTANCES) [2836]
IRS NUMBER: 000000000
STATE OF INCORPORATION: DE
FISCAL YEAR END: 1231
FILING VALUES:
FORM TYPE: 8-K
SEC ACT: 1934 Act
SEC FILE NUMBER: 001-39044
FILM NUMBER: 221084206
BUSINESS ADDRESS:
STREET 1: 100 WASHINGTON BOULEVARD
CITY: STAMFORD
STATE: CT
ZIP: 06902
BUSINESS PHONE: 203-883-9490
MAIL ADDRESS:
STREET 1: 100 WASHINGTON BOULEVARD
CITY: STAMFORD
STATE: CT
ZIP: 06902
</SEC-HEADER>
The problem is that your document is SGML, not XML. XPath requires that your document be XML.
Use a different tool or convert your document to XML if you wish to use XPath. An example of a SGML to XML converter is sx by James Clark.
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Sample Company Name AdminContact#<sample company domain>.com'}
r = requests.get('https://www.sec.gov/Archives/edgar/data/1773427/000110465922079974/0001104659-22-079974.txt', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
print([x.strip().replace('\t', ' ') for x in soup.text.splitlines() if 'CENTRAL INDEX KEY:' in x ][0])
This will return:
CENTRAL INDEX KEY: 0001773427
If you only want the key:
print([x.replace('\t', ' ') for x in soup.text.splitlines() if 'CENTRAL INDEX KEY:' in x ][0].split(':')[1].strip())
I am trying to extract data from a website using beautifulSoup and requests packages
where I want to extract the links and it contents .
Until now I am bale to extract the list of the links that exist on a defined url but I do not know how to enter each link and extract the text.
the image below describe my problem :
The text and the image are the link for the hall article.
code:
import requests
from bs4 import BeautifulSoup
url = "https://www.annahar.com/english/section/186-mena"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, features = "lxml")
print(soup.prettify())
#scrappring html tags such as Title, Links, Publication date
for index,new in enumerate(news):
published_date = new.find('span',class_="article__time-stamp").text
title = new.find('h3',class_="article__title").text
link = new.find('a',class_="article__link").attrs['href']
print(f" publish_date: {published_date}")
print(f" title: {title}")
print(f" link: {link}")
result :
publish_date:
06-10-2020 | 20:53
title:
18 killed in bombing in Turkish-controlled Syrian town
link: https://www.annahar.com/english/section/186-mena/06102020061027020
My question is how to continue from here in order to enter each link and extract its content?
the expected result :
publish_date:
06-10-2020 | 20:53
title:
18 killed in bombing in Turkish-controlled Syrian town
link: https://www.annahar.com/english/section/186-mena/06102020061027020
description:
ANKARA: An explosives-laden truck ignited Tuesday on a busy street in a northern #Syrian town controlled by #Turkey-backed opposition fighters, killing at least 18 people and wounding dozens, Syrian opposition activists reported.
The blast in the town of al-Bab took place near a bus station where people often gather to travel from one region to another, according to the opposition’s Civil Defense, also known as White Helmets.
where the description exist inside the link
Add an additional request to your loop that gets to the article page and there grab the description
page = requests.get(link)
soup = BeautifulSoup(page.content, features = "lxml")
description = soup.select_one('div.articleMainText').get_text()
print(f" description: {description}")
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.annahar.com/english/section/186-mena"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, features = "lxml")
# print(soup.prettify())
#scrappring html tags such as Title, Links, Publication date
for index,new in enumerate(soup.select('div#listingDiv44083 div.article')):
published_date = new.find('span',class_="article__time-stamp").get_text(strip=True)
title = new.find('h3',class_="article__title").get_text(strip=True)
link = new.find('a',class_="article__link").attrs['href']
page = requests.get(link)
soup = BeautifulSoup(page.content, features = "lxml")
description = soup.select_one('div.articleMainText').get_text()
print(f" publish_date: {published_date}")
print(f" title: {title}")
print(f" link: {link}")
print(f" description: {description}", '\n')
You have to grab all follow links to the articles and then loop over that and grab the parts you're interested in.
Here's how:
import time
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.annahar.com/english/section/186-mena").content,
"lxml"
)
follow_links = [
a["href"] for a in
soup.find_all("a", class_="article__link")
if "#" not in a["href"]
]
for link in follow_links:
s = BeautifulSoup(requests.get(link).content, "lxml")
date_published = s.find("span", class_="date").getText(strip=True)
title = s.find("h1", class_="article-main__title").getText(strip=True)
article_body = s.find("div", {"id": "bodyToAddTags"}).getText()
print(f"{date_published} {title}\n\n{article_body}\n", "-" * 80)
time.sleep(2)
Output (shortened for brevity):
08-10-2020 | 12:35 Iran frees rights activist after more than 8 years in prison
TEHRAN: Iran has released a prominent human rights activist who campaigned against the death penalty, Iranian media reported Thursday.The semiofficial ISNA news agency quoted judiciary official Sadegh Niaraki as saying that Narges Mohammadi was freed late Wednesday after serving 8 1/2 years in prison. She was sentenced to 10 years in 2016 while already incarcerated.Niaraki said Mohammadi was released based on a law that allows a prison sentence to be commutated if the related court agrees.In July, rights group Amnesty International demanded Mohammadi’s immediate release because of serious pre-existing health conditions and showing suspected COVID-19 symptoms. The Thursday report did not refer to her possible illness.Mohammadi was sentenced in Tehran’s Revolutionary Court on charges including planning crimes to harm the security of Iran, spreading propaganda against the government and forming and managing an illegal group.She was in a prison in the northwestern city of Zanjan, some 280 kilometers (174 miles) northwest of the capital Tehran.Mohammadi was close to Iranian Nobel Peace Prize laureate Shirin Ebadi, who founded the banned Defenders of Human Rights Center. Ebadi left Iran after the disputed re-election of then-President Mahmoud Ahmadinejad in 2009, which touched off unprecedented protests and harsh crackdowns by authorities.In 2018, Mohammadi, an engineer and physicist, was awarded the 2018 Andrei Sakharov Prize, which recognizes outstanding leadership or achievements of scientists in upholding human rights.
--------------------------------------------------------------------------------
...
I've written a script in python to scrape some disorganized content located within b tags and thier next_sibling from a webpage. The thing is my script fails when linebreaks come between. I'm trying to extract the title's and their concerning description from that page starting from CHIEF COMPLAINT: Bright red blood per rectum to just before Keywords:.
Website address
I've tried so far with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select_one("hr").find_next_siblings('b'):
print(item.text,item.next_sibling)
The portion of output giving me unwanted results are like:
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
How can I get the titles and their concerning description accordingly?
Here's a scraper that's more robust compared to yesterday's solutions.
How to loop through scraping multiple documents on multiple web pages using BeautifulSoup?
How can I grab the entire body text from a web page using BeautifulSoup?
It extracts, title, description and all sections properly
import re
import copy
import requests
from bs4 import BeautifulSoup, Tag, Comment, NavigableString
from urllib.parse import urljoin
from pprint import pprint
import itertools
import concurrent
from concurrent.futures import ThreadPoolExecutor
BASE_URL = 'https://www.mtsamples.com'
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url)
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')
return soup
def clean_soup(soup: BeautifulSoup) -> BeautifulSoup:
soup = copy.copy(soup)
h1 = soup.select_one('h1')
kw_re = re.compile('.*Keywords.*', flags=re.IGNORECASE)
kw = soup.find('b', text=kw_re)
for el in (*h1.previous_siblings, *kw.next_siblings):
el.extract()
kw.extract()
for ad in soup.select('[id*="ad"]'):
ad.extract()
for script in soup.script:
script.extract()
for c in h1.parent.children:
if isinstance(c, Comment):
c.extract()
return h1.parent
def extract_meta(soup: BeautifulSoup) -> dict:
h1 = soup.select_one('h1')
title = h1.text.strip()
desc_parts = []
desc_re = re.compile('.*Description.*', flags=re.IGNORECASE)
desc = soup.find('b', text=desc_re)
hr = soup.select_one('hr')
for s in desc.next_siblings:
if s is hr:
break
if isinstance(s, NavigableString):
desc_parts.append(str(s).strip())
elif isinstance(s, Tag):
desc_parts.append(s.text.strip())
description = '\n'.join(p.strip() for p in desc_parts if p.strip())
return {
'title': title,
'description': description
}
def extract_sections(soup: BeautifulSoup) -> list:
titles = [b for b in soup.select('b') if b.text.isupper()]
parts = []
for t in titles:
title = t.text.strip(': ').title()
text_parts = []
for s in t.next_siblings:
# walk forward until we see another title
if s in titles:
break
if isinstance(s, Comment):
continue
if isinstance(s, NavigableString):
text_parts.append(str(s).strip())
if isinstance(s, Tag):
text_parts.append(s.text.strip())
text = '\n'.join(p for p in text_parts if p.strip())
p = {
'title': title,
'text': text
}
parts.append(p)
return parts
def extract_page(url: str) -> dict:
soup = make_soup(url)
clean = clean_soup(soup)
meta = extract_meta(clean)
sections = extract_sections(clean)
return {
**meta,
'sections': sections
}
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
page = extract_page(url)
pprint(page, width=2000)
output:
{'description': 'Status post colonoscopy. After discharge, experienced bloody bowel movements and returned to the emergency department for evaluation.\n(Medical Transcription Sample Report)',
'sections': [{'text': 'Bright red blood per rectum', 'title': 'Chief Complaint'},
# some elements removed for brevity
{'text': '', 'title': 'Labs'},
{'text': 'WBC count: 6,500 per mL\nHemoglobin: 10.3 g/dL\nHematocrit:31.8%\nPlatelet count: 248 per mL\nMean corpuscular volume: 86.5 fL\nRDW: 18%', 'title': 'Cbc'},
{'text': 'Sodium: 131 mmol/L\nPotassium: 3.5 mmol/L\nChloride: 98 mmol/L\nBicarbonate: 23 mmol/L\nBUN: 11 mg/dL\nCreatinine: 1.1 mg/dL\nGlucose: 105 mg/dL', 'title': 'Chem 7'},
{'text': 'PT 15.7 sec\nINR 1.6\nPTT 29.5 sec', 'title': 'Coagulation Studies'},
{'text': 'The patient receive ... ula.', 'title': 'Hospital Course'}],
'title': 'Sample Type / Medical Specialty: Gastroenterology\nSample Name: Blood per Rectum'}
Code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology& Sample=941-BloodperRectum'
res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html,'html.parser')
# Cut the division containing required text,used Right Click and Inspect element in broweser to find the respective div/tag
sampletext_div = soup.find('div', {'id': "sampletext"})
print(sampletext_div.find('h1').text) # TO print header
Output:
Sample Type / Medical Specialty: Gastroenterology
Sample Name: Blood per Rectum
Code:
# Find all the <b> tag
b_all=sampletext_div.findAll('b')
for b in b_all[4:]:
print(b.text, b.next_sibling)
Output:
CHIEF COMPLAINT: Bright red blood per rectum
HISTORY OF PRESENT ILLNESS: This 73-year-old woman had a recent medical history significant for renal and bladder cancer, deep venous thrombosis of the right lower extremity, and anticoagulation therapy complicated by lower gastrointestinal bleeding. Colonoscopy during that admission showed internal hemorrhoids and diverticulosis, but a bleeding site was not identified. Five days after discharge to a nursing home, she again experienced bloody bowel movements and returned to the emergency department for evaluation.
REVIEW OF SYMPTOMS: No chest pain, palpitations, abdominal pain or cramping, nausea, vomiting, or lightheadedness. Positive for generalized weakness and diarrhea the day of admission.
PRIOR MEDICAL HISTORY: Long-standing hypertension, intermittent atrial fibrillation, and hypercholesterolemia. Renal cell carcinoma and transitional cell bladder cancer status post left nephrectomy, radical cystectomy, and ileal loop diversion 6 weeks prior to presentation, postoperative course complicated by pneumonia, urinary tract infection, and retroperitoneal bleed. Deep venous thrombosis 2 weeks prior to presentation, management complicated by lower gastrointestinal bleeding, status post inferior vena cava filter placement.
MEDICATIONS: Diltiazem 30 mg tid, pantoprazole 40 mg qd, epoetin alfa 40,000 units weekly, iron 325 mg bid, cholestyramine. Warfarin discontinued approximately 10 days earlier.
ALLERGIES: Celecoxib (rash).
SOCIAL HISTORY: Resided at nursing home. Denied alcohol, tobacco, and drug use.
FAMILY HISTORY: Non-contributory.
PHYSICAL EXAM: <br/>
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
COAGULATION STUDIES: <br/>
HOSPITAL COURSE: The patient received 1 liter normal saline and diltiazem (a total of 20 mg intravenously and 30 mg orally) in the emergency department. Emergency department personnel made several attempts to place a nasogastric tube for gastric lavage, but were unsuccessful. During her evaluation, the patient was noted to desaturate to 80% on room air, with an increase in her respiratory rate to 34 breaths per minute. She was administered 50% oxygen by nonrebreadier mask, with improvement in her oxygen saturation to 89%. Computed tomographic angiography was negative for pulmonary embolism.
Keywords:
gastroenterology, blood per rectum, bright red, bladder cancer, deep venous thrombosis, colonoscopy, gastrointestinal bleeding, diverticulosis, hospital course, lower gastrointestinal bleeding, nasogastric tube, oxygen saturation, emergency department, rectum, thrombosis, emergency, department, gastrointestinal, blood, bleeding, oxygen,
NOTE : These transcribed medical transcription sample reports and examples are provided by various users and
are for reference purpose only. MTHelpLine does not certify accuracy and quality of sample reports.
These transcribed medical transcription sample reports may include some uncommon or unusual formats;
this would be due to the preference of the dictating physician. All names and dates have been
changed (or removed) to keep confidentiality. Any resemblance of any type of name or date or
place or anything else to real world is purely incidental.
I am having hard luck scraping this link via Python 3, BeautifulSoup 4
http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining
I only want to get this section.
When you are in ...
Capitol City Grille
This downtown Lansing restaurant offers ...
Capitol City Grille Lounge
For a glass of wine or a ...
Room Service
If you prefer ...
I have this code
for rest in dining_page_soup.select("div.copy_left p strong"):
if rest.next_sibling is not None:
if rest.next_sibling.next_sibling is not None:
title = rest.text
desc = rest.next_sibling.next_sibling
print ("Title: "+title)
print (desc)
But it gives me TypeError: 'NoneType' object is not callable
on desc = rest.next_sibling.next_sibling even I have an if statement to check whether it is None or not.
Here it is a very simple solution
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining")
data = r.text
soup = BeautifulSoup(data)
for found_text in soup.select('div.copy_left'):
print found_text.text
UPDATE
According to an improvement of the question, here it is a solution using RE.
A specific workaround have to be made for the 1st paragraph "When you..." since it does not respect the structure of other paragraphs.
for tag in soup.find_all(re.compile("^strong")):
title = tag.text
desc = tag.next_sibling.next_sibling
print ("Title: "+title)
print (desc)
Output
Title: Capitol City Grille
This downtown Lansing restaurant offers delicious, contemporary
American cuisine in an upscale yet relaxed environment. You can enjoy
dishes that range from fluffy pancakes to juicy filet mignon steaks.
Breakfast and lunch buffets are available, as well as an à la carte
menu.
Title: Capitol City Grille Lounge
For a glass of wine or a hand-crafted cocktail and great conversation,
spend an afternoon or evening at Capitol City Grille Lounge with
friends or colleagues.
Title: Room Service
If you prefer to dine in the comfort of your own room, order from the
room service menu.
Title: Menus
Breakfast Menu
Title: Capitol City Grille Hours
Breakfast, 6:30-11 a.m.
Title: Capitol City Grille Lounge Hours
Mon-Thu, 11 a.m.-11 p.m.
Title: Room Service Hours
Daily, 6:30 a.m.-2 p.m. and 5-10 p.m.
If you don't mind using xpath, this should work
import requests
from lxml import html
url = "http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining"
page = requests.get(url).text
tree = html.fromstring(page)
xp_t = "//*[#class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/text()"
xp_d = "//*[#class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()[not(following-sibling::strong)]"
titles = tree.xpath(xp_t)
descriptions = tree.xpath(xp_d) # still contains garbage like '\r\n'
descriptions = [d.strip() for d in descriptions if d.strip()]
for t, d in zip(titles, descriptions):
print("{title}: {description}".format(title=t, description=d))
Here descriptions contains 3 elements: "This downtown...", "For a glass...", "If you prefer...".
If you need also "When you are in the mood...", replace with this:
xp_d = "//*[#class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()"