Python BeautifulSoup Not Getting Correct Value - python

I am trying to scrape movie data from https://www.imdb.com/search/title/?title_type=feature&genres=comedy&explore=genres but when I try to scrape the movie runtime text I get an error saying get_text is not callable and that is because in some of the movies that I am scraping there is no runtime. How can I make my code skip the movies with no runtime?
source = requests.get('https://www.imdb.com/search/title/?title_type=feature&genres=comedy&explore=genres')
source.raise_for_status()
soup = BeautifulSoup(source.text, 'html.parser')
comedy_movies = soup.find_all('div', class_ = "lister-item mode-advanced")
for movies in comedy_movies:
#movie title
movie_title = movies.find('div', class_ = 'lister-item-content').a.text
#Parental Advisory
advisory = movies.find('span', class_ = 'certificate') #figure out how to single out advisory-
#Movie runtime
runtime = movies.find('span', class_ = 'runtime') #figure out how to single out runtime
#Movie Genre
genre = movies.find('span', class_ = 'genre').get_text()
#Movie Rating
rating = movies.find('span', class_ = 'global-sprite rating-star imdb-rating') #Figure out how to single out ratings
#MetaScore
metascore = movies.find('div', class_ = 'inline-block ratings-metascore') #.span.text same here missing values
#Movie Description
description = movies.find('div', class_ = 'lister-item-content').p.text
print(runtime)
Also when I try to scrape the descriptions. I am not getting the descriptions, I am getting another text with the same and class. How can I fix these? I will appreciate it a lot if someone can help.my code executed with runtime showing the None values

To avoid the error you can simply first check wheteher find returned anything that is not None, like
runtime = movies.find('span', class_ = 'runtime')
if runtime is not None:
runtime = runtime.text
As for ratings, you want the contents of the <strong> tag next to the span you were finding:
rating = movies.find(
'span', class_ = 'global-sprite rating-star imdb-rating'
).find_next('strong').text
and for description, you would need to look for the p tag with class="text-muted" after the div with class="ratings-bar":
rating = movies.find(
'div', class_ = 'ratings-bar'
).find_next('p', class_ = 'text-muted').text
although this will find None [and then raise error] when ratings is missing...
You might have noticed by now that some data (description, rating, metascore and title) would need more than one if...is not None checks to avoid getting errors if anything returns None, so it might be preferable [especially with nested elements] to select_one instead. (If you are unfamiliar with css selectors, check this for reference.)
Then, you would be able to get metascore as simply as:
metascore = movies.select_one('div.inline-block.ratings-metascore span')
if metascore is not None:
metascore = metascore.get_text()
In fact, you could define a dictionary with a selector for each piece of information you need and restructure your for-loop to something like
selectorDict = {
'movie_title': 'div.lister-item-content a',
'advisory': 'span.certificate',
'runtime': 'span.runtime',
'genre': 'span.genre',
'rating': 'span.global-sprite.rating-star.imdb-rating~strong',
'metascore': 'div.inline-block.ratings-metascore span',
'description': 'div.lister-item-content p~p'
#'description': 'div.ratings-bar~p.text-muted'
# ^--misses description when rating is missing
}
movieData = []
for movie in comedy_movies:
mData = {}
for k in selectorDict:
dTag = movie.select_one(selectorDict[k])
if dTag is not None:
mData[k] = dTag.get_text(strip=True)
else: mData[k] = None # OPTIONAL
movieData.append(mData)
with this, you could easily explore the collected data at once; for example, as a pandas dataframe with
# import pandas
pandas.DataFrame(movieData)
[As you might notice in the output below, some cells are blank (because value=None), but no errors would have been raised while the for-loop is running because of it.]
index
movie_title
advisory
runtime
genre
rating
metascore
description
0
Amsterdam
R
134 min
Comedy, Drama, History
6.2
48
In the 1930s, three friends witness a murder, are framed for it, and uncover one of the most outrageous plots in American history.
1
Hocus Pocus 2
PG
103 min
Comedy, Family, Fantasy
6.1
55
Two young women accidentally bring back the Sanderson Sisters to modern day Salem and must figure out how to stop the child-hungry witches from wreaking havoc on the world.
2
Hocus Pocus
PG
96 min
Comedy, Family, Fantasy
6.9
43
A teenage boy named Max and his little sister move to Salem, where he struggles to fit in before awakening a trio of diabolical witches that were executed in the 17th century.
3
The Super Mario Bros. Movie
Animation, Adventure, Comedy
A plumber named Mario travels through an underground labyrinth with his brother, Luigi, trying to save a captured princess. Feature film adaptation of the popular video game.
4
Bullet Train
R
127 min
Action, Comedy, Thriller
7.4
49
Five assassins aboard a swiftly-moving bullet train to find out that their missions have something in common.
5
Spirited
PG-13
127 min
Comedy, Family, Musical
A musical version of Charles Dickens's story of a miserly misanthrope who is taken on a magical journey.
---
---
---
---
---
---
---
---
47
Scooby-Doo
PG
86 min
Adventure, Comedy, Family
5.2
35
After an acrimonious break up, the Mystery Inc. gang are individually brought to an island resort to investigate strange goings on.
48
Casper
PG
100 min
Comedy, Family, Fantasy
6.1
49
An afterlife therapist and his daughter meet a friendly young ghost when they move into a crumbling mansion in order to rid the premises of wicked spirits.
49
Ghostbusters
PG
105 min
Action, Comedy, Fantasy
7.8
71
Three parapsychologists forced out of their university funding set up shop as a unique ghost removal service in New York City, attracting frightened yet skeptical customers.

Related

IMDb webscraping for the top 250 movies using Beautifulsoup

I know that there are many similar questions here already, but none of them gives me a satisfying answer for my problem. So here it is:
We need to create a dataframe from the top 250 movies from IMDb for an assignment. So we need to scrape the data first using BeautifulSoup.
These are the attributes that we need to scrape:
IMDb id (0111161)
Movie name (The Shawshank Redemption)
Year (1994)
Director (Frank Darabont)
Stars (Tim Robbins, Morgan Freeman, Bob Gunton)
Rating (9.3)
Number of reviews (2.6M)
Genres (Drama)
Country (USA)
Language (English)
Budget ($25,000,000)
Gross box Office Revenue ($28,884,504)
So far, I have managed to get only a few of them. I received all the separate URLs for all the movies, and now I loop over them. This is how the loop looks so far:
for x in np.arange(0, len(top_250_links)):
url=top_250_links[x]
req = requests.get(url)
page = req.text
soup = bs(page, 'html.parser')
# ID
# Movie Name
Movie_name=(soup.find("div",{'class':"sc-dae4a1bc-0 gwBsXc"}).get_text(strip=True).split(': ')[1])
# Year
year =(soup.find("a",{'class':"ipc-link ipc-link--baseAlt ipc-link--inherit-color sc-8c396aa2-1 WIUyh"}).get_text())
# Length
# Director
director = (soup.find("a",{'class':"ipc-metadata-list-item__list-content-item"}).get_text())
# Stars
stars = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
# Rating
rating = (soup.find("span",{'class':"sc-7ab21ed2-1 jGRxWM"}).get_text())
rating = float(rating)
# Number of Reviews
reviews = (soup.find("span",{'class':"score"}).get_text())
reviews = reviews.split('K')[0]
reviews = float(reviews)*1000
reviews = int(reviews)
# Genres
genres = (soup.find("span",{'class':"ipc-chip__text"}).get_text())
# Language
# Country
# Budget
meta = (soup.find("div" ,{'class':"ipc-metadata-list-item__label ipc-metadata-list-item__label--link"}))
# Gross box Office Revenue
gross = (soup.find("span",{'class':"ipc-metadata-list-item__list-content-item"}).get_text())
# Combine
movie_dict={
'Rank':x+1,
'ID': 0,
'Movie Name' : Movie_name,
'Year' : year,
'Length' : 0,
'Director' : director,
'Stars' : stars,
'Rating' : rating,
'Number of Reviewes' : reviews,
'Genres' : genres,
'Language': 0,
'Country': 0,
'Budget' : 0,
'Gross box Office Revenue' :0}
df = df.append(pd.DataFrame.from_records([movie_dict],columns=movie_dict.keys() ) )
I can't find a way to obtain the missing information. If anybody here has experience with this kind of topic and might be able to share his thoughts, it would help a lot of people. I think the task is not new and has been solved hundreds of times, but IMDb changed the classes and the structure in their HTML.
Thanks in advance.
BeautifulSoup has many functions to search elements. it is good to read all documentation
You can create more complex code using many .find() with .parent, etc.
soup.find(text='Language').parent.parent.find('a').text
For some elements you can also use data-testid="...."
soup.find('li', {'data-testid': 'title-details-languages'}).find('a').text
Minimale working code (for The Shawshank Redemption)
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=1a264172-ae11-42e4-8ef7-7fed1973bb8f&pf_rd_r=A453PT2BTBPG41Y0HKM8&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1'
response = requests.get(url)
soup = BS(response.text, 'html.parser')
print('Language:', soup.find(text='Language').parent.parent.find('a').get_text(strip=True))
print('Country of origin:', soup.find(text='Country of origin').parent.parent.find('a').get_text(strip=True))
for name in ('Language', 'Country of origin'):
value = soup.find(text=name).parent.parent.find('a').get_text(strip=True)
print(name, ':', value)
print('Language:', soup.find('li', {'data-testid':'title-details-languages'}).find('a').get_text(strip=True))
print('Country of origin:', soup.find('li', {'data-testid':'title-details-origin'}).find('a').get_text(strip=True))
for name, testid in ( ('Language', 'title-details-languages'), ('Country of origin', 'title-details-origin')):
value = soup.find('li', {'data-testid':testid}).find('a').get_text(strip=True)
print(name, ':', value)
Result:
Language: English
Country of origin: United States
Language : English
Country of origin : United States
Language: English
Country of origin: United States
Language : English
Country of origin : United States

How to make dataset from web scaped variables?

I was trying to scrape a real estate website. The problem is that I can't insert my scaped variables into one dataset. Can anyone help me, please? Thank you!
Here is my code:
html_text1=requests.get('https://www.propertyfinder.ae/en/search?c=1&ob=mr&page=1').content
soup1=BeautifulSoup(html_text1,'lxml')
listings=soup1.find_all('a',class_='card card--clickable')
for listing in listings:
price=listing.find('p', class_='card__price').text.split()[0]
price=price.split()[0]
title=listing.find('h2', class_='card__title card__title-link').text
property_type=listing.find('p',class_='card__property-amenity card__property-amenity--property-type').text
bedrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bedrooms').text
bathrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bathrooms').text
location=listing.find('p', class_='card__location').text
dataset=pd.DataFrame({property_type, price, title, bedrooms, bathrooms, location})
print(dataset)
My output looks like this:
enter image description here
However, I want it to look like a DataFrame:
Apartment | 162500 | ...
Townhouse | 162500 | ...
Villa | 7500000 | ...
Villa | 15000000 | ...
The problem with your code is, you are trying to create a dataframe from within the for loop. What you should be doing is creating lists to store these values separately in lists and then creating a df from these lists.
Here's what the code will look like:
price_lst = []
title_lst = []
propertyType_lst = []
bedrooms_lst = []
bathrooms_lst = []
location_lst = []
listings = soup1.find_all('a',class_='card card--clickable')
for listing in listings:
price = listing.find('p', class_='card__price').text.split()[0]
price = price.split()[0]
price_lst.append(price)
title = listing.find('h2', class_='card__title card__title-link').text
title_lst.append(title)
property_type = listing.find('p',class_='card__property-amenity card__property-amenity--property-type').text
propertyType_lst.append(property_type)
bedrooms = listing.find('p', class_='card__property-amenity card__property-amenity--bedrooms').text
bedrooms_lst.append(bedrooms)
bathrooms = listing.find('p', class_='card__property-amenity card__property-amenity--bathrooms').text
bathrooms_lst.append(bathrooms)
location = listing.find('p', class_='card__location').text
location_lst.append(location)
dataset = pd.DataFrame(list(zip(propertyType_lst, price_lst, title_lst, bedrooms_lst, bathrooms_lst, location_lst)),
columns = ['Property Type', 'Price', 'Title', 'Bedrooms', 'Bathrooms', 'Location'])
Would recommend to work with a bit more structur - Use dicts or list of dicts to store the data of your iteration and create a data frame in the end:
data = []
for listing in listings:
price=listing.find('p', class_='card__price').text.split()[0].split()[0]
title=listing.find('h2').text
property_type=listing.find('p',class_='card__property-amenity card__property-amenity--property-type').text
bedrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bedrooms').text
bathrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bathrooms').text
location=listing.find('p', class_='card__location').text
data.append({
'price':price,
'title':title,
'property_type':property_type,
'bedrooms':bedrooms,
'bathrooms':bathrooms,
'location':location
})
Note: Also check the your selections to avoid AttributeErrors
title=t.text if (t:=listing.find('h2')) else None
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
html_text1=requests.get('https://www.propertyfinder.ae/en/search?c=1&ob=mr&page=1').content
soup1=BeautifulSoup(html_text1,'lxml')
listings=soup1.find_all('a',class_='card card--clickable')
data = []
for listing in listings:
price=listing.find('p', class_='card__price').text.split()[0]
price=price.split()[0]
title=listing.find('h2').text
property_type=listing.find('p',class_='card__property-amenity card__property-amenity--property-type').text
bedrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bedrooms').text
bathrooms=listing.find('p', class_='card__property-amenity card__property-amenity--bathrooms').text
location=listing.find('p', class_='card__location').text
data.append({
'price':price,
'title':title,
'property_type':property_type,
'bedrooms':bedrooms,
'bathrooms':bathrooms,
'location':location
})
dataset=pd.DataFrame(data)
Output
price
title
property_type
bedrooms
bathrooms
location
0
35,000,000
Fully Upgraded
Private Pool
Prime Location
Villa
6
District One Villas, District One, Mohammed Bin Rashid City, Dubai
1
2,600,000
Vacant
Brand New and Ready
Community View
Villa
3
La Quinta, Villanova, Dubai Land, Dubai
2
8,950,000
Exclusive
Newly Renovated
Prime Location
Villa
4
Jumeirah 3 Villas, Jumeirah 3, Jumeirah, Dubai
3
3,500,000
Brand New
Single Row
Vastu Compliant
Villa
3
Azalea, Arabian Ranches 2, Dubai
4
1,455,000
Limited Units
3 Yrs Payment Plan
La Violeta TH
Townhouse
3
La Violeta 1, Villanova, Dubai Land, Dubai

"How to find correct tags in nested HTML using Beautiful Soup, receiving list index out of range error or empty list"

Printing out the tags under ('a') works perfectly to bring out the description for each of the houses on the website. Trying to replicate this for the price using any tag('price' for example) doesn't work. Printing out everything under 'master-content' reveals all details including the price.
from urllib.request import urlopen
from bs4 import BeautifulSoup
convert_page = 'https://www.property24.co.mu/property-for-sale'
page = urlopen(convert_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('div', attrs={'class': 'master-content'})
textContent = []
try_again = name_box.find_all("price")
print (try_again)
house_name_list = soup.find(class_="resultsControl")
house_descriptions = house_name_list.find_all('a')
house_prices = house_name_list.find_all('price');
Output expected:
Rs 10 023 304
Rs 46 697 000
Rs 5 323 977
Output received:
[]
When trying to iterate over list:
List out of index(due to list being empty)
Problem is that function house_name_list.find_all('price') will try to find all tags <price>, not tags with class=price. You can change it to house_name_list.find_all(class_="price") to get all tags with prices.
To tie descriptions, prices and titles together, you can use zip() method:
from urllib.request import urlopen
from bs4 import BeautifulSoup
convert_page = 'https://www.property24.co.mu/property-for-sale'
page = urlopen(convert_page)
soup = BeautifulSoup(page, 'html.parser')
for a, desc, price in zip(soup.select('.propertyTileWrapper > a:nth-of-type(1)'),
soup.select('.description'),
soup.select('.price')):
print(a['title'])
print(price.get_text(strip=True))
print(desc.get_text(strip=True))
print('-' * 160)
Prints:
3 Bedroom Apartment / Flat for sale in Roches Noires
Rs 10 023 304
Nice apartment located on the second floor within a secured residence in Azuri. The apartment offers three bedrooms, 2 bathrooms and a...
----------------------------------------------------------------------------------------------------------------------------------------------------------------
3 Bedroom Apartment / Flat for sale in Mon Choisy
Rs 46 697 000
This apartment with island decor exudes elegance and refinement. With 3 beautiful bedrooms in suite, the apartment offers a stunning view of...
----------------------------------------------------------------------------------------------------------------------------------------------------------------
3 Bedroom Apartment / Flat for sale in Flic En Flac
Rs 5 323 977
Project of 5 Duplex 120 m2 which comprises of: - Ground floor: 1 bedroom and 1 bathroom , kitchen, lounge /dining -...
----------------------------------------------------------------------------------------------------------------------------------------------------------------
4 Bedroom House for sale in Grande Rivière Noire
Rs 32 339 553
This splendid furnished house situated in Black River, its really spacious. It comprises of entrance hall, lounge (separate), dining (separate), guest cloakroom,...
----------------------------------------------------------------------------------------------------------------------------------------------------------------
4 Bedroom House for sale in Belle Vue Harel
Rs 40 343 794
This Luxury villa located right at the top of the Hillside Estate. It consists of 4 Bedrooms (3 en-suite), Lounge /Dining, TV...
----------------------------------------------------------------------------------------------------------------------------------------------------------------
...and so on.

Script produces wrong results when linebreak comes into play

I've written a script in python to scrape some disorganized content located within b tags and thier next_sibling from a webpage. The thing is my script fails when linebreaks come between. I'm trying to extract the title's and their concerning description from that page starting from CHIEF COMPLAINT: Bright red blood per rectum to just before Keywords:.
Website address
I've tried so far with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select_one("hr").find_next_siblings('b'):
print(item.text,item.next_sibling)
The portion of output giving me unwanted results are like:
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
How can I get the titles and their concerning description accordingly?
Here's a scraper that's more robust compared to yesterday's solutions.
How to loop through scraping multiple documents on multiple web pages using BeautifulSoup?
How can I grab the entire body text from a web page using BeautifulSoup?
It extracts, title, description and all sections properly
import re
import copy
import requests
from bs4 import BeautifulSoup, Tag, Comment, NavigableString
from urllib.parse import urljoin
from pprint import pprint
import itertools
import concurrent
from concurrent.futures import ThreadPoolExecutor
BASE_URL = 'https://www.mtsamples.com'
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url)
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')
return soup
def clean_soup(soup: BeautifulSoup) -> BeautifulSoup:
soup = copy.copy(soup)
h1 = soup.select_one('h1')
kw_re = re.compile('.*Keywords.*', flags=re.IGNORECASE)
kw = soup.find('b', text=kw_re)
for el in (*h1.previous_siblings, *kw.next_siblings):
el.extract()
kw.extract()
for ad in soup.select('[id*="ad"]'):
ad.extract()
for script in soup.script:
script.extract()
for c in h1.parent.children:
if isinstance(c, Comment):
c.extract()
return h1.parent
def extract_meta(soup: BeautifulSoup) -> dict:
h1 = soup.select_one('h1')
title = h1.text.strip()
desc_parts = []
desc_re = re.compile('.*Description.*', flags=re.IGNORECASE)
desc = soup.find('b', text=desc_re)
hr = soup.select_one('hr')
for s in desc.next_siblings:
if s is hr:
break
if isinstance(s, NavigableString):
desc_parts.append(str(s).strip())
elif isinstance(s, Tag):
desc_parts.append(s.text.strip())
description = '\n'.join(p.strip() for p in desc_parts if p.strip())
return {
'title': title,
'description': description
}
def extract_sections(soup: BeautifulSoup) -> list:
titles = [b for b in soup.select('b') if b.text.isupper()]
parts = []
for t in titles:
title = t.text.strip(': ').title()
text_parts = []
for s in t.next_siblings:
# walk forward until we see another title
if s in titles:
break
if isinstance(s, Comment):
continue
if isinstance(s, NavigableString):
text_parts.append(str(s).strip())
if isinstance(s, Tag):
text_parts.append(s.text.strip())
text = '\n'.join(p for p in text_parts if p.strip())
p = {
'title': title,
'text': text
}
parts.append(p)
return parts
def extract_page(url: str) -> dict:
soup = make_soup(url)
clean = clean_soup(soup)
meta = extract_meta(clean)
sections = extract_sections(clean)
return {
**meta,
'sections': sections
}
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
page = extract_page(url)
pprint(page, width=2000)
output:
{'description': 'Status post colonoscopy. After discharge, experienced bloody bowel movements and returned to the emergency department for evaluation.\n(Medical Transcription Sample Report)',
'sections': [{'text': 'Bright red blood per rectum', 'title': 'Chief Complaint'},
# some elements removed for brevity
{'text': '', 'title': 'Labs'},
{'text': 'WBC count: 6,500 per mL\nHemoglobin: 10.3 g/dL\nHematocrit:31.8%\nPlatelet count: 248 per mL\nMean corpuscular volume: 86.5 fL\nRDW: 18%', 'title': 'Cbc'},
{'text': 'Sodium: 131 mmol/L\nPotassium: 3.5 mmol/L\nChloride: 98 mmol/L\nBicarbonate: 23 mmol/L\nBUN: 11 mg/dL\nCreatinine: 1.1 mg/dL\nGlucose: 105 mg/dL', 'title': 'Chem 7'},
{'text': 'PT 15.7 sec\nINR 1.6\nPTT 29.5 sec', 'title': 'Coagulation Studies'},
{'text': 'The patient receive ... ula.', 'title': 'Hospital Course'}],
'title': 'Sample Type / Medical Specialty: Gastroenterology\nSample Name: Blood per Rectum'}
Code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology& Sample=941-BloodperRectum'
res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html,'html.parser')
# Cut the division containing required text,used Right Click and Inspect element in broweser to find the respective div/tag
sampletext_div = soup.find('div', {'id': "sampletext"})
print(sampletext_div.find('h1').text) # TO print header
Output:
Sample Type / Medical Specialty: Gastroenterology
Sample Name: Blood per Rectum
Code:
# Find all the <b> tag
b_all=sampletext_div.findAll('b')
for b in b_all[4:]:
print(b.text, b.next_sibling)
Output:
CHIEF COMPLAINT: Bright red blood per rectum
HISTORY OF PRESENT ILLNESS: This 73-year-old woman had a recent medical history significant for renal and bladder cancer, deep venous thrombosis of the right lower extremity, and anticoagulation therapy complicated by lower gastrointestinal bleeding. Colonoscopy during that admission showed internal hemorrhoids and diverticulosis, but a bleeding site was not identified. Five days after discharge to a nursing home, she again experienced bloody bowel movements and returned to the emergency department for evaluation.
REVIEW OF SYMPTOMS: No chest pain, palpitations, abdominal pain or cramping, nausea, vomiting, or lightheadedness. Positive for generalized weakness and diarrhea the day of admission.
PRIOR MEDICAL HISTORY: Long-standing hypertension, intermittent atrial fibrillation, and hypercholesterolemia. Renal cell carcinoma and transitional cell bladder cancer status post left nephrectomy, radical cystectomy, and ileal loop diversion 6 weeks prior to presentation, postoperative course complicated by pneumonia, urinary tract infection, and retroperitoneal bleed. Deep venous thrombosis 2 weeks prior to presentation, management complicated by lower gastrointestinal bleeding, status post inferior vena cava filter placement.
MEDICATIONS: Diltiazem 30 mg tid, pantoprazole 40 mg qd, epoetin alfa 40,000 units weekly, iron 325 mg bid, cholestyramine. Warfarin discontinued approximately 10 days earlier.
ALLERGIES: Celecoxib (rash).
SOCIAL HISTORY: Resided at nursing home. Denied alcohol, tobacco, and drug use.
FAMILY HISTORY: Non-contributory.
PHYSICAL EXAM: <br/>
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
COAGULATION STUDIES: <br/>
HOSPITAL COURSE: The patient received 1 liter normal saline and diltiazem (a total of 20 mg intravenously and 30 mg orally) in the emergency department. Emergency department personnel made several attempts to place a nasogastric tube for gastric lavage, but were unsuccessful. During her evaluation, the patient was noted to desaturate to 80% on room air, with an increase in her respiratory rate to 34 breaths per minute. She was administered 50% oxygen by nonrebreadier mask, with improvement in her oxygen saturation to 89%. Computed tomographic angiography was negative for pulmonary embolism.
Keywords:
gastroenterology, blood per rectum, bright red, bladder cancer, deep venous thrombosis, colonoscopy, gastrointestinal bleeding, diverticulosis, hospital course, lower gastrointestinal bleeding, nasogastric tube, oxygen saturation, emergency department, rectum, thrombosis, emergency, department, gastrointestinal, blood, bleeding, oxygen,
NOTE : These transcribed medical transcription sample reports and examples are provided by various users and
are for reference purpose only. MTHelpLine does not certify accuracy and quality of sample reports.
These transcribed medical transcription sample reports may include some uncommon or unusual formats;
this would be due to the preference of the dictating physician. All names and dates have been
changed (or removed) to keep confidentiality. Any resemblance of any type of name or date or
place or anything else to real world is purely incidental.

Cannot scrape specific content from site - BeautifulSoup 4

I am having hard luck scraping this link via Python 3, BeautifulSoup 4
http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining
I only want to get this section.
When you are in ...
Capitol City Grille
This downtown Lansing restaurant offers ...
Capitol City Grille Lounge
For a glass of wine or a ...
Room Service
If you prefer ...
I have this code
for rest in dining_page_soup.select("div.copy_left p strong"):
if rest.next_sibling is not None:
if rest.next_sibling.next_sibling is not None:
title = rest.text
desc = rest.next_sibling.next_sibling
print ("Title: "+title)
print (desc)
But it gives me TypeError: 'NoneType' object is not callable
on desc = rest.next_sibling.next_sibling even I have an if statement to check whether it is None or not.
Here it is a very simple solution
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining")
data = r.text
soup = BeautifulSoup(data)
for found_text in soup.select('div.copy_left'):
print found_text.text
UPDATE
According to an improvement of the question, here it is a solution using RE.
A specific workaround have to be made for the 1st paragraph "When you..." since it does not respect the structure of other paragraphs.
for tag in soup.find_all(re.compile("^strong")):
title = tag.text
desc = tag.next_sibling.next_sibling
print ("Title: "+title)
print (desc)
Output
Title: Capitol City Grille
This downtown Lansing restaurant offers delicious, contemporary
American cuisine in an upscale yet relaxed environment. You can enjoy
dishes that range from fluffy pancakes to juicy filet mignon steaks.
Breakfast and lunch buffets are available, as well as an à la carte
menu.
Title: Capitol City Grille Lounge
For a glass of wine or a hand-crafted cocktail and great conversation,
spend an afternoon or evening at Capitol City Grille Lounge with
friends or colleagues.
Title: Room Service
If you prefer to dine in the comfort of your own room, order from the
room service menu.
Title: Menus
Breakfast Menu
Title: Capitol City Grille Hours
Breakfast, 6:30-11 a.m.
Title: Capitol City Grille Lounge Hours
Mon-Thu, 11 a.m.-11 p.m.
Title: Room Service Hours
Daily, 6:30 a.m.-2 p.m. and 5-10 p.m.
If you don't mind using xpath, this should work
import requests
from lxml import html
url = "http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining"
page = requests.get(url).text
tree = html.fromstring(page)
xp_t = "//*[#class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/text()"
xp_d = "//*[#class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()[not(following-sibling::strong)]"
titles = tree.xpath(xp_t)
descriptions = tree.xpath(xp_d) # still contains garbage like '\r\n'
descriptions = [d.strip() for d in descriptions if d.strip()]
for t, d in zip(titles, descriptions):
print("{title}: {description}".format(title=t, description=d))
Here descriptions contains 3 elements: "This downtown...", "For a glass...", "If you prefer...".
If you need also "When you are in the mood...", replace with this:
xp_d = "//*[#class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()"

Categories

Resources