right now im doing scrape review product at this website
https://www.lazada.com.my/products/xiaomi-mi-a1-4gb-ram-32gb-rom-i253761547-s336359472.html?spm=a2o4k.searchlistcategory.list.64.71546883QBZiNT&search=1
i manage to get the review on first page only
import pandas as pd
from urllib.request import Request, urlopen as uReq #package web scraping
from bs4 import BeautifulSoup as soup
def make_soup(website) :
req = Request(website,headers = {'User-Agent' : 'Mozilla/5.0'})
uClient = uReq(req)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, 'html.parser')
return page_soup
lazada_url = 'https://www.lazada.com.my/products/xiaomi-mi-a1-4gb-ram-32gb-rom-i253761547-s336359472.html?spm=a2o4k.searchlistcategory.list.64.71546883QBZiNT&search=1'
website = make_soup(lazada_url)
news_headlines = pd.DataFrame( columns = ['reviews','sentiment','score'])
headlines = website.findAll('div',attrs={"class":"item-content"})
n = 0
for item in headlines :
top = item.div
#print(top)
#print()
text_headlines = top.text
print(text_headlines)
print()
n +=1
news_headlines.loc[n-1,'title'] = text_headlines
Result only first page.. How to do for all pages. there is no pages in the Url for me to loop.. you guys can check the url.. Thank You :)
I like this phone very much and it's global version. I recommend this phone for who like gaming. Delivery just took 3 days only. Thanks Lazada
Item was received in just two days and was wonderfully wrapped. Thanks for the excellent services Lazada!
Very happy with the phone. It's original, it arrived in good condition. Built quality is superb for a budget phone.
The delivery is very fast just take one day to reach at my home. However, the tax invoice is not attached. How do I get the tax invoice?
great deal from lazada. anyway, i do not find any tax invoice. please do email me the tax invoice. thank you.
You can scrape the pagination at the bottom of the reviews to find the minimum and maximum number of reviews:
import requests
from bs4 import BeautifulSoup as soup
def get_page_reviews(content:soup) -> dict:
rs = content.find('div', {'class':'mod-reviews'}).find_all('div', {'class':'item'})
reviews = [i.find('div', {'class':'item-content'}).find('div', {'class':'content'}).text for i in rs]
stars = [len(c.find('div', {'class':'top'}).find_all('img')) for c in rs]
_by = [i.find('div', {'class':'middle'}).find('span').text for i in rs]
return {'stars':stars, 'reviews':reviews, 'authors':_by}
d = soup(requests.get('https://www.lazada.com.my/products/xiaomi-mi-a1-4gb-ram-32gb-rom-i253761547-s336359472.html?spm=a2o4k.searchlistcategory.list.64.71546883QBZiNT&search=1').text, 'html.parser')
results = list(map(int, filter(None, [i.text for i in d.find_all('button', {'class':'next-pagination-item'})])))
for i in range(min(results), max(results)+1):
new_url = f'https://www.lazada.com.my/products/xiaomi-mi-a1-4gb-ram-32gb-rom-i253761547-s336359472.html?spm=a2o4k.searchlistcategory.list.64.71546883QBZiNT&search={i}'
#now, can use new_url to request the next page of reviews
r = get_page_reviews(soup(requests.get(new_url).text, 'html.parser'))
final_result = [{'stars':a, 'author':b, 'review':c} for a, b, c in zip(r['stars'], r['authors'], r['reviews'])]
Output (for first page):
[{'stars': 5, 'author': 'by Ridwan R.', 'review': "I like this phone very much and it's global version. I recommend this phone for who like gaming. Delivery just took 3 days only. Thanks Lazada"}, {'stars': 5, 'author': 'by Razli A.', 'review': 'Item was received in just two days and was wonderfully wrapped. Thanks for the excellent services Lazada!'}, {'stars': 5, 'author': 'by Nur F.', 'review': "Very happy with the phone. It's original, it arrived in good condition. Built quality is superb for a budget phone."}, {'stars': 5, 'author': 'by Muhammad S.', 'review': 'The delivery is very fast just take one day to reach at my home. However, the tax invoice is not attached. How do I get the tax invoice?'}, {'stars': 5, 'author': 'by Xavier Y.', 'review': 'great deal from lazada. anyway, i do not find any tax invoice. please do email me the tax invoice. thank you.'}]
What you need to do is just using click() method in Selenium.
Selenium is a portable software-testing framework for web applications that allows you to access the web and get the sources you want.
In the given URL, there are pages button for the review, so just find the buttons by xpath, class, id by using find_element_by_(anything you want).click(). This will lead you to next pages.
This is the sample code of mine :D
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import time
from selenium.webdriver.chrome.options import Options
url = 'https://www.lazada.com.my/products/xiaomi-mi-a1-4gb-ram-32gb- rom-i253761547-s336359472.html? spm=a2o4k.searchlistcategory.list.64.71546883QBZiNT&search=1'
chrome_options = Options()
#chrome_options.add_argument("--headless")
browser = webdriver.Chrome('/Users/baejihwan/Documents/chromedriver',
chrome_options=chrome_options)
browser.get(url)
time.sleep(0.1)
page_soup = soup(browser.page_source, 'html.parser')
headlines = page_soup.findAll('div',attrs={"class":"item-content"})
for item in headlines :
top = item.div
text_headlines = top.text
print(text_headlines)
browser.find_element_by_xpath('//* .[#id="module_product_review"]/div/div[3]/div[2]/div/div/button[2]').click()
page_soups = soup(browser.page_source, 'html.parser')
headline = page_soups.findAll('div',attrs={"class":"item-content"})
for item in headline:
top = item.div
text_headlines = top.text
print(text_headlines)
Output:
I like this phone very much and it's global version. I recommend this phone for who like gaming. Delivery just took 3 days only. Thanks Lazada
Item was received in just two days and was wonderfully wrapped. Thanks for the excellent services Lazada!
Very happy with the phone. It's original, it arrived in good condition. Built quality is superb for a budget phone.
The delivery is very fast just take one day to reach at my home. However, the tax invoice is not attached. How do I get the tax invoice?
great deal from lazada. anyway, i do not find any tax invoice. please do email me the tax invoice. thank you.
Penghantaran cepat. Order ahad malam, sampai rabu pagi. Tu pun sbb selasa cuti umum.
Fon disealed dgn bubble wrap dan box.
Dah check mmg original malaysia.
Dpt free tempered glass. Ok je.
Fon so far pakai ok.
Selama ni pakai iphone, bila pakai android ni kekok sikit.
invoice tidak disertakan.
battery dia dikira cpt juga hbs..
Saya telah beli smartphone xioami mi a1 dan telah terima hari ni. Tetapi telefon itu telah rosak. Tidak dapat on.
beli pada 1/6 dgn harga rm599 dpt free gift usb otg type c 64gb jenama sandisk.
delivery pantas, order 1/6 sampai 4/6 tu pon sebab weekend ja kalau x mesti order harini esk sampai dah.
packaging terbaik, dalam kotak ada air bag so memang secure.
kotak fon sealed, dlm kotak dapat screen protector biasa free, kabel type c dgn charger 3 pin.
keluar kotak terus update ke Android oreo, memang puas hati la overall. memang berbaloi sangat beli. Kudos to lazada.
i submitted the order on on sunday and i get it tuesday morning, even the despatch guy called me at 830am just to make sure if im already at the office. super reliable. for the phone, well i got it for RM599. what could you possibly asked for more? hehehe
Purchased Xiaomi Mi A1 from Official store with an offer of "Free gift SanDisk Ultra 64GB Dual USB Drive 3.0 OTG Type C Flash Drive". But they delivered only USB drive 2.0
I've tried it in extremely naive way! It will be better to define a function that reads in html codes and parse the data you want. This code only parse the review to page 2, and you can modify it to get all the reviews to the end! :D If you have questions about this code, please leave a comment!
Hope this helps!
Related
I am trying to extract data from a website using beautifulSoup and requests packages
where I want to extract the links and it contents .
Until now I am bale to extract the list of the links that exist on a defined url but I do not know how to enter each link and extract the text.
the image below describe my problem :
The text and the image are the link for the hall article.
code:
import requests
from bs4 import BeautifulSoup
url = "https://www.annahar.com/english/section/186-mena"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, features = "lxml")
print(soup.prettify())
#scrappring html tags such as Title, Links, Publication date
for index,new in enumerate(news):
published_date = new.find('span',class_="article__time-stamp").text
title = new.find('h3',class_="article__title").text
link = new.find('a',class_="article__link").attrs['href']
print(f" publish_date: {published_date}")
print(f" title: {title}")
print(f" link: {link}")
result :
publish_date:
06-10-2020 | 20:53
title:
18 killed in bombing in Turkish-controlled Syrian town
link: https://www.annahar.com/english/section/186-mena/06102020061027020
My question is how to continue from here in order to enter each link and extract its content?
the expected result :
publish_date:
06-10-2020 | 20:53
title:
18 killed in bombing in Turkish-controlled Syrian town
link: https://www.annahar.com/english/section/186-mena/06102020061027020
description:
ANKARA: An explosives-laden truck ignited Tuesday on a busy street in a northern #Syrian town controlled by #Turkey-backed opposition fighters, killing at least 18 people and wounding dozens, Syrian opposition activists reported.
The blast in the town of al-Bab took place near a bus station where people often gather to travel from one region to another, according to the opposition’s Civil Defense, also known as White Helmets.
where the description exist inside the link
Add an additional request to your loop that gets to the article page and there grab the description
page = requests.get(link)
soup = BeautifulSoup(page.content, features = "lxml")
description = soup.select_one('div.articleMainText').get_text()
print(f" description: {description}")
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.annahar.com/english/section/186-mena"
html_text = requests.get(url)
soup = BeautifulSoup(html_text.content, features = "lxml")
# print(soup.prettify())
#scrappring html tags such as Title, Links, Publication date
for index,new in enumerate(soup.select('div#listingDiv44083 div.article')):
published_date = new.find('span',class_="article__time-stamp").get_text(strip=True)
title = new.find('h3',class_="article__title").get_text(strip=True)
link = new.find('a',class_="article__link").attrs['href']
page = requests.get(link)
soup = BeautifulSoup(page.content, features = "lxml")
description = soup.select_one('div.articleMainText').get_text()
print(f" publish_date: {published_date}")
print(f" title: {title}")
print(f" link: {link}")
print(f" description: {description}", '\n')
You have to grab all follow links to the articles and then loop over that and grab the parts you're interested in.
Here's how:
import time
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.annahar.com/english/section/186-mena").content,
"lxml"
)
follow_links = [
a["href"] for a in
soup.find_all("a", class_="article__link")
if "#" not in a["href"]
]
for link in follow_links:
s = BeautifulSoup(requests.get(link).content, "lxml")
date_published = s.find("span", class_="date").getText(strip=True)
title = s.find("h1", class_="article-main__title").getText(strip=True)
article_body = s.find("div", {"id": "bodyToAddTags"}).getText()
print(f"{date_published} {title}\n\n{article_body}\n", "-" * 80)
time.sleep(2)
Output (shortened for brevity):
08-10-2020 | 12:35 Iran frees rights activist after more than 8 years in prison
TEHRAN: Iran has released a prominent human rights activist who campaigned against the death penalty, Iranian media reported Thursday.The semiofficial ISNA news agency quoted judiciary official Sadegh Niaraki as saying that Narges Mohammadi was freed late Wednesday after serving 8 1/2 years in prison. She was sentenced to 10 years in 2016 while already incarcerated.Niaraki said Mohammadi was released based on a law that allows a prison sentence to be commutated if the related court agrees.In July, rights group Amnesty International demanded Mohammadi’s immediate release because of serious pre-existing health conditions and showing suspected COVID-19 symptoms. The Thursday report did not refer to her possible illness.Mohammadi was sentenced in Tehran’s Revolutionary Court on charges including planning crimes to harm the security of Iran, spreading propaganda against the government and forming and managing an illegal group.She was in a prison in the northwestern city of Zanjan, some 280 kilometers (174 miles) northwest of the capital Tehran.Mohammadi was close to Iranian Nobel Peace Prize laureate Shirin Ebadi, who founded the banned Defenders of Human Rights Center. Ebadi left Iran after the disputed re-election of then-President Mahmoud Ahmadinejad in 2009, which touched off unprecedented protests and harsh crackdowns by authorities.In 2018, Mohammadi, an engineer and physicist, was awarded the 2018 Andrei Sakharov Prize, which recognizes outstanding leadership or achievements of scientists in upholding human rights.
--------------------------------------------------------------------------------
...
I am looking to use Beautiful Soup to scrape the Fujitsu news update page: https://www.fujitsu.com/uk/news/pr/2020/
I only want to extract the information under the headings of the current month and previous month.
For a particular month (e.g. November), I am trying to extract into a list
the Title
the URL
the text
for each news briefing (so a list of lists).
My attempt so far is as follow (showing only previous month for simplicity):
today = datetime.datetime.today()
year_str = str(today.year)
current_m = today.month
previous_m = current_m - 1
current_m_str = calendar.month_name[current_m]
previous_m_str = calendar.month_name[previous_m]
URL = 'https://www.fujitsu.com/uk/news/pr/' + year_str + '/'
resp = requests.get(URL)
soup = BeautifulSoup(resp.text, 'lxml')
previous_m_body = soup.find('h3', text=previous_m_str)
if previous_m_body is not None:
for sib in previous_m_body.find_next_siblings():
if sib.name == "h3":
break
else:
previous_m_text = str(sib.text)
print(previous_m_text)
However, this generates one long string with newlines, and no separation between Title, text, url:
Fujitsu signs major contract with Scottish Government to deliver election e-Counting solution London, United Kingdom, November 30, 2020 - Fujitsu, a leading digital transformation company, has today announced a major contract with the Scottish Government and Scottish Local...
Fujitsu Introduces Ultra-Compact, 50A PCB Relay for Medium-to-Heavy Automotive Loads Hoofddorp, EMEA, November 11, 2020 - Fujitsu Components Europe has expanded its automotive relay offering with a new 12VDC PCB relay featuring.......
I have attached an image of the page DOM.
Try this:
import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.fujitsu.com/uk/news/pr/2020/").text
all_lists = BeautifulSoup(html, "html.parser").find_all("ul", class_="filterlist")
news = []
for unordered_list in all_lists:
for list_item in unordered_list.find_all("li"):
news.append(
[
list_item.find("a").getText(),
f"https://www.fujitsu.com{list_item.find('a')['href']}",
list_item.getText(strip=True)[len(list_item.find("a").getText()):],
]
)
for news_item in news:
print("\n".join(news_item))
print("-" * 80)
Output (shortened for brevity):
Fujitsu signs major contract with Scottish Government to deliver election e-Counting solution
https://www.fujitsu.com/uk/news/pr/2020/fs-20201130.html
London, United Kingdom, November 30, 2020- Fujitsu, a leading digital transformation company, has today announced a major contract with the Scottish Government and Scottish Local Authorities to support the electronic counting (e-Counting) of ballot papers at the Scottish Local Government elections in May 2022.Fujitsu Introduces Ultra-Compact, 50A PCB Relay for Medium-to-Heavy Automotive LoadsHoofddorp, EMEA, November 11, 2020- Fujitsu Components Europe has expanded its automotive relay offering with a new 12VDC PCB relay featuring a switching capacity of 50A at 14VDC. The FBR53-HC offers a higher contact rating than its 40A FBR53-HW counterpart, yet occupies the same 12.1 x 15.5 x 13.7mm footprint and weighs the same 6g.
--------------------------------------------------------------------------------
and more ...
EDIT:
To get just the last two months, all you need is the first two ul items from the soup. So, add [:2] to the first for loop, like this:
for unordered_list in all_lists[:2]:
# the rest of the loop body goes here
here I modified your code. I combined your bs4 code with selenium. Selenium is very powerful for scrape dynamic or JavaScript based website. You can use selenium with BeautifulSoup for make your life easier. Now it will give you output for all months.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.maximize_window()
url = "https://www.fujitsu.com/uk/news/pr/2020/" #change the url if you want to get result for different year
driver.get(url)
# now your bs4 code start. It will give you output from current month to previous all month
soup = BeautifulSoup(driver.page_source, "html.parser")
#here I am getting all month name from January to november.
months = soup.find_all('h3')
for month in months:
month = month.text
print(f"month_name : {month}\n")
#here we are getting all description text from current month to all previous months
description_texts = soup.find_all('ul',class_='filterlist')
for description_text in description_texts:
description_texts = description_text.text.replace('\n','')
print(f"description_text: {description_texts}")
output:
I've written a script in python to scrape some disorganized content located within b tags and thier next_sibling from a webpage. The thing is my script fails when linebreaks come between. I'm trying to extract the title's and their concerning description from that page starting from CHIEF COMPLAINT: Bright red blood per rectum to just before Keywords:.
Website address
I've tried so far with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
for item in soup.select_one("hr").find_next_siblings('b'):
print(item.text,item.next_sibling)
The portion of output giving me unwanted results are like:
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
How can I get the titles and their concerning description accordingly?
Here's a scraper that's more robust compared to yesterday's solutions.
How to loop through scraping multiple documents on multiple web pages using BeautifulSoup?
How can I grab the entire body text from a web page using BeautifulSoup?
It extracts, title, description and all sections properly
import re
import copy
import requests
from bs4 import BeautifulSoup, Tag, Comment, NavigableString
from urllib.parse import urljoin
from pprint import pprint
import itertools
import concurrent
from concurrent.futures import ThreadPoolExecutor
BASE_URL = 'https://www.mtsamples.com'
def make_soup(url: str) -> BeautifulSoup:
res = requests.get(url)
res.raise_for_status()
html = res.text
soup = BeautifulSoup(html, 'html.parser')
return soup
def clean_soup(soup: BeautifulSoup) -> BeautifulSoup:
soup = copy.copy(soup)
h1 = soup.select_one('h1')
kw_re = re.compile('.*Keywords.*', flags=re.IGNORECASE)
kw = soup.find('b', text=kw_re)
for el in (*h1.previous_siblings, *kw.next_siblings):
el.extract()
kw.extract()
for ad in soup.select('[id*="ad"]'):
ad.extract()
for script in soup.script:
script.extract()
for c in h1.parent.children:
if isinstance(c, Comment):
c.extract()
return h1.parent
def extract_meta(soup: BeautifulSoup) -> dict:
h1 = soup.select_one('h1')
title = h1.text.strip()
desc_parts = []
desc_re = re.compile('.*Description.*', flags=re.IGNORECASE)
desc = soup.find('b', text=desc_re)
hr = soup.select_one('hr')
for s in desc.next_siblings:
if s is hr:
break
if isinstance(s, NavigableString):
desc_parts.append(str(s).strip())
elif isinstance(s, Tag):
desc_parts.append(s.text.strip())
description = '\n'.join(p.strip() for p in desc_parts if p.strip())
return {
'title': title,
'description': description
}
def extract_sections(soup: BeautifulSoup) -> list:
titles = [b for b in soup.select('b') if b.text.isupper()]
parts = []
for t in titles:
title = t.text.strip(': ').title()
text_parts = []
for s in t.next_siblings:
# walk forward until we see another title
if s in titles:
break
if isinstance(s, Comment):
continue
if isinstance(s, NavigableString):
text_parts.append(str(s).strip())
if isinstance(s, Tag):
text_parts.append(s.text.strip())
text = '\n'.join(p for p in text_parts if p.strip())
p = {
'title': title,
'text': text
}
parts.append(p)
return parts
def extract_page(url: str) -> dict:
soup = make_soup(url)
clean = clean_soup(soup)
meta = extract_meta(clean)
sections = extract_sections(clean)
return {
**meta,
'sections': sections
}
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology&Sample=941-BloodperRectum'
page = extract_page(url)
pprint(page, width=2000)
output:
{'description': 'Status post colonoscopy. After discharge, experienced bloody bowel movements and returned to the emergency department for evaluation.\n(Medical Transcription Sample Report)',
'sections': [{'text': 'Bright red blood per rectum', 'title': 'Chief Complaint'},
# some elements removed for brevity
{'text': '', 'title': 'Labs'},
{'text': 'WBC count: 6,500 per mL\nHemoglobin: 10.3 g/dL\nHematocrit:31.8%\nPlatelet count: 248 per mL\nMean corpuscular volume: 86.5 fL\nRDW: 18%', 'title': 'Cbc'},
{'text': 'Sodium: 131 mmol/L\nPotassium: 3.5 mmol/L\nChloride: 98 mmol/L\nBicarbonate: 23 mmol/L\nBUN: 11 mg/dL\nCreatinine: 1.1 mg/dL\nGlucose: 105 mg/dL', 'title': 'Chem 7'},
{'text': 'PT 15.7 sec\nINR 1.6\nPTT 29.5 sec', 'title': 'Coagulation Studies'},
{'text': 'The patient receive ... ula.', 'title': 'Hospital Course'}],
'title': 'Sample Type / Medical Specialty: Gastroenterology\nSample Name: Blood per Rectum'}
Code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://www.mtsamples.com/site/pages/sample.asp?Type=24-Gastroenterology& Sample=941-BloodperRectum'
res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html,'html.parser')
# Cut the division containing required text,used Right Click and Inspect element in broweser to find the respective div/tag
sampletext_div = soup.find('div', {'id': "sampletext"})
print(sampletext_div.find('h1').text) # TO print header
Output:
Sample Type / Medical Specialty: Gastroenterology
Sample Name: Blood per Rectum
Code:
# Find all the <b> tag
b_all=sampletext_div.findAll('b')
for b in b_all[4:]:
print(b.text, b.next_sibling)
Output:
CHIEF COMPLAINT: Bright red blood per rectum
HISTORY OF PRESENT ILLNESS: This 73-year-old woman had a recent medical history significant for renal and bladder cancer, deep venous thrombosis of the right lower extremity, and anticoagulation therapy complicated by lower gastrointestinal bleeding. Colonoscopy during that admission showed internal hemorrhoids and diverticulosis, but a bleeding site was not identified. Five days after discharge to a nursing home, she again experienced bloody bowel movements and returned to the emergency department for evaluation.
REVIEW OF SYMPTOMS: No chest pain, palpitations, abdominal pain or cramping, nausea, vomiting, or lightheadedness. Positive for generalized weakness and diarrhea the day of admission.
PRIOR MEDICAL HISTORY: Long-standing hypertension, intermittent atrial fibrillation, and hypercholesterolemia. Renal cell carcinoma and transitional cell bladder cancer status post left nephrectomy, radical cystectomy, and ileal loop diversion 6 weeks prior to presentation, postoperative course complicated by pneumonia, urinary tract infection, and retroperitoneal bleed. Deep venous thrombosis 2 weeks prior to presentation, management complicated by lower gastrointestinal bleeding, status post inferior vena cava filter placement.
MEDICATIONS: Diltiazem 30 mg tid, pantoprazole 40 mg qd, epoetin alfa 40,000 units weekly, iron 325 mg bid, cholestyramine. Warfarin discontinued approximately 10 days earlier.
ALLERGIES: Celecoxib (rash).
SOCIAL HISTORY: Resided at nursing home. Denied alcohol, tobacco, and drug use.
FAMILY HISTORY: Non-contributory.
PHYSICAL EXAM: <br/>
LABS: <br/>
CBC: <br/>
CHEM 7: <br/>
COAGULATION STUDIES: <br/>
HOSPITAL COURSE: The patient received 1 liter normal saline and diltiazem (a total of 20 mg intravenously and 30 mg orally) in the emergency department. Emergency department personnel made several attempts to place a nasogastric tube for gastric lavage, but were unsuccessful. During her evaluation, the patient was noted to desaturate to 80% on room air, with an increase in her respiratory rate to 34 breaths per minute. She was administered 50% oxygen by nonrebreadier mask, with improvement in her oxygen saturation to 89%. Computed tomographic angiography was negative for pulmonary embolism.
Keywords:
gastroenterology, blood per rectum, bright red, bladder cancer, deep venous thrombosis, colonoscopy, gastrointestinal bleeding, diverticulosis, hospital course, lower gastrointestinal bleeding, nasogastric tube, oxygen saturation, emergency department, rectum, thrombosis, emergency, department, gastrointestinal, blood, bleeding, oxygen,
NOTE : These transcribed medical transcription sample reports and examples are provided by various users and
are for reference purpose only. MTHelpLine does not certify accuracy and quality of sample reports.
These transcribed medical transcription sample reports may include some uncommon or unusual formats;
this would be due to the preference of the dictating physician. All names and dates have been
changed (or removed) to keep confidentiality. Any resemblance of any type of name or date or
place or anything else to real world is purely incidental.
I'm scraping a news article using BeautifulSoup trying to only return the text body of the article itself, not all the additional "noise". Is there any easy way to do this?
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
element = soup.select_one('div.pg-rail-tall__body #body-text').text
print(element)
Trying to exclude some of the information returned such as
{CNN.VideoPlayer.handleUnmutePlayer = function
handleUnmutePlayer(containerId, dataObj) {'use strict';var
playerInstance,playerPropertyObj,rememberTime,unmuteCTA,unmuteIdSelector
= 'unmute_' +
The noise, as you call it, is the text in the <script>...</script> tags (JavaScript code). You can remove it using .extract() like:
for s in soup.find_all('script'):
s.extract()
You can use this:
r = requests.get('https://edition.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html')
soup = BeautifulSoup(r.text, 'html.parser')
[x.extract() for x in soup.find_all('script')] # Does the same thing as the 'for-loop' above
element = soup.find('div', class_='pg-rail-tall__body')
print(element.text)
Partial Output:
(CNN)Puerto Rico Gov. Ricardo Rosselló announced Monday that the
commonwealth will begin privatizing the Puerto Rico Electric Power
Authority, or PREPA. In comments published on Twitter, the governor
said the assets sale would transform the island's power generation
system into a "modern" and "efficient" one that would be less
expensive for citizens.He said the system operates "deficiently" and
that the improved infrastructure would respond more "agilely" to
natural disasters. The privatization process will begin "in the next
few days" and occur in three phases over the next 18 months, the
governor said.JUST WATCHEDSchool cheers as power returns after 112
daysReplayMore Videos ...MUST WATCHSchool cheers as power returns
after 112 days 00:48San Juan Mayor Carmen Yulin Cruz, known for her
criticisms of the Trump administration's response to Puerto Rico after
Hurricane Maria, spoke out against the move.Cruz, writing on her
official Twitter account, said PREPA's privatization would put the
commonwealth's economic development into "private hands" and that the
power authority will begin to "serve other interests.
Try this:
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-au$'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elementd = soup.findAll('div', {'class': 'zn-body__paragraph'})
elementp = soup.findAll('p', {'class': 'zn-body__paragraph'})
for i in elementp:
print(i.text)
for i in elementd:
print(i.text)
I am trying to scrape the links from an inputted URL, but its only working for one url (http://www.businessinsider.com). How can it be adapted to scrape from any url inputted? I am using BeautifulSoup, but is Scrapy better suited for this?
def WebScrape():
linktoenter = input('Where do you want to scrape from today?: ')
url = linktoenter
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
if linktoenter in url:
print('Retrieving your links...')
links = {}
n = 0
link_title=soup.findAll('a',{'class':'title'})
n += 1
links[n] = link_title
for eachtitle in link_title:
print(eachtitle['href']+","+eachtitle.string)
else:
print('Please enter another Website...')
You could make a more generic scraper, searching for all tags and all links within those tags. Once you have the list of all links, you can use a regular expression or similar to find the links that match your desired structure.
import requests
from bs4 import BeautifulSoup
import re
response = requests.get('http://www.businessinsider.com')
soup = BeautifulSoup(response.content)
# find all tags
tags = soup.find_all()
links = []
# iterate over all tags and extract links
for tag in tags:
# find all href links
tmp = tag.find_all(href=True)
# append masters links list with each link
map(lambda x: links.append(x['href']) if x['href'] else None, tmp)
# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links)
code:
def WebScrape():
url = input('Where do you want to scrape from today?: ')
html = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(html, "lxml")
title_tags = soup.findAll('a', {'class': 'title'})
url_titles = [(tag['href'], tag.text)for tag in title_tags]
if title_tags:
print('Retrieving your links...')
for url_title in url_titles:
print(*url_title)
out:
Where do you want to scrape from today?: http://www.businessinsider.com
Retrieving your links...
http://www.businessinsider.com/trump-china-drone-navy-2016-12 Trump slams China's capture of a US Navy drone as 'unprecedented' act
http://www.businessinsider.com/trump-thank-you-rally-alabama-2016-12 'This is truly an exciting time to be alive'
http://www.businessinsider.com/how-smartwatch-pioneer-pebble-lost-everything-2016-12 How the hot startup that stole Apple's thunder wound up in Silicon Valley's graveyard
http://www.businessinsider.com/china-will-return-us-navy-underwater-drone-2016-12 Pentagon: China will return US Navy underwater drone seized in South China Sea
http://www.businessinsider.com/what-google-gets-wrong-about-driverless-cars-2016-12 Here's the biggest thing Google got wrong about self-driving cars
http://www.businessinsider.com/sheriff-joe-arpaio-still-wants-to-investigate-obamas-birth-certificate-2016-12 Sheriff Joe Arpaio still wants to investigate Obama's birth certificate
http://www.businessinsider.com/rents-dropping-in-new-york-bubble-pop-2016-12 Rents are finally dropping in New York City, and a bubble might be about to pop
http://www.businessinsider.com/trump-david-friedman-ambassador-israel-2016-12 Trump's ambassador pick could drastically alter 2 of the thorniest issues in the US-Israel relationship
http://www.businessinsider.com/can-hackers-be-caught-trump-election-russia-2016-12 Why Trump's assertion that hackers can't be caught after an attack is wrong
http://www.businessinsider.com/theres-a-striking-commonality-between-trump-and-nixon-2016-12 There's a striking commonality between Trump and Nixon
http://www.businessinsider.com/tesla-year-in-review-2016-12 Tesla's biggest moments of 2016
http://www.businessinsider.com/heres-why-using-uber-to-fill-public-transportation-gaps-is-a-bad-idea-2016-12 Here's why using Uber to fill public transportation gaps is a bad idea
http://www.businessinsider.com/useful-hard-adopt-early-morning-rituals-productive-exercise-2016-12 4 morning rituals that are hard to adopt but could really pay off
http://www.businessinsider.com/most-expensive-champagne-bottles-money-can-buy-2016-12 The 11 most expensive Champagne bottles money can buy
http://www.businessinsider.com/innovations-in-radiology-2016-11 5 innovations in radiology that could impact everything from the Zika virus to dermatology
http://www.businessinsider.com/ge-healthcare-mr-freelium-technology-2016-11 A new technology is being developed using just 1% of the finite resource needed for traditional MRIs