I am trying to scrape articles from the Wall Street Journal using Beautifulsoup in Python. However, the code which I am running is executing without any error (exit code 0) but no results. I don't understand what is happening? Why this code is not giving expected results.
I even have paid a subscription.
I know that something is not right but I can't locate the problem.
import time
import requests
from bs4 import BeautifulSoup
url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
'&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'
pages = 32
for page in range(1, pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".items.hedSumm li > a"):
resp = requests.get(item.get("href"))
_href = item.get("href")
try:
resp = requests.get(_href)
except Exception as e:
try:
resp = requests.get("https://www.wsj.com" + _href)
except Exception as e:
continue
sauce = BeautifulSoup(resp.text,"lxml")
date = sauce.select("time.timestamp.article__timestamp.flexbox__flex--1")
date = date[0].text
tag = sauce.select("li.article-breadCrumb span").text
title = sauce.select_one("h1.wsj-article-headline").text
content = [elem.text for elem in sauce.select("p.article-content")]
print(f'{date}\n {tag}\n {title}\n {content}\n')
time.sleep(3)
As I wrote in the code, I am trying to scrape date, title, tag, and content of all the articles. It would be helpful if I can get suggestions about my mistakes, what should I do to get the desired results.
Replace your code :
resp = requests.get(item.get("href"))
To:
_href = item.get("href")
try:
resp = requests.get(_href)
except Exception as e:
try:
resp = requests.get("https://www.wsj.com"+_href)
except Exception as e:
continue
Because most of item.get("href") is not providing proper website url for eg you are getting url like this.
/news/types/national-security
/public/page/news-financial-markets-stock.html
https://www.wsj.com/news/world
Only https://www.wsj.com/news/world is a valid website URL. so you need to concate base URL with _href.
Update:
import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
'&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'
pages = 32
for page in range(1, pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.find_all("a",{"class":"headline-image"},href=True):
_href = item.get("href")
try:
resp = requests.get(_href)
except Exception as e:
try:
resp = requests.get("https://www.wsj.com"+_href)
except Exception as e:
continue
sauce = BeautifulSoup(resp.text,"lxml")
dateTag = sauce.find("time",{"class":"timestamp article__timestamp flexbox__flex--1"})
tag = sauce.find("li",{"class":"article-breadCrumb"})
titleTag = sauce.find("h1",{"class":"wsj-article-headline"})
contentTag = sauce.find("div",{"class":"wsj-snippet-body"})
date = None
tagName = None
title = None
content = None
if isinstance(dateTag,Tag):
date = dateTag.get_text().strip()
if isinstance(tag,Tag):
tagName = tag.get_text().strip()
if isinstance(titleTag,Tag):
title = titleTag.get_text().strip()
if isinstance(contentTag,Tag):
content = contentTag.get_text().strip()
print(f'{date}\n {tagName}\n {title}\n {content}\n')
time.sleep(3)
O/P:
March 31, 2019 10:00 a.m. ET
Tech
Care.com Removes Tens of Thousands of Unverified Listings
The online child-care marketplace Care.com scrubbed its site of tens of thousands of unverified day-care center listings just before a Wall Street Journal investigation published March 8, an analysis shows. Care.com, the largest site in the U.S. for finding caregivers, removed about 72% of day-care centers, or about 46,594 businesses, listed on its site, a Journal review of the website shows. Those businesses were listed on the site as recently as March 1....
Updated March 29, 2019 6:08 p.m. ET
Politics
FBI, Retooling Once Again, Sets Sights on Expanding Cyber Threats
The FBI has launched its biggest transformation since the 2001 terror attacks to retrain and refocus special agents to combat cyber criminals, whose threats to lives, property and critical infrastructure have outstripped U.S. efforts to thwart them. The push comes as federal investigators grapple with an expanding range of cyber attacks sponsored by foreign adversaries against businesses or national interests, including Russian election interference and Chinese cyber thefts from American companies, senior bureau executives...
Related
I wanna scrape some details from a main_url in which for each company it has another url for each company and get credentials of a each company from which there are some elements like company name,phone,fax,website..etc. I have write the code using beautifulsoup and requests and also got the credentials(but got only for 52 companies) .After that it giving an error of javascript cause there is loadmore button. I want to get all companies details by crossing that loadmore button too.Only wanna do with requests and beautifulsoup. Don't want to use selenium for it.I'll be glad and thankful for getting help. Here is my code till where I get good result but only having problem with load more button.
from time import time
from bs4 import BeautifulSoup
import urllib.request
import requests
import json
import time
parser = 'html.parser' # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("https://www.arabiantalks.com/category/1/advertising-gift-articles")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
cnt=0
for links in soup.find_all('a', href=True):
url=(links['href'])
# print(url)
page=requests.get(url)
# print(page).
soup=BeautifulSoup(page.content,'html.parser')
# print(soup)
results=soup.find('div',class_='rdetails')
# print(results)
if results is not None:
# print(f"company_results: {results.text}")
Name=results.find('h1')
if Name is not None:
print(f"Comapny_name: {Name.text}")
else:
print(f"Comapny_name: Notfound")
address=results.find(attrs={'itemprop':'address'})
if address is not None:
# address.text.replace(" Address : ", "")
# print(type(f"Comapny_address: {address.text}"),"this type")
print((f"Comapny_address: {address.text[12:]}"))
else:
print(f"Comapny_address: Notfound")
phone=results.find(attrs={'itemprop':'telephone'})
if phone is not None:
print(f"Comapny_phone: {phone.text[16:]}")
else:
print(f"Comapny_phone: Notfound")
fax=results.find(attrs={'itemprop':'faxNumber'})
if fax is not None:
print(f"Comapny_fax: {fax.text[7:]}")
else:
print(f"Comapny_fax: Notfound")
email=results.find(attrs={'itemprop':'email'})
if email is not None:
print(f"Comapny_email: {email.text[9:]}")
else:
print(f"Comapny_email: Notfound")
website=results.find(attrs={'itemprop':'url'})
if website is not None:
print(f"Comapny_website: {website.text}")
else:
print(f"Comapny_website: Notfound")
cnt += 1
print(cnt)
print("="*100)
If anyone know the ideology could you please copy these code and do required modifications and post as reply again.So I would know where to change.I tried from many articles, but still struggling.This is first answer the final link I tried.Please help me with this issue I'm pretty new in web scraping. Thanks in advance...
This question is (almost) a duplicate of this one: Scraping a website that has a "Load more" button doesn't return info of newly loaded items with Beautiful Soup and Selenium
The difference is - on this instance, the ajax response is not JSON, but HTML. You need to inspect the Network tab in browser's Dev tools, to see what network calls are being made.
The following code will access the ajax endpoint, pull all the data available, get all companies profile urls, scrape name, address, phone, fax, email, website and save everything into a csv file:
from bs4 import BeautifulSoup
import requests
import pandas as pd
item_list = []
counter = 20
s = requests.Session()
r = s.get('https://www.arabiantalks.com/category/1/advertising-gift-articles')
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('a', {'itemprop': 'item'})
for item in items:
item_list.append(item.get('href'))
while True:
payload = {
'start': counter,
'cat': 1
}
r = s.post('https://www.arabiantalks.com/ajax/load_content', data=payload)
if len(r.text) < 25:
break
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('a')
for item in items:
item_list.append(item.get('href'))
counter = counter + 12
print('Total items:', len(set(item_list)))
full_comp_list = []
for x in item_list:
r = s.get(x)
soup = BeautifulSoup(r.text, 'html.parser')
c_details_card = soup.select_one('div.rdetails')
try:
c_name = c_details_card.select_one('#hcname').text.strip()
except Exception as e:
c_name = 'Name unknown'
try:
c_address = c_details_card.find('h3', {'itemprop': 'address'}).text.strip()
except Exception as e:
c_address = 'Address unknown'
try:
c_phone = c_details_card.find('h3', {'itemprop': 'telephone'}).text.strip()
except Exception as e:
c_phone = 'Phone unknown'
try:
c_fax = c_details_card.find('h3', {'itemprop': 'faxNumber'}).text.strip()
except Exception as e:
c_fax = 'Fax unknown'
try:
c_email = c_details_card.find('h3', {'itemprop': 'email'}).text.strip()
except Exception as e:
c_email = 'Email unknown'
try:
c_website = c_details_card.find('a').get('href')
except Exception as e:
c_website = 'Website unknown'
full_comp_list.append((c_name, c_address, c_phone, c_fax, c_email, c_website))
print('Done', c_name)
full_df = pd.DataFrame(list(set(full_comp_list)), columns = ['Name', 'Address', 'Phone', 'Fax', 'Email', 'Website'])
full_df.to_csv('full_arabian_advertising_companies.csv')
full_df
It will also print out in terminal, to give you a sense of what it's doing:
Total items: 122
Done Ash & Sims Advertising LLC
Done Strings International Advertising LLC
Done Zaabeel Advertising LLC
Done Crystal Arc Factory LLC
Done Zone Group
Done Business Link General Trading
[....]
Name Address Phone Fax Email Website
0 Ash & Sims Advertising LLC Address : P.O.Box 50391,\nDubai - United Arab Emirates Phone Number : +971-4-8851366 , +9714 8851366 Fax : +971-4-8852499 E-mail : sales#ashandsims.com http://www.ashandsims.com
1 Strings International Advertising LLC Address : P O BOX 117617\n57, Al Kawakeb Property, Al Quoz\nDubai, U.A.E Phone Number : +971-4-3386567 , +971502503591 Fax : +971-4-3386569 E-mail : vinod#stringsinternational.org http://www.stringsinternational.org
2 Zaabeel Advertising LLC Address : Al Khabaisi, Phone Number : +971-4-2598444 Fax : +971-4-2598448 E-mail : info#zaabeeladv.com http://www.zaabeeladv.com
3 Crystal Arc Factory LLC Address : Dubai - P.O. Box 72282\nAl Quoz, Interchange 3, Al Manara, Phone Number : +971-4-3479191 , +971 4 3479191, Fax : +971-4-3475535 E-mail : info#crystalarc.net http://www.crystalarc.net
4 Zone Group Address : Al Khalidiya opp to Rak Bank, Kamala Tower,\nOffice no.1401, PO Box 129297, Abu Dhabi, UAE Phone Number : +97126339004 Fax : +97126339005 E-mail : info#zonegroupuae.ae http://www.zonegroupuae.ae
[....]
I wrote a code to get all the title urls but have some issues like it displays None values. So could you please help me out?
Here is my code:
import requests
from bs4 import BeautifulSoup
import csv
def get_page(url):
response = requests.get(url)
if not response.ok:
print('server responded:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser
return soup
def get_index_data(soup):
try:
titles_link = soup.find_all('div',class_="marginTopTextAdjuster")
except:
titles_link = []
urls = [item.get('href') for item in titles_link]
print(urls)
def main():
#url = "http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2653/rec/1"
mainurl = "http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1"
#get_page(url)
get_index_data(get_page(mainurl))
#write_csv(data,url)
if __name__ == '__main__':
main()
You are trying to get the href attribute of the div tag. Instead try selecting all the a tags. They seem to have a common class attribute body_link_11.
Use titles_link = soup.find_all('a',class_="body_link_11") instead of titles_link = soup.find_all('div',class_="marginTopTextAdjuster")
url = "http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1"
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
titles_link = []
titles_div = soup.find_all('div', attrs={'class': 'marginTopTextAdjuster'})
for link in titles_div:
tag = link.find_all('a', href=True)
try:
if tag[0].attrs.get('item_id', None):
titles_link.append({tag[0].text: tag[0].attrs.get('href', None)})
except IndexError:
continue
print(titles_link)
output:
[{'Civil Affairs Handbook, Japan, section 1a: population statistics.': '/cdm/singleitem/collection/p4013coll8/id/2653/rec/1'}, {'Army Air Forces Program 1943.': '/cdm/singleitem/collection/p4013coll8/id/2385/rec/2'}, {'Casualty report number II.': '/cdm/singleitem/collection/p4013coll8/id/3309/rec/3'}, {'Light armored division, proposed March 1943.': '/cdm/singleitem/collection/p4013coll8/id/2425/rec/4'}, {'Tentative troop list by type units for Blacklist operations.': '/cdm/singleitem/collection/p4013coll8/id/150/rec/5'}, {'Chemical Warfare Service: history of training, part 2, schooling of commissioned officers.': '/cdm/compoundobject/collection/p4013coll8/id/2501/rec/6'}, {'Horses in the German Army (1941-1945).': '/cdm/compoundobject/collection/p4013coll8/id/2495/rec/7'}, {'Unit history: 38 (MECZ) cavalry rcn. sq.': '/cdm/singleitem/collection/p4013coll8/id/3672/rec/8'}, {'Operations in France: December 1944, 714th Tank Battalion.': '/cdm/singleitem/collection/p4013coll8/id/3407/rec/9'}, {'G-3 Reports : Third Infantry Division. (22 Jan- 30 Mar 44)': '/cdm/singleitem/collection/p4013coll8/id/4393/rec/10'}, {'Summary of operations, 1 July thru 31 July 1944.': '/cdm/singleitem/collection/p4013coll8/id/3445/rec/11'}, {'After action report 36th Armored Infantry Regiment, 3rd Armored Division, Nov 1944 thru April 1945.': '/cdm/singleitem/collection/p4013coll8/id/3668/rec/12'}, {'Unit history, 38th Mechanized Cavalry Reconnaissance Squadron, 9604 thru 9665.': '/cdm/singleitem/collection/p4013coll8/id/3703/rec/13'}, {'Redeployment: occupation forces in Europe series, 1945-1946.': '/cdm/singleitem/collection/p4013coll8/id/2952/rec/14'}, {'Twelfth US Army group directives. Annex no. 1.': '/cdm/singleitem/collection/p4013coll8/id/2898/rec/15'}, {'After action report, 749th Tank Battalion: Jan, Feb, Apr - 8 May 45.': '/cdm/singleitem/collection/p4013coll8/id/3502/rec/16'}, {'743rd Tank Battalion, S3 journal history.': '/cdm/singleitem/collection/p4013coll8/id/3553/rec/17'}, {'History of military training, WAAC / WAC training.': '/cdm/singleitem/collection/p4013coll8/id/4052/rec/18'}, {'After action report, 756th Tank Battalion.': '/cdm/singleitem/collection/p4013coll8/id/3440/rec/19'}, {'After action report 92nd Cavalry Recon Squadron Mechanized 12th Armored Division, Jan thru May 45.': '/cdm/singleitem/collection/p4013coll8/id/3583/rec/20'}]
An easy way to do it with requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup
req = requests.get(url) # url stands for the page's url you want to find
soup = BeautifulSoup(req.text, "html.parser") # req.text is the complete html of the page
print(soup.title.string) # soup.title will give you the title of the page but with the <title> tags so .string removes them
Try this.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1'
html = req.get(url)
doc = SimplifiedDoc(html)
lst = doc.selects('div.marginTopTextAdjuster').select('a')
titles_link = [(utils.absoluteUrl(url,a.href),a.text) for a in lst if a]
print (titles_link)
Result:
[('http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2653/rec/1', 'Civil Affairs Handbook, Japan, section 1a: population statistics.'), ('http://cgsc.cdmhost.com/cdm/landingpage/collection/p4013coll8', 'World War II Operational Documents'), ('http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2385/rec/2', 'Army Air Forces Program 1943.'),...
I want to scrape data from Yahoo News and 'Bing News' pages. The data that I want to scrape are headlines or/and text below headlines (what ever It can be scraped) and dates (time) when its posted.
I have wrote a code but It does not return anything. Its the problem with my url since Im getting response 404
Can you please help me with it?
This is the code for 'Bing'
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'http://www.bing.com/news/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
And this is for Yahoo:
term = 'usa'
url = 'http://news.search.yahoo.com/q?s={}'.format(term)
response = requests.get(url)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Please help me to generate these urls, whats the logic behind them, Im still a noob :)
Basically your urls are just wrong. The urls that you have to use are the same ones that you find in the address bar while using a regular browser. Usually most search engines and aggregators use q parameter for the search term. Most of the other parameters are usually not required (sometimes they are - eg. for specifying result page no etc..).
Bing
from bs4 import BeautifulSoup
import requests
import re
term = 'usa'
url = 'https://www.bing.com/news/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_card in soup.find_all('div', class_="news-card-body"):
title = news_card.find('a', class_="title").text
time = news_card.find(
'span',
attrs={'aria-label': re.compile(".*ago$")}
).text
print("{} ({})".format(title, time))
Output
Jason Mohammed blitzkrieg sinks USA (17h)
USA Swimming held not liable by California jury in sexual abuse case (1d)
United States 4-1 Canada: USA secure payback in Nations League (1d)
USA always plays the Dalai Lama card in dealing with China, says Chinese Professor (1d)
...
Yahoo
from bs4 import BeautifulSoup
import requests
term = 'usa'
url = 'https://news.search.yahoo.com/search?q={}'.format(term)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for news_item in soup.find_all('div', class_='NewsArticle'):
title = news_item.find('h4').text
time = news_item.find('span', class_='fc-2nd').text
# Clean time text
time = time.replace('·', '').strip()
print("{} ({})".format(title, time))
Output
USA Baseball will return to Arizona for second Olympic qualifying chance (52 minutes ago)
Prized White Sox prospect Andrew Vaughn wraps up stint with USA Baseball (28 minutes ago)
Mexico defeats USA in extras for Olympic berth (13 hours ago)
...
I am trying to scrape all the articles on this web page: https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/
I can scrape the first article, but need help understanding how to jump to the next article and scrape the information there. Thank you in advance for your support.
import requests
from bs4 import BeautifulSoup
class Content:
def __init__(self,url,title,body):
self.url = url
self.title = title
self.body = body
def getPage(url):
req = requests.get(url)
return BeautifulSoup(req.text, 'html.parser')
# Scaping news articles from Coindesk
def scrapeCoindesk(url):
bs = getPage(url)
title = bs.find("h3").text
body = bs.find("p",{'class':'desc'}).text
return Content(url,title,body)
# Pulling the article from coindesk
url = 'https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/'
content = scrapeCoindesk(url)
print ('Title:{}'.format(content.title))
print ('URl: {}\n'.format(content.url))
print (content.body)
You can use the fact that every article is contained within a div.article to iterate over them:
def scrapeCoindesk(url):
bs = getPage(url)
articles = []
for article in bs.find_all("div", {"class": "article"}):
title = article.find("h3").text
body = article.find("p", {"class": "desc"}).text
article_url = article.find("a", {"class": "fade"})["href"]
articles.append(Content(article_url, title, body))
return articles
# Pulling the article from coindesk
url = 'https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/'
content = scrapeCoindesk(url)
for article in content:
print(article.url)
print(article.title)
print(article.body)
print("-------------")
You can use find_all with BeautifulSoup:
from bs4 import BeautifulSoup as soup
from collections import namedtuple
import request, re
article = namedtuple('article', 'title, link, timestamp, author, description')
r = requests.get('https://www.coindesk.com/category/markets-news/markets-markets-news/markets-bitcoin/').text
full_data = soup(r, 'lxml')
results = [[i.text, i['href']] for i in full_data.find_all('a', {'class':'fade'})]
timestamp = [re.findall('(?<=\n)[a-zA-Z\s]+[\d\s,]+at[\s\d:]+', i.text)[0] for i in full_data.find_all('p', {'class':'timeauthor'})]
authors = [i.text for i in full_data.find_all('a', {'rel':'author'})]
descriptions = [i.text for i in full_data.find_all('p', {'class':'desc'})]
full_articles = [article(*(list(i[0])+list(i[1:]))) for i in zip(results, timestamp, authors, descriptions) if i[0][0] != '\n ']
Output:
[article(title='Topping Out? Bitcoin Bulls Need to Defend $9K', link='https://www.coindesk.com/topping-out-bitcoin-bulls-need-to-defend-9k/', timestamp='May 8, 2018 at 09:10 ', author='Omkar Godbole', description='Bitcoin risks falling to levels below $9,000, courtesy of the bearish setup on the technical charts. '), article(title='Bitcoin Risks Drop Below $9K After 4-Day Low', link='https://www.coindesk.com/bitcoin-risks-drop-below-9k-after-4-day-low/', timestamp='May 7, 2018 at 11:00 ', author='Omkar Godbole', description='Bitcoin is reporting losses today but only a break below $8,650 would signal a bull-to-bear trend change. '), article(title="Futures Launch Weighed on Bitcoin's Price, Say Fed Researchers", link='https://www.coindesk.com/federal-reserve-scholars-blame-bitcoins-price-slump-to-the-futures/', timestamp='May 4, 2018 at 09:00 ', author='Wolfie Zhao', description='Cai Wensheng, a Chinese angel investor, says he bought 10,000 BTC after the price dropped earlier this year.\n'), article(title='Bitcoin Looks for Price Support After Failed $10K Crossover', link='https://www.coindesk.com/bitcoin-looks-for-price-support-after-failed-10k-crossover/', timestamp='May 3, 2018 at 10:00 ', author='Omkar Godbole', description='While equity bulls fear drops in May, it should not be a cause of worry for the bitcoin market, according to historical data.'), article(title='Bitcoin Sets Sights Above $10K After Bull Breakout', link='https://www.coindesk.com/bitcoin-sets-sights-10k-bull-breakout/', timestamp='May 3, 2018 at 03:18 ', author='Wolfie Zhao', description="Goldman Sachs is launching a new operation that will use the firm's own money to trade bitcoin-related contracts on behalf of its clients.")]
I'm scraping a news article using BeautifulSoup trying to only return the text body of the article itself, not all the additional "noise". Is there any easy way to do this?
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
element = soup.select_one('div.pg-rail-tall__body #body-text').text
print(element)
Trying to exclude some of the information returned such as
{CNN.VideoPlayer.handleUnmutePlayer = function
handleUnmutePlayer(containerId, dataObj) {'use strict';var
playerInstance,playerPropertyObj,rememberTime,unmuteCTA,unmuteIdSelector
= 'unmute_' +
The noise, as you call it, is the text in the <script>...</script> tags (JavaScript code). You can remove it using .extract() like:
for s in soup.find_all('script'):
s.extract()
You can use this:
r = requests.get('https://edition.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-authority/index.html')
soup = BeautifulSoup(r.text, 'html.parser')
[x.extract() for x in soup.find_all('script')] # Does the same thing as the 'for-loop' above
element = soup.find('div', class_='pg-rail-tall__body')
print(element.text)
Partial Output:
(CNN)Puerto Rico Gov. Ricardo Rosselló announced Monday that the
commonwealth will begin privatizing the Puerto Rico Electric Power
Authority, or PREPA. In comments published on Twitter, the governor
said the assets sale would transform the island's power generation
system into a "modern" and "efficient" one that would be less
expensive for citizens.He said the system operates "deficiently" and
that the improved infrastructure would respond more "agilely" to
natural disasters. The privatization process will begin "in the next
few days" and occur in three phases over the next 18 months, the
governor said.JUST WATCHEDSchool cheers as power returns after 112
daysReplayMore Videos ...MUST WATCHSchool cheers as power returns
after 112 days 00:48San Juan Mayor Carmen Yulin Cruz, known for her
criticisms of the Trump administration's response to Puerto Rico after
Hurricane Maria, spoke out against the move.Cruz, writing on her
official Twitter account, said PREPA's privatization would put the
commonwealth's economic development into "private hands" and that the
power authority will begin to "serve other interests.
Try this:
import bs4
import requests
url = 'https://www.cnn.com/2018/01/22/us/puerto-rico-privatizing-state-power-au$'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
elementd = soup.findAll('div', {'class': 'zn-body__paragraph'})
elementp = soup.findAll('p', {'class': 'zn-body__paragraph'})
for i in elementp:
print(i.text)
for i in elementd:
print(i.text)