Web scraping with requests not working correctly - python

I am trying to get the html from CNN for a personal project. I am using the requests library and am new to it. I have followed basic tutorials to get the HTML from CNN using requests, but keep getting responses that are different from the HTML I find when I inspect the webpage from my browser. Here is my code:
base_url = 'https://www.cnn.com/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
I am trying to get article titles from CNN, but this is my first issue. Thanks for the help!
Update
It seems that I know even less than I had initially assumed. My real question is: How do I extract titles from the CNN homepage? I've tried both answers, but the HTML from requests does not contain title information. How can I get the title information like what is in this picture (Screenshot of my browser)Screenshot of cnn article title with accompanying html side by side

You can use Selenium ChromeDriver to scrape https://cnn.com.
import bs4 as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
driver = webdriver.Chrome("---CHROMEDRIVER-PATH---", options=chrome_options)
driver.get('https://cnn.com/')
soup = bs.BeautifulSoup(driver.page_source, 'lxml')
# Get Titles from HTML.
titles = soup.find_all('span', {'class': 'cd__headline-text'})
print(titles)
# Close ChromeDriver.
driver.close()
driver.quit()
Output:
[<span class="cd__headline-text"><strong>The West turned Aung San Suu Kyi into a saint. She was always going to disappoint </strong></span>, <span class="cd__headline-text"><strong>In Hindu-nationalist India, Muslims risk being branded infiltrators</strong></span>, <span class="cd__headline-text">Johnson may have stormed to victory, but he's got a problem</span>, <span class="cd__headline-text">Impeachment heads to full House after historic vote</span>, <span class="cd__headline-text">Supreme Court to decide on Trump's financial records</span>, <span class="cd__headline-text">Michelle Obama's message for Thunberg after Trump mocks her</span>, <span class="cd__headline-text">Actor Danny Aiello dies at 86</span>, <span class="cd__headline-text">The biggest risk at the North Pole isn't what you think</span>, <span class="cd__headline-text">US city declares state of emergency after cyberattack </span>, <span class="cd__headline-text">Reality TV show host arrested</span>, <span class="cd__headline-text">Big names in 2019 you may have mispronounced</span>, <span class="cd__headline-text"><strong>Morocco has Africa's 'first fully solar village'</strong></span>]
You can download ChromeDriver from here.

I tried the following code and it worked for me.
base_url = 'https://www.cnn.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'
}
r = requests.get(base_url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
Note that I have specified a headers parameter in requests.get(). All it does is that it tries to mimic a real browser so that the anti-scraping algorithms can't be able to detect it.
Hope this helps and if not then feel free to ask me in the comments. Cheers :)

I just checked. CNN seems to recognize that you programmatically try to scrape the site and serves a 404 / missing page (with no content on it) instead of the homepage.
Try a headless browser like Selenium, e.g. like so:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://cnn.com')
html = driver.page_source

Related

Is there a way to make the html elements of a website more visible?

While scraping the following website (https://www.middletownk12.org/Page/4113), this code could not locate the table rows (To get the staff name, email & department) even though they are visible when I use the Chrome developer tools. The soup object is not readbale enough to locate the tr tags that have the info needed.
import requests
from bs4 import BeautifulSoup
url = "https://www.middletownk12.org/Page/4113"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
print(response.text)
I used different libraries such as bs4, request & selenium with no chance. I also tried Css selectors & XPATH with selenium with no chance. The Tr elements could not be located.
That table of contact information is filled in by Javascript after the page has loaded. The content doesn't exist in the page's HTML and you won't see it using requests.
By using the developer tools available in the browser, we can examine the requests made after the page has loaded. There are a lot of them, but at least in my browser it's obvious the contact information is loaded near the end.
Looking at the request log, I see a request for a spreadsheet from docs.google.com:
If we examine that entry, we find that it's a request for:
https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0
And if we fetch the above link, we get a spreadsheet with the source data for that table.
Actually I used Selenium & then bs4 without any results. The code does not find the 'tr' elements...
Why are you using Selenium? The whole point to this answer is that you don't need to use Selenium if you can figure out the link to retrieve the data -- which we have.
All we need is requests to fetch the data and BeautifulSoup to parse it:
import requests
import bs4
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for link in soup.findAll('a'):
print(f"{link.text}: {link.get('href')}")

BeautifulSoup soup.find_all() returns an empty list

I try to get data from website using BeautifulSoup but I get an empty list. Also tried with "html.parser" but it is also not helping. Please help me to find a solution. Thank you very much.
My code:
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.empireonline.com/movies/features/best-movies-2/")
movies_webpage = response.text
soup = BeautifulSoup(movies_webpage, "html.parser")
all_movies = soup.find_all(name="h3", class_="jsx-2692754980")
movie_titles = [movie.getText() for movie in all_movies]
print(movie_titles)
Output:
[]
What happens?
Response do not contain the h3 elements cause content of the website is served dynamically.
How to fix?
You can use the json information from the response or use selenium to request the site and get the content as expected
Example
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
url = 'https://www.empireonline.com/movies/features/best-movies-2/'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
all_movies = soup.find_all("h3", class_="jsx-2692754980")
movie_titles = [movie.getText() for movie in all_movies]
print(movie_titles)
In this question, you can not use selenium, because it slows down the scraping process, it will be enough to use only BeautifulSoup using regular expressions.
The movie list data is located in page source in the inline JSON.
In order to extract data from inline JSON you need:
open page source CTRL + U;
find the data (title, name, etc.) CTRL + F;
using regular expression to extract parts of the inline JSON:
# https://regex101.com/r/CqzweN/1
portion_of_script = re.findall("\"Author:42821\"\:{(.*)\"Article:54591\":", str(all_script))
retrieve the list of movies directly:
# https://regex101.com/r/jRgmKA/1
movie_list = re.findall("\"titleText\"\:\"(.*?)\"", str(portion_of_script))
We can also get snippet and image using CSS selectors because they are not rendered with JavaScript. You can use SelectorGadget Chrome extension to define CSS selectors.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, re, json, lxml
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
html = requests.get("https://www.empireonline.com/movies/features/best-movies-2/", headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
movie_data = []
movie_snippets = []
movie_images = []
all_script = soup.select("script")
# https://regex101.com/r/CqzweN/1
portion_of_script = re.findall("\"Author:42821\"\:{(.*)\"Article:54591\":", str(all_script))
# https://regex101.com/r/jRgmKA/1
movie_list = re.findall("\"titleText\"\:\"(.*?)\"", str(portion_of_script))
for snippets in soup.select(".listicle-item"):
movie_snippets.append(snippets.select_one(".listicle-item-content p:nth-child(1)").text)
for image in soup.select('.image-container img'):
movie_images.append(f"https:{image['data-src']}")
# [1:] exclude first unnecessary result
for movie, snippet, image in zip(movie_list, movie_snippets, movie_images[1:]):
movie_data.append({
"movie_list": movie,
"movie_snippet": snippet,
"movie_image": image
})
print(json.dumps(movie_data, indent=2, ensure_ascii=False))
Example output:
[
{
"movie_list": "11) Star Wars",
"movie_snippet": "George Lucas' cocktail of fantasy, sci-fi, Western and World War II movie remains as culturally pervasive as ever. It's so mythically potent, you sense in time it could become a bona-fide religion...",
"movie_image": "https://images.bauerhosting.com/legacy/media/619d/b9f5/3ebe/477b/3f9c/e48a/11%20Star%20Wars.jpg?q=80&w=500"
},
{
"movie_list": "10) Goodfellas",
"movie_snippet": "Where Coppola embroiled us in the politics of the Mafia elite, Martin Scorsese drew us into the treacherous but seductive world of the Mob's foot soldiers. And its honesty was as impactful as its sudden outbursts of (usually Joe Pesci-instigated) violence. Not merely via Henry Hill's (Ray Liotta) narrative, but also Karen's (Lorraine Bracco) perspective: when Henry gives her a gun to hide, she admits, \"It turned me on.\"",
"movie_image": "https://images.bauerhosting.com/legacy/media/619d/ba59/5165/43e0/333b/7c6f/10%20Goodfellas.jpg?q=80&w=500"
},
{
"movie_list": "9) Raiders Of The Lost Ark",
"movie_snippet": "In '81, it must have sounded like the ultimate pitch: the creator of Star Wars teams up with the director of Jaws to make a rip-roaring, Bond-style adventure starring the guy who played Han Solo, in which the bad guys are the evillest ever (the Nazis) and the MacGuffin is a big, gold box which unleashes the power of God. It still sounds like the ultimate pitch.",
"movie_image": "https://images.bauerhosting.com/legacy/media/619d/bb13/f590/5e77/c706/49ac/9%20Raiders.jpg?q=80&w=500"
},
# ...
]
There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.

How to scrape text from this webpage?

I'm trying to scrape this HTML title
<h2 id="p89" data-pid="89"><span id="page77" class="pageNum" data-no="77" data-before-text="77"></span>Tuesday, July 30</h2>
from this website: https://wol.jw.org/en/wol/h/r1/lp-e
My code:
from bs4 import BeautifulSoup
import requests
url = requests.get('https://wol.jw.org/en/wol/h/r1/lp-e').text
soup = BeautifulSoup(url, 'lxml')
textodiario = soup.find('header')
dia = textodiario.h2.text
print(dia)
It should returns me today's day but it returns me a passed day: Wednesday, July 24
At the moment I don't have a PC to test, please double check for possible errors.
You need the chromedriver for your platform too, put it in the same folder of the script.
My idea would be to use selenium to get the HTML and then parse it:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://wol.jw.org/en/wol/h/r1/lp-e"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
textodiario = soup.find('header')
dia = textodiario.h2.text
print(dia)
The data is getting loaded asynchronously and the contents of the div are being changed. What you need is a selenium web driver to act alongside bs4.
I actually tried your code, and there's definitely something wrong with how the website/the code is grabbing data. Because when I pipe the entirety of the URL text to a grep with July, it gives:
Wednesday, July 24
<h2 id="p71" data-pid="71"><span id="page75" class="pageNum" data-no="75" data-before-text="75"></span>Wednesday, July 24</h2>
<h2 id="p74" data-pid="74">Thursday, July 25</h2>
<h2 id="p77" data-pid="77">Friday, July 26</h2>
If I had to take a guess, the fact that they're keeping multiple dates under h2 probably doesn't help, but I have almost zero experience in web scraping. And if you notice, July 30th isn't even in there, meaning that somewhere along the line your data is getting weird (as LazyCoder points out).
Hope that Selenium fixes your issue.
Go to NetWork Tab and you will get the link.
https://wol.jw.org/wol/dt/r1/lp-e/2019/7/30
Here is the code.
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
session = requests.Session()
response = session.get('https://wol.jw.org/wol/dt/r1/lp-e/2019/7/30',headers=headers)
result=response.json()
data=result['items'][0]['content']
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('h2').text)
Output:
Tuesday, July 30

BeautifulSoup does not return all elements on page

I'm new to Web scraping and just started using BeautifulSoup. Here's my question.
When you look up a word in Google in such a way using a search query like "define:lucid", in most cases, a panel showing the meaning and the pronunciation appears at the front page. (Shown in the left side of the embedded image)
[Google default dictionary example]
Things I want to scrape and collect automatically are the text of the meaning and the URL in which the mp3 data of the pronunciation is stored. Using the Chrome Inspector manually, these are easily found in its "Elements" section, e.g., the Inspector (shown in the right side of the image) shows the URL, which stores the mp3 data of the pronunciation of "lucid" (here).
However, using requests to get the content of the HTML of the search result and parsing it with BeautifulSoup, like the code below, the soup gets only a few of contents in the panel such as the IPA "/ˈluːsɪd/" and the attribute "adjective" like the result below, and none of the contents I need can be found, such as things in audio elements.
How can I get the information with BeautifulSoup if possible, otherwise what alternative tools are suitable for this task?
P.S. I think the quality of pronunciation from Google dictionary is better than ones from any other dictionary sites. So I want to stick to it.
Code:
import requests
from bs4 import BeautifulSoup
query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query
r = requests.get(goog_search)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
Part of soup content:
</span>
<span style="font:smaller 'Doulos SIL','Gentum','TITUS Cyberbit Basic','Junicode','Aborigonal Serif','Arial Unicode MS','Lucida Sans Unicode','Chrysanthi Unicode';padding-left:15px">
/ˈluːsɪd/
</span>
</div>
</h3>
<table style="font-size:14px;width:100%">
<tr>
<td>
<div style="color:#666;padding:5px 0">
adjective
</div>
The basic request you run is not returning the parts of the page rendered via JavaScript. If you right-click in Chrome and select View Page Source the audio link is not there. Solution: you could render the page via selenium. With the below code I get the <audio> tag including the link.
You'll have to pip install selenium, download ChromeDriver and add the folder containing it to PATH like export PATH=$PATH:~/downloads/
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
def render_page(url):
driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
r = driver.page_source
#driver.quit()
return r
query = "define:lucid"
goog_search = "https://www.google.co.uk/search?q=" + query
r = render_page(goog_search)
soup = BeautifulSoup(r, "html.parser")
print(soup.prettify())
I checked it. You're right, in the BeautifulSoup output there is no audio elements for some reason. However, having inspected the code, I found a source for the audio file which Google is using, which is http://ssl.gstatic.com/dictionary/static/sounds/oxford/lucid--_gb_1.mp3 and which perfectly works if you substitute "lucid" with any other word.
So, if you need to scrape the audio file, you could just do the following:
url='http://ssl.gstatic.com/dictionary/static/sounds/oxford/'
audio=requests.get(url+'lucid'+'--_gb_1.mp3', stream=True).content
with open('lucid'+'.mp3', 'wb') as f:
f.write(audio)
As for other elements, I'm afraid you'll need just to find the word "definition" in the soup and scrape the content of the tag that contains it.
There's no need in selenium to slow down scraping time as M3RS shows since the data is in the HTML, not rendered via JavaScript. Have a look at the SelectorsGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
You're looking for this (CSS selectors reference):
soup.select_one('audio source')['src']
# //ssl.gstatic.com/dictionary/static/sounds/20200429/swagger--_gb_1.mp3
Make sure you're using user-agent because default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. user-agent faking user visit by adding this information into HTTP request headers.
Code:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'lucid definition',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
phonetic = soup.select_one('.S23sjd .LTKOO span').text
audio_link = soup.select_one('audio source')['src']
print(phonetic)
print(audio_link)
# ˈluːsɪd
# //ssl.gstatic.com/dictionary/static/sounds/20200429/swagger--_us_1.mp3
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to grab the data you want fast, instead of coding everything from scratch, figuring out why certain things don't work as they should, and then maintain it over time if something in the HTML layout is changed.
At the moment, SerpApi doesn't extract audio link. This will be changed in the future. Please, check it in the playground to clarify if the audio link is present.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "lucid definition",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
phonetic = results['answer_box']['syllables']
print(phonetic)
# lu·cid
Disclaimer, I work for SerpApi.

Extracting text from chart in Beautiful soup

Relatively new to beautifulsoup and I'm trying to extract data from this webpage: http://reports.workforce.test.ohio.gov/program-county-wia-reports.aspx?name=GTL8gAmmdulY5GSlycy7WQ==&dataType=hIp9ibmBIwbKor1WvT5Bkg==&dataTypeText=hIp9ibmBIwbKor1WvT5Bkg==#
I would like to grab the numbers under the headings "Program Completers", "Employed Second Quarter", etc. A relevant part of the html code is:
<ul class="listbox">
<li class="li1">
<p style="cursor:help" class="listtop" title="WIA Adult
completers are those individuals who have exited a WIA Adult program from
which the individual received a core staff-assisted service (such as job
search or placement assistance) or an intensive service (such as
counseling, career planning, or job training). Those individuals who
participated in WIA through self-service, like OhioMeansJobs.com, or other
less intensive programs are not included in the dashboard.">Program
Completers</p>
<p id="programcompleters1">18</p></li>
I want the string "Program Completers" and "18". I have tried implementing these solutions here, here, and here but without much luck. One version of my code is:
from bs4 import BeautifulSoup
import urllib2
url="http://reports.workforce.test.ohio.gov/program-county-wia-reports.aspx?name=GTL8gAmmdulY5GSlycy7WQ==&dataType=hIp9ibmBIwbKor1WvT5Bkg==&dataTypeText=hIp9ibmBIwbKor1WvT5Bkg=="
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'}
req = urllib2.Request(url, headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
for tag in soup.find_all('ul'):
print tag.text, tag.next_sibling
This returns text but from other parts of the webpage also tagged 'ul'. I have been unsuccessful in grabbing any text from inside the chart area. How can I retrieve the text I want?
Thank you for any help!
As mentioned before the data you're looking for is in an iframe, access it as #chosen_codex says here:
http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8=
You can then access the fields you are interested by:
data = {}
for tag in soup.find_all('p'):
if tag.get('id'):
data[tag.get('id')] = tag.text
print(data)
>> print(data.get('programcompleters1'))
18
The elements you want are inside an iframe. Try extracting from the page itself at http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8=
so, this should work
url="http://reports.workforce.test.ohio.gov/WIAReports/WIA_COUNTY.ASPX?level=county&DataType=hIp9ibmBIwbKor1WvT5Bkg==&name=GTL8gAmmdulY5GSlycy7WQ==&programDate=Kf/2jvCFFRgQJjODWV7l08ATxxM/adw9p1FWfZ9J7O8="
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
chartcontainers = soup.findAll('div', {"class": "chartcontain"})
for container in chartcontainers:
print(container)
#then do whatever

Categories

Resources