How to scrape text from this webpage?

How to scrape text from this webpage? - python

I'm trying to scrape this HTML title
<h2 id="p89" data-pid="89"><span id="page77" class="pageNum" data-no="77" data-before-text="77"></span>Tuesday, July 30</h2>
from this website: https://wol.jw.org/en/wol/h/r1/lp-e
My code:
from bs4 import BeautifulSoup
import requests
url = requests.get('https://wol.jw.org/en/wol/h/r1/lp-e').text
soup = BeautifulSoup(url, 'lxml')
textodiario = soup.find('header')
dia = textodiario.h2.text
print(dia)
It should returns me today's day but it returns me a passed day: Wednesday, July 24

At the moment I don't have a PC to test, please double check for possible errors.
You need the chromedriver for your platform too, put it in the same folder of the script.
My idea would be to use selenium to get the HTML and then parse it:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://wol.jw.org/en/wol/h/r1/lp-e"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
textodiario = soup.find('header')
dia = textodiario.h2.text
print(dia)

The data is getting loaded asynchronously and the contents of the div are being changed. What you need is a selenium web driver to act alongside bs4.

I actually tried your code, and there's definitely something wrong with how the website/the code is grabbing data. Because when I pipe the entirety of the URL text to a grep with July, it gives:
Wednesday, July 24
<h2 id="p71" data-pid="71"><span id="page75" class="pageNum" data-no="75" data-before-text="75"></span>Wednesday, July 24</h2>
<h2 id="p74" data-pid="74">Thursday, July 25</h2>
<h2 id="p77" data-pid="77">Friday, July 26</h2>
If I had to take a guess, the fact that they're keeping multiple dates under h2 probably doesn't help, but I have almost zero experience in web scraping. And if you notice, July 30th isn't even in there, meaning that somewhere along the line your data is getting weird (as LazyCoder points out).
Hope that Selenium fixes your issue.

Go to NetWork Tab and you will get the link.
https://wol.jw.org/wol/dt/r1/lp-e/2019/7/30
Here is the code.
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
session = requests.Session()
response = session.get('https://wol.jw.org/wol/dt/r1/lp-e/2019/7/30',headers=headers)
result=response.json()
data=result['items'][0]['content']
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('h2').text)
Output:
Tuesday, July 30

Related

Scrape the snippet text from google search page

When we search a question in google it often produces an answer in a snippet like the following:
My objective is to scrape this text ("August 4, 1961" encircled in red mark in the screenshot) in my python code.
Before trying to scrape the text, I stored the web response in a text file using the following code:
page = requests.get("https://www.google.com/search?q=when+barak+obama+born")
soup = BeautifulSoup(page.content, 'html.parser')
out_file = open("web_response.txt", "w", encoding='utf-8')
out_file.write(soup.prettify())
In the inspect element section, I noticed that the snippet is inside div class Z0LcW XcVN5d (encircled in green mark in the screenshot). However, the response in my txt file contains no such text, let alone class name.
I've also tried this solution where the author scraped items with id rhs_block. But my response contains no such id.
I've searched the occurrences of "August 4, 1961" in my response txt file and tried to comprehend whether it could be the snippet. But none of the occurences seemed to be the one that I was looking for.
My plan was to get the div id or class name of the snippet and find its content like this:
# IT'S A PSEUDO CODE
containers = soup.find_all(class or id = 'somehting')
for tag in containers:
print(f"tag text : {tag.text}")
Is there any way to do this?
NOTE: I'm also okay with using libraries other than beautifulsoup and requests as long as it can produce result.

There's no need to use Selenium, you can achieve this using requests and BS4 since everything you need is located in HTML and there's no dynamic JavaScript.
Code and example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=Barack Obama born date', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
born = soup.select_one('.XcVN5d').text
age = soup.select_one('.kZ91ed').text
print(born)
print(age)
Output:
August 4, 1961
age 59 years

Selenium will produce the result you need.
It's convenient because you can add any waits and see what is actually going on on your screen.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.get('https://google.com/')
assert "Google" in driver.title
wait = WebDriverWait(driver, 20)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".gLFyf.gsfi")))
input_field = driver.find_element_by_css_selector(".gLFyf.gsfi")
input_field.send_keys("how many people in the world")
input_field.send_keys(Keys.RETURN)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".Z0LcW.XcVN5d")))
result = driver.find_element_by_css_selector(".Z0LcW.XcVN5d").text
print(result)
driver.close()
driver.quit()
The result will probably wonder you :)
You'll need to install Selenium and Chromedriver. You'll need to put Chromedriver executable in the path for Windows, or show the path to it for Linux. My example is for Linux.

Unable to fetch the rest of the names leading to the next pages from a webpage using requests

I've created a script to get different names from this website filtering State Province to Alabama and Country to United States in the search box. The script can parse the names from the first page. However, I can't figure out how I can get the results from next pages as well using requests.
There are two options in there to get all the names. Option one: using this show all 410 and option two: making use of next button.
I've tried with (capable of grabbing names from the first page):
import re
import requests
from bs4 import BeautifulSoup
URL = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"
params = {
'errorpath': '/CCI/Verify/CCI/Credential_Verification.aspx'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
r = s.get(URL)
params['WebsiteKey'] = re.search(r"gWebsiteKey[^\']+\'(.*?)\'",r.text).group(1)
params['hkey'] = re.search(r"gHKey[^\']+\'(.*?)\'",r.text).group(1)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']: i.get('value', '') for i in soup.select('input[name]')}
payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input4$DropDown1'] = 'AL'
payload['ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Sheet0$Input5$DropDown1'] = 'United States'
r = s.post(URL,params=params,data=payload)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
print(item.text)
In case someone comes up with any solution based on selenium, I've found success already with the same. However, I'm not willing to go that route:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://cci-online.org/CCI/Verify/CCI/Credential_Verification.aspx"
with webdriver.Chrome() as driver:
driver.get(link)
wait = WebDriverWait(driver,15)
Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input4_DropDown1']")))).select_by_value("AL")
Select(wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "select[id$='Input5_DropDown1']")))).select_by_value("United States")
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='SubmitButton']"))).click()
wait.until(EC.visibility_of_element_located((By.XPATH, "//a[contains(.,'show all')]"))).click()
wait.until(EC.invisibility_of_element_located((By.XPATH, "//span[#id='ctl01_LoadingLabel' and .='Loading']")))
soup = BeautifulSoup(driver.page_source,"lxml")
for item in soup.select("table.rgMasterTable > tbody > tr a[title]"):
print(item.text)
How can I get the rest of the names from that webpage leading to the next pages using requests module?

First, click that link in chrome with the network panel open. Then look at the Form Data for the request:
Pay extra attention to __EVENTTARGET and __EVENTARGUMENT.
Next, inspect one of those next links, they will look like this:
<a onclick="return false;" title="Go to page 2" class="rgCurrentPage" href="javascript:__doPostBack('ctl01$TemplateBody$WebPartManager1$gwpciPeopleSearch$ciPeopleSearch$ResultsGrid$Grid1$ctl00$ctl02$ctl00$ctl07','')"><span>2</span></a>
The doPostBack arguments go in __EVENTTARGET and __EVENTARGUMENT and everything else should match what you see in network (headers as well as form data).
It will be helpful to proxy requests through Charles or Fiddler so you can compare the requests side by side.

Web scraping with requests not working correctly

I am trying to get the html from CNN for a personal project. I am using the requests library and am new to it. I have followed basic tutorials to get the HTML from CNN using requests, but keep getting responses that are different from the HTML I find when I inspect the webpage from my browser. Here is my code:
base_url = 'https://www.cnn.com/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
I am trying to get article titles from CNN, but this is my first issue. Thanks for the help!
Update
It seems that I know even less than I had initially assumed. My real question is: How do I extract titles from the CNN homepage? I've tried both answers, but the HTML from requests does not contain title information. How can I get the title information like what is in this picture (Screenshot of my browser)Screenshot of cnn article title with accompanying html side by side

You can use Selenium ChromeDriver to scrape https://cnn.com.
import bs4 as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
driver = webdriver.Chrome("---CHROMEDRIVER-PATH---", options=chrome_options)
driver.get('https://cnn.com/')
soup = bs.BeautifulSoup(driver.page_source, 'lxml')
# Get Titles from HTML.
titles = soup.find_all('span', {'class': 'cd__headline-text'})
print(titles)
# Close ChromeDriver.
driver.close()
driver.quit()
Output:
[<span class="cd__headline-text"><strong>The West turned Aung San Suu Kyi into a saint. She was always going to disappoint </strong></span>, <span class="cd__headline-text"><strong>In Hindu-nationalist India, Muslims risk being branded infiltrators</strong></span>, <span class="cd__headline-text">Johnson may have stormed to victory, but he's got a problem</span>, <span class="cd__headline-text">Impeachment heads to full House after historic vote</span>, <span class="cd__headline-text">Supreme Court to decide on Trump's financial records</span>, <span class="cd__headline-text">Michelle Obama's message for Thunberg after Trump mocks her</span>, <span class="cd__headline-text">Actor Danny Aiello dies at 86</span>, <span class="cd__headline-text">The biggest risk at the North Pole isn't what you think</span>, <span class="cd__headline-text">US city declares state of emergency after cyberattack </span>, <span class="cd__headline-text">Reality TV show host arrested</span>, <span class="cd__headline-text">Big names in 2019 you may have mispronounced</span>, <span class="cd__headline-text"><strong>Morocco has Africa's 'first fully solar village'</strong></span>]
You can download ChromeDriver from here.

I tried the following code and it worked for me.
base_url = 'https://www.cnn.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'
}
r = requests.get(base_url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())
Note that I have specified a headers parameter in requests.get(). All it does is that it tries to mimic a real browser so that the anti-scraping algorithms can't be able to detect it.
Hope this helps and if not then feel free to ask me in the comments. Cheers :)

I just checked. CNN seems to recognize that you programmatically try to scrape the site and serves a 404 / missing page (with no content on it) instead of the homepage.
Try a headless browser like Selenium, e.g. like so:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://cnn.com')
html = driver.page_source

BeautifulSoup can't find class that exists on webpage?

So I am trying to scrape the following webpage https://www.scoreboard.com/uk/football/england/premier-league/,
Specifically the scheduled and finished results. Thus I am trying to look for the elements with class = "stage-finished" or "stage-scheduled". However when I scrape the webpage and print out what page_soup contains, it doesn't contain these elements.
I found another SO question with an answer saying that this is because it is loaded via AJAX and I need to look at the XHR under the network tab on chrome dev tools to find the file thats loading the necessary data, however it doesn't seem to be there?
import bs4
import requests
from bs4 import BeautifulSoup as soup
import csv
import datetime
myurl = "https://www.scoreboard.com/uk/football/england/premier-league/"
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = requests.get(myurl, headers=headers)
page_soup = soup(page.content, "html.parser")
scheduled = page_soup.select(".stage-scheduled")
finished = page_soup.select(".stage-finished")
live = page_soup.select(".stage-live")
print(page_soup)
print(scheduled[0])
The above code throws an error of course as there is no content in the scheduled array.
My question is, how do I go about getting the data I'm looking for?
I copied the contents of the XHR files to a notepad and searched for stage-finished and other tags and found nothing. Am I missing something easy here?

The page is JavaScript rendered. You need Selenium. Here is some code to start on:
from selenium import webdriver
url = 'https://www.scoreboard.com/uk/football/england/premier-league/'
driver = webdriver.Chrome()
driver.get(url)
stages = driver.find_elements_by_class_name('stage-scheduled')
driver.close()
Or you could pass driver.content in to the BeautifulSoup method. Like this:
soup = BeautifulSoup(driver.page_source, 'html.parser')
Note:
You need to install a webdriver first. I installed chromedriver.
Good luck!

get around cookies with requests + python

I'm very much a noob in python and scraping. I understand the basics but just cannot get past this problem.
I'm trying to scrape content from www.tweakers.net using python with the requests and beautifullsoup libraries. However, when I scrape, I keep scraping the cookie statement instead of the actual site content. Hope that there is anyone who can help me with code. I got run down with similar issues on other websites so would really like to understand how I can tackle such an issue. This is what I have now.
import time
from bs4 import BeautifulSoup
import requests
from requests.cookies import cookiejar_from_dict
last_agreed_time = str(int(time.time() * 1000))
url = 'www.tweakers.net'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'}
session.cookies = cookiejar_from_dict({
'wt3_sid': %3B318816705845986
'wt_cdbeid': 68907f896d9f37509a2f4b0a9495f272
'wt_feid': 2f59b5d845403ada14b462a2c1d0b967
'wt_fweid' 473bb8c305b0b42f5202e14a
})
response = session.get(url)
soup = BeautifulSoup(response.content)
soup.prettify()`
Do not mind the content of the header, I ripped it from somewhere else.

Two of the best imports for scraping would be selenium or cookielib. Here is a link to selenium, http://selenium-python.readthedocs.io/api.html, and cookielib, https://docs.python.org/2/library/cookielib.html.
## added selenium code
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import requests
url = 'www.tweakers.net'
driver = webdriver.Chrome() # or webdriver.Firefox()
driver.set_window_size(1120, 550)
driver.get(url)
#add needed cookies
driver.add_cookie({'wt3_sid': %3B318816705845986
'wt_cdbeid': 68907f896d9f37509a2f4b0a9495f272
'wt_feid': 2f59b5d845403ada14b462a2c1d0b967
'wt_fweid' 473bb8c305b0b42f5202e14a})
##this would be to retrieve a cookie
print(driver.get_cookie('string'))
driver.get(url)
soup = BeautifulSoup(driver.content)
soup.prettify()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape text from this webpage? - python

The data is getting loaded asynchronously and the contents of the div are being changed. What you need is a selenium web driver to act alongside bs4.

Related

Scrape the snippet text from google search page

Unable to fetch the rest of the names leading to the next pages from a webpage using requests

Web scraping with requests not working correctly

BeautifulSoup can't find class that exists on webpage?

get around cookies with requests + python

Categories

Resources