Beautiful Soup and the findAll() process

Beautiful Soup and the findAll() process - python

I am attempting to scrape data from a site using the following code. The site required the decode method and I followed a #royatirek solution. My problem is that container_a ends up containing nothing. I use a similar method on few other sites and it works. But on this and a couple of other sites my container_a variable remains an empty list. Cheers
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
my_url = 'http://www.news.com.au/sport/afl-round-3-teams-full-lineups-and-
the-best-supercoach-advice/news-story/dfbe9e0e68d445e07c9522a138a2b824'
req = Request(my_url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
page_soup = soup(web_byte, "html.parser")
container_a = page_soup.findAll("div",{"class":"fyre-comment-wrapper"})

The content you want to parse is being dynamically loaded by JavaScript and therefore requests won't do the job for you. You could use selenium and ChromeDriver or any other driver for that:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("http://www.news.com.au/sport/afl-round-3-teams-full-lineups-and-the-best-supercoach-advice/news-story/dfbe9e0e68d445e07c9522a138a2b824")
You can then proceed with the use of bs4 as you wish by accessing the page source using .page_source:
page_soup = BeautifulSoup(driver.page_source, "html.parser")
container_a = page_soup.findAll("div",{"class":"fyre-comment-wrapper"})

Related

Python requests.get() not showing all HTML

I'm looking to scrape some information from Easy Allies reviews for a personal project using:
Python3
requests
BS4 (BeautifulSoup)
I would like to scrape the names of the last games they have reviewed which is easy to find within the browser inspect tool, but doesn't exist within the source code of the page which is what is returned with this Python code:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://www.easyallies.com/#!/reviews")
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.prettify())
How do I access this data?

Notice that when you open that url, it calls an endpoint https://www.easyallies.com/api/review/get that will fetch the reviews.
Take this code as an example, and parse the JSON result as you wish.
import requests
from bs4 import BeautifulSoup
data = { 'method': 'review', 'action': 'get', 'data[start]': 0, 'data[limit]': 10 }
reviews = requests.post("https://www.easyallies.com/api/review/get", data=data)
print (reviews.text)

from selenium import webdriver
import time
from bs4 import BeautifulSoup
browser = webdriver.Firefox()
url = 'https://www.easyallies.com/#!/reviews'
sada = browser.get(url)
time.sleep(3)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
for item in soup.findAll('div', attrs={'class': 'name'}):
print(item.text)

Parsing HTML using beautifulsoup gives "None"

I can clearly see the tag I need in order to get the data I want to scrape.
According to multiple tutorials I am doing exactly the same way.
So why it gives me "None" when I simply want to display code between li class
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.governmentjobs.com/careers/sdcounty")
soup = BeautifulSoup(response.text,'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)

Whilst the page does dynamically update (it makes additional requests from browser to update content which you don't capture with your single request) you can find the source URI in the network tab for the content of interest. You also need to add the expected header.
import requests
from bs4 import BeautifulSoup as bs
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = bs(r.content, 'lxml')
print(len(soup.select('.list-item')))

There is no such content in the original page. The search results which you're referring to, are loaded dynamically/asynchronously using JavaScript.
Print the variable response.text to verify that. I got the result using ReqBin. You'll find that there's no text list-item inside.
Unfortunately, you can't run JavaScript with BeautifulSoup .

Another way to handle dynamically loading data is to use selenium instead of requests to get the page source. This should wait for the Javascript to load the data correctly and then give you the according html. This can be done like so:
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
url = "<URL>"
chrome_options = Options()
chrome_options.add_argument("--headless") # Opens the browser up in background
with Chrome(options=chrome_options) as browser:
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)

Python BeautifulSoup soup.find

I want to scrape some specific data from a website using urllib and BeautifulSoup.
Im trying to fetch the text "190.0 kg". I have tried as you can see in my code to use attrs={'class': 'col-md-7'}
but this returns the wrong result. Is there any way to specify that I want it to return the text between <h3>?
from urllib.request import urlopen
from bs4 import BeautifulSoup
# specify the url
quote_page = 'https://styrkeloft.no/live.styrkeloft.no/v2/?test-stevne'
# query the website and return the html to the variable 'page'
page = urlopen(quote_page)
# parse the html using beautiful soup
soup = BeautifulSoup(page, 'html.parser')
# take out the <div> of name and get its value
Weight_box = soup.find('div', attrs={'class': 'col-md-7'})
name = name_box.text.strip()
print (name)

Since this content is dynamically generated there is no way to access that data using the requests module.
You can use selenium webdriver to accomplish this:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_driver = "path_to_chromedriver"
driver = webdriver.Chrome(chrome_options=chrome_options,executable_path=chrome_driver)
driver.get('https://styrkeloft.no/live.styrkeloft.no/v2/?test-stevne')
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
current_lifter = soup.find("div", {"id":"current_lifter"})
value = current_lifter.find_all("div", {'class':'row'})[2].find_all("h3")[0].text
driver.quit()
print(value)
Just be sure to have the chromedriver executable in your machine.

BeautifulSoup4 not finding div

I've been experimenting with BeautifulSoup4 lately and have found pretty good success across a range of different sites. Though I have stumbled into an issue when trying to scrape amazon.com.
Using the code below, when printing 'soup' I can see the div, but when I search for the div itself, BS4 brings back null. I think the issue is in how the html is being processed. I've tried with LXML and html5lib. Any ideas?
import bs4 as bs
import urllib3
urllib3.disable_warnings()
http = urllib3.PoolManager()
url = 'https://www.amazon.com/gp/goldbox/ref=gbps_fcr_s-4_a870_dls_UPCM?gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:GIVEAWAY_DEAL,sortOrder:BY_SCORE,enforcedCategories:2619533011,dealTypes:LIGHTNING_DEAL&pf_rd_p=56200e05-4eb2-42ca-9723-af0811ada870&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=PQNZWZRKVD93HXXVG5A7&ie=UTF8'
original = http.request('Get',url)
soup = bs.BeautifulSoup(original.data, 'lxml')
div = soup.find_all('div', {'class':'a-row padCenterContainer'})

You could use selenium to allow the javascript to load before grabbing the html.
from bs4 import BeautifulSoup
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=options)
url = 'https://www.amazon.com/gp/goldbox/ref=gbps_fcr_s-4_a870_dls_UPCM?gb_f_deals1=dealStates:AVAILABLE%252CWAITLIST%252CWAITLISTFULL,includedAccessTypes:GIVEAWAY_DEAL,sortOrder:BY_SCORE,enforcedCategories:2619533011,dealTypes:LIGHTNING_DEAL&pf_rd_p=56200e05-4eb2-42ca-9723-af0811ada870&pf_rd_s=slot-4&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=ATVPDKIKX0DER&pf_rd_r=PQNZWZRKVD93HXXVG5A7&ie=UTF8'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
div = soup.find('div', {'class': 'a-row padCenterContainer'})
print(div.prettify())
The output of this script was too long to put in this question but here is a link to it

Python Beautiful Soup - Span class text not extracted

I'm using beautiful soup for the first time and the text from the span class is not being extracted. I'm not familiarized with HTML so I'm unsure as to why this happens, so it'd be great to understand.
I've used the code below:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.Close()
page_soup = soup(page_html, "html.parser")
content = page_soup.findAll("span",attrs={"data-item":"rate"})
With this code for index 0 it returns the following:
<span class="productdata" data-baserate-code="VRI" data-cc="AU" data-
item="rate" data-section="PHL" data-subsection="VR"></span>
However I'd expect something like this when I inspect via Chrome, which has the text such as the interest rate:
<span class="productdata" data-cc="AU" data-section="PHL" data-
subsection="VR" data-baserate-code="VRI" data-item="rate">5.20% p.a.</span>

Data you are trying to extract does not exists. It is loaded using JS after the page is loaded. Website uses a JSON api to load information on the page. So Beautiful soup can not find the data. Data can be viewed at following link that hits JSON API on the site and provides JSON data.
https://www.anz.com/productdata/productdata.asp?output=json&country=AU&section=PHL
You can parse the json and get the data. Also for HTTP requests I would recommend requests package.

As others said, the content is JavaScript generated, you can use selenium together ChromeDriver to find the data you want with something like:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.anz.com.au/personal/home-loans/your-loan/interest-rates/#varhome")
items = driver.find_elements_by_css_selector("span[data-item='rate']")
itemsText = [item.get_attribute("textContent") for item in items]
>>> itemsText
['5.20% p.a.', '5.30% p.a.', '5.75% p.a.', '5.52% p.a.', ....]
As seen above, BeautifulSoup wasn't necessary at all, but you can use it instead to parse the page source and get the same results:
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
items = soup.findAll("span",{"data-item":"rate"})
itemsText = [item.text for items in items]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup and the findAll() process - python

Related

Python requests.get() not showing all HTML

Parsing HTML using beautifulsoup gives "None"

Python BeautifulSoup soup.find

BeautifulSoup4 not finding div

Python Beautiful Soup - Span class text not extracted

Categories

Resources