Beautiful Soup Link Scraping [duplicate] - python

I am testing using the requests module to get the content of a webpage. But when I look at the content I see that it does not get the full content of the page.
Here is my code:
import requests
from bs4 import BeautifulSoup
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Also on the chrome web-browser if I look at the page source I do not see the full content.
Is there a way to get the full content of the example page that I have provided?

The page is rendered with JavaScript making more requests to fetch additional data. You can fetch the complete page with selenium.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
url = "https://shop.nordstrom.com/c/womens-dresses-shop?origin=topnav&cm_sp=Top%20Navigation-_-Women-_-Dresses&offset=11&page=3&top=72"
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
print(soup.prettify())
For other solutions see my answer to Scraping Google Finance (BeautifulSoup)

Request is different from getting page source or visual elements of the web page, also viewing source from web page doesn't give you full access to everything that is on the web page including database requests and other back-end stuff. Either your question is not clear enough or you've misinterpreted how web browsing works.

Related

Not getting all the information when scraping bet365.com

I am having problem when trying to scrape https://www.bet365.com/ using urllib.request and BeautifulSoup.
The problem is, the code below doesn't get all the information on the page, for example players' names don't appear. Maybe another framework or configuration to extract the information?
My code is:
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
page = urllib.request.urlopen(url)
except:
print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
Looking at the source code for the page in question it looks like essentially all of the data is populated by Javascript. BeautifulSoup isn't a headless client, it's just something that downloads and parses HTML, so anything that's populated with Javascript it can't see. You'd need a headless browser like selenium to scrape something like that.
You need to use selenium instead of requests, along with Beautifulsoup as well.
from selenium import webdriver
url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")
driver.get(url)
driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error
soup = BeautifulSoup(driver.page_source, 'html.parser') #get the soup

How to scrape link title over many pages and through specified tab

I am having trouble figuring out how to use BeautifulSoup to scrape all 100 link titles on the page since it is under "a href = ....." . I have tried the below code but it returns a blank.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import bs4
url = 'https://www150.statcan.gc.ca/n1/en/type/data?count=100'
page = urlopen(url)
soup = bs4.BeautifulSoup(page,'html.parser')
title = soup.find_all('a')
Additionally, is there a way to ensure I am scraping everything under the "Tables (8898)" tabs? Thanks in advance!
Link:
https://www150.statcan.gc.ca/n1/en/type/data?count=100
The link you provided is loading it's contents with async javascript requests. So when you exec page = urlopen(url) it is only fetching the empty HTML and javascript blocks.
You need to use a browser to execute js to load page contents. You can checkout this link to learn how to do it: https://towardsdatascience.com/web-scraping-using-selenium-python-8a60f4cf40ab

How can I view web page content that is generated using angular JS?

I have an app that uses requests to analyze and act on web page text. But it does not seem to work on this page that is likely built with angular: https://bio.tools/bowtie, in that the source HTML is different than the actual content. I am trying to collect the DOI that is referenced on the page (10.1186/gb-2009-10-3-r25), but when requests picks up the HTML source the DOI is not there.
I've heard that Google is able to parse pages that are generated using javascript. How do they do it? Any tips on viewing the DOI information with python?
You probably need an engine which runs the javascript of the http response for you (like an internet browser does). You can use selenium for this and then parsing the html it returns with beautifulsoup.
Example:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://bio.tools/bowtie"
path = "path/to/chrome/webdriver"
browser = webdriver.Chrome(path) # Can also be Firefox, etc.
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
...

What is difference between soup of selenium and requests?

I was crawling some information from the web, but there were different results while I'm using Selenium and requests
Selenium
driver.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(driver.page_source, 'html.parser')
sample= soup.find_all('div', class_='accord_hd')`
requests
response= requests.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(response.content, 'html.parser')
sample= soup.find_all('div', class_='accord_hd')`
while using Selenium, it returned an empty list.
but in requests, there was a list with some strings in it.
I experienced sth similar to this before, so I wonder what's going on here
requests will obtain/return the initial html source code.
selenium will simulate/automate the browser to open the web page, which then you can pull the html source that was used to render the page.
The difference between these 2 is requests does not support that rendering/java script if the site is dynamically created. While since selenium actually opens up the browser to display the page, will allow the page to render it's contents before getting the html source.
That's the reason why you may get 2 different responses when using requests versus selenium.
However, in the particular code you have given above, I had the exact same output with using Selenium and using requests
Code:
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
driver.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(driver.page_source, 'html.parser')
sample_selenium= soup.find_all('div', class_='accord_hd')
driver.close()
import requests
response = requests.get('https://www.jobplanet.co.kr/companies/322493/benefits/%EC%A7%80%EC%97%90%EC%9D%B4%EC%B9%98%EC%94%A8%EC%A7%80')
soup= BeautifulSoup(response.content, 'html.parser')
sample_requests= soup.find_all('div', class_='accord_hd')
print ('Selenium: %s items\nRequests: %s items' %(len(sample_selenium), len(sample_requests)))
Output:
Selenium: 11 items
Requests: 11 items

webscraping: extracting url from xpath in html using python: airbnb listings

I am trying to extract urls for listings from a city page in AirBnb, using python 3 libraries. I am familiar with how to scrape simpler websites with Beautifulsoup and requests libraries.
url: 'https://www.airbnb.com/s/Denver--CO--United-States/homes'
element in the html
If I inspect the element of a link on the page (in Chrome), I get:
xpath: "//*[#id="listing-9770909"]/div[2]/a"
selector: "listing-9770909 > div._v72lrv > a"
My attempts:
import requests
from bs4 import BeautifulSoup
url = 'https://www.airbnb.com/s/Denver--CO--United-States/homes'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
divs = soup.find_all('div', attrs={'id': 'listing'})
attempt 2:
import requests
from lxml import html
page = requests.get(url)
root = html.fromstring(page.content)
tree = root.getroottree()
result = root.xpath('//div[#id="listing-9770909"]/div[2]/a')
for r in result:
print(r)
Neither of these returns anything. What I need to be able to extract is the url for the page link. Any ideas?
To extract the links, first you have to make sure that the urls to the links exists in the page source. For this you can search with any of the listing ids in the page source(ctrl+u if you are using google chrome,mozilla firefox). If the urls exist in the page source you can directly scrape them using xpath in the response text of the listing page. Here the above listing page of Airbnb is not having the links in the page source, so the page might be sending requests to some other pages(usually json requests). You can find out those requests and send requests to those pages and get the required data.
Please comment if you have any doubt regarding this.

Categories

Resources