Scraping HTML code using Selenium with Python

Scraping HTML code using Selenium with Python - python

I was trying to scrape some image data from some stores. For example, I was looking at some images from Nordstrom (tested with 'https://www.nordstrom.com/browse/men/clothing/sweaters').
I had initially used requests.get() to get the code, but I noticed that I was getting some javascript -- and upon further researc I found that this occured because it was dynamically loaded in the html using javascript.
To remedy this issue, following this post (Python requests.get(url) returning javascript code instead of the page html), I tried to use selenium to get the html code. However, I still ran into issues trying to access all the html: it was still returning alot of javascript. Finally, I added in some time delay as I thought maybe it needed some time to load in all of the html, but this still failed. Is there a way to get all the html using selenium? I have attached the current code below:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
#headless tag created an invisible browser
browser_options.add_argument("--headless")
browser_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
print("Done Creating Browser")
return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('path/to/chromedriver_win32/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
time.sleep(10)
html_source = browser.page_source
print(html_source)
Is there something that I am not doing properly to load in all of the html code?

browser.page_source always returns initial HTML source but not current DOM state. Try
time.sleep(10)
html_source = browser.find_element_by_tag_name('html').get_attribute('outerHTML')

I would recommend reading "Test-Driven Development with Python", you'll get an answer for your question and so many more. You can read it for free here: https://www.obeythetestinggoat.com/ (and then you can also buy it ;-) )
Regarding your question, you have to wait that the element you're looking for is effectively loaded. You may use time.sleep but you'll get strange behavior depending on the speed of your internet connection and browser.
A better solution is explained here in depth: https://www.obeythetestinggoat.com/book/chapter_organising_test_files.html#self.wait-for
You can use the proposed solution:
def wait_for(self, fn):
start_time = time.time()
while True:
try:
return fn()
except (AssertionError, WebDriverException) as e:
if time.time() - start_time > MAX_WAIT:
raise e
time.sleep(0.5)
fn is just a function finding the element in the page.

Just add a user agent. Chrome's headless user agent says headless that is the problem.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
browser_options.add_argument('--headless')
browser_options.add_argument('--user-agent="Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"')
browser = webdriver.Chrome(webdriver_path, options=browser_options)
print("Done Creating Browser")
return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('C:/bin/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
divs = browser.find_elements_by_tag_name('a')
for div in divs:
print(div.text)
Output:-
Displays all links on the page..
Patagonia Better Sweater® Quarter Zip Pullover
(56)
Nordstrom Men's Shop Regular Fit Cashmere Quarter Zip Pullover (Regular & Tall)
(73)
Nordstrom Cashmere Crewneck Sweater
(51)
Cutter & Buck Lakemont Half Zip Sweater
(22)
Nordstrom Washable Merino Quarter Zip Sweater
(2)
ALLSAINTS Mode Slim Fit Merino Wool Sweater
Process finished with exit code -1

Related

Selenium returns different page sources at different moments

I am trying to get all the events of a given city from eventim.de (more precisely https://www.eventim.de/en/city/berlin-1/?shownonbookable=true). Unfortunately when I try to get the code with Selenium and process it with BeautifulSoup I face a problem: the tag <product-group-item class="hydrated"> is not always present in the scraped code. I tried to make my script to run in loop for several iterations (example_length variable), and 50% of the time the tag was in the code, but for the other 50% it doesn't appear.
I already tried with requests library, but this website generates the items dynamically, so that Selenium is required.
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager
from time import sleep
example_length = 10
for i in range(example_length):
webDriver = webdriver.Firefox(service=Service(GeckoDriverManager().install()), options=my_options)
full_url = 'https://www.eventim.de/en/city/berlin-1/?shownonbookable=true&page=1'
webDriver.get(full_url)
sleep(10)
rawPageCode = webDriver.page_source
webDriver.close()
soupPageCode = BS(rawPageCode, 'html5lib') #tried also with 'html.parser'
attr = {'class': 'hydrated'}
eventContainerCode = soupPageCode.find_all('product-group-item', attr)
print(len(eventContainerCode))
This code returns either 20 or 0, unpredictably.
This last line of code is the one that sometimes find the tag, sometimes it doesn't.
I tried also with other webdrivers (Chrome, for example), but the result is the same.
Increasing the sleep time doesn't change the output, as the page - whether it has the searched tag or not - is always fully loaded in less than 5 seconds. At the moment I have to believe that the tag appears randomly. If the tag I'm searching for is not there, the items appear in the <listing-item class="hydrated"> tag.
The webDriver options I tried to use to fix the problem are:
my_options = Options()
headers = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36"
my_options.binary_location = "C:\Program Files\Mozilla Firefox/firefox.exe"
my_options.add_argument("--disable-extensions")
my_options.add_argument('--blink-settings=imagesEnabled=false')
my_options.add_argument("--disable-gpu")
my_options.add_argument("--no-sandbox")
my_options.add_argument("--headless")
my_options.add_argument('--user-agent=' + headers)
my_options.add_argument("--window-size=1920, 1080")
my_options.add_argument('--start-fullscreen')
my_options.add_argument("--start-maximized")
my_options.add_argument('--disable-blink-features=AutomationControlled')
my_options.add_argument('--ignore-certificate-errors')
my_options.add_argument("--test-type")
my_options.add_argument("--incognito")
my_options.add_argument("--verbose")
But none of these options is mandatory, as I get the same 50:50 result. I tried to run it without the headless argument to see what my webDriver loads, and sometimes I see the main events, some other only the subevents.
I tried also to use the Selenium functions to wait for the page to load until an element is shown:
wait = WebDriverWait(browser, 10)
wait.until(EC.presence_of_element_located((By.XPATH, "Full XPath goes here")))
But again, either the tag is there or not. It's not a matter of webpage loading time.
From a normal browser (e.g. Chrome) the website is always loaded correctly, displaying the tag <product-group-item class="hydrated"> correctly every time.
At the moment I have fixed the code by just repeating the routine if the tag is not there, but for a matter of principle I want to understand why this happens. The code works, but it can definitely be faster and more robust.
How can I get back always the same page source from the same request? And why there is this difference of output?

Is there a way to make a dynamic web page automatically run its JavaScript when webscraping with Python?

I have been getting a lot of issues when trying to do some Python webscraping using BeautifulSoup. Since this particular web page is dynamic, I have been trying to use Selenium first in order to "open" the web page before trying to work with the dynamic content with BeautifulSoup.
The issue I am getting is that the dynamic content is only showing up in my HTML output when I manually scroll through the website upon running the program, otherwise those parts of the HTML remain empty as if I was just using BeautifulSoup by itself without Selenium.
Here is my code:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
if __name__ == "__main__":
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
# options.add_argument('--headless')
driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe", chrome_options=options)
driver.get('https://coinmarketcap.com/')
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
tbody = soup.tbody
trs = tbody.contents
for tr in trs:
print(tr)
driver.close()
Now if I have Selenium open Chrome with the headless option turned on, I get the same output I would normally get without having pre-loaded the page. The same thing happens if I'm not in headless mode and I simply let the page load by itself, without scrolling through the content manually.
Does anyone know why this is? Is there a way to get the dynamic content to load without manually scrolling through each time I run the code?

Actually, data is loaded dynamically by javascipt. So you can grab data easily
from api calls json response:
Here is the working example:
Code:
import requests
import json
url= 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
r = requests.get(url)
for item in r.json()['data']['cryptoCurrencyList']:
name = item['name']
print('crypto_name:' + str(name))
Output:
crypto_name:Bitcoin
crypto_name:Ethereum
crypto_name:Binance Coin
crypto_name:Cardano
crypto_name:Tether
crypto_name:Solana
crypto_name:XRP
crypto_name:Polkadot
crypto_name:USD Coin
crypto_name:Dogecoin
crypto_name:Terra
crypto_name:Uniswap
crypto_name:Wrapped Bitcoin
crypto_name:Litecoin
crypto_name:Avalanche
crypto_name:Binance USD
crypto_name:Chainlink
crypto_name:Bitcoin Cash
crypto_name:Algorand
crypto_name:SHIBA INU
crypto_name:Polygon
crypto_name:Stellar
crypto_name:VeChain
crypto_name:Internet Computer
crypto_name:Cosmos
crypto_name:FTX Token
crypto_name:Filecoin
crypto_name:Axie Infinity
crypto_name:Ethereum Classic
crypto_name:TRON
crypto_name:Bitcoin BEP2
crypto_name:Dai
crypto_name:THETA
crypto_name:Tezos
crypto_name:Fantom
crypto_name:Hedera
crypto_name:NEAR Protocol
crypto_name:Elrond
crypto_name:Monero
crypto_name:Crypto.com Coin
crypto_name:PancakeSwap
crypto_name:EOS
crypto_name:The Graph
crypto_name:Flow
crypto_name:Aave
crypto_name:Klaytn
crypto_name:IOTA
crypto_name:eCash
crypto_name:Quant
crypto_name:Bitcoin SV
crypto_name:Neo
crypto_name:Kusama
crypto_name:UNUS SED LEO
crypto_name:Waves
crypto_name:Stacks
crypto_name:TerraUSD
crypto_name:Harmony
crypto_name:Maker
crypto_name:BitTorrent
crypto_name:Celo
crypto_name:Helium
crypto_name:OMG Network
crypto_name:THORChain
crypto_name:Dash
crypto_name:Amp
crypto_name:Zcash
crypto_name:Compound
crypto_name:Chiliz
crypto_name:Arweave
crypto_name:Holo
crypto_name:Decred
crypto_name:NEM
crypto_name:Theta Fuel
crypto_name:Enjin Coin
crypto_name:Revain
crypto_name:Huobi Token
crypto_name:OKB
crypto_name:Decentraland
crypto_name:SushiSwap
crypto_name:ICON
crypto_name:XDC Network
crypto_name:Qtum
crypto_name:TrueUSD
crypto_name:yearn.finance
crypto_name:Nexo
crypto_name:Celsius
crypto_name:Bitcoin Gold
crypto_name:Curve DAO Token
crypto_name:Mina
crypto_name:KuCoin Token
crypto_name:Zilliqa
crypto_name:Perpetual Protocol
crypto_name:Ren
crypto_name:dYdX
crypto_name:Ravencoin
crypto_name:Synthetix
crypto_name:renBTC
crypto_name:Telcoin
crypto_name:Basic Attention Token
crypto_name:Horizenput:

Web scraping in Python - extract a value from website

I'm trying to extract two values from this website:
bizportal.co.il
One value is the dollar rate from the right, and from the left the drop/rise in percentage.
The problem is that, after I'm getting the dollar rate value, the number is rounded from some reason. (You can see in the terminal). I want to get the exactly number as shown in the website.
Is there some friendly documentation for web scraping in Python?
P.S: how can I get rid of the pop up Python terminal window when running a code in VS ? I just want the output will be in VS - in the interactive window.
my_url = "https://www.bizportal.co.il/forex/quote/generalview/22212222"
uClient = urlopen(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html, "html.parser")
div_class = page_soup.findAll("div",{"class":"data-row"})
print (div_class)
#print(div_class[0].text)
#print(div_class[1].text)

The data is loaded dynamically via Ajax, but you can simulate this request with requests module:
import json
import requests
url = 'https://www.bizportal.co.il/forex/quote/generalview/22212222'
ajax_url = "https://www.bizportal.co.il/forex/quote/AjaxRequests/DailyDeals_Ajax?paperId={paperId}&take=20&skip=0&page=1&pageSize=20"
paper_id = url.rsplit('/')[-1]
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
data = requests.get(ajax_url.format(paperId=paper_id), headers=headers).json()
# uncomment this to print all data:
#print(json.dumps(data, indent=4))
# print first one
print(data['Data'][0]['rate'], data['Data'][0]['PrecentageRateChange'])
Prints:
3.4823 -0.76%

The problem is this element is being dynamically updated with Javascript. You will not be able to scrape the 'up to date' value with urllib or requests. When the page is loaded, it has a recent value populated (likely from a database) and then it is replaced with the real time number via Javascript.
In this case it would be better to use something like Selenium, to load the webpage - this allows the javascript to execute on the page, and then scrape the numbers.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.add_argument("--headless") # allows you to scrape page without opening the browser window
driver = webdriver.Chrome('./chromedriver', options=options)
driver.get("https://www.bizportal.co.il/forex/quote/generalview/22212222")
time.sleep(1) # put in to allow JS time to load, sometimes works without.
values = driver.find_elements_by_class_name('num')
price = values[0].get_attribute("innerHTML")
change = values[1].find_element_by_css_selector("span").get_attribute("innerHTML")
print(price, "\n", change)
Output:
╰─$ python selenium_scrape.py
3.483
-0.74%
You should familiarize yourself with Selenium, understand how to set it up, and run it - this includes installing the browser (in this case I am using Chrome, but you can use others), understanding where to get the browser driver (Chromedriver in this case) and understand how to parse the page. You can learn all about it here https://www.selenium.dev/documentation/en/

How to get python requests or urllib2 to wait for final page

I'm using python requests and beautifulsoup to verify a html document. However, the server for the landing page has some backend code that delays several seconds before presenting the final html document. I've tried the redirect=true approach but I end up with the original document. When loading the url on a browser, there is a 2-3 second delay while the page is created by the server. I've tried various samples like url2.geturl() after page load but all of these return the original url (and do so well before the 2-3 seconds elapses). I need something that emulates a browser and grabs the final document.
Btw, I am able to view the correct DOM elements in Chrome, just not problematically in python.

Figured this out after a few cycles. This requires a combination of 2 solutions (use python selenium package and time.sleep). Sets the background chrome process to run headless, get the url, wait for server side code to complete, then load the document. Here, I'm using BeautifulSoup to parse the DOM.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
def run():
url = "http://192.168.1.55"
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(chrome_options=options)
browser.get(url)
time.sleep(5)
bs = BeautifulSoup(browser.page_source, 'html.parser')
data = bs.find_all('h3')
if __name__ == "__main__":
run()

How to retrieve the values of dynamic html content using Python

I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:
url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);
response = request.urlopen(url)
data = str(response.read(10000))
data = data.replace("\\n", "\n")
print(data)
Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".
How can I make it so that I can retrieve the value instead of the placeholder text?
Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}
Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.
The code I have now is:
from bs4 import BeautifulSoup
from selenium import webdriver
#...
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
print "Finding..."
for tag in soup.find_all('formatPrice median'):
print tag.text
Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.

Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)

I used selenium + chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "www.sitetotarget.com"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')`

Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.
Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.
import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()
Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:
url = 'http://www.example.org' ## Any website
driver.get(url)
pageSource = driver.page_source
print(driver.get_window_size())
locations = []
for element in driver.find_elements_by_tag_name("a"):
location = element.location;
size = element.size;
# Collect coordinates of object: left/right, top/bottom
x1 = location['x'];
y1 = location['y'];
x2 = location['x']+size['width'];
y2 = location['y']+size['height'];
locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])
locations = [ (el,x1,y1,x2,y2, width,height)
for el,x1,y1,x2,y2,width,height in locations
if not (
## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
x2 <= x1 or y2 <= y1 or x2<0 or y2<0
## Further restrict if you expect the objects to be around a specific size
## or width<200 or height<100
)
]
for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
try:
print('-'*100,f'({width},{height})')
print(el.text[:100])
element_png = el.screenshot_as_png
with open('/tmp/_pageImage.png', 'wb') as f:
f.write(element_png)
img = Image.open('/tmp/_pageImage.png')
display(img)
except Exception as err:
print(err)
Installation for mac+chrome:
pip install selenium
brew cask install chromedriver
brew cask install google-chrome
I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.
Regarding responsibility, please respect robots.txt on each site.

I know this is an old question, but sometimes there is a better solution than using heavy selenium.
This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal.
Though, sometimes if you have to click elements or sth, I guess selenium is the only option.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping HTML code using Selenium with Python - python

browser.page_source always returns initial HTML source but not current DOM state. Try time.sleep(10) html_source = browser.find_element_by_tag_name('html').get_attribute('outerHTML')

Related

Selenium returns different page sources at different moments

Is there a way to make a dynamic web page automatically run its JavaScript when webscraping with Python?

Web scraping in Python - extract a value from website

How to get python requests or urllib2 to wait for final page

How to retrieve the values of dynamic html content using Python

Categories

Resources