Python/Dynamic Parsing: <div id="root"> can't parse anything inside

Python/Dynamic Parsing: <div id="root"> can't parse anything inside - python

I've been wanting to parse information from a particular website, and I have been having problems with the dynamic aspect. When a request is called in python for this site with BeautifulSoup, etc., everything in < div id="root" > isn't there.
According to the answer to this similar question -- Why isn't the html code inside div is being parsed? -- I tried to use a headless browser. I ended up trying to use selenium and splinter with the '--headless' options enabled for chrome.
I don't know whether the headless browser I chose is just the wrong one for this particular website's setup, or if its my code, so please give me suggestions if you have any.
Notes: Running on Ubunutu 20.04.1 LTS, and Python 3.8.3. If you want to suggest different headless browser prgorams, go ahead, but it needs to be compatible for all linux, mac, etc. and Python.
Below is a look at my most recent code. I've tried various ways to ".find" the button I want to click. Here I tried to use the xpath of the element I want, which I got through inspect:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')
with Browser('chrome', options=options) as browser:
browser.visit("http://gnomad.broadinstitute.org/region/16-2087388-2087428?dataset=gnomad_r2_1")
print(browser.title)
browser.find_by_xpath('//*[#id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button').first.click()
The error message this gave me was:
File "etc/anaconda3/lib/python3.8/site-packages/splinter/element_list.py", line 42, in __getitem__
return self._container[index]
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "practice3.py", line 20, in
browser.find_by_xpath('//[#id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button').first.click()
File "etc/anaconda3/lib/python3.8/site-packages/splinter/element_list.py", line 57, in first
return self[0]
File "etc/anaconda3/lib/python3.8/site-packages/splinter/element_list.py", line 44, in getitem
raise ElementDoesNotExist(
splinter.exceptions.ElementDoesNotExist: no elements could be found with xpath "// [#id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button"
Thanks!

Your problem seems to be that you don't wait for the elements to fully load. I set up the environment of your piece of code and printed the source of the website, ran through the response with a html beautifier
https://www.freeformatter.com/html-formatter.html#ad-output
Here I found that a div you want to access has a state of
<div class="StatusMessage-xgxrme-0 daewTb">Loading region...</div>
Which implies that the site is not fully loaded yet. To fix this, you can simply wait for the website to load, which selenium can do
from selenium.webdriver.support.ui import WebDriverWait
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button')))
This will wait for the element to be loaded and clickable.
Here's the code snippet I tested on
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--ignore-certificate-errors')
with webdriver.Chrome("<path-to-driver>", options=options) as browser:
browser.get("http://gnomad.broadinstitute.org/region/16-2087388-2087428?dataset=gnomad_r2_1")
WebDriverWait(browser, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button')))
print(browser.title)
print(browser.page_source)
b = browser.find_element_by_xpath('//*[#id="root"]/div/div/div[2]/div/div[3]/section/div[2]/button')
browser.execute_script("arguments[0].click()", b)
Simply replace the <path-to-driver> with the path to your chrome webdriver.
The last bit is becuase I got an error from the click of the button, which selenium.common.exceptions.ElementClickInterceptedException: Message: element click intercepted: Element is not clickable with Selenium and Python solved.

Related

Handling "Accept all cookie" popup with selenium when selector is unknown

I have a python script, It look like this.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.select import Select
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from os import path
import time
# Tried this code
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
browser = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
links = ["https://www.henleyglobal.com/", "https://markets.ft.com/data"]
for link in links:
browser.get(link)
#WebDriverWait(browser, 20).until(EC.url_changes(link))
#How do I disable/Ignore/remove/escape this "Accept all cookie" popup and then access the website to scrape data?
browser.quit()
So each website in the links array displays an "Accept all cookie" popup after navigating to the site. check the below image.
I have tried many ways nothing works, Check the one after imports
How do I exit/pass/escape this popup and then access the website to scrape data?

If you open your page in a new browser you'll note the page fully loads, then, a moment later your popup appears. The default wait strategy in selenium is just that the page is loaded.
One way to handle this is to simply inspect the page and find the xpath of the popup window. The below code should work for that.
browser.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS)
if link == 'https://www.henleyglobal.com/':
browser.findElement(By.XPATH("/html/body/div[7]/div/div/div/div[2]/div/div[2]/button[2]")).click()
else:
browser.findElement(By.XPATH("/html/body/div[4]/div/div/div[2]/div[2]/a")).click()
The code is waiting until the element of the pop-up is clickable and then clicking it.
For unknown sites you could try:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-notifications")
webdriver.Chrome(os.path.join(path, 'chromedriver'), chrome_options=chrome_options)

generally, you can not use some universal locator that will match the "Accept cookies" buttons for each and every web site in the world.
Even here, you have 2 different sites and the elements you need to click are totally different on these sites.
For https://www.henleyglobal.com/ site the correct locator may be something like this CSS Selector .confirmation button.primary-btn while for https://markets.ft.com/data site I'd advise to use CSS Selector .o-cookie-message__actions a.o-cookie-message__button.
These 2 elements are totally different: the first one is button while the second is a, they have totally different class names and all other attributes.
You may thing about the Accept text. It seems to be common, so you could use this XPath //*[contains(text(),'Accept')] but even this will not work since on the first page it matches 2 elements while the accept cookies element is the second between them...
So, there is no General locators, you will have to define separate locators for each page.
Again, for https://www.henleyglobal.com/ I would prefer
driver.find_element(By.CSS_SELECTOR, ".confirmation button.primary-btn").click()
While for the second page https://markets.ft.com/data I would prefer this
driver.find_element(By.CSS_SELECTOR, ".o-cookie-message__actions a.o-cookie-message__button").click()
Also, generally we always use WebDriverWait expected_conditions explicit waits, so the code will be as following:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
# for the first page
wait.until(EC.element_to_be_clickable((By.XPATH, ".confirmation button.primary-btn"))).click()
# for the second page
wait.until(EC.element_to_be_clickable((By.XPATH, ".o-cookie-message__actions a.o-cookie-message__button"))).click()

Unable to extract page tittle and page_source using IEDriverServer and Selenium through Python

i have just started selenium coding.
i have python 3.6.6, executing following code on jupyter notebook (with chrome broser)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Ie("C:\\Python 36\\IEDriverServer.exe")
driver.get('https://google.com')
print(driver.title)
print(driver.page_source)
driver.close()
this is giving following output:
WebDriver
WebDriverThis is the initial start page for the WebDriver server.
In this process an IE browser gets open and goes to google.com (any desired site) but not getting closed

To extract the Page Titile and the Page Source you need to:
Invoke the FQDN i.e https://www.google.com/ through get(), i.e. including www.
Induce WebDriverWait for a clickable WebElement to be interactive.
While ending your program invoke quit() instead of close().
You can use the following solution:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Ie("C:\\Python 36\\IEDriverServer.exe")
driver.get('https://www.google.com/')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.NAME, "q")))
print(driver.title)
print(driver.page_source)
driver.quit()

Python Selenium find element XPath doesn't work

I try to find refresing elements (time minute) on the webpage. My code worked only for simple text earlier. Now I use Ctrl+Shift+I and point out my element and "Copy Xpath".
Also, I have Chrome extension "XPath helper" and tried to do that with it one. There is more longer XPath, than in my code below. And it doesn't work too.
Error: NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[#id....
And, I also tried to use find by class, by tag, by CSS selector.. It only worked by tag and no perfect, on different page.
And I don't even say about print it, sometimes find_element(By.XPATH,'//*[...).text work, sometimes not.
I don't understand, why it work on one page and not on second.. I want to work with find elements by XPath in flash later.
UPDATE Now I retrying code and it work! But still doesn't work on the next webpage.. why it is so changeable? XPath change, when page reload or what? What is the simplest way to get text(refresing) info from flash, opened in chrome browser?
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(r"C:\Users\vishniakov\Desktop\python bj\driver\chromedriver.exe",chrome_options=options)
driver.get("https://www.betfair.com/sport/football/event?eventId=28935432")
print(driver.title)
elem =driver.find_element(By.XPATH,'//*[#id="yui_3_5_0_1_1538670363144_2571"]').text
print(elem)

This will work with assumption you want data of that page not of any specific element:
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
driver.get("https://www.betfair.com/sport/football/event?eventId=28935730")
print(driver.title)
elem =driver.find_element(By.CSS_SELECTOR,'.scroller.context-event').text
print(elem)

Assuming you do want spcipic data, you can use the contains() Xpath method... You can read about this here
For your case:
import selenium
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(r"C:\Users\vishniakov\Desktop\python bj\driver\chromedriver.exe",chrome_options=options)
driver.get("https://www.betfair.com/sport/football/event?eventId=28935432")
print(driver.title)
elements =driver.find_elements(By.XPATH,'//*[contains(#id, "yui_3_5_0_1_")]')
print([i.text for i in elements])
You can play around with the contains() if my example didn't work... You must find what part of the id changes and exclude that part of the ID.
Hope this helps you!

How to find the href attribute of the videos on twitch through selenium and python?

I'm trying to find the twitch video IDs of all videos for a specific user. So for example on this page
https://www.twitch.tv/dyrus/videos/all
So here we have all videos linked, but its not quite so simple as to just scrape the html and find the links since they are generated dynamically it seems.
So I heard about selenium and did something like this:
from selenium import webdriver
# Change path here obviously
driver = webdriver.Chrome('C:/Users/Jason/Downloads/chromedriver')
driver.get('https://www.twitch.tv/dyrus/videos/all')
link_element = driver.find_elements_by_xpath("//*[#href]")
for link in link_element:
print(link.get_attribute('href'))
driver.close()
This returns me a bunch of links on the page but not the videos, they lie "deeper" I think, any input?
Thanks in advance

I would still suggest a couple of changes as follows:
Always open the Web Browser in maximized mode so that all/majority of the desired elements are within the Viewport.
If you are on Windows OS you need to append the extension .exe at the end of the WebDriver variant name, e.g. chromedriver.exe
While you identify for elements always try to include the class attribute in your Locator Strategy.
Always invoke driver.quit() at the end of your #Test to close & destroy the WebDriver and Web Client instances gracefully.
Here is your own code block with the above mentioned tweaks:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\path\to\chromedriver.exe')
driver.get('https://www.twitch.tv/dyrus/videos/all')
link_elements = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.tw-interactive.tw-link[data-a-target='preview-card-image-link']")))
for link in link_elements:
print(link.get_attribute('href'))
driver.quit()
Console Output:
https://www.twitch.tv/videos/295314690
https://www.twitch.tv/videos/294901947
https://www.twitch.tv/videos/294472813
https://www.twitch.tv/videos/294075254
https://www.twitch.tv/videos/293617036
https://www.twitch.tv/videos/293236560
https://www.twitch.tv/videos/292800601
https://www.twitch.tv/videos/292409437
https://www.twitch.tv/videos/292328170
https://www.twitch.tv/videos/292032996
https://www.twitch.tv/videos/291625563
https://www.twitch.tv/videos/291192151
https://www.twitch.tv/videos/290824842
https://www.twitch.tv/videos/290434348
https://www.twitch.tv/videos/290021370
https://www.twitch.tv/videos/289561690
https://www.twitch.tv/videos/289495488
https://www.twitch.tv/videos/289138003
https://www.twitch.tv/videos/289110429
https://www.twitch.tv/videos/288804893
https://www.twitch.tv/videos/288784992
https://www.twitch.tv/videos/288687479
https://www.twitch.tv/videos/288432438
https://www.twitch.tv/videos/288117849
https://www.twitch.tv/videos/288004968
https://www.twitch.tv/videos/287689102
https://www.twitch.tv/videos/287451192
https://www.twitch.tv/videos/287267032
https://www.twitch.tv/videos/287017431
https://www.twitch.tv/videos/286819343

With your locator, you are returning every element on the page that contains an href attribute. You can be a little more specific than that and get what you are looking for. Switch to a CSS selector...
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Change path here obviously
driver = webdriver.Chrome('C:/Users/Jason/Downloads/chromedriver')
driver.get('https://www.twitch.tv/dyrus/videos/all')
links = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-a-target='preview-card-image-link']")))
for link in links:
print(link.get_attribute('href'))
driver.close()
That prints 40 links from the page.

Issues clicking an element using selenium

Im using this code to explore tripadvisor (Portuguese comments)
from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Firefox()
driver.get("https://www.tripadvisor.com/Airline_Review-d8729164-Reviews-Cheap-Flights-TAP-Portugal#review_425811350")
driver.set_window_size(1920, 1080)
Then Im trying to click the google-translate link
driver.find_element_by_class_name("googleTranslation").click()
But getting this error :-
WebDriverException: Message: Element is not clickable at point (854.5, 10.100006103515625). Other element would receive the click: <div class="inner easyClear"></div>
So the div class="inner easyClear" is getting the click. I tried exploring it
from bs4 import BeautifulSoup
page=driver.page_source
for i in page.findAll("div","easyClear"):
print i
print "================="
But was unable to get any intuition from this as in what changes to incorporate now to make the "Google Translate" clickable. Please help
===============================EDIT===============================
Ive also tried these
driver.execute_script("window.scrollTo(0, 1200);")
driver.find_element_by_class_name("googleTranslation").click()
Resizing the browser to full screen etc..

What worked for me was to use an Explicit Wait and the element_to_be_clickable Expected Condition and get to the inner span element:
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://www.tripadvisor.com.br/ShowUserReviews-g1-d8729164-r425802060-TAP_Portugal-World.html")
wait = WebDriverWait(driver, 10)
google_translate = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".googleTranslation .link")))
actions = ActionChains(driver)
actions.move_to_element(google_translate).click().perform()
You may also be getting into a "survey" or "promotion" popup - make sure to account for those.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/Dynamic Parsing: <div id="root"> can't parse anything inside - python

Related

Handling "Accept all cookie" popup with selenium when selector is unknown

Unable to extract page tittle and page_source using IEDriverServer and Selenium through Python

Python Selenium find element XPath doesn't work

How to find the href attribute of the videos on twitch through selenium and python?

Issues clicking an element using selenium

Categories

Resources