Scraping result is different from inspected DOM element

Scraping result is different from inspected DOM element - python

I want to parse list of price in a web page using Selenium webdriver in Python. So, I try to fetch all the DOM elements using this code
url = 'https://www.google.com/flights/explore/#explore;f=BDO;t=r-Asia-0x88d9b427c383bc81%253A0xb947211a2643e5ac;li=0;lx=2;d=2018-01-09'
driver = webdriver.Chrome()
driver.get(url)
print(driver.page_source)
The problem is what I got from page_source is different from what I see in the inspected element
<div class="CTPFVNB-f-a">
<div class="CTPFVNB-f-c"></div>
<div class="CTPFVNB-f-d elt="toolbelt"></div>
<div class="CTPFVNB-f-e" elt="result">Here is the difference</div>
</div>
The difference exist inside the CTPFVNB-f-e class. In the inspected DOM element, this tag hold all the prices that I want to fetch. But, in the result of page_source, this part is missing.
Could anyone tell me what is wrong with my code? Or do I need further steps to parse the list of prices?

JavaScript is modifying the page after the page loads. As you are printing page source immediately after opening the page, you're getting the initial code without the execution of JavaScript.
You can do any one of the following things:
Add delay: Using time.sleep(x) (change value of x according to your requirements. it is in seconds) (NOT recommended)
Implicit wait: driver.implicitly_wait(x) (again x is same as above)
Explicit wait: Wait for the HTML element to appear and then get the page source. To learn how to do this, refer this link. (HIGHLY recommended)
Using explicit wait is the better option here as it waits only for the time required for the element to become visible. Thus won't cause any excess delays. Or if the page loads slower than expected, you won't get the desired output using implicit wait.

Related

How to prevent "State Element Reference" errors in selenium

driver = webdriver.Chrome(service=s)
url="https://fourminutebooks.com/book-summaries/"
driver.get(url)
page_tabs = driver.find_elements(By.CSS_SELECTOR, "a[class='post_title w4pl_post_title']")
#html = driver.find_elements(By.CSS_SELECTOR,"header[class='entry-header page-header']")
length_page_tabs = len(page_tabs)
in_length = len(page_tabs)
for i in range(length_page_tabs):
ran = random.randint(0,in_length)
page_tabs[ran].click()
driver.execute_script("window.history.go(-1)")
time.sleep(10)
#need to get page source of html and then open it to a new file, extract what I want and add it to the email
I am trying to click one of the links, get the html code, email it to myself, and then go back a page and repeat. However after clicking the first random link, the code stops working and instead I get this error

You have to be very careful, when you put some elements collection to the variable, and going to iterate and perform some actions.
page_tabs = driver.find_elements...
All the elements in this case are cached, and each web browser action of navigate to another page, refrech the page, etc. will make all of these cached elements stale. This means they bacame like out-of-date and not possible to interact them any more.
So, to avoid stale element reference errors, you have to prevent any page reloads, or just refresh the elements every time after the page state has been changed.

StaleElementReferenceException
StaleElementReferenceException is a type of WebDriverException which is thrown when a reference to an element have gone stale, i.e. the element no longer appears on the HTML DOM of the page.
Some of the possible causes of StaleElementReferenceException include:
You are no longer on the same page, or the page may have refreshed since the element was last located.
The element may have been removed and re-added to the DOM Tree, since it was located. Such as an element being relocated. This can happen typically with a javascript framework when values are updated and the node is rebuilt.
Element may have been inside an iframe or another context which was refreshed.
This usecase
In your usecase, you have created a list of webelement i.e. page_tabs using the locator strategy:
page_tabs = driver.find_elements(By.CSS_SELECTOR, "a[class='post_title w4pl_post_title']")
Next within the loop whenever you invoke click on page_tabs[ran] you are redirected to a new page, where the elements within the list page_tabs becomes stale and new elements are loaded.
Moving forward when you invoke driver.execute_script("window.history.go(-1)") you are moving back to the main page where the elements of page_tabs were present and they reload again. At this point of time, the list page_tabs still continues to hold the webelements of the previous search, which have now become stale. Hence during the second iteration you face StaleElementReferenceException
Solution
In your usecase to avoid StaleElementReferenceException as the desired elements are <A> tag so instead of saving the elements you can store the href attributes in a list and invoke get(href) as follows:
driver.get("https://fourminutebooks.com/book-summaries/")
hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[class='post_title w4pl_post_title']")))]
for href in hrefs:
driver.get(href)
print("Placeholder to perform the desired operations on the respective page")
driver.quit()
References
You can find a couple of relevant detailed discussions in:
StaleElementException when iterating with Python
Message: stale element reference: element is not attached to the page document in Python
StaleElementReferenceException: Message: stale element reference: element is not attached to the page document with Selenium and Python

Use driver.execute_script and javascript. Javascript is never stale because it evaluates right away. In other words, if you select an element with Python and later interact with it, there's a decent chance it won't be there anymore. The only way you can be sure it's still there is to evaluate it as you interact with it and the only way to do that is to stay in the browser context.

click link automatically and scraping

I try to extract the all the products data from this page:
https://www.shufersal.co.il/online/he/קטגוריות/סופרמרקט/חטיפים%2C-מתוקים-ודגני-בוקר/c/A25
I tried
shufersal = "https://www.shufersal.co.il/online/he/%D7%A7%D7%98%D7%92%D7%95%D7%A8%D7%99%D7%95%D7%AA/%D7%A1%D7%95%D7%A4%D7%A8%D7%9E%D7%A8%D7%A7%D7%98/%D7%97%D7%98%D7%99%D7%A4%D7%99%D7%9D%2C-%D7%9E%D7%AA%D7%95%D7%A7%D7%99%D7%9D-%D7%95%D7%93%D7%92%D7%A0%D7%99-%D7%91%D7%95%D7%A7%D7%A8/c/A25"
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
import time
driver.get(shufersal)
products = driver.find_elements_by_css_selector("li.miglog-prod.miglog-sellingmethod-by_unit"
)
the problem is that the product details is showed only when I click the product:
There any option to click all links automatically and scraping the open windows?

What you want can be achieved, but is a significantly more time consuming processes.
You'll have to:
identify the elements that you need to click in the page (they'll probably have the same class) and then select them all:
butons_to_click = driver.find_elements_by_css_selector({RELEVANT SELECTOR HERE})
Then you should loop through all clickable elements, click on them, wait for the pop up to load, scrape the data, close the popup:
scraped_list = []
for button_instance in buttons_to_click:
button_instance.click()
#scrape the information you need here and append to scraped_list
#find close popup button
driver.find_element_by_xpath({XPATH TO ELEMENT}).click()
For this to work properly it is important to setup selenium's implicitly wait parameter. What that does is that if it doesn't find the required element it'll wait for X seconds until it is loaded, if X passes then it'll throw an error (you can handle the error in your code if it is expected).
In your case, you'll need the wait time because after you click on the product to display the popup, the information might take some seconds to load, in case you don't set implicitly wait, your script will exit with an element not found error. More information on selenium's wait parameters can be found here: https://selenium-python.readthedocs.io/waits.html
#put this line immediatelly after creating the driver object
driver.implicitly_wait(10) # seconds
***Suggestion:
I suggest you to always use Xpath when looking up elements, its syntax can actually emulate all other selenium selectors, is faster and will make an easier transition to C compiled html parsers (which you'll need in case you scale your scraper - recommend lxml a python package that uses a compiled parser)

Python/Selenium:Different ways to click this specific button

I am trying to understand Python in general as I just switched over from using VBA. I interested in the possible ways you could approach this single issue. I already went around it by just going to the link directly, but I need to understand and apply here.
from selenium import webdriver
chromedriver = r'C:\Users\dd\Desktop\chromedriver.exe'
browser = webdriver.Chrome(chromedriver)
url = 'https://www.fake.com/'
browser.get(url)
browser.find_element_by_id('txtLoginUserName').send_keys("Hello")
browser.find_element_by_id('txtLoginPassword').send_keys("There")
browser.find_element_by_id('btnLogin').click()
At this point, I am trying to navigate to a particular button/link.
Here is the info from the page/element
T-Mobile
Here are some of the things I tried:
for elem in browser.find_elements_by_xpath("//*[contains(text(), 'T-Mobile')]"):
elem.click
browser.execute_script("InitiateCallBack(187, True, T-Mobile, https://www.fake.com/, TMobile)")
I also attempted to look for tags and use css selector all of which I deleted out of frustration!
Specific questions
How do I utilize the innertext,"T-Mobile", to click the button?
How would I execute the onclick event?
I've tried to read the following links, but still have not succeeded incoming up with a different way. Part of it is probably because I don't understand the specific syntax yet. This is just some of the things I looked at. I spent about 3 hours trying various things before I came here!
selenium python onclick() gives StaleElementReferenceException
http://selenium-python.readthedocs.io/locating-elements.html
Python: Selenium to simulate onclick
https://stackoverflow.com/questions/43531654/simulate-a-onclick-with-selenium-https://stackoverflow.com/questions/45360707/python-selenium-using-onclick
Running javascript in Selenium using Python

How do I utilize the innertext,"T-Mobile", to click the button?
find_elements_by_link_text would be appropriate for this case.
elements = driver.find_elements_by_link_text('T-Mobile')
for elem in elements:
elem.click()
There's also a by_partial_link_text locator as well if you don't have the full exact text.
How would I execute the onclick event?
The simplest way would be to simply call .click() on the element as shown above and the event should, naturally, execute at that time.
Alternatively, you can retrieve the onclick attribute and use driver.execute_script to run the js.
for elem in elements:
script = elem.get_attribute('onlcick')
driver.execute_script(script)
Edit:
note that in your code you did element.click -- this does nothing. element.click() (note the parens) calls the click method.
is there a way to utilize browser.execute_script() for the onclick event
execute_script can fire the equivalent event, but there may be more listeners that you miss by doing this. Using the element click method is the most sound. There may very well be many implementation details of the site that may hinder your automation efforts, but those possibilities are endless. Without seeing the actual context, it's hard to say.
You can use JS methods to click an element or otherwise interact with the page, but you may miss certain event listeners that occur when using the site 'normally'; you want to emulate, more or less, the normal use as closely as possible.

As per the HTML you have shared it's pretty clear the website uses JavaScript. So to click() on the link with text as T-Mobile you have to induce WebDriverWait with expected_conditions clause as element_to_be_clickable and your can use the following code block :
WebDriverWait(driver, 20).until(expected_conditions.element_to_be_clickable((By.XPATH, "//a[contains(.,'T-Mobile')]"))).click()

you can use it
<div class="button c_button s_button" onclick="submitForm('rMTF')" style="margin-bottom: 30px;">
<input class="v_small" type="button"></input>
<span>
Reset
</span>

Selenium Stale Element driver.get(url) inside Loop

I desire to iterate thru a set of URLs using Selenium. From time to time I get 'element is not attached to the page document'. Thus after reading a couple of other questions indicated that it's because I am changing the page that is looking at. But I am not satisfied with that argument since:
for url in urlList:
driver.get(url)
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//div/div')))
#^ WebDriverWait shall had taken care of it
myString = driver.find_element_by_xpath('//div/div').get_attribute("innerHTML")
# ^ Error occurs here
# Then I call this function to go thru other elements given other conditions not shown
if myString:
getMoreElements(driver)
But if I add a delay like this:
for url in urlList:
driver.get(url)
time.sleep(5) # <<< IT WORKS, BUT WHY?
element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//div/div')))
myString = driver.find_element_by_xpath('//div/div').get_attribute("innerHTML") # Error occured here
I feel I am hiding the problem by adding the delay right there. I have implicity_wait set to 30s and set_page_load_timeout to 90s, that would had been sufficient. So, why am I still facing to add what looks like useless time.sleep?

Did you try the xpath: //div/div manually in dev tool to see how many div will be found on the page? I thinks there should be many. So your below explicity wait code can very easy to satisfied, maybe no more than 1 second, selenium can find such one div after browser.get() and your wait end.
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//div/div')))
Consider following possiblity:
Due to your above explicity wait issue, the page loading not complete, more and more //div/div are rendering to page, at this time point, you ask selenium to find such one div and to interact with it.
Think about the possiblity of the first found div by selenium won't be deleted or moved to another DOM node.
What do you think the rate of above possiblity will be high or low? I think it's very hight, because div is very common tag in nowdays web page and you use such a relaxed xpath which lead to so many matched div will be found, and each one of them is possible to cause the 'Element Stale' issue
To resolve your issue, please use more strict locator to wait some special element, rather than such hasty xpath which result in finding very common and many exist element.

What you observe as element is not attached to the page document is pretty much possible.
Analysis:
In your code, while iterating over the urlList, we are opening an url then waiting for the WebElement with XPATH as //div/div with ExpectedConditions clause set to presence_of_element_located which does not necessarily mean that the element is visible or clickable.
Hence, next when you try to driver.find_element_by_xpath('//div/div').get_attribute("innerHTML") the reference of previous search/find_element is not found.
Solution:
The solution to your question would be to change the ExpectedConditions clause from presence_of_element_located to element_to_be_clickable which checks that element is visible and enabled such that you can even click it.
Code Block:
Your optimized code block may look like:
for url in urlList:
driver.get(url)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, '//div/div')))
myString = driver.find_element_by_xpath('//div/div').get_attribute("innerHTML")
Your other solution:
Your other solution works because you are trying to covering up Selenium's work through time.sleep(5) which is not a part of best practices.

Can't find element by id using selenium

I am making a script in python that goes on a webpage https://www.realtor.ca/ and searches for a specific location. My problem is at the very beginning. When you open a page in the middle is a large search element. Html for that element is:
<input name="ctl00$elContent$ctl00$home_search_input"
maxlength="255" id="home_search_input" class="m_hme_srch_ipt_txtbox"
type="text" style="display: none;">
I am trying to access the element with find_element_by_id method but I always get the error:Message: Unable to locate element: [id="home_search_input"]
This is my code:
from selenium import webdriver as web
Url = "https://www.realtor.ca/"
browser = web.Firefox()
browser.get(Url)
TextField = browser.find_element_by_id("home_search_input")
Has anyone encountered a similar problem or does anyone know how to fix it?

When navigating to the page, the element with the id home_search_input isn't visibble at first. It seems that this one only gets visible once you click the "Where are you looking" placeholder (which disappears then). You'll need to do this explicitly in your test.
Additionally make sure to use either implicit or explicit wait statements to ensure that the elements you interact with are properly loaded and rendered.
Here's an example for your page using the Java client bindings - Python should be quite similar:
driver.get("https://www.realtor.ca/");
new WebDriverWait(driver, 5).until(ExpectedConditions.elementToBeClickable(By.id("m_hme_wherelooking_lnk"))).click();
driver.findElement(By.id("home_search_input")).sendKeys("demo");

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping result is different from inspected DOM element - python

Related

How to prevent "State Element Reference" errors in selenium

click link automatically and scraping

Python/Selenium:Different ways to click this specific button

Selenium Stale Element driver.get(url) inside Loop

Can't find element by id using selenium

Categories

Resources