I'm having a bit of an issue using Selenium with python. There is a page I'm scraping, and I'm accessing a children of a parent element. However each time I run the script, it's not always guaranteed that I'll be able to get the children.
So for example, I have:
filters = driver.find_element_by_class_name("classname")
filters_children = filters.find_elements_by_class_name("anotherclassname")
And I print out filters_children[1] just to make sure.
Around 60% it will work fine, and filters_children will have a list of the children elements. However the other 40%, it'll have a NoneType so it won't be able to grab the elements.
I tried using a sleep of up to 10 seconds after the page rendered but that hasn't helped a whole lot.
Your parent class might be too broad and some time you might get a different element, then your second query will fail to find the proper child.
When searching via css selector, you can combine multiple nested class by using spaces between them. You could then combine your nested query into one.
Also I suggest that you use wait until in this case to ensure that the element will be present. Compare to sleep, this will send the request periodically to the page until it finds your request.
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
wait = WebDriverWait(driver, '30')
wait.until(EC.presence_of_all_elements_located("css selector", ".classname .anotherclassname")))
If the element also need to be visible, change presence_of_all_elements_located to visibility_of_any_elements_located
Related
I try to extract the all the products data from this page:
https://www.shufersal.co.il/online/he/קטגוריות/סופרמרקט/חטיפים%2C-מתוקים-ודגני-בוקר/c/A25
I tried
shufersal = "https://www.shufersal.co.il/online/he/%D7%A7%D7%98%D7%92%D7%95%D7%A8%D7%99%D7%95%D7%AA/%D7%A1%D7%95%D7%A4%D7%A8%D7%9E%D7%A8%D7%A7%D7%98/%D7%97%D7%98%D7%99%D7%A4%D7%99%D7%9D%2C-%D7%9E%D7%AA%D7%95%D7%A7%D7%99%D7%9D-%D7%95%D7%93%D7%92%D7%A0%D7%99-%D7%91%D7%95%D7%A7%D7%A8/c/A25"
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
import time
driver.get(shufersal)
products = driver.find_elements_by_css_selector("li.miglog-prod.miglog-sellingmethod-by_unit"
)
the problem is that the product details is showed only when I click the product:
There any option to click all links automatically and scraping the open windows?
What you want can be achieved, but is a significantly more time consuming processes.
You'll have to:
identify the elements that you need to click in the page (they'll probably have the same class) and then select them all:
butons_to_click = driver.find_elements_by_css_selector({RELEVANT SELECTOR HERE})
Then you should loop through all clickable elements, click on them, wait for the pop up to load, scrape the data, close the popup:
scraped_list = []
for button_instance in buttons_to_click:
button_instance.click()
#scrape the information you need here and append to scraped_list
#find close popup button
driver.find_element_by_xpath({XPATH TO ELEMENT}).click()
For this to work properly it is important to setup selenium's implicitly wait parameter. What that does is that if it doesn't find the required element it'll wait for X seconds until it is loaded, if X passes then it'll throw an error (you can handle the error in your code if it is expected).
In your case, you'll need the wait time because after you click on the product to display the popup, the information might take some seconds to load, in case you don't set implicitly wait, your script will exit with an element not found error. More information on selenium's wait parameters can be found here: https://selenium-python.readthedocs.io/waits.html
#put this line immediatelly after creating the driver object
driver.implicitly_wait(10) # seconds
***Suggestion:
I suggest you to always use Xpath when looking up elements, its syntax can actually emulate all other selenium selectors, is faster and will make an easier transition to C compiled html parsers (which you'll need in case you scale your scraper - recommend lxml a python package that uses a compiled parser)
I have a website which contains lots of images. I am trying to get the "src" of a specific image on the webpage. However I can't seem to find a way to precisely only point to that specific image. Since every possible way that I know of always returns more results.
If you go to :
https://weheartit.com/entry/349292873
You will see the one biggest image staring directly right at you.
When the poster doesn't provide title (which I found out later) it gets an alt attribute of "Image by {user}" and so I tried to use the [contains(text(), 'Image by')] identifier. This obviously doesn't work if the user actually provides the title. On the example webpage that I listed above the image has an alt of 'aesthetic, couple, and cute image' for example.
And so I tried to point to the tree where the element is listed by doing:
//div/div/a/img
Which returns 5 different results of the same tree. And so I tried using nth:child instead by doing:
"(//div/div/a/img)[4]"
which works on most pages however on some it points to the websites logo instead of the image that I am trying to download because the nth structure is broken and it's 4th child suddenly becomes something else than what I am trying to download.
Which leads me to my initial question. How can I correctly point ONLY to the actual image that I am trying to download. I couldn't find a way to do so and I would be grateful if anyone could help me out with this!
You correctly described the issue.
So,
1 Wait for this element to become visible
2 Get its attribute with get_attribute("src")
I used css locator, but it can also be done with xpath.
Solution
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
driver = webdriver.Firefox()
driver.get('https://weheartit.com/entry/349292873')
wait = WebDriverWait(driver, 15)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".panel-large.list.js-entry-panel>a>img")))
link = driver.find_element_by_css_selector(".panel-large.list.js-entry-panel>a>img").get_attribute("src")
print(link)
driver.close()
driver.quit()
Output:
https://data.whicdn.com/images/349292873/original.gif
I am trying to use Selenium to log into a printer and conduct some tests. I really do not have much experience with this process and have found it somewhat confusing.
First, to get values that I may have wanted I opened the printers web page in Chrome and did a right click "view page source" - This turned out to be not helpful. From here I can only see a bunch of <script> tags that call some .js scripts. I am assuming this is a big part of my problem.
Next I selected the "inspect" option after right clicking. From here I can see the actual HTML that is loaded. I logged into the site and recorded the process in Chrome. With this I was able to identify the variables which contain the Username and Password. I went to this part of the HTML did a right click and copied the Xpath. I then tried to use the Selenium find_element_by_xpath but still no luck. I have tried all the other methods to (find by ID, and name) however it returns an error that the element is not found.
I feel like there is something fundamental here that I am not understanding. Does anyone have any experience with this???
Note: I am using Python 3.7 and Selenium, however I am not opposed to trying something other than Selenium if there is a more graceful way to accomplish this.
My code looks something like this:
EDIT
Here is my updated code - I can confirm this is not just a time/wait issue. I have managed to successfully grab the first two outer elements but the second I go deeper it errors out.
def sel_test():
chromeOptions = Options()
chromeOptions.add_experimental_option("useAutomationExtension", False)
browser = webdriver.Chrome(chrome_options=chromeOptions)
url = 'http://<ip address>/'
browser.get(url)
try:
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="ccrx-root"]')))
finally: browser.quit()
The element that I want is buried in this tag - Maybe this has something to do with it? Maybe related to this post
<frame name="wlmframe" src="../startwlm/Start_Wlm.htm?arg11=">
As mentioned in this post you can only work with the current frame which is seen. You need to tell selenium to switch frames in order to access child frames.
For example:
browser.switch_to.frame('wlmframe')
This will then load the nested content so you can access the children
Your issue is most likely do to either the element not loading on the page until after your bot searches for it, or a pop-up changing the xpath of the element.
Try this:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
delay = 3 # seconds
try:
elementUsername = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.xpath, 'element-xpath')))
element.send_keys('your username')
except TimeoutException:
print("Loading took too much time!")
you can find out more about this here
I have been trying to use selenium to scrape and entire web page. I expect at least a handful of them are spa's such as Angular, React, Vue so that is why I am using Selenium.
I need to download the entire page (if some content isn't loaded from lazy loading because of not scrolling down that is fine). I have tried setting a time.sleep() delay, but that has not worked. After I get the page I am looking to hash it and store it in a db to compare later and check to see if the content has changed. Currently the hash is different every time and that is because selenium is not downloading the entire page, each time a different partial amount is missing. I have confirmed this on several web pages not just a singular one.
I also have probably a 1000+ web pages to go through by hand just getting all the links so I do not have time to find an element on them to make sure it is loaded.
How long this process takes is not important. If it takes 1+ hours so be it, speed is not important only accuracy.
If you have an alternative idea please also share.
My driver declaration
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
driverPath = '/usr/lib/chromium-browser/chromedriver'
def create_web_driver():
options = webdriver.ChromeOptions()
options.add_argument('headless')
# set the window size
options.add_argument('window-size=1200x600')
# try to initalize the driver
try:
driver = webdriver.Chrome(executable_path=driverPath, chrome_options=options)
except WebDriverException:
print("failed to start driver at path: " + driverPath)
return driver
My url call my timeout = 20
driver.get(url)
time.sleep(timeout)
content = driver.page_source
content = content.encode('utf-8')
hashed_content = hashlib.sha512(content).hexdigest()
^ getting different hash here every time since same url not producing same web page
As the Application Under Test(AUT) is based on Angular, React, Vue in that case Selenium seems to be the perfect choice.
Now, as you are fine with the fact that some content isn't loaded from lazy loading because of not scrolling makes the usecase feasible. But in all possible ways ...do not have time to find an element on them to make sure it is loaded... can't be really compensated inducing time.sleep() as time.sleep() have certain drawbacks. You can find a detailed discussion in How to sleep webdriver in python for milliseconds. It would be worth to mention that the state of the HTML DOM will be different for all the 1000 odd web pages.
Solution
A couple of viable solutions:
A pottential solution could have been to induce WebDriverWait and ensure that some HTML elements are loaded as per the discussion How can I make sure if some HTML elements are loaded for Selenium + Python? validating atleast either of the following:
Page Title
Page Heading
Another solution would be to tweak the capability pageLoadStrategy. You can set the pageLoadStrategy for all the 1000 odd web pages to common point assigning a value either:
normal (full page load)
eager (interactive)
none
You can find a detailed discussion in How to make Selenium not wait till full page load, which has a slow script?
If you implement pageLoadStrategy, page_source method will be triggered at the same tripping point and possibly you would see identical hashed_content.
In my experience time.sleep() does not work well with dynamic loading times.
If the page is javascript-heavy you have to use the WebDriverWait clause.
Something like this:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get(url)
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "[my-attribute='my-value']")))
Change 10 with whatever timer you want, and By.CSS_SELECTOR and its value with whatever type you want to use as a reference for a lo
You can also wrap the WebDriverWait around a Try/Except statement with the TimeoutException exception, which you can get from the submodule selenium.common.exceptions in case you want to set a hard limit.
You can probably set it inside a while loop if you truly want it to check forever until the page's loaded, because I couldn't find any reference in the docs about waiting "forever", but you'll have to experiment with it.
I'm using Python / Selenium to submit a form then I have the web driver waiting for the next page to load by using an expected condition using class id.
My problem is that there are two pages that can be displayed but they do not share an unique element (that I can find) that is not in the original page. One page has a unique class is of mobile_txt_holder and the other possible page has a class id of notfoundcopy. I would like to use a wait that is looking for mobile_txt_holder OR notfoundcopy to appear.
Is it possible to combine two expected conditions into one wait?
Basic idea of what I am looking for but obviously won't work:
WebDriverWait(driver, 30).until(EC.presence_of_element_located(
(By.CLASS_NAME, "mobile_txt_holder")))
or .until(EC.presence_of_element_located((By.CLASS_NAME, "notfoundcopy")))
I really just need to program to wait until the next page loads so that I can parse the source.
Sample HTML:
<p class="notfoundcopy">Unfortunately, the number you entered is not in our tracking system.</p>
Apart from clubbing up 2 expected_conditions through or clause, we can easily construct a CSS to take care of our requirement The following CSS will look either for the EC either in mobile_txt_holder class or in notfoundcopy class:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".mobile_txt_holder, .notfoundcopy"))
You can find a detailed discussion in selenium two xpath tests in one