click link automatically and scraping

click link automatically and scraping - python

I try to extract the all the products data from this page:
https://www.shufersal.co.il/online/he/קטגוריות/סופרמרקט/חטיפים%2C-מתוקים-ודגני-בוקר/c/A25
I tried
shufersal = "https://www.shufersal.co.il/online/he/%D7%A7%D7%98%D7%92%D7%95%D7%A8%D7%99%D7%95%D7%AA/%D7%A1%D7%95%D7%A4%D7%A8%D7%9E%D7%A8%D7%A7%D7%98/%D7%97%D7%98%D7%99%D7%A4%D7%99%D7%9D%2C-%D7%9E%D7%AA%D7%95%D7%A7%D7%99%D7%9D-%D7%95%D7%93%D7%92%D7%A0%D7%99-%D7%91%D7%95%D7%A7%D7%A8/c/A25"
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
import time
driver.get(shufersal)
products = driver.find_elements_by_css_selector("li.miglog-prod.miglog-sellingmethod-by_unit"
)
the problem is that the product details is showed only when I click the product:
There any option to click all links automatically and scraping the open windows?

What you want can be achieved, but is a significantly more time consuming processes.
You'll have to:
identify the elements that you need to click in the page (they'll probably have the same class) and then select them all:
butons_to_click = driver.find_elements_by_css_selector({RELEVANT SELECTOR HERE})
Then you should loop through all clickable elements, click on them, wait for the pop up to load, scrape the data, close the popup:
scraped_list = []
for button_instance in buttons_to_click:
button_instance.click()
#scrape the information you need here and append to scraped_list
#find close popup button
driver.find_element_by_xpath({XPATH TO ELEMENT}).click()
For this to work properly it is important to setup selenium's implicitly wait parameter. What that does is that if it doesn't find the required element it'll wait for X seconds until it is loaded, if X passes then it'll throw an error (you can handle the error in your code if it is expected).
In your case, you'll need the wait time because after you click on the product to display the popup, the information might take some seconds to load, in case you don't set implicitly wait, your script will exit with an element not found error. More information on selenium's wait parameters can be found here: https://selenium-python.readthedocs.io/waits.html
#put this line immediatelly after creating the driver object
driver.implicitly_wait(10) # seconds
***Suggestion:
I suggest you to always use Xpath when looking up elements, its syntax can actually emulate all other selenium selectors, is faster and will make an easier transition to C compiled html parsers (which you'll need in case you scale your scraper - recommend lxml a python package that uses a compiled parser)

Related

Send keys to window DOM with Selenium in python to bypass captcha

I am trying to get data from a webpage (https://www.educacion.gob.es/teseo/irGestionarConsulta.do). It has a captcha at the entry which I manually solve and move to the results page.
It happens that going back from the results page to the entry page does not modify the captcha if I reach the initial page with the "go back" button of the browser; but if I use the driver.back() instruction of Selenium's WebDriver, sometimes the captcha is modified - which I'd better avoid.
Clearly: I want to get access from Selenium to the DOM window (the browser), rather than to the document (or any element within the html) and send the ALT+ARROW_LEFT keys to the browser (the window).
This, apparently, cannot be done with:
from selenium.webdriver import Firefox
from selenium.webdriver.common.keys import Keys
driver = Firefox()
driver.get(url)
xpath = ???
driver.find_element_by_xpath(xpath).send_keys(Keys.ALT, Keys.ARROW_LEFT)
because send_keys connects to the element on focus, and my target is the DOM window, not any particular element of the document.
I have also tried with ActionChains:
from selenium.webdriver.common.action_chains import ActionChains
action = ActionChains(driver)
action.key_down(Keys.LEFT_ALT).key_down(Keys.ARROW_LEFT).send_keys().key_up(Keys.LEFT_ALT).key_up(Keys.ARROW_LEFT)
action.perform()
This also does not work (I have tried with several combinations). The documentation states that key_down/up require an element to send keys, which if None (the default) is the current focused element. So again here, there is the issue of how to focus on the window (the browser).
I have thought about using the mouse controls, but I assume it will be the same: I know how to make the mouse reach any element in the document, but I want to reach the window/browser.
I have thought of identifying the target element through the driver.current_window_handle, but also fruitlessly.
Can this be done from Selenium? If so, how? Can other libraries do it, perhaps pyppeteer or playwright?

Try with JavaScriptExecutor - driver.execute_script("window.history.go(-1)")

Is it possible to resume Selenium code when the browser lands on a certain url?

Simple question, is it possible to resume selenium code the moment the browser lands on a certain URL?

import sys
WebDriverWait(driver, sys.maxsize - 1).until(lambda s: s.current_url == "https://www.myurl.com/")

When you invoke get() method, Selenium executes the next line only when the browser attains document.readyState equals complete.
So you don't have to take any additional steps for Selenium to resume code execution the moment the browser lands on a certain url.
However, in some rarest of the rare cases you may have to explicitly wait for document.readyState to be equal to complete using:
driver.execute_script("return document.readyState").equals("complete"))
References
You can find a couple of relevant detailed discussions in:
Is there a way with python-selenium to wait until all elements of a page has loaded?
What is the correct syntax checking the .readyState of a website in Selenium Python?

Unable to locate element using selenium (Python)

I have been trying to find a button a click on it but no matter what I try it has been unable to locate it. I have tried using all the driver.find_element_by... methods but nothing seems to be working
from selenium import webdriver
import time
driver = webdriver.Chrome(executable_path="/Users/shreygupta/Documents/ComputerScience/PythonLanguage/Automation/corona/chromedriver")
driver.get("https://ourworldindata.org/coronavirus")
driver.maximize_window()
time.sleep(5)
driver.find_element_by_css_selector("a[data-track-note='chart-click-data']").click()
I am trying to click the DATA tab on the screenshot below

You can modify your script to open this graph directly:
driver.get("https://ourworldindata.org/grapher/total-cases-covid-19")
driver.maximize_window()
Then you can add implicitly_wait instead of sleep. An implicit wait tells WebDriver to poll the DOM for a certain amount of time when trying to find any element (or elements) not immediately available (from python documentation). It'll work way faster because it'll interact with an element as soon as it finds it.
driver.implicitly_wait(5)
driver.find_element_by_css_selector("a[data-track-note='chart-click-data']").click()
Hope this helps, good luck.

Here is the logic that you can use, where the script will wait for max 30 for the Data menu item and if the element is present with in 30 seconds it will click on the element.
url = "https://ourworldindata.org/grapher/covid-confirmed-cases-since-100th-case"
driver.get(url)
driver.maximize_window()
wait = WebDriverWait(driver,30)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"a[data-track-note='chart-click-data']"))).click()

Selenium download entire html

I have been trying to use selenium to scrape and entire web page. I expect at least a handful of them are spa's such as Angular, React, Vue so that is why I am using Selenium.
I need to download the entire page (if some content isn't loaded from lazy loading because of not scrolling down that is fine). I have tried setting a time.sleep() delay, but that has not worked. After I get the page I am looking to hash it and store it in a db to compare later and check to see if the content has changed. Currently the hash is different every time and that is because selenium is not downloading the entire page, each time a different partial amount is missing. I have confirmed this on several web pages not just a singular one.
I also have probably a 1000+ web pages to go through by hand just getting all the links so I do not have time to find an element on them to make sure it is loaded.
How long this process takes is not important. If it takes 1+ hours so be it, speed is not important only accuracy.
If you have an alternative idea please also share.
My driver declaration
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
driverPath = '/usr/lib/chromium-browser/chromedriver'
def create_web_driver():
options = webdriver.ChromeOptions()
options.add_argument('headless')
# set the window size
options.add_argument('window-size=1200x600')
# try to initalize the driver
try:
driver = webdriver.Chrome(executable_path=driverPath, chrome_options=options)
except WebDriverException:
print("failed to start driver at path: " + driverPath)
return driver
My url call my timeout = 20
driver.get(url)
time.sleep(timeout)
content = driver.page_source
content = content.encode('utf-8')
hashed_content = hashlib.sha512(content).hexdigest()
^ getting different hash here every time since same url not producing same web page

As the Application Under Test(AUT) is based on Angular, React, Vue in that case Selenium seems to be the perfect choice.
Now, as you are fine with the fact that some content isn't loaded from lazy loading because of not scrolling makes the usecase feasible. But in all possible ways ...do not have time to find an element on them to make sure it is loaded... can't be really compensated inducing time.sleep() as time.sleep() have certain drawbacks. You can find a detailed discussion in How to sleep webdriver in python for milliseconds. It would be worth to mention that the state of the HTML DOM will be different for all the 1000 odd web pages.
Solution
A couple of viable solutions:
A pottential solution could have been to induce WebDriverWait and ensure that some HTML elements are loaded as per the discussion How can I make sure if some HTML elements are loaded for Selenium + Python? validating atleast either of the following:
Page Title
Page Heading
Another solution would be to tweak the capability pageLoadStrategy. You can set the pageLoadStrategy for all the 1000 odd web pages to common point assigning a value either:
normal (full page load)
eager (interactive)
none
You can find a detailed discussion in How to make Selenium not wait till full page load, which has a slow script?
If you implement pageLoadStrategy, page_source method will be triggered at the same tripping point and possibly you would see identical hashed_content.

In my experience time.sleep() does not work well with dynamic loading times.
If the page is javascript-heavy you have to use the WebDriverWait clause.
Something like this:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.get(url)
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "[my-attribute='my-value']")))
Change 10 with whatever timer you want, and By.CSS_SELECTOR and its value with whatever type you want to use as a reference for a lo
You can also wrap the WebDriverWait around a Try/Except statement with the TimeoutException exception, which you can get from the submodule selenium.common.exceptions in case you want to set a hard limit.
You can probably set it inside a while loop if you truly want it to check forever until the page's loaded, because I couldn't find any reference in the docs about waiting "forever", but you'll have to experiment with it.

Cannot find element from a jump out window. How can I switch to a new jump out window?

I'm trying to automate our system with Python2.7, Selenium-webdriver, and Sikuli. I have a problem on login. Every time I open our system, the first page is an empty page, and it will jump to another page automatically; the new page is the main login page, so Python is always trying to find the element from the first page. The first page sometimes shows:
your session has timeout
I set a really large number for session timeout, but it doesn't work.
Here is my code:
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Chrome()
driver.get('http://172.16.1.186:8080/C******E/servlet/LOGON')
# time.sleep(15)
bankid = driver.find_element_by_id("idBANK")
bankid.send_keys(01)
empid = driver.find_element_by_id("idEMPLOYEE")
empid.send_keys(200010)
pwdid = driver.fin`enter code here`d_element_by_id("idPASSWORD")
pwdid.send_keys("C******e1")
elem = driver.find_element_by_id("maint")
elem.send_keys(Keys.RETURN)

First of all, I can't see any Sikuli usage in your example. If you were using Sikuli, it wouldn't matter how the other page was launched as you'd be interacting with whatever is visible on your screen at that time.
In Selenium, if you have multiple windows you have to switch your driver to the correct one. A quick way to get a list of the available windows is something like this:
for handle in driver.window_handles:
driver.switch_to_window(handle);
print "Switched to handle:", handle
element = browser.find_element_by_tag_name("title")
print element.get_attribute("value")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

click link automatically and scraping - python

Related

Send keys to window DOM with Selenium in python to bypass captcha

Is it possible to resume Selenium code when the browser lands on a certain url?

Unable to locate element using selenium (Python)

Selenium download entire html

Cannot find element from a jump out window. How can I switch to a new jump out window?

Categories

Resources