I'm making a price scraping program and have ran into the issue of antiscraping systems. I managed to get around these with the undetected_chromedriver but now I'm running into 2 issues
the first is that the UC is significantly slower than the standard chrome driver, through I need it for some sites, so I have some sites scraped with a normal driver and others with the UC
the second problem is that I have the standard Chrome driver install at the beginning of the program, but once I do that, the UC feels the need to install every time I open it?? this causes some sites to be scraped really slowly. can you help with why that is? and any other tips for running scraper faster would be appreciated.
I have this run at the beginning of the program as global variables:
chrome_path = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
and this runs as a function every time I need a UC:
def start_uc():
options = webdriver.ChromeOptions()
# just some options passing in to skip annoying popups
options.add_argument('--no-first-run --no-service-autorun --password-store=basic')
driver = uc.Chrome(options=options)
driver.minimize_window()
return driver
My scraping functions just loop looking up the url and scrape the info, and restart the driver to clear the cookies if I run into a captcha .The scraping functions look like this (this is psuedo code to give you an idea):
driver = start_uc()
for url in url_list:
while true:
try:
driver.get(url)
#scrape info
break
except:
driver.close()
driver = start_uc()
I dont see why chrome_path would affect the UC? and are there any suggestions to make the scraping functions run more efficiently? Im not an expert on drivers and their intricacies so I could be doing something terribly wrong that I dont recognize.
thankyou in advance!
You can use https://github.com/seleniumbase/SeleniumBase to speed things up.
(It has a special undetected-chromedriver mode that works with headless mode.)
pip install -U seleniumbase
And then run the following with python:
from seleniumbase import Driver
from seleniumbase import page_actions
driver = Driver(headless=True, uc=True)
driver.get("https://nowsecure.nl")
page_actions.wait_for_text(driver, "OH YEAH, you passed!", "h1")
print(driver.find_element("css selector", "body").text)
screenshot_name = "now_secure_image.png"
driver.save_screenshot(screenshot_name)
print("\nScreenshot saved to: %s" % screenshot_name)
driver.quit()
Related
Since I have upgraded to Chrome 106 (and now to 107) the driver manages to open the URL at the first time but after some timeout (?) or other criteria, the driver.refresh() does not return (i.e. no resposne).
The way it looks to me, if the Refresh occurs within seconds since the initial open, then the refresh succeeds.
Yet, if longer interval has elapsed since the initial open, then the refresh() does not complete.
Code is in Python.
Before version 106 this seemed to work well
Does anyone experience similar behaviour? Any idea how to fix that?
#Code to create the driver:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-blink-features")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(r'... \chromedriver', options=chrome_options)`
#Then, the attempt to refresh:
driver.refresh()
Expecting the webpage to refresh
Can you do it on google colab? try this
import sys sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=chrome_options)
driver.get(url)
if your work does not depend much on it, just use earlier versions where it used to work.
if not, try the following:
driver.find_element_by_tag_name('body').send_keys(Keys.COMMAND + 'r')
I am using selenium and python in order to scrape data on a website.
The problem is I need to manually log in because there is a CAPTCHA after the login.
My question is the following : is there a way to start the program on a page that is already loaded ? (for example, here I would log to the website, solve the CAPTCHA manually, and then launch the program that would scrape the data)
Note: I have already been looking for an answer on SO but did not find it, might have missed it as it seems to be an obvious question.
don't open in headless mode. open in head mode.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.headless = False # Set false here
driver = webdriver.Chrome(options=options, executable_path=r'C:\path\to\chromedriver.exe')
driver.get("http://google.com/")
print ("Headless Chrome Initialized")
time.sleep(30) # wait 30 seconds, this should give enough time to manually do the capture
# do other code here
driver.quit()
Using python and selenium, I have a function for IE to run headless, but for some reason it's not working. It works perfect for Chrome, but not IE. I could've sworn it worked previously. Any ideas?
from selenium import webdriver
from selenium.webdriver.ie.options import Options as IEOptions
def openie():
setglobalvariables()
window_size = '1920,1080'
ie_options = IEOptions()
ie_options.add_argument('--headless')
ie_options.add_argument('--window-size=%s' % window_size)
ie_options.add_argument('--no-sandbox')
driver = webdriver.Ie(input_path + 'IEDriverServer.exe', options=ie_options)
url = settingsfile('url').strip()
statusmessage(url)
driver.get(url)
driver.maximize_window()
driver.implicitly_wait(3)
return driver
As far as I know, IE does not support Headless Browsing.
You can refer to this thread for verification and a workaround on how it can work:
The IE driver does not support execution without an active, logged-in
desktop session running. You'll need to take this up with the author
of the solution you're using to achieve "headless" (scare quotes
intentional) execution of IE.
https://github.com/SeleniumHQ/selenium/issues/4551#issuecomment-324319508
https://community.lambdatest.com/t/how-can-i-run-my-selenium-tests-in-headless-ie/5447
EDIT:
The second thread is of the LambdaTest community and answered by me.
I have been trying to open multiple browser windows in internet explorer using webdriver in selenium. Once it reaches the get(url) line, it just halts there and eventually times out. I've added a print line, which does not execute. I've tried various methods and the one below is the Ie version of code I used to open multiple tabs in Chrome. Even if I remove the first 3 lines, it still only goes up to opening google.com. I've looked googled this issue and looked through other posts but nothing has helped. Would really appreciate any advice, thanks!
options = webdriver.IeOptions()
options.add_additional_option("detach", True)
driver = webdriver.Ie(options = options, executable_path=r'blahblah\IEDriverServer.exe')
driver.get("http://google.com")
print("syrfgf")
driver.execute_script("window.open('about:blank', 'tab2');")
driver.switch_to.window("tab2")
driver.get("http://yahoo.com")
You need to replace the url you have provided:
http://google.com
with a proper url as follows:
https://www.google.com/
Which should be represented as per the syntax diagram as follows:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome()
driver.get("https://github.com")
signin_link = driver.find_element(By.LINK_TEXT, "Sign in")
signin_link.click()
time.sleep(1)
user = driver.find_element(By.ID, "login_field")
user.send_keys("X")
passw = driver.find_element(By.ID, "password")
passw.send_keys("X")
passw.submit()
time.sleep(5)
driver.close()
I had this issue and writing this code seems to have made it work flawlessly. Adjust the sleep time as you want it. Putting my chromedriver.exe into my project folder also helped with some errors
Before I get blasted in my comments on this question, I am well aware that this is a possible duplication of this link, but the answer provided is not helping me out with my instance of code, even after applying an adjusted version of the answer in my code. I have looked into many answers, including installing Chromedriver into my device, but to no avail.
My code is as follows:
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
options.binary_location = "/usr/bin/chromium"
driver = webdriver.Chrome(executable_path = r'C:\Users\user\Downloads\chromedriver_win32')
driver.get('http://codepad.org')
text_area = driver.find_element_by_id('textarea')
text_area.send_keys("This text is send using Python code.")
Every time I run the code, including the executable_path = r'C:\Users\user\Downloads\chromedriver_win32'
I keep getting a permission error message when I run the code with the executable path.
My code without the path is the same minus the executable_path which I replace with driver = webdriver.Chrome(options), but I get the error message argument of type 'Options' is not iterable.
Any help with this problem is greatly appreciated. Admittedly I am a bit new to Python, and coding, in general, and I am trying new ideas to learn the program better, but everything that I try to find an answer to just breaks my code in general.
Try adding the executable file name at the end of executable_path argument:
executable_path = r'C:\Users\user\Downloads\chromedriver_win32\chromedriver.exe'
Code I used for testing:
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument("--test-type")
options.binary_location = "/usr/bin/chromium"
driver = webdriver.Chrome(executable_path = r'C:\Users\user\Downloads\chromedriver_win32\chromedriver.exe')
driver.get('http://codepad.org')
text_area = driver.find_element_by_id('textarea')
text_area.send_keys("This text is send using Python code.")