I am using selenium and python in order to scrape data on a website.
The problem is I need to manually log in because there is a CAPTCHA after the login.
My question is the following : is there a way to start the program on a page that is already loaded ? (for example, here I would log to the website, solve the CAPTCHA manually, and then launch the program that would scrape the data)
Note: I have already been looking for an answer on SO but did not find it, might have missed it as it seems to be an obvious question.
don't open in headless mode. open in head mode.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.headless = False # Set false here
driver = webdriver.Chrome(options=options, executable_path=r'C:\path\to\chromedriver.exe')
driver.get("http://google.com/")
print ("Headless Chrome Initialized")
time.sleep(30) # wait 30 seconds, this should give enough time to manually do the capture
# do other code here
driver.quit()
Related
I'm making a price scraping program and have ran into the issue of antiscraping systems. I managed to get around these with the undetected_chromedriver but now I'm running into 2 issues
the first is that the UC is significantly slower than the standard chrome driver, through I need it for some sites, so I have some sites scraped with a normal driver and others with the UC
the second problem is that I have the standard Chrome driver install at the beginning of the program, but once I do that, the UC feels the need to install every time I open it?? this causes some sites to be scraped really slowly. can you help with why that is? and any other tips for running scraper faster would be appreciated.
I have this run at the beginning of the program as global variables:
chrome_path = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.headless = True
options.add_experimental_option('excludeSwitches', ['enable-logging'])
and this runs as a function every time I need a UC:
def start_uc():
options = webdriver.ChromeOptions()
# just some options passing in to skip annoying popups
options.add_argument('--no-first-run --no-service-autorun --password-store=basic')
driver = uc.Chrome(options=options)
driver.minimize_window()
return driver
My scraping functions just loop looking up the url and scrape the info, and restart the driver to clear the cookies if I run into a captcha .The scraping functions look like this (this is psuedo code to give you an idea):
driver = start_uc()
for url in url_list:
while true:
try:
driver.get(url)
#scrape info
break
except:
driver.close()
driver = start_uc()
I dont see why chrome_path would affect the UC? and are there any suggestions to make the scraping functions run more efficiently? Im not an expert on drivers and their intricacies so I could be doing something terribly wrong that I dont recognize.
thankyou in advance!
You can use https://github.com/seleniumbase/SeleniumBase to speed things up.
(It has a special undetected-chromedriver mode that works with headless mode.)
pip install -U seleniumbase
And then run the following with python:
from seleniumbase import Driver
from seleniumbase import page_actions
driver = Driver(headless=True, uc=True)
driver.get("https://nowsecure.nl")
page_actions.wait_for_text(driver, "OH YEAH, you passed!", "h1")
print(driver.find_element("css selector", "body").text)
screenshot_name = "now_secure_image.png"
driver.save_screenshot(screenshot_name)
print("\nScreenshot saved to: %s" % screenshot_name)
driver.quit()
I'm trying to use Selenium (3.141.0) with ChromeDriver (87.0.4280) to access a page. When accessed manually, it brings me to a policy page (different URL) where you have to hit 'Ok' before continuing to the site. Edit This is using Win 10 and I have the folder with the chromedriver on PATH.
When using the following code, I'm able to get to the policy page with the ("--headless") option but without it I get a blank page with 'data:,' in the URL and nothing else loads. I've tried accessing straight from the policy page and the site URL but they both get stuck when the webdriver is created. Am I missing something? I'm open to any suggestions, thanks!
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver_path = 'D:\....\chromedriver.exe'
driver = webdriver.Chrome(executable_path= driver_path, options= chrome_options)
driver.get(...) # left out the url
This is the output page I get without using ("--headless")
Funny enough, I realized it was because my Chrome Developer tools had become disabled. Not sure how but when I re-enabled them, it worked perfectly again. Weird.
I have been trying to open multiple browser windows in internet explorer using webdriver in selenium. Once it reaches the get(url) line, it just halts there and eventually times out. I've added a print line, which does not execute. I've tried various methods and the one below is the Ie version of code I used to open multiple tabs in Chrome. Even if I remove the first 3 lines, it still only goes up to opening google.com. I've looked googled this issue and looked through other posts but nothing has helped. Would really appreciate any advice, thanks!
options = webdriver.IeOptions()
options.add_additional_option("detach", True)
driver = webdriver.Ie(options = options, executable_path=r'blahblah\IEDriverServer.exe')
driver.get("http://google.com")
print("syrfgf")
driver.execute_script("window.open('about:blank', 'tab2');")
driver.switch_to.window("tab2")
driver.get("http://yahoo.com")
You need to replace the url you have provided:
http://google.com
with a proper url as follows:
https://www.google.com/
Which should be represented as per the syntax diagram as follows:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome()
driver.get("https://github.com")
signin_link = driver.find_element(By.LINK_TEXT, "Sign in")
signin_link.click()
time.sleep(1)
user = driver.find_element(By.ID, "login_field")
user.send_keys("X")
passw = driver.find_element(By.ID, "password")
passw.send_keys("X")
passw.submit()
time.sleep(5)
driver.close()
I had this issue and writing this code seems to have made it work flawlessly. Adjust the sleep time as you want it. Putting my chromedriver.exe into my project folder also helped with some errors
I'm working on trying to automate a game I want to get ahead in called pokemon vortex and when I login using selenium it works just fine, however when I attempt to load a page that requires a user to be logged in I am sent right back to the login page (I have tried it outside of selenium with the same browser, chrome).
This is what I have
import time
from selenium import webdriver
from random import randint
driver = webdriver.Chrome(r'C:\Program Files (x86)\SeleniumDrivers\chromedriver.exe')
driver.get('https://zeta.pokemon-vortex.com/dashboard/');
time.sleep(5) # Let the user actually see something!
usernameLoc = driver.find_element_by_id('myusername')
passwordLoc = driver.find_element_by_id('mypassword')
usernameLoc.send_keys('mypassword')
passwordLoc.send_keys('12345')
submitButton = driver.find_element_by_id('submit')
submitButton.submit()
time.sleep(3)
driver.get('https://zeta.pokemon-vortex.com/map/10')
time.sleep(10)
I'm using python 3.6+ and I literally just installed selenium today so it's up to date, how do I force selenium to hold onto cookies?
Using a pre-defined user profile might solve your problem. This way your cache will be saved and will not be deleted.
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--user-data-dir=C:/Users/user_name/AppData/Local/Google/Chrome/User Data")
driver = webdriver.Chrome(options=options)
driver.get("xyz.com")
Browser opens and driver loses its control. It starts the browser but it can't initiate the driver in order to use it and send_keys, or do anything.
The code runs using Ghost Browser, which is a chromium based browser.
What should be done in order to selenium get control over browser?
Ive tried to get session_id in order to attach selenium to existing browser but it didnt worked also, since it cant get the session_id, because selenium exits.
Code:
exe_path = r'C:\Users\Anonymous\AppData\Local\GhostBrowser\Application\ghost.exe'
driver = webdriver.Chrome(executable_path=exe_path)
driver = webdriver.Chrome(executable_path=exe_path)
Are you sure you use chrome? Maybe change the webdriver.Chrome