Issues in selenium chromedriver headless browser to scrape data from website

Issues in selenium chromedriver headless browser to scrape data from website - python

I am using the latest chromedriver 2.45. I am currently building a program which scrape stocks data from a website. I have a list of around 3000 stocks to scrape, so I used multithreading to speed up my work. My program seems to work fine if I turn off the headless browser, but when I turn headless browser to true(with the aim to speed up the script), sometimes the thread will get stuck when running the following line:
browser.get(url)
For each stock, prior running the above script, the following script will run:
options = Options()
chrome_prefs = {}
options.experimental_options["prefs"] = chrome_prefs
chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
options.add_argument('--headless')
options.add_argument("–no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--disable-extensions")
options.add_argument("disable-infobars")
options.add_argument('--disable-useAutomationExtension')
options.Proxy = None
options.add_argument("–disable-dev-shm-usage")
options.add_argument('blink-settings=imagesEnabled=false')
browser = webdriver.Chrome(options=options)
browser.minimize_window()
The sad thing is when it get stuck in the line, it does not raise any exception. I believe that the thread is trying to access the url but the site does not load so it just keep waiting and waiting? Could that be the case? How to stop the problem? Or maybe a way out is to make a timer for browser.get(url), let say for 10 seconds, if it does not get any data, it will refresh the link again and continue on the script?
Is there also any ways or settings that I can speed up the script? And is it possible to make the program run in the background when I execute the script as it keep popping up (although it minimize itself a second later but the chromedriver is still on the front..)
Thank you for your time!

Related

python selenium get_log("performance") won't log webworker requests

following many article one can log XHR calls in an automated browser (using selenium) as bellow:
capabilities = DesiredCapabilities.CHROME
capabilities["loggingPrefs"] = {"performance": "ALL"} # newer: goog:loggingPrefs driver = webdriver.Chrome(
desired_capabilities=capabilities, executable_path="./chromedriver" )
...
logs_raw = driver.get_log("performance")
my probleme is the target request is performed by a "WebWorker", so it's not listed in the performance list of the browser main thread.
getting in chrome and manualy selecting the webworkers scope in the dev console "performance.getEntries()" gets me the request i want;
my question is how can someone perform such an action in selenium ? (python preferable).
no where in the doc of python selenium or Devtool Api have i found something similar
i'm so greatful in advance.
Edit: after some deggin i found that it has something to do with execution context of javascrip, i have no clue how to switch that in selenium

Why are my selenium webdrivers crashing/not responding in lambdas?

lambda_options = [
'--autoplay-policy=user-gesture-required',
'--disable-background-networking',
'--disable-background-timer-throttling',
'--disable-backgrounding-occluded-windows',
'--disable-breakpad',
'--disable-client-side-phishing-detection',
'--disable-default-apps',
'--disable-dev-shm-usage',
'--disable-extensions',
'--disable-features=AudioServiceOutOfProcess',
'--disable-hang-monitor',
'--disable-notifications',
'--disable-offer-store-unmasked-wallet-cards',
'--disable-print-preview',
'--disable-prompt-on-repost',
'--disable-speech-api',
'--disable-sync',
'--ignore-gpu-blacklist',
'--ignore-certificate-errors',
'--mute-audio',
'--no-default-browser-check',
'--no-first-run',
'--no-pings',
'--no-sandbox',
'--no-zygote',
'--password-store=basic',
'--use-gl=swiftshader',
'--use-mock-keychain',
'--single-process',
'--headless']
for argument in lambda_options:
options.add_argument(argument)
process_ids = list(range(10))
drivers = {i: webdriver.Chrome(executable_path='path', options=options) for i in process_ids}
So these are my chrome options and how i set them up to have 10 instances running in a single lambda invocation. When I run it non headless on my pc, the crawler misses very few sites due to errors with page loading or selenium not responding, but in lambdas I am missing a ton of data. What can I do to rectify this?
I am scraping mostly with python selenium and some pages with beautifulsoup, and the sites I visit require some actions to be done before I can grab the data I want.

How to use Selenium for 24x7 until any error comes

Now, I am using selenium to execute script "window.performance.timing" to get the full load time of a page. It can run without opening a browser. I want this keep running 24x7 and return the loading time.
Here is my code:
import time
from selenium import webdriver
import getpass
u=getpass.getuser()
print(u)
# # initialize Chrome options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('user-data-dir=C:\\Users\\%s\\AppData\\Local\\Google\\Chrome\\User Data'%(u))
source = "https://na66.salesforce.com/5000y00001SgXm0?srPos=0&srKp=500"
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(source)
driver.find_element_by_id("Login").click()
while True:
try:
navigationStart = driver.execute_script("return window.performance.timing.navigationStart")
domComplete = driver.execute_script("return window.performance.timing.domComplete")
loadEvent = driver.execute_script("return window.performance.timing. loadEventEnd")
onloadPerformance = loadEvent - navigationStart
print("%s--It took about: %s" % (time.ctime(), onloadPerformance))
driver.refresh()
except TimeoutException:
print("It took too long")
driver.quit()
I have two questions:
Is it a good idea to keep refresh the page and print the page loading time? Does it have any risk?
Anything needs to get improvement for my code?
Someone suggested using docker and jerkin when I searched any suggestions on Google, but it will need to download more things. This code will be package into exe in the end and let others to use. It will be good if it does not acquire many software packages.
Thank you very much as I am a fresh man in the web side. Any suggestions will be appreciated.

Python Selenium opening random empy webdrivers when I run my code, how do I stop this?

from selenium import webdriver
import random
url = "https://www.youtube.com/"
list_of_drivers = [webdriver.Firefox(), webdriver.Chrome(), webdriver.Edge()]
Driver = random.choice(list_of_drivers)
Driver.get(url)
I'm trying to cycle though a list of random webdrivers using selenium.
It does a good job at picking a random webdriver and opening the URL however, it also opens up other webdrivers with a blanck page.
How do I stop this from happening?
I am running python 2.7 in a virtualenv.

list_of_drivers = [webdriver.Firefox(), webdriver.Chrome(), webdriver.Edge()]
You created three instances already in this line, that's why all 3 browsers show up with a blank page at the very beginning.
Driver = random.choice(list_of_drivers)
Driver.get(url)
And then you randomly choose one to open a webpage, leaving the rest doing nothing.
Instead of creating three instances, just create one:
list_of_drivers = ['Firefox', 'Chrome', 'Edge']
Driver = getattr(webdriver, random.choice(list_of_drivers))()
Driver.get(url)

Python webdriver connect to already webpage (selenium)

I need to open multiple links in separate tabs or sessions...
I already know how to do it, so what i would like to know is if it's possible to connect to an already open webpage instead of open every links every time i run the script.
What i used now in Python is:
from selenium import webdriver
driver.get(link)
The purpose would be once i run the first script (to load multiple links), the second should connect to the webpages, refresh them and continue with the code.
Is it possible? Anyone know how to do it?
Thanks a lot for the help!!!!

Connecting to the previously opened window is easy:
driver = webdriver.Firefox()
url = driver.command_executor._url
session_id = driver.session_id
driver2 = webdriver.Remote(command_executor=url,desired_capabilities={})
driver2.session_id = session_id
#You're all set to do whatever with the previously opened browser
driver2.get("http://www.stackoverflow.com")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issues in selenium chromedriver headless browser to scrape data from website - python

Related

python selenium get_log("performance") won't log webworker requests

Why are my selenium webdrivers crashing/not responding in lambdas?

How to use Selenium for 24x7 until any error comes

Python Selenium opening random empy webdrivers when I run my code, how do I stop this?

Python webdriver connect to already webpage (selenium)

Categories

Resources