How to use Selenium for 24x7 until any error comes - python

Now, I am using selenium to execute script "window.performance.timing" to get the full load time of a page. It can run without opening a browser. I want this keep running 24x7 and return the loading time.
Here is my code:
import time
from selenium import webdriver
import getpass
u=getpass.getuser()
print(u)
# # initialize Chrome options
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('user-data-dir=C:\\Users\\%s\\AppData\\Local\\Google\\Chrome\\User Data'%(u))
source = "https://na66.salesforce.com/5000y00001SgXm0?srPos=0&srKp=500"
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(source)
driver.find_element_by_id("Login").click()
while True:
try:
navigationStart = driver.execute_script("return window.performance.timing.navigationStart")
domComplete = driver.execute_script("return window.performance.timing.domComplete")
loadEvent = driver.execute_script("return window.performance.timing. loadEventEnd")
onloadPerformance = loadEvent - navigationStart
print("%s--It took about: %s" % (time.ctime(), onloadPerformance))
driver.refresh()
except TimeoutException:
print("It took too long")
driver.quit()
I have two questions:
Is it a good idea to keep refresh the page and print the page loading time? Does it have any risk?
Anything needs to get improvement for my code?
Someone suggested using docker and jerkin when I searched any suggestions on Google, but it will need to download more things. This code will be package into exe in the end and let others to use. It will be good if it does not acquire many software packages.
Thank you very much as I am a fresh man in the web side. Any suggestions will be appreciated.

Related

Selenium Python - Multiprocessing Only Controlling One Browser

I'm attempting to run a selenium script locally and open three non-headless browsers. I'm using multiprocesing Pools (and have tried with just regular multiprocessing as well) and come across an interesting issue where 3 browser sessions open, but only the first one actually navigates to the target_url and attempts control. The other two just sit and wait and do nothing.
Here is the execution code that's relevant
run_id = str(uuid.uuid4())
options = Options()
#options.binary_location = '/opt/headless-chromium' #works for lambda
start_time = time.time()
options.binary_location = '/usr/bin/google-chrome' #testing
#options.add_argument('--headless') don't need headless
options.add_argument('--no-sandbox')
options.add_argument('--verbose')
#options.add_argument('--single-process')
options.add_argument('--user-data-dir=/tmp/user-data')# test add
options.add_argument('--data-path=/tmp/data-path')
options.add_argument('--disk-cache-dir=/tmp/cache-dir')
options.add_argument('--homedir=/tmp')
#options.add_argument('--disable-gpu')# test add
#options.add_argument("--remote-debugging-port=9222") test remove
#options.add_argument('--disable-dev-shm-usage')
#'/opt/chromedriver' not found
logger.info("Before driver initiated")
# job_id = event['job_id']
# run_id = event['run_id']
send_log(job_id, run_id, "JOB START", True, "", time.time() - start_time)
retries = 0
drivers = []
try:
#driver = webdriver.Chrome('/opt/chromedriver_89', chrome_options=options)
driver = webdriver.Chrome('/opt/chromedriver90', chrome_options=options)
#driver2 = webdriver.Chrome('/opt/chromedriver90', chrome_options=options)
break
except Exception as e:
print(str(e))
logger.info('exception with driver instantiation, retrying... ' + str(e))
#time.sleep(5)
driver = None
driver = webdriver.Chrome('/opt/chromedriver', chrome_options=options)
....
and here is how i'm invoking each process
from multiprocessing import Pool
pool = Pool(processes=3)
for i in range(3):
pool.apply_async(invoke, args=("https://macau-flash-sale.myshopify.com/",))
pool.close()
pool.join()
Is it possible that despite the multiple processes, selenium is not communicating with the other two browser instances properly?
Any ideas are greatly appreciated!
I don't think selenium can handle multiple browsers at once. I recommend writing 3 different python scripts and running them all at once through the terminal. If they need to communicate information to each other, probably the easiest way is to just get them to write to a text file.
For anyone else struggling with this, Selenium CAN be used with async/multiprocessing.
But you cannot specify the same user/data directories for each session. I had to remove these parameters so that each chrome session creates unique directories for iteself.
just remove the below and it'll work.
options.add_argument('--user-data-dir=/tmp/user-data')# test add
options.add_argument('--data-path=/tmp/data-path')

API not working properly after deploying on heroku

Hello everyone i recently made a python script which downloads video from youtube using selenium so i tried to convert that script into an API and deployed it on heroku at first it worked perfectly (but only once or twice) now i get diffrent errors everytime i enter the url i get from heroku when i check the logs sometimes it says that the "memory quota exceeded" sometime it says "no such element is found" i.e selenium error and other times it runs so i am not sure what really casuing this here is the code:-
import flask
from flask import request
from selenium import webdriver
import os
import time
app = flask.Flask(__name__)
#app.route('/', methods=['GET'])
def home():
url = request.args['url']
mobile_emulation = { "deviceName": "iPhone 6/7/8" }
op = webdriver.ChromeOptions()
op.add_experimental_option("mobileEmulation", mobile_emulation)
op.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
op.add_argument("--headless")
op.add_argument("--no-sandbox")
op.add_argument("--disable-dev-sh-usage")
driver =
webdriver.Chrome(executable_path=os.environ.get("CHROMEDRIVER_PATH"),chrome_options=op)
driver.get(url)
time.sleep(3)
driver.refresh()
element = driver.find_element_by_xpath('/html/body/div[1]/div[1]/div/div[1]/video')
video_source = element.get_attribute('src')
return video_source
if __name__ == "__main__":
app.run(debug=True)
You've got a few things that could be improved.
1/
This is a bad xpath:
'/html/body/div[1]/div[1]/div/div[1]/video'
You don't want to use a full path because if anything in the DOM structure changes between html and the video tags it will fail.
If you can share the DOM for the video element I can help you create a new one.
2/
You do this
driver.get(url)
time.sleep(3)
driver.refresh()
element = driver.find_element_by_xpath('/html/body/div[1]/div[1]/div/div[1]/video')
You wait 3 seconds - but what if the page takes 4 seconds? (the answer is it fails). There is no dynamic synchronisation that the object is ready and available.
Additionally, do you need to refresh? - that's a browser refresh to reloads the page. It's essentially going to the page twice.
Have a look at selenium wait strategies here
If you're getting NoSuchElement, you can try an implicit wait:
driver.implicitly_wait(10)
# add this only once - it polls for up to 10 seconds and moves on when ready
If you're getting ElementNotIteractable (or such) you can try an explicit wait:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/div[1]/div/div[1]/video")))
#(but also update the xpath)
You will want to review that selenium docs link to find the appropriate expected condition.
3/
By default --headless creates a tiny browser window. IF your page has dynamic scaling it will impact xpaths.
Set the window size to your expected resolution:
op.add_argument("window-size=1400,600")

Issues in selenium chromedriver headless browser to scrape data from website

I am using the latest chromedriver 2.45. I am currently building a program which scrape stocks data from a website. I have a list of around 3000 stocks to scrape, so I used multithreading to speed up my work. My program seems to work fine if I turn off the headless browser, but when I turn headless browser to true(with the aim to speed up the script), sometimes the thread will get stuck when running the following line:
browser.get(url)
For each stock, prior running the above script, the following script will run:
options = Options()
chrome_prefs = {}
options.experimental_options["prefs"] = chrome_prefs
chrome_prefs["profile.default_content_settings"] = {"images": 2}
chrome_prefs["profile.managed_default_content_settings"] = {"images": 2}
options.add_argument('--headless')
options.add_argument("–no-sandbox")
options.add_argument("--disable-gpu")
options.add_argument("--disable-extensions")
options.add_argument("disable-infobars")
options.add_argument('--disable-useAutomationExtension')
options.Proxy = None
options.add_argument("–disable-dev-shm-usage")
options.add_argument('blink-settings=imagesEnabled=false')
browser = webdriver.Chrome(options=options)
browser.minimize_window()
The sad thing is when it get stuck in the line, it does not raise any exception. I believe that the thread is trying to access the url but the site does not load so it just keep waiting and waiting? Could that be the case? How to stop the problem? Or maybe a way out is to make a timer for browser.get(url), let say for 10 seconds, if it does not get any data, it will refresh the link again and continue on the script?
Is there also any ways or settings that I can speed up the script? And is it possible to make the program run in the background when I execute the script as it keep popping up (although it minimize itself a second later but the chromedriver is still on the front..)
Thank you for your time!

Python: how i can print all the source code by using Selenium

driver.page_source don't returns all the source code.It is detaily printing only some parts of code, but it's missing a big part of code. How can i fix this?
This is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
def htmlToLuna():
url ='https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A'
driver = webdriver.Chrome('C:\\Python27\\chromedriver\\chromedriver.exe')
driver.get(url)
web=open('web.txt','w')
web.write(driver.page_source)
print driver.page_source
web.close()
print htmlToLuna()
Here is a simple code all it does is it opens the url and gets the length page source and waits for five seconds and will get the length of page source again.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
initial = len(browser.page_source)
print(initial)
time.sleep(5)
new_source = browser.page_source
print(len(new_source)
see the output:
15722
48800
you see that the length of the page source increases after a wait? you must make sure that the page is fully loaded before getting the source. But this is not a proper implementation since it blindly waits.
Here is a nice way to do this, The browser will wait until the element of your choice is found. Timeout is set for 10 sec.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
try:
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.CodeMirror > div:nth-child(1) > textarea:nth-child(1)'))) # 10 seconds delay
print("Result:")
print(len(browser.page_source))
except TimeoutException:
print("Your exception message here!")
The output: Result: 52195
Reference:
https://stackoverflow.com/a/26567563/7642415
http://selenium-python.readthedocs.io/locating-elements.html
Hold on! even that wont make any guarantees for getting full page source, since individual elements are loaded dynamically. If the browser finds the element it moves on. So make sure you find the proper element to make sure the page has been loaded fully.
P.S Mine is Python3 & webdriver is in my environment PATH. So my code needs to be modified a bit to make it work for Python 2.x versions. I guess only print statements are to be modified.

How can I get a proxy list to work with selenium properly? (python)

The funny thing is I'm not getting any errors running this code but I do believe that the script isnt using a proxy when it reloads the page. here's the script
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chop = webdriver.ChromeOptions()
proxy_list = input("Name of proxy list file?: ")
proxy_file = open(proxy_list, 'r')
print ('Enter url')
url = input()
driver = webdriver.Chrome(chrome_options = chop)
driver.get(url)
import time
for x in range(0,10):
import urllib.request
import time
proxies = []
for line in proxy_file:
proxies.append( line )
proxies = [w.replace('\n', '') for w in proxies]
while True:
for i in range(len(proxies)):
proxy = proxies[i]
proxy2 = {"http":"http://%s" % proxy}
proxy_support = urllib.request.ProxyHandler(proxy2)
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
urllib.request.urlopen(url).read()
time.sleep(5)
driver.get(url)
time.sleep (5)
Just wondering how I can use a proxy list with this script and have it work properly
As I know, its almost impossible to make selenium work with proxy on Firefox & Chrome. I tried 1 day and nothing. Opera I didnt try, but possible will be the same.
Also I saw freelancer request like "I have a problem with proxy on Selenium + Chrome. I'm the developer and lost 2 days to make it work,but nothing as result. If you don't know sure how to make it working good, please dont't disturb me -- its harder than just copy-paste from internet."
But its easy to do with Phantom JS -- this feature is working good there.
So try to do needed actions via PJS.

Categories

Resources