Run selenium webscraper in parallel across multiple instances of the same browser?

Run selenium webscraper in parallel across multiple instances of the same browser? - python

I built a scraper using selenium that only works when it starts a new chrome window and makes a referred request. It does not work in headless mode, I have to actually see a new chrome a window open, navigate to the site, and close every time. It works fine but is a bit slow. Is there a way to run the scraper in parallel multiple times? Maybe using multiple remote OS opening chrome? Is there software that helps me do that?
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--proxy-server={"........"}')
options.add_argument('window-size=500x250')
options.add_experimental_option('useAutomationExtension', False)
def interceptor(request):
request.headers[
'Referer'] = 'https://www.*****'
request.headers...
driver = webdriver.Chrome("D:\chromedriver\94\chromedriver.exe", options=options)
driver.request_interceptor = interceptor
listurl = ["www...", "www..."]
for i in range(len(listurl):
try:
driver = webdriver.Chrome("D:\chromedriver\94\chromedriver.exe", options=options)
driver.request_interceptor = interceptor
driver.get(listurl[i])
# save json info into a csv, ...
time.sleep(2 * random.random())
driver.stop_client()
driver.close()
driver.quit()

Related

How To Block Chrome Window Popup in Selenium Python

I have a Selenium-Python script for performing some automation tests on website. The script repeatedly opens some new tabs performs some work on the opened window and closes it.
Issue I'm facing is that whenever a new tab is opened my chrome window pops up from Minimize state to maximize. I want it to do all the task in background without interuppting.
Ps: Headless version is not applicable for my scenario.
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--start-maximized")
driver = webdriver.Chrome(options=chrome_options,executable_path="chromedriver.exe")
driver.get("https://xyx.org/#/login") #Login manually to a website.
while 1:
#some stuff here
main_window = driver.current_window_handle
driver.execute_script("window.open();")
driver.switch_to.window(driver.window_handles[1])
driver.get("some link here ")
#doing some work here
driver.close()
driver.switch_to.window(main_window)
If I minimize the chrome window manually then whenever the driver.execute_script("window.open();") is executed it automatically maximizes the chrome window. I want it to just keep remain minimized and do the work.

My fixed solution, using two driver instead of one. Driver one to login and driver two to do the work on the window with headless mode.
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
options = Options()
options.headless = True
driver_one = webdriver.Chrome(executable_path=r'/chromedriver')
driver_one.maximize_window()
driver_one.get("https://xyx.org/#/login") #Login manually to a website.
while 1:
#some stuff here
driver_two = webdriver.Chrome(executable_path=r'/chromedriver',
options=options)
link = "https://www.google.com"
driver_two.get("some link here ")
#doing some work here
driver_two.close()

Interacting with browser that has been garbage collected in try block

I have a selenium browser where I've added options to use my google
chrome profile when the browser is opened.
I know there will be an error when trying to create my selenium
browser if chrome is opened elsewhere with the same profile.
But despite there being an error the browser still opens
What I want to do is to still be able to interact with this browser, since it still opens with the profile I wanted it to, (and for various reasons I don't want to close my other chrome instances)
I had to through in try and except so the program doesn't stop, but I think the browser gets garbage collected in the try block.
So is there a way to stop this getting garbage collected or can I find all browsers opened by webdriver ?and then set one of them to a new browser
Here's the code:
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
try:
chrome_options.add_argument("user-data-dir=C:\\Users\\coderoftheday\\AppData\\Local\\Google\\Chrome\\User Data\\")
browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)
except:
pass
browser.get('https://www.google.co.uk/')
Error:
NameError: name 'browser' is not defined

Isn't this just a python variable scope issue?
See https://stackoverflow.com/a/25666911/1387701
Simplest solution:
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
browser = None
try:
chrome_options.add_argument("user-data-dir=C:\\Users\\coderoftheday\\AppData\\Local\\Google\\Chrome\\User Data\\")
browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)
except:
#code here to kill existing chrome instance
#retry opening chrome:
browser = webdriver.Chrome("chromedriver.exe", options=chrome_options)
browser.get('https://www.google.co.uk/')

Python Bot Twitch Viewer (Selenium)

So basically i am working on a python script that loggs into a twitch account and stays there to generate a viewer.
But my main issue is how do i make this work for multiple accounts.
How to hide alle the Windows, and how can i handle multiple selenium windows ?
Is selenium even good for that or is there a other way ?
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--mute-audio")
driver = webdriver.Chrome("D:\Downloads\chromedriver_win32\chromedriver.exe", chrome_options=chrome_options)
driver.minimize_window()
driver.get('https://www.twitch.tv/login')
search_form = driver.find_element_by_id('login-username')
search_form.send_keys('user')
search_form = driver.find_element_by_id('password-input')
search_form.send_keys('password')
search_form.submit()
driver.implicitly_wait(10)
driver.get('https://www.twitch.tv/channel')

You are definitely able to use Selenium and Python to do this. To run multiple accounts, you will have to utilize multi-threading or create multiple driver objects to manage.
Multithreading example from this thread:
from selenium import webdriver
import threading
import time
def test_logic():
driver = webdriver.Firefox()
url = 'https://www.google.co.in'
driver.get(url)
# Implement your test logic
time.sleep(2)
driver.quit()
N = 5 # Number of browsers to spawn
thread_list = list()
# Start test
for i in range(N):
t = threading.Thread(name='Test {}'.format(i), target=test_logic)
t.start()
time.sleep(1)
print t.name + ' started!'
thread_list.append(t)
# Wait for all thre<ads to complete
for thread in thread_list:
thread.join()
print 'Test completed!'
Each driver will have to use a proxy connection to connect to Twitch on separate IP addresses. I suggest using Opera as it has a built in VPN, makes it a lot easier.
Example of Opera and Selenium from this thread:
from selenium import webdriver
from time import sleep
# The profile directory which Opera VPN was enabled manually using the GUI
opera_profile = '/home/user-directory/.config/opera'
options = webdriver.ChromeOptions()
options.add_argument('user-data-dir=' + opera_profile)
driver = webdriver.Opera(options=options)
driver.get('https://whatismyipaddress.com')
sleep(10)
driver.quit()
To hide the console for webdrivers you must run them with the "headless" option.
Headless for chrome driver.
from selenium import webdriver from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
Unfortunately headless is not supported in Opera driver, so you must use Chrome or Firefox for this.
Good luck!

hi you will not be able to create a bot with selenium because even if you manage to connect several accounts on the twitch account, twitch (like youtube) have a system that looks at your IP address and does not increase the number of views if the multiple connection come from the same computer.

In Python, how do I make Selenium work headless with a Saved Browser Session?

I'm trying to bypass the web.whatsapp.com QR scan page. This is the code I used so far:
options = webdriver.ChromeOptions();
options.add_argument('--user-data-dir=./User_Data')
driver = webdriver.Chrome(options=options)
driver.get('https://web.whatsapp.com/')
On first attempt i have to manually scan the QR code and on later attempts it doesn't ask for the QR code.
HOWEVER, if i try to do the same after adding this line chrome_options.add_argument("--headless") I get Error writing DevTools active port to file. I tried at least a dozen different google search solutions, but none of them are working. Any help on this would be highly appreciated! Thank you.
Tried a bunch of differet arguments in different combinations so far but nothing worked:
options = Options() #decomment for local debugging
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-setuid-sandbox')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu') # Last I checked this was necessary.
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('--user-data-dir=./User_Data')
driver = webdriver.Chrome('chromedriver.exe', options=options)
driver.get('https://web.whatsapp.com/')

Recently I made a whatsapp bot and had the same problems. After searching for a long time I came up with this solution:
The first problem was the browser cache memory, if it doesn't get the QR code cached in the browser apdata it will keep waiting in order to scan it.
So in my program I used the following function to get:
def apdata_path():
path = str(pathlib.Path().absolute())
driver_path = path + "\chromedriver.exe"
apdata = os.getenv('APPDATA')
apdata_path = "user-data-dir=" + \
re.findall(".+.\Dta\D", a)[0] + \
r'Local\Chromium\User Data\Default\Default'
apdata_path = apdata_path.replace("\\", "\\"*2)
return apdata_path
Here it finds first apdata path => C:\Users\AppData\ then I concatenated the rest of the path to the cache folder, in this case I used Chromium. In your case it will be:
C:\Users\AppData\Local\Google\Chrome\User Data\Default
There's probably a better way to find the profile data path. After finding it I set the driver:
def chrome_driver(user_agent=0):
usr_path = apdata_path()
chrome_path = file_path() + '\Chromium 85\\bin\chrome.exe'
options = webdriver.ChromeOptions()
options.binary_location = r"{}".format(chrome_path)
if user_agent != 0:
options.add_argument('--headless')
options.add_argument('--hide-scrollbars')
options.add_argument('--disable-gpu')
options.add_argument("--log-level=3")
options.add_argument('--user-agent={}'.format(user_agent))
options.add_argument(usr_path)
driver = webdriver.Chrome('chromedriver.exe', chrome_options=options)
return driver
Here I had another problem, that is, sometimes Selenium wouldn't work because Whatsapp has user agent validation in order to be able to verify if the browser version its compatible. I don't know much so I reached this conclusion by trial and error, maybe this is not the real explanation. But it worked for me.
So, in my Bot i made a start function to get the user agent and get the first QR Scan and keep it on the browser cache:
def whatsapp_QR():
driver = chrome_driver()
user_agent = driver.execute_script("return navigator.userAgent;")
driver.get("https://web.whatsapp.com/")
print("Scan QR Code, And then Enter")
input()
print("Logged In")
driver.close()
return user_agent
Afterall, my bot worked, not perfectly but it ran smoothly. I was able to send messages in my test group in headless mode.
To summarize, get the profile user cache in apdata to bypass the QR Code ( but you will need to run it once without headless to create the first cache).
Then get the user agent in order to bypass Wthatsapp validation. So the whole option set would look like this:
options.add_argument('--headless')
options.add_argument('--hide-scrollbars')
options.add_argument('--disable-gpu')
options.add_argument("--log-level=3")
options.add_argument('--user-agent={}'.format(user_agent)) # User agent for validation
options.add_argument(usr_path) #apdata user profile, to by pass QR code
usr_path = "user-data-dir=rC:\Users\\AppData\Local\Google\Chrome\User Data\Default"

Connecting to CefPython instance with Selenium

After an exhaustive amount of Googling, I'm at a loss as to how to connect to a CefPython (Chrome Embedded Framework) browser instance using Selenium.
I see two possible ways of going about this:
Use Selenium to launch a CefPython instance directly, or
Launch a CefPython instance independently, then connect to it with Selenium.
I've looked for similar questions but they either have non-working code (older versions?) or seem to be attempting to do other things, and I can't find any with actual working code snippets as answers. So as a starting point, here is working code for launching Chrome with Selenium, but using a standard non-CEF Chrome instance:
Option 1 (working; launch standard Chrome.exe with Selenium)
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
chromedriver_path = r"C:\Users\..\webdrivers\chromedriver_2_40\chromedriver_win32\chromedriver.exe"
chrome_path = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
options.binary_location = chrome_path;
driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)
time.sleep(1)
driver.get("https://www.google.com") # SUCCESS!
time.sleep(4)
driver.quit()
Option 2 (working; launch Chrome.exe then connect to it with Selenium)
In this example, "driver2" is the one connecting remotely to the already-running instance created by "driver."
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
chromedriver_path = r"C:\Users\..\webdrivers\chromedriver_2_40\chromedriver_win32\chromedriver.exe"
chrome_path = r"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe"
options.binary_location = chrome_path;
driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)
executor_url = driver.command_executor._url
session_id = driver.session_id
print(executor_url, session_id)
driver2 = webdriver.Remote(command_executor=executor_url, desired_capabilities={})
driver2.close() # close the second session created by driver 2 (cannot pass a session_id to webdriver.Remote())
driver2.session_id = session_id # use the driver1 session instead
time.sleep(1)
driver2.get("https://www.google.com") # SUCCESS!
time.sleep(4)
driver2.quit()
But when I try to make this work with CefPython, I am at a loss for how to do it.
Option 1 (non-working; CefPython instance)
Attempting Option 1 with CefPython just hangs for a while before raising an exception. The only executable in the CefPython package I see that Selenium could possibly use to launch is the subprocess.exe file, but clearly this is NOT just a drop-in replacement for chrome.exe.
This code is identical to the "Option 1" code above, except it swaps the chrome_path for the subprocess.exe binary.
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
chromedriver_path = r"C:\Users\..\webdrivers\chromedriver_2_40\chromedriver_win32\chromedriver.exe"
chrome_path = r"C:\Users\..\project-folder\pybin\Lib\site-packages\cefpython3\subprocess.exe"
options.binary_location = chrome_path;
driver = webdriver.Chrome(executable_path=chromedriver_path, options=options)
print('driver created...') # is never reached :( apparently hangs over socket waiting...
# after a while...
# selenium.common.exceptions.WebDriverException: Message: unknown error: DevToolsActivePort file doesn't exist
time.sleep(1)
driver.get("https://www.google.com")
time.sleep(4)
driver.quit()
Option 2 (non-working; CefPython instance)
Here I try to launch CEFPython independently, then connecting to it with Selenium. Attempting this leaves me with needing an executor_url and a session Id, however I cannot for the life of me figure out how to get these from a running CefPython instance:
from cefpython3 import cefpython as cef
from selenium import webdriver
settings = {"windowless_rendering_enabled": False}
switches = {"remote-debugging-port": "22222",
'user-data-dir':r"C:\Users\..\..\mydatadir"}
cef.Initialize(settings, switches)
executor_url = None # how to get this?
session_id = None # how to get this?
driver2 = webdriver.Remote(command_executor=executor_url, desired_capabilities={})
driver2.close() # close the driver 1 session (cannot pass a session_id to webdriver.Remote())
driver2.session_id = session_id
time.sleep(30)
driver2.get("https://www.google.com")
time.sleep(4)
driver2.quit()
I'm using the 2.40 version of ChromeDriver, because the latest version of CefPython uses Chrome version 66, which in turn requires version 2.40 of the chromedriver.
Any assistance is appreciated.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Run selenium webscraper in parallel across multiple instances of the same browser? - python

Related

How To Block Chrome Window Popup in Selenium Python

Interacting with browser that has been garbage collected in try block

Python Bot Twitch Viewer (Selenium)

In Python, how do I make Selenium work headless with a Saved Browser Session?

Connecting to CefPython instance with Selenium

Categories

Resources