I am new to selenium python and I am trying to scrape the data from a website. Below is the code, where I have taken all the necessary precautions to not get blocked.
from random import randrange
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
#Function to generate random useragent.
def generate_user_agent():
user_agents_file = open("user_agents.txt", "r")
user_agents = user_agents_file.read().split("\n")
i = randrange(len(user_agents))
userAgent = user_agents[i]
user_agents_file.close()
return userAgent
#Function to generate random IP address.
def generate_ip_address():
proxies_file = open("proxyscrape_premium_http_proxies.txt", "r")
proxies = proxies_file.read().split("\n")
i = randrange(len(proxies))
proxy = proxies[i]
proxies_file.close()
return proxy
#Function to create and set chrome options.
def set_chrome_options():
proxy = generate_ip_address()
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--incognito")
options.add_argument(f'--proxy-server={proxy}')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
return options, proxy
#Function to create a webdriver object and set its properties.
def create_webdriver():
options, proxy = set_chrome_options()
userAgent = generate_user_agent()
webdriver.DesiredCapabilities.CHROME['proxy'] = {
"httpProxy": proxy,
"ftpProxy": proxy,
"sslProxy": proxy,
"proxyType": "MANUAL",}
webdriver.DesiredCapabilities.CHROME['acceptSslCerts']=True
driver = webdriver.Chrome(options=options, executable_path=r'chromedriver.exe')
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": userAgent})
return driver
url = 'http://www.doctolib.de/impfung-covid-19-corona/berlin'
driver = create_webdriver()
driver.get(url)
The webpage is not opened via selenium web driver(but can be opened normally). Below is the screenshot of how the browser is opened when I run the code.
Please let me know If I am missing something. Any help would be highly appreciated
PS: I am using the premium proxies for IP rotation.
Browser_output
I've had similar experience in the past where the website detects that selenium is being used, even after using several methods like IP rotation, User-Agent rotation or using proxies.
I would suggest you to use the undetected_chromedriver library.
pip install undetected-chromedriver
It's able to load the website without any problem.
The code snippet is given below:-
import undetected_chromedriver.v2 as uc
driver = uc.Chrome()
with driver:
driver.get('http://www.doctolib.de/impfung-covid-19-corona/berlin')
I was having similar issue with Firefox on Linux. I just deleted the log file which was quite big for text file (4.8 mb) created by geckodriver and everything started to work fine again
Related
I was trying to get around the request limit of GitHub contribution graph (e.g.,https://github.com/crobby/webhook/graphs/contributors) in my web-scraping) during web scraping. So I decide to use webdriver on Tor.
I can open my web driver with the Tor browser. But it cannot stuck at the connecting stage as shown in the screenshot.
I can open links with the web driver, but I still encountered the request limit after it scraped several links. Does anyone have a hint about the potential issue?
Here is my code:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
from selenium.webdriver.firefox.options import Options
import os
torexe = os.popen(r'C:/Users/fredr/Desktop/Tor Browser/Browser/TorBrowser/Tor/tor.exe')
profile = FirefoxProfile(r'C:/Users/fredr/Desktop/Tor Browser/Browser/TorBrowser/Data/Browser/profile.default')
profile.set_preference('network.proxy.type', 1)
profile.set_preference('network.proxy.socks', '127.0.0.1')
profile.set_preference('network.proxy.socks_port', 9050)
profile.set_preference("network.proxy.socks_remote_dns", False)
profile.update_preferences()
options = Options()
options.binary_location = r'C:/Users/fredr/Desktop/Tor Browser/Browser/firefox.exe'
driver = webdriver.Firefox(firefox_profile= profile, executable_path=r'C:/Users/fredr/Downloads/geckodriver.exe', options=options)
driver.get("http://check.torproject.org")
I'm using Selenium webdriver to open a webpage and I set up a proxy for the driver to use. The code is listed below:
PATH = "C:\Program Files (x86)\chromedriver.exe"
PROXY = "212.237.16.60:3128" # IP:PORT or HOST:PORT
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'--proxy-server={PROXY}')
proxy = Proxy()
proxy.auto_detect = False
proxy.http_proxy = PROXY
proxy.sslProxy = PROXY
proxy.socks_proxy = PROXY
capabilities = webdriver.DesiredCapabilities.CHROME
proxy.add_to_capabilities(capabilities)
driver = webdriver.Chrome(PATH, chrome_options=chrome_options,desired_capabilities=capabilities)
driver.get("https://whatismyipaddress.com")
The problem is that the web driver is not using the given proxy and it accesses the page with my normal IP. I already tried every type of code I could find on the internet and it didn't work. I also tried to set a proxy directly in my pc settings and when I open a normal chrome page it works fine (it's not a proxy server problem then), but if I open a page with the driver it still uses my normal IP and somehow bypasses the proxy. I also tried changing the proxy settings of the IDE (pycharm) and still it's not working. I'm out of ideas, could someone help me?
This should work.
Code snippet-
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
PROXY = "212.237.16.60:3128"
#add proxy in chrome_options
chrome_options.add_argument(f'--proxy-server={PROXY}')
driver = webdriver.Chrome(PATH,options=chrome_options)
#to check new IP
driver.get("https://api.ipify.org/?format=json")
Note:- chrome_options is deprecated now, you have to use options instead
I've been attempting to connect to an HTTPS proxy for hours and can't seem to figure out why it's not working. I'm using Python 3.9.0 with Selenium 3.141.0 and Chrome 92.0.4515.159. I'm grabbing free HTTPS- and Google-accessible proxies from here, and using the following preferences:
options = webdriver.ChromeOptions()
useragent = UserAgent.random
options.add_argument(f"--user-agent={useragent}")
options.add_argument("--window-size=1920,1080")
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_argument(f"--proxy-server={proxystring}")
capabilities = DesiredCapabilities.CHROME
capabilities["AcceptSslCerts"] = True
capabilities["marionette"] = True
driver = webdriver.Chrome(options=options, desired_capabilities=capabilities)
where proxystring is a <host>:<port> string. Whenever I attempt to use the proxy, I get the following Chrome error:
Whenever I don't use the proxy, the page loads fine – I'm just confused as to why my attempt to use proxies isn't working.
Sadly, free proxies from https://free-proxy-list.net/ and other similar websites have very low success rate and work very rarely.
I'd suggest you to use premium proxies or other methods to avoid detection such as User-Agent rotation, using more waits etc.
I'm scraping a seemingly simple site that doesn't require a login, or any interaction with elements. However, when I use Selenium/requests/etc., the code just hangs. I've tried matching the headers to what I find using developer tools to no avail. I'm wondering if someone can point me in the right direction.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.proxy import Proxy
from fake_useragent import UserAgent
URL = 'https://www.cmegroup.com/CmeWS/mvc/xsltTransformer.do?xlstDoc=/XSLT/md/blocks-records.xsl&url=/da/BlockTradeQuotes/V1/Block/BlockTrades?exchange=XCBT,XCME,XCEC,DUMX,XNYM&foi=FUT,OPT,SPD&assetClassId=8&tradeDate=06212021&sortCol=time&sortBy=desc&_=1624332329760'
agent = UserAgent()
userAgent = agent.random
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'--user-agent={userAgent}')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.get(URL)
The option arguments are recommended for getting chromedriver to run on Google Colab. I've tried it locally without them with the same result.
The challenge I see is that, through selenium, I am trying to click on a website element (a div with some js attached). The "button" navigates you to another page.
How can I configure the browser to automatically route the requests through a proxy?
My proxy is set up as follows:
http://api.myproxy.com?key=AAA111BBB6&url=http://awebsitetobrowse.com
I am trying to put webdriver (chrome) behind the proxy
from selenium import webdriver
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(chrome_options=options)
where options, so far, is some basic configuration of the browser window size.
I have seen quite some examples (ex1, ex2, ex3) but I somehow fail to find an example that suits my needs.
import os
dir_path = os.path.dirname(os.path.realpath(__file__)) + "\\chromedriver.exe"
PROXY = "http://api.scraperapi.com?api_key=1234&render=true"
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
driver = webdriver.Chrome(executable_path = dir_path, chrome_options=chrome_options)
driver.get("https://stackoverflow.com/questions/11450158/how-do-i-set-proxy-for-chrome-in-python-webdriver")
Though it seems like the Proxy address you are using is not an actual proxy it is an API that returns HTML content of page itself after handling proxies, captcha or any IP blocking. But still for different scenario there can be different solution. some of those are as follow.
Scenario 1
So according to me, you are using this API in the wrong manner if your
api provide the facility to return the response of your visited page through the proxy.
So it should be used directly in 'driver.get()' with
address="http://api.scraperapi.com/?api_key=YOURAPIKEY&url="+url_to_be_visited_via_api
Example code for this would look like:
import os
dir_path = os.path.dirname(os.path.realpath(__file__)) + "\\chromedriver.exe"
APIKEY=1234 #replace with your API Key
apiURL = "http://api.scraperapi.com/?api_key="+APIKEY+"&render=true&url="
visit_url = "https://stackoverflow.com/questions/11450158/how-do-i-set-proxy-for-chrome-in-python-webdriver"
from selenium import webdriver
driver = webdriver.Chrome(executable_path = dir_path)
driver.get(apiURL+visit_url)
Scenario 2
But if you have some API that provides proxy address and login
credentials in response then it can be fudged in chrome options to use
it with chrome itself.
This should be in case if response of api is something like
"PROTOCOL://user:password#proxyserver:proxyport" (In case of authentication)
"PROTOCOL://proxyserver:proxyport" (In case of null authentication)
In both cases PROTOCOL can like HTTP, HTTPS, SOCKS4, SOCKS5 etc.
And that code should look like:
import os
dir_path = os.path.dirname(os.path.realpath(__file__)) + "\\chromedriver.exe"
import requests
proxyapi = "http://api.scraperapi.com?api_key=1234&render=true"
proxy=requests.get(proxyapi).text
visit_url = "https://stackoverflow.com/questions/11450158/how-do-i-set-proxy-for-chrome-in-python-webdriver"
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server='+proxy)
driver = webdriver.Chrome(executable_path = dir_path, chrome_options=chrome_options)
driver.get(visit_url)
Scenario 3
But if you have some API itself is a proxy with null authentication, then it can be fudged in chrome options to use
it with chrome itself.
And that code should look like:
import os
dir_path = os.path.dirname(os.path.realpath(__file__)) + "\\chromedriver.exe"
proxyapi = "http://api.scraperapi.com?api_key=1234&render=true"
visit_url = "https://stackoverflow.com/questions/11450158/how-do-i-set-proxy-for-chrome-in-python-webdriver"
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server='+proxyapi)
driver = webdriver.Chrome(executable_path = dir_path, chrome_options=chrome_options)
driver.get(visit_url)
So the solution can be used as per the different scenario.
Well, after countless of experiments, I have figure out that the thing works with:
apiURL = "http://api.scraperapi.com/?api_key="+APIKEY+"&render=true&url="
while fails miserably with
apiURL = "http://api.scraperapi.com?api_key="+APIKEY+"&render=true&url="
I have to admit my ignorance here: I thought the two should be equivalent