How to avoid bot detection in Selenium? - python

I am trying to scrape this website with the following code:
from selenium import webdriver
options = webdriver.ChromeOptions()
driver_path = '/Users/francopiccolo/Utils/chromedriver97'
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)
url = 'https://www.zonaprop.com.ar/inmuebles-venta-rosario.html'
driver.get(url)
The problem is it somehow detects a bot and throws an error.
Ideas?

options = Options()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
Try these to remove the detection of the bot.

Related

How to use Firefox with enable-automation disabled?

The code below starts Chrome loading its options which consequently disables "automation", I would like to know how I do the same for firefox?
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_experimental_option("useAutomationExtension", False)
options.add_experimental_option("excludeSwitches",["enable-automation"])
#driver_path = r'C:\Users\mkdob\AppData\Local\ms-playwright\firefox-1323\firefox\firefox.exe'
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe',
chrome_options=options) driver.get('https://google.com')
driver.close()

Website not opening with chrome driver but opening with geckodriver

I am trying to scrape the website. First of all, it is not working with Beautifulsoup but when I am trying to open it with selenium chrome driver it's not opening. It's opening with firefox but it's very slow and gives an error on element click. Here is my code:
from selenium import webdriver
opt = webdriver.ChromeOptions()
opt.add_argument("--disable-xss-auditor")
opt.add_argument("--disable-web-security")
opt.add_argument("--allow-running-insecure-content")
opt.add_argument("--no-sandbox")
opt.add_argument("--disable-setuid-sandbox")
opt.add_argument("--disable-webgl")
opt.add_argument("--disable-popup-blocking")
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(f"http://app1.nmpa.gov.cn/data_nmpa/face3/base.jsp?tableId=25&tableName=TABLE25&title=%B9%FA%B2%FA%D2%A9%C6%B7&bcId=152904713761213296322795806604&CbSlDlH0=qGrYrAktn7.tn7.tnznJalIvVetjcXpaapSdKuqmmoVqqWL")
Possibly Selenium driven ChromeDriver initiated google-chrome Browsing Context is geting detected as bot and the arguments you have added can't bypass the bot detection mechanism effectively.
Solution
You can evade the detection by adding a few arguments and experimental_option as follows:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
driver.get("http://app1.nmpa.gov.cn/data_nmpa/face3/base.jsp?tableId=25&tableName=TABLE25&title=%B9%FA%B2%FA%D2%A9%C6%B7&bcId=152904713761213296322795806604&CbSlDlH0=qGrYrAktn7.tn7.tnznJalIvVetjcXpaapSdKuqmmoVqqWL")

Deploy Scrapper facebook post on Heroku using Selenium Python

Here is my config to using chrome driver on Heroku
chrome_options = webdriver.ChromeOptions()
chrome_options.binary_location = os.environ.get("GOOGLE_CHROME_BIN")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument('--disable-gpu')
chrome =webdriver.Chrome(executable_path=os.environ.get("CHROMEDRIVER_PATH"),chrome_options=chrome_options)
chrome.get("https://facebook.com/groups/744721719556973/")
My duty is crawling post from a public group.So I selenium and beautifulsoup
It worked very well on Local ,I cralwed data successfully.But when I deploy it to heroku it return an empty array
My config in local
options = Options()
options.add_argument("--disable-notifications")
options.add_argument("--headless") \# options.add_argument("--start-maximized")
options.add_argument("--disable-dev-shm-usage") \# options.add_argument("--start-maximized")
options.add_argument("--no-sandbox")
# chrome = webdriver.Chrome('./chromedriver', chrome_options=options)
# chrome.get("https://facebook.com/groups/744721719556973/")`
//scroll to crawl post
chrome.execute_script("window.scrollTo(0,document.body.scrollHeight)")
tree = html.fromstring(chrome.page_source)
soup = BeautifulSoup(chrome.page_source, 'html.parser')
Here is my way
//this still work on local but on heroku it doenst find any div with the these classes
match = soup.find_all('div', class\_='du4w35lb k4urcfbm l9j0dhe7 sjgh65i0')
options.add_argument("--disable-notifications")
options.add_argument("--headless")
options.add_argument("--start-maximized")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--start-maximized")
options.add_argument("--no-sandbox")
I tried many way but it doesnt work.If you overcame this problem,suggest me pls.Tks u you so much
Here is an in-depth article from Medium, that may help you overcome your issues. https://medium.com/#mikelcbrowne/running-chromedriver-with-python-selenium-on-heroku-acc1566d161c

Accesssing Corsair.com with Python Selenium?

Is it possible to access https://www.corsair.com/ with Selenium in Python without getting blocked by Corsair?
When I try to load the page in Selenium, it keeps giving me this error message:
What I tried to bypass it, is changing the user-agent to a random one, which didn't fix the issue.
This is my code:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from fake_useragent import UserAgent
options = webdriver.ChromeOptions()
options.add_argument("window-size=1400,600")
ua = UserAgent()
user_agent = ua.random
print(user_agent)
options.add_argument(f'user-agent={user_agent}')
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=options)
print('Loading Corsair Website ...')
driver.get("https://www.corsair.com/")
There are multiple ways to evade detection of Selenium automation and one of them is to use the following argument:
--disable-blink-features=AutomationControlled.
Code Block:
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
# options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://www.corsair.com/")
driver.save_screenshot("image.png")
Screenshot:

How to correctly specify webdriver path with add_argument()?

This is my code:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("--no-sandbox")
options.add_argument("C:/webdrivers/chromedriver.exe")
driver = webdriver.Chrome(options=options)
driver.get("https://www.google.com")
But it did not use the webdriver that I am trying to specify and uses some different one. How to correctly specify path to webdriver in the code above?
So main point here is that I want to specify path to webdriver and also use it without sandbox. How can I do it?
This worked:
from selenium import webdriver
# start the browser
options = webdriver.ChromeOptions()
# options.add_argument("--headless")
options.add_argument("--no-sandbox")
# options.add_argument("--disable-dev-shm-usage")
# options.add_argument("--disable-gpu")
# options.add_argument("--window-size=1920,1080")
driver = webdriver.Chrome(executable_path=r"C:/webdrivers/chromedriver.exe", options=options)
driver.get("https://www.google.com")

Categories

Resources