I wanted to build a semi-automatic solution for scraping a website protected by Cloudflare's hcaptcha. I thought that I could solve captcha manually whenever it appears and then let my scraper scrape the website for some time until another captcha must be solved.
To try out my solution I open the url with Selenium while trying to mask it as a regular user:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium_stealth import stealth
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
driver.get(url_to_scrape) # Fill the captcha manually
I would want to get to the actual website after solving the captcha so I can scrape some info from it. The problem is, even when I solve the captcha, Cloudflare doesn't let me see the site, it just refreshes the site with the captcha (with response 403) and makes me solve another one, then another, and another, etc.
What am I doing wrong? There shouldn't be any problem with me solving the captcha so it must somehow detect Selenium as a bot. I thought that with the snippet used above the website doesn't see Selenium any different than a normal user with Chrome web browser but surely I'm missing something.
Without the site url it is impossible to tell exactly what is happening, although from previous experience I believe, the Hcaptcha prompt is probably appearing as a result of the site protection and may not be on the site itself.
If its appearing as a result of the site protection then start you browser using your profile.
$browser = Start-SeDriver -Browser Chrome -Arguments "--user-data-dir=C:\Users\$($env:username)\AppData\Local\Google\Chrome\User Data"
$browser.Navigate().GoToURL("https://google.com")
....then run the remaining part of your code to scrape the site.
Related
I have a Python script which login on a page (sso.acesso.gov.br) using some credentials and them usually answer a captcha using 2Captcha API.
The problem is that recently it takes an error after captcha answer, even when I answer it manually.
By the way, the error message received is different than when I forced answer wrong, which makes me believe that my script has now being detected somehow by the website.
If I open a Chrome browser as a user and just do the same steps, I can login, sometimes even without captcha. And all times without an error.
Here is my code:
from selenium import webdriver
from fake_useragent import UserAgent
import undetected_chromedriver as uc
from fp.fp import FreeProxy
user_path = 'C:\\PythonProjects\\User Data'
driver_path = 'C:\\PythonProjects\\107\\chromedriver.exe'
options = webdriver.ChromeOptions()
## Tactics to avoid being detected as automation
options.add_argument("--start-maximized")
options.add_argument('--disable-blink-features=AutomationControlled')
## User profile
options.add_argument(f"--user-data-dir={user_path}")
## User agent
ua = UserAgent()
options.add_argument(f'--user-agent={ua.random}')
## Proxy
proxy = FreeProxy().get()
options.add_argument(f'--proxy-server={proxy}')
## Set browser
driver = uc.Chrome(
driver_executable_path = driver_path,
options = options
)
## Set device memory info
driver.execute_script("Object.defineProperty(navigator, 'deviceMemory', {get: () => 8});")
## Set navigator webdriver to undefined
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined});")
## Open page
driver.get('https://sso.acesso.gov.br/login')
## From this point I insert CPF (user) and password, then answer captcha using 2Captcha
## I also have tried just set the browser and navigate manually, inserting data and answering captcha, but no success
Do you have any suggestion to bypass this block?
I have no idea about what is detecting and blocking my browser.
If I use my script on bot.sannysoft.com, I get the following results:
Intoli tests
Fingerprint Scanner 1/2
Fingerprint Scanner 2/2
Add couple of seconds wait before you enter correct captcha first time, that might work unless its designed otherwise.
I'm trying to use Selenium (3.141.0) with ChromeDriver (87.0.4280) to access a page. When accessed manually, it brings me to a policy page (different URL) where you have to hit 'Ok' before continuing to the site. Edit This is using Win 10 and I have the folder with the chromedriver on PATH.
When using the following code, I'm able to get to the policy page with the ("--headless") option but without it I get a blank page with 'data:,' in the URL and nothing else loads. I've tried accessing straight from the policy page and the site URL but they both get stuck when the webdriver is created. Am I missing something? I'm open to any suggestions, thanks!
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver_path = 'D:\....\chromedriver.exe'
driver = webdriver.Chrome(executable_path= driver_path, options= chrome_options)
driver.get(...) # left out the url
This is the output page I get without using ("--headless")
Funny enough, I realized it was because my Chrome Developer tools had become disabled. Not sure how but when I re-enabled them, it worked perfectly again. Weird.
I have written the following code to login to a website. So far it simply gets the webpage, accepts cookies, but when I try to login by clicking the login button, the page hangs and the login page never loads.
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, ElementNotInteractableException
# Accept consent cookies
def accept_cookies(browser):
try:
browser.find_element_by_xpath('//*[#id="gdpr-banner-accept"]').click()
except NoSuchElementException:
print('Cookies already accepted')
# Webpage parameters
base_site = "https://www.ebay-kleinanzeigen.de/"
# Setup remote control browser
fireFoxOptions = webdriver.FirefoxOptions()
#fireFoxOptions.add_argument("--headless")
browser = webdriver.Firefox(executable_path = '/home/Webdriver/bin/geckodriver',firefox_options=fireFoxOptions)
browser.get(base_site)
accept_cookies(browser)
# Click login pop-up
browser.find_elements_by_xpath("//*[contains(text(), 'Einloggen')]")[1].click()
Note: There are two login buttons (one popup & one in the page), I've tried both with the same result.
I have done similar with other websites, no problem. So am curious as to why it doesn't work here.
Any thoughts on why this might be? Or how to get around this?
I modified your code a bit adding a couple of optional arguments and on execution I got the following result:
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://www.ebay-kleinanzeigen.de/")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[#id='gdpr-banner-accept']"))).click()
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(text(), 'Einloggen')]"))).click()
Observation: My observation was similar to your's that the page hangs and the login page never loads as shown below:
Deep Dive
While inspecting the DOM Tree of the webpage you will find that some of the <script> and <link> tag refers to JavaScripts having keyword dist. As an example:
<script type="text/javascript" async="" src="/static/js/lib/node_modules/#ebayk/prebid/dist/prebid.10o55zon5xxyi.js"></script>
window.BelenConf.prebidFileSrc = '/static/js/lib/node_modules/#ebayk/prebid/dist/prebid.10o55zon5xxyi.js';
This is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.
Distil
As per the article There Really Is Something About Distil.it...:
Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.
Further,
"One pattern with Selenium was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".
Reference
You can find a couple of detailed discussion in:
Unable to use Selenium to automate Chase site login
Webpage Is Detecting Selenium Webdriver with Chromedriver as a bot
Is there a version of selenium webdriver that is not detectable
from selenium import webdriver
from selenium_stealth import stealth
import time
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
# options.add_argument("--headless")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r"C:\Users\DIPRAJ\Programming\adclick_bot\chromedriver.exe")
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
url = "https://bot.sannysoft.com/"
driver.get(url)
time.sleep(5)
driver.quit()
I'm using Selenium and ChromeDriver to scrape data from a website.
I need to keep my account logged in after closing the Driver: for this purpose I use every time the default Chrome profile.
Here you can see my code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
urlpage = 'https://example.com/'
options = webdriver.ChromeOptions()
options.add_argument("user-data-dir=C:\\Users\\MyName\\AppData\\Local\\Google\\Chrome\\User Data")
driver = webdriver.Chrome(options=options)
driver.get(urlpage)
The problem is that for some websites (e.g. https://projecteuler.net/) it works, so I'm logged in also the following session, but for other (like https://www.fundraiso.ch, the one I need) it doesn't, although in the "normal" browser I'm still logged in after I close the window.
Does anyone know how to fix this problem?
EDIT:
I didn't mention that I can't automate the login because the website has a maximum login number, and if I breach it the website will block my account.
I'm implementing a TikTok crawler using selenium and scrapy
start_urls = ['https://www.tiktok.com/trending']
....
def parse(self, response):
options = webdriver.ChromeOptions()
from fake_useragent import UserAgent
ua = UserAgent()
user_agent = ua.random
options.add_argument(f'user-agent={user_agent}')
options.add_argument('window-size=800x841')
driver = webdriver.Chrome(chrome_options=options)
driver.get(response.url)
The crawler open Chrome but it does not load videos.
Image loading
The same problem happens also using Firefox
No loading page using Firefox
The same problem using a simple script using Selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("https://www.tiktok.com/trending")
time.sleep(10)
driver.close()
driver = webdriver.Chrome()
driver.get("https://www.tiktok.com/trending")
time.sleep(10)
driver.close()
Did u try to navigate further within the selenium browser window? If an error 404 appears on following sites, I have a solution that worked for me:
I simply changed my User-Agent to "Naverbot" which is "allowed" by the robots.txt file from Tik Tok
(Robots.txt)
After changing that all sites and videos loaded properly.
Other user-agents that are listed under the "allow" segment should work too, if you want to add a rotation.
You can use Windows IE. Instead of chrome or firefox
Videos will load in IE but IE's Layout of showing feed is somehow different from chrome and firefox.
Reasons, why your page, is not loading.
Few advance web apps check your browser history, profile data and cached to check the authentication of the user.
One other thing you can do is run your default profile within your selenium It would be helpfull.