I wanted to build a semi-automatic solution for scraping a website protected by Cloudflare's hcaptcha. I thought that I could solve captcha manually whenever it appears and then let my scraper scrape the website for some time until another captcha must be solved.
To try out my solution I open the url with Selenium while trying to mask it as a regular user:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium_stealth import stealth
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
driver.get(url_to_scrape) # Fill the captcha manually
I would want to get to the actual website after solving the captcha so I can scrape some info from it. The problem is, even when I solve the captcha, Cloudflare doesn't let me see the site, it just refreshes the site with the captcha (with response 403) and makes me solve another one, then another, and another, etc.
What am I doing wrong? There shouldn't be any problem with me solving the captcha so it must somehow detect Selenium as a bot. I thought that with the snippet used above the website doesn't see Selenium any different than a normal user with Chrome web browser but surely I'm missing something.
Without the site url it is impossible to tell exactly what is happening, although from previous experience I believe, the Hcaptcha prompt is probably appearing as a result of the site protection and may not be on the site itself.
If its appearing as a result of the site protection then start you browser using your profile.
$browser = Start-SeDriver -Browser Chrome -Arguments "--user-data-dir=C:\Users\$($env:username)\AppData\Local\Google\Chrome\User Data"
$browser.Navigate().GoToURL("https://google.com")
....then run the remaining part of your code to scrape the site.
I am using selenium and python in order to scrape data on a website.
The problem is I need to manually log in because there is a CAPTCHA after the login.
My question is the following : is there a way to start the program on a page that is already loaded ? (for example, here I would log to the website, solve the CAPTCHA manually, and then launch the program that would scrape the data)
Note: I have already been looking for an answer on SO but did not find it, might have missed it as it seems to be an obvious question.
don't open in headless mode. open in head mode.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.headless = False # Set false here
driver = webdriver.Chrome(options=options, executable_path=r'C:\path\to\chromedriver.exe')
driver.get("http://google.com/")
print ("Headless Chrome Initialized")
time.sleep(30) # wait 30 seconds, this should give enough time to manually do the capture
# do other code here
driver.quit()
I'm using Selenium and ChromeDriver to scrape data from a website.
I need to keep my account logged in after closing the Driver: for this purpose I use every time the default Chrome profile.
Here you can see my code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
urlpage = 'https://example.com/'
options = webdriver.ChromeOptions()
options.add_argument("user-data-dir=C:\\Users\\MyName\\AppData\\Local\\Google\\Chrome\\User Data")
driver = webdriver.Chrome(options=options)
driver.get(urlpage)
The problem is that for some websites (e.g. https://projecteuler.net/) it works, so I'm logged in also the following session, but for other (like https://www.fundraiso.ch, the one I need) it doesn't, although in the "normal" browser I'm still logged in after I close the window.
Does anyone know how to fix this problem?
EDIT:
I didn't mention that I can't automate the login because the website has a maximum login number, and if I breach it the website will block my account.
I'm trying to bypass the web.whatsapp.com QR scan page. This is the code I used so far:
options = webdriver.ChromeOptions();
options.add_argument('--user-data-dir=./User_Data')
driver = webdriver.Chrome(options=options)
driver.get('https://web.whatsapp.com/')
On first attempt i have to manually scan the QR code and on later attempts it doesn't ask for the QR code.
HOWEVER, if i try to do the same after adding this line chrome_options.add_argument("--headless") I get Error writing DevTools active port to file. I tried at least a dozen different google search solutions, but none of them are working. Any help on this would be highly appreciated! Thank you.
Tried a bunch of differet arguments in different combinations so far but nothing worked:
options = Options() #decomment for local debugging
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-setuid-sandbox')
options.add_argument('--remote-debugging-port=9222')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu') # Last I checked this was necessary.
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('--user-data-dir=./User_Data')
driver = webdriver.Chrome('chromedriver.exe', options=options)
driver.get('https://web.whatsapp.com/')
Recently I made a whatsapp bot and had the same problems. After searching for a long time I came up with this solution:
The first problem was the browser cache memory, if it doesn't get the QR code cached in the browser apdata it will keep waiting in order to scan it.
So in my program I used the following function to get:
def apdata_path():
path = str(pathlib.Path().absolute())
driver_path = path + "\chromedriver.exe"
apdata = os.getenv('APPDATA')
apdata_path = "user-data-dir=" + \
re.findall(".+.\Dta\D", a)[0] + \
r'Local\Chromium\User Data\Default\Default'
apdata_path = apdata_path.replace("\\", "\\"*2)
return apdata_path
Here it finds first apdata path => C:\Users\AppData\ then I concatenated the rest of the path to the cache folder, in this case I used Chromium. In your case it will be:
C:\Users\AppData\Local\Google\Chrome\User Data\Default
There's probably a better way to find the profile data path. After finding it I set the driver:
def chrome_driver(user_agent=0):
usr_path = apdata_path()
chrome_path = file_path() + '\Chromium 85\\bin\chrome.exe'
options = webdriver.ChromeOptions()
options.binary_location = r"{}".format(chrome_path)
if user_agent != 0:
options.add_argument('--headless')
options.add_argument('--hide-scrollbars')
options.add_argument('--disable-gpu')
options.add_argument("--log-level=3")
options.add_argument('--user-agent={}'.format(user_agent))
options.add_argument(usr_path)
driver = webdriver.Chrome('chromedriver.exe', chrome_options=options)
return driver
Here I had another problem, that is, sometimes Selenium wouldn't work because Whatsapp has user agent validation in order to be able to verify if the browser version its compatible. I don't know much so I reached this conclusion by trial and error, maybe this is not the real explanation. But it worked for me.
So, in my Bot i made a start function to get the user agent and get the first QR Scan and keep it on the browser cache:
def whatsapp_QR():
driver = chrome_driver()
user_agent = driver.execute_script("return navigator.userAgent;")
driver.get("https://web.whatsapp.com/")
print("Scan QR Code, And then Enter")
input()
print("Logged In")
driver.close()
return user_agent
Afterall, my bot worked, not perfectly but it ran smoothly. I was able to send messages in my test group in headless mode.
To summarize, get the profile user cache in apdata to bypass the QR Code ( but you will need to run it once without headless to create the first cache).
Then get the user agent in order to bypass Wthatsapp validation. So the whole option set would look like this:
options.add_argument('--headless')
options.add_argument('--hide-scrollbars')
options.add_argument('--disable-gpu')
options.add_argument("--log-level=3")
options.add_argument('--user-agent={}'.format(user_agent)) # User agent for validation
options.add_argument(usr_path) #apdata user profile, to by pass QR code
usr_path = "user-data-dir=rC:\Users\\AppData\Local\Google\Chrome\User Data\Default"
How can I save all cookies in Python's Selenium WebDriver to a .txt file, and then load them later?
The documentation doesn't say much of anything about the getCookies function.
You can save the current cookies as a Python object using pickle. For example:
import pickle
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://www.google.com")
pickle.dump(driver.get_cookies(), open("cookies.pkl", "wb"))
And later to add them back:
import pickle
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://www.google.com")
cookies = pickle.load(open("cookies.pkl", "rb"))
for cookie in cookies:
driver.add_cookie(cookie)
When you need cookies from session to session, there is another way to do it. Use the Chrome options user-data-dir in order to use folders as profiles. I run:
# You need to: from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("user-data-dir=selenium")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("www.google.com")
Here you can do the logins that check for human interaction. I do this and then the cookies I need now every time I start the Webdriver with that folder everything is in there. You can also manually install the Extensions and have them in every session.
The second time I run, all the cookies are there:
# You need to: from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("user-data-dir=selenium")
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get("www.google.com") # Now you can see the cookies, the settings, extensions, etc., and the logins done in the previous session are present here.
The advantage is you can use multiple folders with different settings and cookies, Extensions without the need to load, unload cookies, install and uninstall Extensions, change settings, change logins via code, and thus no way to have the logic of the program break, etc.
Also, this is faster than having to do it all by code.
Remember, you can only add a cookie for the current domain.
If you want to add a cookie for your Google account, do
browser.get('http://google.com')
for cookie in cookies:
browser.add_cookie(cookie)
Just a slight modification for the code written by Roel Van de Paar, as all credit goes to him. I am using this in Windows and it is working perfectly, both for setting and adding cookies:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--user-data-dir=chrome-data")
driver = webdriver.Chrome('chromedriver.exe',options=chrome_options)
driver.get('https://web.whatsapp.com') # Already authenticated
time.sleep(30)
Based on the answer by Eduard Florinescu, but with newer code and the missing imports added:
$ cat work-auth.py
#!/usr/bin/python3
# Setup:
# sudo apt-get install chromium-chromedriver
# sudo -H python3 -m pip install selenium
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--user-data-dir=chrome-data")
driver = webdriver.Chrome('/usr/bin/chromedriver',options=chrome_options)
chrome_options.add_argument("user-data-dir=chrome-data")
driver.get('https://www.somedomainthatrequireslogin.com')
time.sleep(30) # Time to enter credentials
driver.quit()
$ cat work.py
#!/usr/bin/python3
import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--user-data-dir=chrome-data")
driver = webdriver.Chrome('/usr/bin/chromedriver',options=chrome_options)
driver.get('https://www.somedomainthatrequireslogin.com') # Already authenticated
time.sleep(10)
driver.quit()
Ideally it would be better to not copy the directory in the first place, but this is very hard, see
How to Prevent Selenium 3.0 (Geckodriver) from Creating Temporary Firefox Profiles?
how do I use an existing profile in-place with Selenium Webdriver?
Also
Can't use existing Firefox profile in Selenium WebDriver using C# (similar solution to the solution below)
This is a solution that saves the profile directory for Firefox (similar to the user-data-dir (user data directory) in Chrome) (it involves manually copying the directory around. I haven't been able to find another way):
It was tested on Linux.
Short version:
To save the profile
driver.execute_script("window.close()")
time.sleep(0.5)
currentProfilePath = driver.capabilities["moz:profile"]
profileStoragePath = "/tmp/abc"
shutil.copytree(currentProfilePath, profileStoragePath,
ignore_dangling_symlinks=True
)
To load the profile
driver = Firefox(executable_path="geckodriver-v0.28.0-linux64",
firefox_profile=FirefoxProfile(profileStoragePath)
)
Long version (with demonstration that it works and a lot of explanation -- see comments in the code)
The code uses localStorage for demonstration, but it works with cookies as well.
#initial imports
from selenium.webdriver import Firefox, FirefoxProfile
import shutil
import os.path
import time
# Create a new profile
driver = Firefox(executable_path="geckodriver-v0.28.0-linux64",
# * I'm using this particular version. If yours is
# named "geckodriver" and placed in system PATH
# then this is not necessary
)
# Navigate to an arbitrary page and set some local storage
driver.get("https://DuckDuckGo.com")
assert driver.execute_script(r"""{
const tmp = localStorage.a; localStorage.a="1";
return [tmp, localStorage.a]
}""") == [None, "1"]
# Make sure that the browser writes the data to profile directory.
# Choose one of the below methods
if 0:
# Wait for some time for Firefox to flush the local storage to disk.
# It's a long time. I tried 3 seconds and it doesn't work.
time.sleep(10)
elif 1:
# Alternatively:
driver.execute_script("window.close()")
# NOTE: It might not work if there are multiple windows!
# Wait for a bit for the browser to clean up
# (shutil.copytree might throw some weird error if the source directory changes while copying)
time.sleep(0.5)
else:
pass
# I haven't been able to find any other, more elegant way.
#`close()` and `quit()` both delete the profile directory
# Copy the profile directory (must be done BEFORE driver.quit()!)
currentProfilePath = driver.capabilities["moz:profile"]
assert os.path.isdir(currentProfilePath)
profileStoragePath = "/tmp/abc"
try:
shutil.rmtree(profileStoragePath)
except FileNotFoundError:
pass
shutil.copytree(currentProfilePath, profileStoragePath,
ignore_dangling_symlinks=True # There's a lock file in the
# profile directory that symlinks
# to some IP address + port
)
driver.quit()
assert not os.path.isdir(currentProfilePath)
# Selenium cleans up properly if driver.quit() is called,
# but not necessarily if the object is destructed
# Now reopen it with the old profile
driver=Firefox(executable_path="geckodriver-v0.28.0-linux64",
firefox_profile=FirefoxProfile(profileStoragePath)
)
# Note that the profile directory is **copied** -- see FirefoxProfile documentation
assert driver.profile.path!=profileStoragePath
assert driver.capabilities["moz:profile"]!=profileStoragePath
# Confusingly...
assert driver.profile.path!=driver.capabilities["moz:profile"]
# And only the latter is updated.
# To save it again, use the same method as previously mentioned
# Check the data is still there
driver.get("https://DuckDuckGo.com")
data = driver.execute_script(r"""return localStorage.a""")
assert data=="1", data
driver.quit()
assert not os.path.isdir(driver.capabilities["moz:profile"])
assert not os.path.isdir(driver.profile.path)
What doesn't work:
Initialize Firefox(capabilities={"moz:profile": "/path/to/directory"}) -- the driver will not be able to connect.
options=Options(); options.add_argument("profile"); options.add_argument("/path/to/directory"); Firefox(options=options) -- same as above.
Try this method:
import pickle
from selenium import webdriver
driver = webdriver.Chrome(executable_path="chromedriver.exe")
URL = "SITE URL"
driver.get(URL)
sleep(10)
if os.path.exists('cookies.pkl'):
cookies = pickle.load(open("cookies.pkl", "rb"))
for cookie in cookies:
driver.add_cookie(cookie)
driver.refresh()
sleep(5)
# check if still need login
# if yes:
# write login code
# when login success save cookies using
pickle.dump(driver.get_cookies(), open("cookies.pkl", "wb"))
This is code I used in Windows. It works.
for item in COOKIES.split(';'):
name,value = item.split('=', 1)
name=name.replace(' ', '').replace('\r', '').replace('\n', '')
value = value.replace(' ', '').replace('\r', '').replace('\n', '')
cookie_dict={
'name':name,
'value':value,
"domain": "", # Google Chrome
"expires": "",
'path': '/',
'httpOnly': False,
'HostOnly': False,
'Secure': False
}
self.driver_.add_cookie(cookie_dict)
Use this code to store the login session of any website like google facebook etc
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import undetected_chromedriver as uc
options = webdriver.ChromeOptions()
options.add_argument("user-data-dir=C:/Users/salee/AppData/Local/Google/Chrome/User Data/Profile 1")
browser = uc.Chrome(use_subprocess=True,Options=options)
For my case, the accepted answer is almost there.
For people that have no luck with the answers above, you are welcome to try my way.
Before you start coding, make sure that the website is using cookies for authentication.
Step:
Open your browser (I am using chrome here), login to your website.
Go to this website in order to know how to check the value of the cookies
Open another browser in incognito mode and go to your website (at this stage, your website should still prompt you the login page)
Try to modify the cookies value with the first browser cookies's value accordingly (your first browser must authenticated to your website)
Refresh your incognito mode browser, it should by pass the login page
The steps above is how I used to make sure adding cookies can authenticate to my website.
Now is the coding part, it is almost the same as the accepted answer. The only problem for me with the accepted answer is that I ended up having double the numbers of my cookies.
The pickle.dump part has no issue for me, so I would straight to the add cookie part.
import pickle
import selenium.webdriver
driver = selenium.webdriver.Chrome()
driver.get("http://your.website.com")
cookies = pickle.load(open("cookies.pkl", "rb"))
# the reason that I delete the cookies is because I found duplicated cookies by inspect the cookies with browser like step 2
driver.delete_all_cookies()
for cookie in cookies:
driver.add_cookie(cookie)
driver.refresh()
You are able to use step 2 to check if the cookies you add with the code is working correctly.
Hope it helps.