I am using anticaptcha to help out with bypassing recaptcha on a webpage I'm crawling.
I have managed to work out the api part of this solution. It's quite straightforward.
The part I am struggling with is the injection of the token received from anti-captcha into the webpage.
Haven't found too many resources on this. I am using Selenium and Python alongside the anticaptchaofficial module.
The script I am executing does change the innerHtml of the textarea with id g-recaptcha-response but the webpage does nothing and the checkbox doesn't load the spinner or get verified.
Here's my code:
from anticaptchaofficial.recaptchav2proxyless import recaptchaV2Proxyless
from selenium import webdriver
import os
import time
driver = webdriver.Chrome(os.path.normpath(os.getcwd()+"\\chromedriver.exe"))
driver.get("https://www.google.com/recaptcha/api2/demo")
time.sleep(1)
data_sitekey = driver.find_element_by_class_name('g-recaptcha').get_attribute('data-sitekey')
solver = recaptchaV2Proxyless()
solver.set_verbose(1)
solver.set_key("<--my-key-->")
solver.set_website_url("https://www.google.com/recaptcha/api2/demo")
solver.set_website_key(data_sitekey)
g_response = solver.solve_and_return_solution()
driver.execute_script('document.getElementById("g-recaptcha-response").innerHTML = "{}";'.format(g_response)) # target textarea that is supposed to be injected with the token, I found upon some research
driver.execute_script("onSuccess('{}')".format(g_response))
time.sleep(1)
Turns out I was under the assumption that the recaptcha frame would show visible feedback on injection of the token (or some other equivalent action) but it turns out just the line:
driver.execute_script('document.getElementById("g-recaptcha-response").innerHTML = "{}";'.format(g_response))
which updates the textarea's innerHtml is enough. So you would basically need to continue with your task ie: click submit, if it is a recaptcha on form or reload the page if it is just randomly triggered
from anticaptchaofficial.recaptchav2proxyless import recaptchaV2Proxyless
from selenium import webdriver
import os
import time
driver = webdriver.Chrome(os.path.normpath(os.getcwd()+"\\chromedriver.exe"))
driver.get("https://www.google.com/recaptcha/api2/demo")
time.sleep(1)
data_sitekey = driver.find_element_by_class_name('g-recaptcha').get_attribute('data-sitekey')
solver = recaptchaV2Proxyless()
solver.set_verbose(1)
solver.set_key("<--my-key-->")
solver.set_website_url("https://www.google.com/recaptcha/api2/demo")
solver.set_website_key(data_sitekey)
g_response = solver.solve_and_return_solution()
driver.execute_script('document.getElementById("g-recaptcha-response").innerHTML = "{}";'.format(g_response))
time.sleep(1)
# whatever the next step is. Could be clicking on a submit button
driver.refresh()
I'm trying to automate a duolingo login with Selenium with the code posted below.
While everything seems to work as expected at first, I always get an "Wrong password" message on the website after the login button is clicked.
I have checked the password time and time again and even changed it to one without special characters, but still the login fails.
I have seen in other examples that there is sometimes an additional password input field, however I cannot find one while inspecting the html.
What could I be missing ?
(Side note: I'm also open to a completely different solution without a webdriver since I really only want to get to the duolingo.com/learn page to scrape some data, but as of yet I haven't found an alternative way to login)
The code used:
from selenium import webdriver
from time import sleep
url = "https://www.duolingo.com/"
def login():
driver = webdriver.Chrome()
driver.get(url)
sleep(2)
hve_acnt_btn = driver.find_element_by_xpath("/html/body/div/div/div/span[1]/div/div[1]/div[2]/div/div[2]/a")
hve_acnt_btn.click()
sleep(2)
email_input = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div[2]/form/div[1]/div/label[1]/div/input")
email_input.send_keys("email#email.com")
sleep(2)
pwd_input = driver.find_element_by_css_selector("input[type=password]")
pwd_input.clear()
pwd_input.send_keys("password")
sleep(2)
login_btn = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div[2]/form/div[1]/button")
login_btn.click()
sleep(5)
login()
I couldn't post the website's html because of the character limit, so here is the link to the duolingo page: Duolingo
Switch to Firefox or a browser which does not tell the page that you are visiting it automated. See my earlier answer for a very similar issue here: https://stackoverflow.com/a/57778034/8375783
Long story short: When you start Chrome it will run with navigator.webdriver=true. You can check it in console. Pages can detect that flag and block login or other actions, hence the invalid login. This is a read-only flag set by the browser during startup.
With Chrome I couldn't log in to Duolingo either. After I switched the driver to Firefox, the very same code just worked.
Also if I may recommend, try to use Xpath with attributes.
Instead of this:
hve_acnt_btn = driver.find_element_by_xpath("/html/body/div/div/div/span[1]/div/div[1]/div[2]/div/div[2]/a")
You can use:
hve_acnt_btn = driver.find_element_by_xpath('//*[#data-test="have-account"]')
Same goes for:
email_input = driver.find_element_by_xpath("/html/body/div[1]/div[3]/div[2]/form/div[1]/div/label[1]/div/input")
vs:
email_input = driver.find_element_by_xpath('//input[#data-test="email-input"]')
driver.page_source don't returns all the source code.It is detaily printing only some parts of code, but it's missing a big part of code. How can i fix this?
This is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
def htmlToLuna():
url ='https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A'
driver = webdriver.Chrome('C:\\Python27\\chromedriver\\chromedriver.exe')
driver.get(url)
web=open('web.txt','w')
web.write(driver.page_source)
print driver.page_source
web.close()
print htmlToLuna()
Here is a simple code all it does is it opens the url and gets the length page source and waits for five seconds and will get the length of page source again.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
initial = len(browser.page_source)
print(initial)
time.sleep(5)
new_source = browser.page_source
print(len(new_source)
see the output:
15722
48800
you see that the length of the page source increases after a wait? you must make sure that the page is fully loaded before getting the source. But this is not a proper implementation since it blindly waits.
Here is a nice way to do this, The browser will wait until the element of your choice is found. Timeout is set for 10 sec.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
try:
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.CodeMirror > div:nth-child(1) > textarea:nth-child(1)'))) # 10 seconds delay
print("Result:")
print(len(browser.page_source))
except TimeoutException:
print("Your exception message here!")
The output: Result: 52195
Reference:
https://stackoverflow.com/a/26567563/7642415
http://selenium-python.readthedocs.io/locating-elements.html
Hold on! even that wont make any guarantees for getting full page source, since individual elements are loaded dynamically. If the browser finds the element it moves on. So make sure you find the proper element to make sure the page has been loaded fully.
P.S Mine is Python3 & webdriver is in my environment PATH. So my code needs to be modified a bit to make it work for Python 2.x versions. I guess only print statements are to be modified.
I am trying to automate a booking in process on a travel site using
splinter and having trouble clicking on a css element on the page.
This is my code
import splinter
import time
secret_deals_email = {
'user[email]': 'adf#sad.com'
}
browser = splinter.Browser()
url = 'http://roomer-qa-1.herokuapp.com'
browser.visit(url)
click_FIND_ROOMS = browser.find_by_css('.blue-btn').first.click()
time.sleep(10)
# click_Book_button = browser.find_by_css('.book-button-row.blue-btn').first.click()
browser.fill_form(secret_deals_email)
click_get_secret_deals = browser.find_by_name('button').first.click()
time.sleep(10)
click_book_first_room_list = browser.find_by_css('.book-button-row-link').first.click()
time.sleep(5)
click_book_button_entry = browser.find_by_css('.entry-white-box.entry_box_no_refund').first.click()
The problem is whenever I run it and the code gets to the page where I need to click the sort of purchase I would like. I can't click any of the option on the page.
I keep getting an error of the element not existing no matter what should I do.
http://roomer-qa-1.herokuapp.com/hotels/atlanta-hotels/ramada-plaza-atlanta-downtown-capitol-park.h30129/44389932?rate_plan_id=1&rate_plan_token=6b5aad6e9b357a3d9ff4b31acb73c620&
This is the link to the page that is causing me trouble please help :).
You need to whait until the element is present at the website. You can use the is_element_not_present_by_css method with a while loop to do that
while not(is_element_not_present_by_css('.entry-white-box.entry_box_no_refund')):
time.sleep(50)
I am scraping a website with a lot of javascript that is generated when the page is called. As a result, traditional web scraping methods (beautifulsoup, ect.) are not working for my purposes (at least I have been unsuccessful in getting them to work, all of the important data is in the javascript parts). As a result I have started using selenium webdriver. I need to scrape a few hundred pages, each of which has between 10 and 80 data points (each with about 12 fields), so it is important that this script (is that the right terminology?) can run for quite awhile without me having to babysit it.
I have the code working for a single page, and I have a controlling section that tells the scraping section what page to scrape. The problem is that sometimes the javascript portions of the page load, and sometimes they don't when they don't(~1/7), a refresh fixes things, but occasionally the refresh will freeze webdriver and thus the python runtime environment as well. Annoyingly, when it freezes like this, the code fails to time out. What is going on?
Here is a stripped down version of my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time, re, random, csv
from collections import namedtuple
def main(url_full):
driver = webdriver.Firefox()
driver.implicitly_wait(15)
driver.set_page_load_timeout(30)
#create HealthPlan namedtuple
HealthPlan = namedtuple( "HealthPlan", ("State, County, FamType, Provider, PlanType, Tier,") +
(" Premium, Deductible, OoPM, PrimaryCareVisitCoPay, ER, HospitalStay,") +
(" GenericRx, PreferredPrescription, RxOoPM, MedicalDeduct, BrandDrugDeduct"))
#check whether the page has loaded and handle page load and time out errors
pageNotLoaded= bool(True)
while pageNotLoaded:
try:
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
# Handle page load error by testing presence of showAll,
# an important feature of the page, which only appears if everything else loads
try:
driver.find_element_by_xpath('//*[#id="showAll"]').text
# catch NoSuchElementException=>refresh page
except NoSuchElementException:
try:
driver.refresh()
# catch TimeoutException => quit and load the page
# in a new instance of firefox,
# I don't think the code ever gets here, because it freezes in the refresh
# and will not throw the timeout exception like I would like
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
pageNotLoaded= False
scrapePage() # this is a dummy function, everything from here down works fine,
I have looked extensively for similar problems, and I do not think anyone else has posted about this on so, or anywhere else that I have looked. I am using python 2.7, selenium 2.39.0 and I am trying to scrape Healthcare.gov 's get premium estimate's pages
EDIT: (as an example,this page) It may also be worth mentioning that the page fails to load completely more often when the computer has been on/ doing this for awhile (i'm guessing that the free ram is getting full, and it glitches while loading) this is kind of beside the point though, because this should be handled by the try/except.
EDIT2: I should also mention that this is being run on windows7 64bit, with firefox 17 (which I believe is the newest supported version)
Dude, time.sleep it's a fail!
What's this?
time.sleep(3+ abs(random.normalvariate(1.8,3)))
Try this:
class TestPy(unittest.TestCase):
def waits(self):
self.implicit_wait = 30
Or this:
(self.)driver.implicitly_wait(10)
Or this:
WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath('some_xpath'))
Or, instead of driver.refresh() you can trick it :
driver.get(your url)
Also you can cick the cookie :
driver.delete_all_cookies()
scrapePage() # this is a dummy function, everything from here down works fine, :
http://scrapy.org