GoogleCaptcha roadblock in website scraper

GoogleCaptcha roadblock in website scraper - python

I am currently working on a scraper for aniworld.to.
My goal is it to enter the anime name and get all of the Episodes downloaded.
I have everything working except one thing...
The websites has a Watch button. That Button redirects you to https://aniworld.to/redirect/SOMETHING and that Site has a captcha which means the link is not in the html...
Is there a way to bypass this/get the link in python? Or a way to display the captcha so I can solve it?
Because the captcha only appears every lightyear.
The only thing I need from that page is the redirect link. It looks like this:
https://vidoza.net/embed-something.html
My very very wip code is here if it helps: https://github.com/wolfswolke/aniworld_scraper

Mitchdu showed me how to do it.
If anyone else needs help here is my code: https://github.com/wolfswolke/aniworld_scraper/blob/main/src/logic/captcha.py
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from threading import Thread
import os
def open_captcha_window(full_url):
working_dir = os.getcwd()
path_to_ublock = r'{}\extensions\ublock'.format(working_dir)
options = webdriver.ChromeOptions()
options.add_argument("app=" + full_url)
options.add_argument("window-size=423,705")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
if os.path.exists(path_to_ublock):
options.add_argument('load-extension=' + path_to_ublock)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get(full_url)
wait = WebDriverWait(driver, 100, 0.3)
wait.until(lambda redirect: redirect.current_url != full_url)
new_page = driver.current_url
Thread(target=threaded_driver_close, args=(driver,)).start()
return new_page
def threaded_driver_close(driver):
driver.close()

Related

Get full data from HTML page python

I am trying to download thousands of HTML pages in order to parse them. I tried it with selenium but the downloaded file does not contain all the text seen in the browser.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
chrome_options = Options()
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
for url in URL_list:
browser.get(url)
content = browser.page_source
with open(DOWNLOAD_PATH + file_name + ".html", "w", encoding='utf-8') as file:
file.write(str(content))
browser.close()
but the html file I got doen't contain all the content I see in the browser in the same page. for example text I see on the screen is not found in the HTML file. only when I right click the page in the browser and "Save As" I get the full page.
URL example - https://www.camoni.co.il/411788/1Jacob
thank you

Be aware that using the webdriver in headless mode may not provide the same results. For a fast resolution I suggest scraping the pages source without the --headless option.
The other way around is, perhaps, to await certain elements to be located.
I suggest getting around Expected Conditions and waits for that example.
Here's a function that I prepared for your better understanding:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def awaitCertainElements_andGetSource():
element_one = driver.find_element(By.XPATH, "//*[text() = 'some text that is crucial for you'")
element_two = driver.find_element(By.XPATH, "//*[#id='some-id'")
wait = WebDriverWait(driver, 5)
wait.until(EC.visibility_of(element_one))
wait.until(EC.visibility_of(element_two))
return driver.get_source

How to get a full-page screenshot in Python using Selenium and Screenshot

I'm trying to get a full-length screenshot and haven't been able to make it work. Here's the code I'm using:
from Screenshot import Screenshot
from selenium import webdriver
import time
ob = Screenshot.Screenshot()
driver = webdriver.Chrome()
driver.maximize_window()
driver.implicitly_wait(10)
url = "https://stackoverflow.com/questions/73298355/how-to-remove-duplicate-values-in-one-column-but-keep-the-rows-pandas"
driver.get(url)
img_url = ob.full_Screenshot(driver, save_path=r'.', image_name='example.png')
print(img_url)
driver.quit()
But this gives us a clipped screenshot:
So as you can see that's just what the driver window is showing, not a full-length screenshot. How can I tweak this code to get what I'm looking for?

Here is an example of how you can take full <body> screenshot of a page:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://stackoverflow.com/questions/7263824/get-html-source-of-webelement-in-selenium-webdriver-using-python?rq=1'
browser.get(url)
required_width = browser.execute_script('return document.body.parentNode.scrollWidth')
required_height = browser.execute_script('return document.body.parentNode.scrollHeight')
browser.set_window_size(required_width, required_height)
t.sleep(5)
browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")
required_width = browser.execute_script('return document.body.parentNode.scrollWidth')
required_height = browser.execute_script('return document.body.parentNode.scrollHeight')
browser.set_window_size(required_width, required_height)
t.sleep(1)
body_el = WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.TAG_NAME, "body")))
body_el.screenshot('full_page_screenshot.png')
print('took full screenshot!')
t.sleep(1)
browser.quit()
Selenium setup is for linux, but just note the imports, and the part after defining the browser. Code above is starting from a small window, then it maximizes it to fit in the full page body, then it waits a bit and computes the body size again, just to account for some scripts kicking in on user's input. Then it takes the screenshot - tested and working on a really long page.

To get a full-page screenshot using Selenium-Python clients you can use the GeckoDriver and firefox based save_full_page_screenshot() method as follows:
Code:
driver = webdriver.Firefox(service=s, options=options)
driver.get('https://stackoverflow.com/questions/73298355/how-to-remove-duplicate-values-in-one-column-but-keep-the-rows-pandas')
driver.save_full_page_screenshot('fullpage_gecko_firefox.png')
driver.quit()
Screenshot:
tl; dr
[py] Adding full page screenshot feature for Firefox

ChromeWebdriver sees website differently than I do (Python)

I'm trying to make a script that logs into my online grade book to look for any changes (new grades, etc). This is my code so far.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
def main():
options = Options()
# options.add_argument("--headless")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.maximize_window()
# goes to the desired website
driver.get('https://portal.librus.pl/rodzina')
# searches for and clicks a button that drops down a menu in which link for login form is visible
button = driver.find_element(By.CLASS_NAME, 'btn.btn-third.btn-synergia-top.btn-navbar.dropdown-toggle')
button.click()
# searches and clicks login link
agree = driver.find_element(By.CLASS_NAME, 'zmdi.zmdi-account.dropdown-item__icon')
agree.click()
time.sleep(10)
driver.quit()
if __name__ == '__main__':
main()
And there is a problem, I cannot seem to find a way to make webdriver see what I see. What I mean is that I see the webpage like this and webdriver sees the same webpage like this also the source code is different. I've tried using undetected ChromeDriver with no success. This is my code using UC.
import undetected_chromedriver as uc
import time
from selenium.webdriver.common.by import By
def main():
driver = uc.Chrome()
driver.maximize_window()
# goes to the desired website
driver.get('https://portal.librus.pl/rodzina/home')
# searches for and clicks a button that drops down a menu in which link for login form is visible
button = driver.find_element(By.CLASS_NAME, 'btn.btn-third.btn-synergia-top.btn-navbar.dropdown-toggle')
button.click()
# searches and clicks login link
agree = driver.find_element(By.CLASS_NAME, 'zmdi.zmdi-account.dropdown-item__icon')
agree.click()
time.sleep(5)
driver.execute_script("window.print();")
if __name__ == '__main__':
main()
Has anyone had a similar problem and managed to solve it?

Selenium - Google Travel Scraping Price History missing

I am returning html with this python script but it doesn't return price history (see screenshot). Using non-selenium browser does return html with the prices (even without expending this section by simple regex); chrome/safari/firefox all do, incognito as well.
from selenium import webdriver
import time
url = 'https://www.google.com/flights?hl=en#flt=SFO.JFK.2021-06-01*JFK.SFO.2021-06-07'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(10)
html = driver.page_source
print(html)
driver.quit()
I can't quite pinpoint if it's some setting in chromedriver. It is possible to do because there is a 3rd party scraper that currently returns this data.
Tried this to no avail. Can a website detect when you are using Selenium with chromedriver?
Any thoughts appreciated.

After I added chrome_options.add_argument("--disable-blink-features=AutomationControlled") I started to see this block. Not sure why it is not always loaded.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
url = 'https://www.google.com/flights?hl=en#flt=SFO.JFK.2021-06-01*JFK.SFO.2021-06-07'
chrome_options = Options()
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', chrome_options=chrome_options)
driver.get(url)
# wait = WebDriverWait(driver, 20)
# wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".EA71Tc.q7Eewe")))
time.sleep(10)
history = driver.find_element_by_css_selector(".EA71Tc.q7Eewe").get_attribute("innerHTML")
print(history)
Here the full block is returned, including all tag names. As you see, I tried explicit waits, but this block was not visible. Experiment with adding another explicit wait.

Python script goes to website but doesn't click button intended to

As a test, I am trying to create a script that goes to my website and clicks on the learn more button, but am having trouble actually automatically clicking the button.
I've tried everything that I've found on stack overflow but nothing has worked.
from selenium import webdriver
import webbrowser
import time
url = 'https://www.mwstan.com'
driver = webbrowser.open_new_tab(url)
element = driver.find_element_by_id('learnmore')
element.click()

You are going to need to install a binary for whatever driver you are going to use
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")
chrome_driver = os.getcwd() + "/chromedriver"
def get_url_example(url):
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=chrome_driver)
driver.get(url)
button = driver.find_element_by_id("learnmore")
button.click()
# you can access the page source here using driver.page_source
if __name__ == '__main__':
get_url_page_source("https://www.mwstan.com")
This code works for me and hits your button.
This is using chrome webdriver but you can use another webdriver. JUst makesure you move the driver and access the path correctly like in line
chrome_driver = os.getcwd() + "/chromedriver"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

GoogleCaptcha roadblock in website scraper - python

Related

Get full data from HTML page python

How to get a full-page screenshot in Python using Selenium and Screenshot

ChromeWebdriver sees website differently than I do (Python)

Selenium - Google Travel Scraping Price History missing

Python script goes to website but doesn't click button intended to

Categories

Resources