I am having a weird issue with Python and Selenium. I am accessing the URL https://www.biggerpockets.com/users/JarridJ1. When you click more it shows further content. I can understand that it is a React-based website. When I view it on browser and doa View Source I can see the required stuff in a react element <div data-react-class="Profile/Header/Header" data-react-props="{". I tried to automate Firefox via Selenium but I could not even get with that as well.
Check the screenshot:
Below is the code I tried:
from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def parse(u):
print('Processing... {}'.format(u))
driver.get(u)
sleep(2)
html = driver.page_source
driver.save_screenshot('bp.png')
print(html)
if __name__ == '__main__':
options = Options()
options.add_argument("--headless") # Runs Chrome in headless mode.
options.add_argument('--no-sandbox') # Bypass OS security model
options.add_argument('--disable-gpu') # applicable to windows os only
options.add_argument('start-maximized') #
options.add_argument('disable-infobars')
options.add_argument("--disable-extensions")
driver = webdriver.Firefox()
parse('https://www.biggerpockets.com/users/JarridJ1')
This is a tricky one but I found a way to get to the element you have highlighted. Still not sure why driver.page_source is not return what you are looking for.
def parse(u):
print('Processing... {}'.format(u))
driver.get(u)
sleep(2)
get_everything = driver.find_elements_by_xpath("//*")
for element in get_everything:
print(element .get_attribute('innerHTML'))
#html = driver.page_source
#driver.save_screenshot('bp.png')
#print(html)
Below is my standalone example:
from selenium import webdriver
import time
driver = webdriver.Chrome("C:\Path\To\chromedriver.exe")
driver.get("https://www.biggerpockets.com/users/JarridJ1")
time.sleep(5)
a = driver.find_element_by_xpath("//div[#data-react-class='Profile/Header/Header']")
b = a.get_attribute("data-react-props")
print(b)
c = driver.find_elements_by_xpath("//*")
for i in c:
print(i.get_attribute('innerHTML'))
Related
I'm trying to get a full-length screenshot and haven't been able to make it work. Here's the code I'm using:
from Screenshot import Screenshot
from selenium import webdriver
import time
ob = Screenshot.Screenshot()
driver = webdriver.Chrome()
driver.maximize_window()
driver.implicitly_wait(10)
url = "https://stackoverflow.com/questions/73298355/how-to-remove-duplicate-values-in-one-column-but-keep-the-rows-pandas"
driver.get(url)
img_url = ob.full_Screenshot(driver, save_path=r'.', image_name='example.png')
print(img_url)
driver.quit()
But this gives us a clipped screenshot:
So as you can see that's just what the driver window is showing, not a full-length screenshot. How can I tweak this code to get what I'm looking for?
Here is an example of how you can take full <body> screenshot of a page:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://stackoverflow.com/questions/7263824/get-html-source-of-webelement-in-selenium-webdriver-using-python?rq=1'
browser.get(url)
required_width = browser.execute_script('return document.body.parentNode.scrollWidth')
required_height = browser.execute_script('return document.body.parentNode.scrollHeight')
browser.set_window_size(required_width, required_height)
t.sleep(5)
browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")
required_width = browser.execute_script('return document.body.parentNode.scrollWidth')
required_height = browser.execute_script('return document.body.parentNode.scrollHeight')
browser.set_window_size(required_width, required_height)
t.sleep(1)
body_el = WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.TAG_NAME, "body")))
body_el.screenshot('full_page_screenshot.png')
print('took full screenshot!')
t.sleep(1)
browser.quit()
Selenium setup is for linux, but just note the imports, and the part after defining the browser. Code above is starting from a small window, then it maximizes it to fit in the full page body, then it waits a bit and computes the body size again, just to account for some scripts kicking in on user's input. Then it takes the screenshot - tested and working on a really long page.
To get a full-page screenshot using Selenium-Python clients you can use the GeckoDriver and firefox based save_full_page_screenshot() method as follows:
Code:
driver = webdriver.Firefox(service=s, options=options)
driver.get('https://stackoverflow.com/questions/73298355/how-to-remove-duplicate-values-in-one-column-but-keep-the-rows-pandas')
driver.save_full_page_screenshot('fullpage_gecko_firefox.png')
driver.quit()
Screenshot:
tl; dr
[py] Adding full page screenshot feature for Firefox
I am currently working on a scraper for aniworld.to.
My goal is it to enter the anime name and get all of the Episodes downloaded.
I have everything working except one thing...
The websites has a Watch button. That Button redirects you to https://aniworld.to/redirect/SOMETHING and that Site has a captcha which means the link is not in the html...
Is there a way to bypass this/get the link in python? Or a way to display the captcha so I can solve it?
Because the captcha only appears every lightyear.
The only thing I need from that page is the redirect link. It looks like this:
https://vidoza.net/embed-something.html
My very very wip code is here if it helps: https://github.com/wolfswolke/aniworld_scraper
Mitchdu showed me how to do it.
If anyone else needs help here is my code: https://github.com/wolfswolke/aniworld_scraper/blob/main/src/logic/captcha.py
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from threading import Thread
import os
def open_captcha_window(full_url):
working_dir = os.getcwd()
path_to_ublock = r'{}\extensions\ublock'.format(working_dir)
options = webdriver.ChromeOptions()
options.add_argument("app=" + full_url)
options.add_argument("window-size=423,705")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
if os.path.exists(path_to_ublock):
options.add_argument('load-extension=' + path_to_ublock)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get(full_url)
wait = WebDriverWait(driver, 100, 0.3)
wait.until(lambda redirect: redirect.current_url != full_url)
new_page = driver.current_url
Thread(target=threaded_driver_close, args=(driver,)).start()
return new_page
def threaded_driver_close(driver):
driver.close()
I have a code for making temp mail automatically but I have a problem.
This is code:
import time
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://temp-mail.org/")
time.sleep(10)
browser.close()
The link opens correctly but I can't pass cloudflare.
Also, I see some errors on my console:
Thanks...
Try adding user agent argument in chrome options and set user agent to any value
ops = Options()
ua='me'
ops.add_argument('--user-agent=%s' % ua)
driver=uc.Chrome(executable_path=r"C:\chromedriver.exe",chrome_options=ops)
Alternatively try using undetected-chromedriver
import undetected_chromedriver as uc
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
driver = uc.Chrome(options=options)
driver.get("https://temp-mail.org/")
As a test, I am trying to create a script that goes to my website and clicks on the learn more button, but am having trouble actually automatically clicking the button.
I've tried everything that I've found on stack overflow but nothing has worked.
from selenium import webdriver
import webbrowser
import time
url = 'https://www.mwstan.com'
driver = webbrowser.open_new_tab(url)
element = driver.find_element_by_id('learnmore')
element.click()
You are going to need to install a binary for whatever driver you are going to use
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")
chrome_driver = os.getcwd() + "/chromedriver"
def get_url_example(url):
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=chrome_driver)
driver.get(url)
button = driver.find_element_by_id("learnmore")
button.click()
# you can access the page source here using driver.page_source
if __name__ == '__main__':
get_url_page_source("https://www.mwstan.com")
This code works for me and hits your button.
This is using chrome webdriver but you can use another webdriver. JUst makesure you move the driver and access the path correctly like in line
chrome_driver = os.getcwd() + "/chromedriver"
I know how to call a method to maximize window from driver object.
driver.maximize_window()
But what method should I use when I need to minimize browser window (hide it)?
Actually, driver object hasn't maximize_window attribute.
My goal to work silently with the browser window. I don't want to see it on my PC.
Option 1: Use driver.minimize_window()
Option 2: Use --headless
Example 1:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get("enter your url here")
Example 2:
from selenium import webdriver
driver = webdriver.Chrome()
driver.minimize_window()
driver.get("enter your url here")
I would like to give a suggestion and make you correct that there's a method to minimize the window as you are required to do.
driver.minimize_window()
I would also like to mention that this will definitely work in python3, hope you were working in python.
Just driver.minimize_window() or use a headless browser as PhamtonJS or Chromium.
Example:
PROXY_SERVER = "127.0.0.1:5566" # IP:PORT or HOST:PORT
options = webdriver.ChromeOptions()
options.binary_location = r"/usr/bin/opera" # path to opera executable
options.add_argument('--proxy-server=%s' % PROXY_SERVER)
driver = webdriver.Opera(executable_path=r"/home/prestes/Tools/operadriver_linux64/operadriver", options=options)
driver.minimize_window()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
print(html)
driver.quit()
Maybe try a headless browser? Chrome headless or PhantomJS.
http://phantomjs.org/
Keep in mind that development is suspended for Phantom js. You may want to use other alternatives if it doesn't work or gives errors -
https://github.com/dhamaniasad/HeadlessBrowsers
A headless browser is a web browser without a GUI.
Try this,
driver.set_window_position(0, 0)