Get full data from HTML page python - python

I am trying to download thousands of HTML pages in order to parse them. I tried it with selenium but the downloaded file does not contain all the text seen in the browser.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
chrome_options = Options()
chrome_options.add_argument("--headless")
browser = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
for url in URL_list:
browser.get(url)
content = browser.page_source
with open(DOWNLOAD_PATH + file_name + ".html", "w", encoding='utf-8') as file:
file.write(str(content))
browser.close()
but the html file I got doen't contain all the content I see in the browser in the same page. for example text I see on the screen is not found in the HTML file. only when I right click the page in the browser and "Save As" I get the full page.
URL example - https://www.camoni.co.il/411788/1Jacob
thank you

Be aware that using the webdriver in headless mode may not provide the same results. For a fast resolution I suggest scraping the pages source without the --headless option.
The other way around is, perhaps, to await certain elements to be located.
I suggest getting around Expected Conditions and waits for that example.
Here's a function that I prepared for your better understanding:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def awaitCertainElements_andGetSource():
element_one = driver.find_element(By.XPATH, "//*[text() = 'some text that is crucial for you'")
element_two = driver.find_element(By.XPATH, "//*[#id='some-id'")
wait = WebDriverWait(driver, 5)
wait.until(EC.visibility_of(element_one))
wait.until(EC.visibility_of(element_two))
return driver.get_source

Related

How to get a full-page screenshot in Python using Selenium and Screenshot

I'm trying to get a full-length screenshot and haven't been able to make it work. Here's the code I'm using:
from Screenshot import Screenshot
from selenium import webdriver
import time
ob = Screenshot.Screenshot()
driver = webdriver.Chrome()
driver.maximize_window()
driver.implicitly_wait(10)
url = "https://stackoverflow.com/questions/73298355/how-to-remove-duplicate-values-in-one-column-but-keep-the-rows-pandas"
driver.get(url)
img_url = ob.full_Screenshot(driver, save_path=r'.', image_name='example.png')
print(img_url)
driver.quit()
But this gives us a clipped screenshot:
So as you can see that's just what the driver window is showing, not a full-length screenshot. How can I tweak this code to get what I'm looking for?
Here is an example of how you can take full <body> screenshot of a page:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://stackoverflow.com/questions/7263824/get-html-source-of-webelement-in-selenium-webdriver-using-python?rq=1'
browser.get(url)
required_width = browser.execute_script('return document.body.parentNode.scrollWidth')
required_height = browser.execute_script('return document.body.parentNode.scrollHeight')
browser.set_window_size(required_width, required_height)
t.sleep(5)
browser.execute_script("window.scrollTo(0,document.body.scrollHeight);")
required_width = browser.execute_script('return document.body.parentNode.scrollWidth')
required_height = browser.execute_script('return document.body.parentNode.scrollHeight')
browser.set_window_size(required_width, required_height)
t.sleep(1)
body_el = WebDriverWait(browser,10).until(EC.element_to_be_clickable((By.TAG_NAME, "body")))
body_el.screenshot('full_page_screenshot.png')
print('took full screenshot!')
t.sleep(1)
browser.quit()
Selenium setup is for linux, but just note the imports, and the part after defining the browser. Code above is starting from a small window, then it maximizes it to fit in the full page body, then it waits a bit and computes the body size again, just to account for some scripts kicking in on user's input. Then it takes the screenshot - tested and working on a really long page.
To get a full-page screenshot using Selenium-Python clients you can use the GeckoDriver and firefox based save_full_page_screenshot() method as follows:
Code:
driver = webdriver.Firefox(service=s, options=options)
driver.get('https://stackoverflow.com/questions/73298355/how-to-remove-duplicate-values-in-one-column-but-keep-the-rows-pandas')
driver.save_full_page_screenshot('fullpage_gecko_firefox.png')
driver.quit()
Screenshot:
tl; dr
[py] Adding full page screenshot feature for Firefox

GoogleCaptcha roadblock in website scraper

I am currently working on a scraper for aniworld.to.
My goal is it to enter the anime name and get all of the Episodes downloaded.
I have everything working except one thing...
The websites has a Watch button. That Button redirects you to https://aniworld.to/redirect/SOMETHING and that Site has a captcha which means the link is not in the html...
Is there a way to bypass this/get the link in python? Or a way to display the captcha so I can solve it?
Because the captcha only appears every lightyear.
The only thing I need from that page is the redirect link. It looks like this:
https://vidoza.net/embed-something.html
My very very wip code is here if it helps: https://github.com/wolfswolke/aniworld_scraper
Mitchdu showed me how to do it.
If anyone else needs help here is my code: https://github.com/wolfswolke/aniworld_scraper/blob/main/src/logic/captcha.py
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from threading import Thread
import os
def open_captcha_window(full_url):
working_dir = os.getcwd()
path_to_ublock = r'{}\extensions\ublock'.format(working_dir)
options = webdriver.ChromeOptions()
options.add_argument("app=" + full_url)
options.add_argument("window-size=423,705")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
if os.path.exists(path_to_ublock):
options.add_argument('load-extension=' + path_to_ublock)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
driver.get(full_url)
wait = WebDriverWait(driver, 100, 0.3)
wait.until(lambda redirect: redirect.current_url != full_url)
new_page = driver.current_url
Thread(target=threaded_driver_close, args=(driver,)).start()
return new_page
def threaded_driver_close(driver):
driver.close()

Selenium cannot find elements

I try to automate retrieving data from "SAP Business Client" using Python and Selenium.
Since I cannot find the element I wanted even though I am sure it is correct, I printed out the html content with the following code:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
EDGE_PATH = r"C:\Users\XXXXXX\Desktop\WPy64-3940\edgedriver_win64\msedgedriver"
service = Service(executable_path=EDGE_PATH)
options = Options()
options.use_chromium = True
options.add_argument("headless")
options.add_argument("disable-gpu")
cc_driver = webdriver.Edge(service = service, options=options)
cc_driver.get('https://saps4.sap.XXXX.de/sap/bc/ui5_ui5/ui2/ushell/shells/abap/FioriLaunchpad.html#Z_APSuche-display')
sleep(5)
cc_html = cc_driver.page_source
cc_content = soup(cc_html, 'html.parser')
print(cc_content.prettify())
cc_driver.close()
Now I am just surprised, because the printed out content is different than from firefox "inspect" function. For example, I can find the word "Nachname" from the firefox html content but not such word exists in the printed out html content from the code above:
Have someone an idea, why the printed out content is different?
Thank you for any help... Gunardi
the code you get from selenium is a the code without javascript process on it, then you shoul get the code from javascript using selenium interaction with javascipt,
String javascript = "return arguments[0].innerHTML"; String pageSource=(String)(JavascriptExecutor)driver) .executeScript(javascript, driver.findElement(By.tagName("html")enter code here)); pageSource = "<html>"+pageSource +"</html>"; System.out.println(pageSource);

Selenium - Google Travel Scraping Price History missing

I am returning html with this python script but it doesn't return price history (see screenshot). Using non-selenium browser does return html with the prices (even without expending this section by simple regex); chrome/safari/firefox all do, incognito as well.
from selenium import webdriver
import time
url = 'https://www.google.com/flights?hl=en#flt=SFO.JFK.2021-06-01*JFK.SFO.2021-06-07'
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=options)
driver.get(url)
time.sleep(10)
html = driver.page_source
print(html)
driver.quit()
I can't quite pinpoint if it's some setting in chromedriver. It is possible to do because there is a 3rd party scraper that currently returns this data.
Tried this to no avail. Can a website detect when you are using Selenium with chromedriver?
Any thoughts appreciated.
After I added chrome_options.add_argument("--disable-blink-features=AutomationControlled") I started to see this block. Not sure why it is not always loaded.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
url = 'https://www.google.com/flights?hl=en#flt=SFO.JFK.2021-06-01*JFK.SFO.2021-06-07'
chrome_options = Options()
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', chrome_options=chrome_options)
driver.get(url)
# wait = WebDriverWait(driver, 20)
# wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".EA71Tc.q7Eewe")))
time.sleep(10)
history = driver.find_element_by_css_selector(".EA71Tc.q7Eewe").get_attribute("innerHTML")
print(history)
Here the full block is returned, including all tag names. As you see, I tried explicit waits, but this block was not visible. Experiment with adding another explicit wait.

How to find the href attribute of the videos on twitch through selenium and python?

I'm trying to find the twitch video IDs of all videos for a specific user. So for example on this page
https://www.twitch.tv/dyrus/videos/all
So here we have all videos linked, but its not quite so simple as to just scrape the html and find the links since they are generated dynamically it seems.
So I heard about selenium and did something like this:
from selenium import webdriver
# Change path here obviously
driver = webdriver.Chrome('C:/Users/Jason/Downloads/chromedriver')
driver.get('https://www.twitch.tv/dyrus/videos/all')
link_element = driver.find_elements_by_xpath("//*[#href]")
for link in link_element:
print(link.get_attribute('href'))
driver.close()
This returns me a bunch of links on the page but not the videos, they lie "deeper" I think, any input?
Thanks in advance
I would still suggest a couple of changes as follows:
Always open the Web Browser in maximized mode so that all/majority of the desired elements are within the Viewport.
If you are on Windows OS you need to append the extension .exe at the end of the WebDriver variant name, e.g. chromedriver.exe
While you identify for elements always try to include the class attribute in your Locator Strategy.
Always invoke driver.quit() at the end of your #Test to close & destroy the WebDriver and Web Client instances gracefully.
Here is your own code block with the above mentioned tweaks:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\path\to\chromedriver.exe')
driver.get('https://www.twitch.tv/dyrus/videos/all')
link_elements = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.tw-interactive.tw-link[data-a-target='preview-card-image-link']")))
for link in link_elements:
print(link.get_attribute('href'))
driver.quit()
Console Output:
https://www.twitch.tv/videos/295314690
https://www.twitch.tv/videos/294901947
https://www.twitch.tv/videos/294472813
https://www.twitch.tv/videos/294075254
https://www.twitch.tv/videos/293617036
https://www.twitch.tv/videos/293236560
https://www.twitch.tv/videos/292800601
https://www.twitch.tv/videos/292409437
https://www.twitch.tv/videos/292328170
https://www.twitch.tv/videos/292032996
https://www.twitch.tv/videos/291625563
https://www.twitch.tv/videos/291192151
https://www.twitch.tv/videos/290824842
https://www.twitch.tv/videos/290434348
https://www.twitch.tv/videos/290021370
https://www.twitch.tv/videos/289561690
https://www.twitch.tv/videos/289495488
https://www.twitch.tv/videos/289138003
https://www.twitch.tv/videos/289110429
https://www.twitch.tv/videos/288804893
https://www.twitch.tv/videos/288784992
https://www.twitch.tv/videos/288687479
https://www.twitch.tv/videos/288432438
https://www.twitch.tv/videos/288117849
https://www.twitch.tv/videos/288004968
https://www.twitch.tv/videos/287689102
https://www.twitch.tv/videos/287451192
https://www.twitch.tv/videos/287267032
https://www.twitch.tv/videos/287017431
https://www.twitch.tv/videos/286819343
With your locator, you are returning every element on the page that contains an href attribute. You can be a little more specific than that and get what you are looking for. Switch to a CSS selector...
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Change path here obviously
driver = webdriver.Chrome('C:/Users/Jason/Downloads/chromedriver')
driver.get('https://www.twitch.tv/dyrus/videos/all')
links = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-a-target='preview-card-image-link']")))
for link in links:
print(link.get_attribute('href'))
driver.close()
That prints 40 links from the page.

Categories

Resources