I'm trying to get the rendered html of a webpage. The ctrl+u equivalent (in firefox or chrome).
Currently I must .click() load the page, get the url and then load it again adding view-source: to the url
search = browser.find_elements_by_xpath('//*[#id="edit-keys"]')
button = browser.find_elements_by_xpath('//*[#id="edit-submit"]')
browser.execute_script("arguments[0].value = 'bla';", search[0])
browser.execute_script('arguments[0].target="_blank";', button[0].find_element_by_xpath('./ancestor::form'))
browser.execute_script('arguments[0].click();', button[0])
url = browser.current_url
browser.get("view-source:" + url)
Is it possible to do this without loading the url twice?
browser.execute_script('return document.documentElement.outerHTML') does not offer the view-source: equivalent
driver.page_source also does not match view-source:
maybe there is a way to add view-source: to browser.execute_script('arguments[0].click();', button[0])?
to get the rendered HTML with dynamic JS loaded elements and all you will need to get it with JS with a simple one-liner:
rendered_source = driver.execute_script('return document.documentElement.outerHTML;')
Related
I am new to selenium and trying to automate a WordPress social media schedule plugin. The problem is there is a link text box as shown below.
Now if I type URL in this box and click continue it I will get the next page like this :
But when I try to automate this step using this code :
mydriver = webdriver.Chrome(Path)
cookies = get_cookies_values("cookies.csv")
mydriver.get(url)
for i in cookies:
mydriver.add_cookie(i)
mydriver.get(url)
link_element = WebDriverWait(mydriver, 10).until(
EC.presence_of_element_located((By.ID, "b2s-curation-input-url"))
)
link_element.send_keys(link)
mydriver.find_element_by_xpath('//*[#id="b2s-curation-post-
form"]/div[2]/div[1]/div/div/div[2]/button').click()
Now, if I run the above code, which gets the URL, it also loads my cookies, but when it clicks on the continue button after sending keys, I get this type of page with an extra field with the name of the title. I don't want this extra title box as shown in the picture below
I would like to know what is causing this issue. I am using python 3.8 with Chrome web driver version 92.0.4515.131.
Thank you
One possibility is to try entering the URL character-by-character, as opposed to sending the entire string. The former is closer to the process of manually typing in the link, and assuming the site has a Javascript listener to catch the input, a character-by-character input process will be intercepted differently than a copy-paste of the entire URL:
mydriver = webdriver.Chrome(Path)
cookies = get_cookies_values("cookies.csv")
mydriver.get(url)
for i in cookies:
mydriver.add_cookie(i)
mydriver.get(url)
link_element = WebDriverWait(mydriver, 10).until(
EC.presence_of_element_located((By.ID, "b2s-curation-input-url"))
)
for c in link: #send each character to the field
link_element.send_keys(c)
#import time; time.sleep(0.3) => you could also add a short delay between each entry
mydriver.find_element_by_xpath('//*[#id="b2s-curation-post-
form"]/div[2]/div[1]/div/div/div[2]/button').click()
I have created the following code in hopes to open up a new tab with a few parameters and then scrape the data table that is on the new tab.
#Open Webpage
url = "https://www.website.com"
driver=webdriver.Chrome(executable_path=r"C:\mypathto\chromedriver.exe")
driver.get(url)
#Click Necessary Parameters
driver.find_element_by_partial_link_text('Output').click()
driver.find_element_by_xpath('//*[#id="flexOpt"]/table/tbody/tr/td[2]/input[3]').click()
driver.find_element_by_xpath('//*[#id="flexOpt"]/table/tbody/tr/td[2]/input[4]').click()
driver.find_element_by_xpath('//*[#id="repOpt"]/table[2]/tbody/tr/td[2]/input[4]').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Dates').click()
driver.find_element_by_xpath('//*[#id="RangeOption"]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[2]/select/option[2]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[3]/select/option[1]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[4]/select/option[1]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[2]/select/option[2]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[3]/select/option[31]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[4]/select/option[1]').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Groupings').click()
driver.find_element_by_xpath('//*[#id="availFld_DATE"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_LOCID"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_STATE"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_DDSO_SA"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_CLASS_ID"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_REGION"]/a/img').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Run').click()
time.sleep(2)
df_url = driver.switch_to_window(driver.window_handles[0])
page = requests.get(df_url).text
soup = BeautifulSoup(page, features = 'html5lib')
soup.prettify()
However, the following error pops up when I run it.
requests.exceptions.MissingSchema: Invalid URL 'None': No schema supplied. Perhaps you meant http://None?
I will say that regardless of the parameters, the new tab always generates the same url. In other words, if the new tab creates www.website.com/b, it also creates www.website.com/b the third, fourth, etc. time, regardless of changing the parameters. Any thoughts?
The problem lies here:
df_url = driver.switch_to_window(driver.window_handles[0])
page = requests.get(df_url).text
df_url is not referring to the url of the page. To get that, you should call driver.current_url after switching windows to get the url of the active window.
Some other pointers:
finding elements by xpath is relatively inefficient (source)
instead of time.sleep, you can look into using explicit waits
Insert the url below the driver variable because first, the webdriver executes and then the url provided
driver=webdriver.Chrome(executable_path=r"C:\mypathto\chromedriver.exe")
url = "https://www.website.com"
I need to get the ctrl-u equivalent of browser.page_source for comparative purposes.
is this possible with browser.execute_script or another method?
I've tried various methods like browser.get(view-source:https://www.example.com) but haven't seen a solution.
Its works fine for me , I guess it's the problem with the quotes,
browser.get('https://www.example.com')
browser.page_source
You can also achieve the same using browser.execute_script()
browser.execute_script('return document.documentElement.outerHTML')
if I'm not wrong you want to compare original html ctrl+U and rendered html browser.page_source, for that you can use requests
import requests
originalHTML = requests.get('http://...').text
print(originalHTML)
or you can create another tab for view-source:
url = 'https://..../'
browser.get(url)
renderedHTML = browser.page_source
# open blank page because JS cannot open special URL like `view-source:`
browser.execute_script("window.open('about:blank', '_blank')")
# switch to tab 2
browser.switch_to_window(browser.window_handles[1])
browser.get("view-source:" + url)
originalHTML = originalHTML = browser.find_element_by_css_selector('body').text
# switch to tab 1
#browser.switch_to_window(browser.window_handles[0])
I'm writing a Python crawler using the Selenium library and the PhantomJs browser. I triggered a click event in a page to open a new page, and then I used the browser.page_source method, but I get the original page source instead of the new open page source. I wonder how to get the new open page source?
Here's my code:
import requests
from selenium import webdriver
url = 'https://sf.taobao.com/list/50025969__2__%D5%E3%BD%AD.htm?auction_start_seg=-1&page=150'
browser = webdriver.PhantomJS(executable_path='C:\\ProgramData\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
browser.get(url)
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
html = browser.page_source
print(html)
browser.quit()
You need to switch to the new window first
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
browser.switch_to_window(browser.window_handles[-1])
html = browser.page_source
I believe you need to add a wait before getting page source.
I've used an implicit wait at the code below.
from selenium import webdriver
url = 'https://sf.taobao.com/list/50025969__2__%D5%E3%BD%AD.htm?auction_start_seg=-1&page=150'
browser = webdriver.PhantomJS(executable_path='C:\\ProgramData\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
browser.get(url)
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
browser.implicitly_wait(5)
html = browser.page_source
browser.quit()
Better to use an explicit wait, but it required a condition like EC.element_to_be_clickable((By.ID, 'someid'))
I am scraping a very large document, and when I call:
page_source = driver.page_source
It freezes and isn't able to capture the full page source. Is there something I can do to mitigate this issue? The page is from an autoscroll and I can't access to the source.
You can workaround it with an execute_script():
driver.execute_script("return document.documentElement.outerHTML;")
You can also try scrolling into view of the footer and only then get the page source:
footer = driver.find_element_by_tag_name("footer")
driver.execute_script("arguments[0].scrollIntoView();", footer)
print(driver.page_source)
Assuming there is the footer element, of course.