python selenium - get (ctrl-u) equivalent page_source

python selenium - get (ctrl-u) equivalent page_source - python

I need to get the ctrl-u equivalent of browser.page_source for comparative purposes.
is this possible with browser.execute_script or another method?
I've tried various methods like browser.get(view-source:https://www.example.com) but haven't seen a solution.

Its works fine for me , I guess it's the problem with the quotes,
browser.get('https://www.example.com')
browser.page_source
You can also achieve the same using browser.execute_script()
browser.execute_script('return document.documentElement.outerHTML')

if I'm not wrong you want to compare original html ctrl+U and rendered html browser.page_source, for that you can use requests
import requests
originalHTML = requests.get('http://...').text
print(originalHTML)
or you can create another tab for view-source:
url = 'https://..../'
browser.get(url)
renderedHTML = browser.page_source
# open blank page because JS cannot open special URL like `view-source:`
browser.execute_script("window.open('about:blank', '_blank')")
# switch to tab 2
browser.switch_to_window(browser.window_handles[1])
browser.get("view-source:" + url)
originalHTML = originalHTML = browser.find_element_by_css_selector('body').text
# switch to tab 1
#browser.switch_to_window(browser.window_handles[0])

Related

how to use selenium to go from one url tab to another before scraping?

I have created the following code in hopes to open up a new tab with a few parameters and then scrape the data table that is on the new tab.
#Open Webpage
url = "https://www.website.com"
driver=webdriver.Chrome(executable_path=r"C:\mypathto\chromedriver.exe")
driver.get(url)
#Click Necessary Parameters
driver.find_element_by_partial_link_text('Output').click()
driver.find_element_by_xpath('//*[#id="flexOpt"]/table/tbody/tr/td[2]/input[3]').click()
driver.find_element_by_xpath('//*[#id="flexOpt"]/table/tbody/tr/td[2]/input[4]').click()
driver.find_element_by_xpath('//*[#id="repOpt"]/table[2]/tbody/tr/td[2]/input[4]').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Dates').click()
driver.find_element_by_xpath('//*[#id="RangeOption"]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[2]/select/option[2]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[3]/select/option[1]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[1]/td[4]/select/option[1]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[2]/select/option[2]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[3]/select/option[31]').click()
driver.find_element_by_xpath('//*[#id="Range"]/table/tbody/tr[2]/td[4]/select/option[1]').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Groupings').click()
driver.find_element_by_xpath('//*[#id="availFld_DATE"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_LOCID"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_STATE"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_DDSO_SA"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_CLASS_ID"]/a/img').click()
driver.find_element_by_xpath('//*[#id="availFld_REGION"]/a/img').click()
time.sleep(2)
driver.find_element_by_partial_link_text('Run').click()
time.sleep(2)
df_url = driver.switch_to_window(driver.window_handles[0])
page = requests.get(df_url).text
soup = BeautifulSoup(page, features = 'html5lib')
soup.prettify()
However, the following error pops up when I run it.
requests.exceptions.MissingSchema: Invalid URL 'None': No schema supplied. Perhaps you meant http://None?
I will say that regardless of the parameters, the new tab always generates the same url. In other words, if the new tab creates www.website.com/b, it also creates www.website.com/b the third, fourth, etc. time, regardless of changing the parameters. Any thoughts?

The problem lies here:
df_url = driver.switch_to_window(driver.window_handles[0])
page = requests.get(df_url).text
df_url is not referring to the url of the page. To get that, you should call driver.current_url after switching windows to get the url of the active window.
Some other pointers:
finding elements by xpath is relatively inefficient (source)
instead of time.sleep, you can look into using explicit waits

Insert the url below the driver variable because first, the webdriver executes and then the url provided
driver=webdriver.Chrome(executable_path=r"C:\mypathto\chromedriver.exe")
url = "https://www.website.com"

Python beautiful soup web scraper doesn't return tag contents

I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else

it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.

You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')

Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")

Append "view-source:" to button click url output

I'm trying to get the rendered html of a webpage. The ctrl+u equivalent (in firefox or chrome).
Currently I must .click() load the page, get the url and then load it again adding view-source: to the url
search = browser.find_elements_by_xpath('//*[#id="edit-keys"]')
button = browser.find_elements_by_xpath('//*[#id="edit-submit"]')
browser.execute_script("arguments[0].value = 'bla';", search[0])
browser.execute_script('arguments[0].target="_blank";', button[0].find_element_by_xpath('./ancestor::form'))
browser.execute_script('arguments[0].click();', button[0])
url = browser.current_url
browser.get("view-source:" + url)
Is it possible to do this without loading the url twice?
browser.execute_script('return document.documentElement.outerHTML') does not offer the view-source: equivalent
driver.page_source also does not match view-source:
maybe there is a way to add view-source: to browser.execute_script('arguments[0].click();', button[0])?

to get the rendered HTML with dynamic JS loaded elements and all you will need to get it with JS with a simple one-liner:
rendered_source = driver.execute_script('return document.documentElement.outerHTML;')

How to read source web site already open in browser

I wondering to know if is any way to open url in browser and read source opened url ?
I'm trying to check if my XPath selector getting right value of captcha img src. I can't do this making 2 connections to url cause captcha will reload every single time i connect to url.
For reading source i'm using:
url = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx"
sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
To open url in browser i'm using:
if sys.platform=='win32':
os.startfile(url)
elif sys.platform=='darwin':
subprocess.Popen(['open', url])
else:
try:
subprocess.Popen(['xdg-open', url])
except OSError:
print 'Please open a browser on: '+url
Does any of you guys know how to solve it ?
Thanks

I found solution. To see url in browser and in the same time see source code of this page just use this code:
from selenium import webdriver
from lxml import etree, html
url = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx"
adres_prefix = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/"
adres_sufix = etree.XPath('string(//img[#class="captcha"]/#src)')
browser = webdriver.Firefox()
browser.get(url)
html_source = browser.page_source # i'm getting source code of open url
root = etree.HTML(html_source)
result = etree.tostring(root, pretty_print=True, method="html")
result2 = adres_sufix(root)
www = adres_prefix + result2
print www # now i see if XPath gives me right value
Hope it will help others
Thanks anyway for any help

Most of the cross platform python GUI tool kits such as wx.Python, pyside, etc., have a html display window that you can use to display the html source from within your python. I would recommend using one of those to display your content from within your python code.

You probably are going to need to make more than one request to get the CAPTCHA. Get yourself a copy of Fiddler 2 (free) http://fiddler2.com/get-fiddler. It will allow you to see the "conversation" between the server and your browser. Once you see that, you will probably know what you need.

How to retrieve the values of dynamic html content using Python

I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:
url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);
response = request.urlopen(url)
data = str(response.read(10000))
data = data.replace("\\n", "\n")
print(data)
Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".
How can I make it so that I can retrieve the value instead of the placeholder text?
Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}
Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.
The code I have now is:
from bs4 import BeautifulSoup
from selenium import webdriver
#...
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
print "Finding..."
for tag in soup.find_all('formatPrice median'):
print tag.text
Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.

Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)

I used selenium + chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "www.sitetotarget.com"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')`

Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.
Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.
import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()
Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:
url = 'http://www.example.org' ## Any website
driver.get(url)
pageSource = driver.page_source
print(driver.get_window_size())
locations = []
for element in driver.find_elements_by_tag_name("a"):
location = element.location;
size = element.size;
# Collect coordinates of object: left/right, top/bottom
x1 = location['x'];
y1 = location['y'];
x2 = location['x']+size['width'];
y2 = location['y']+size['height'];
locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])
locations = [ (el,x1,y1,x2,y2, width,height)
for el,x1,y1,x2,y2,width,height in locations
if not (
## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
x2 <= x1 or y2 <= y1 or x2<0 or y2<0
## Further restrict if you expect the objects to be around a specific size
## or width<200 or height<100
)
]
for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
try:
print('-'*100,f'({width},{height})')
print(el.text[:100])
element_png = el.screenshot_as_png
with open('/tmp/_pageImage.png', 'wb') as f:
f.write(element_png)
img = Image.open('/tmp/_pageImage.png')
display(img)
except Exception as err:
print(err)
Installation for mac+chrome:
pip install selenium
brew cask install chromedriver
brew cask install google-chrome
I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.
Regarding responsibility, please respect robots.txt on each site.

I know this is an old question, but sometimes there is a better solution than using heavy selenium.
This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal.
Though, sometimes if you have to click elements or sth, I guess selenium is the only option.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python selenium - get (ctrl-u) equivalent page_source - python

I need to get the ctrl-u equivalent of browser.page_source for comparative purposes. is this possible with browser.execute_script or another method? I've tried various methods like browser.get(view-source:https://www.example.com) but haven't seen a solution.

Its works fine for me , I guess it's the problem with the quotes, browser.get('https://www.example.com') browser.page_source You can also achieve the same using browser.execute_script() browser.execute_script('return document.documentElement.outerHTML')

Related

how to use selenium to go from one url tab to another before scraping?

Python beautiful soup web scraper doesn't return tag contents

Append "view-source:" to button click url output

How to read source web site already open in browser

How to retrieve the values of dynamic html content using Python

Categories

Resources