Get visible content of a page using selenium and BeautifulSoup

Get visible content of a page using selenium and BeautifulSoup - python

I want to retrieve all visible content of a web page. Let say for example this webpage. I am using a headless firefox browser remotely with selenium.
The script I am using looks like this
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
dom = BeautifulSoup(driver.page_source, parser)
f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))
with open('out.html', 'w') as fe:
fe.write(dom.encode('utf-8'))
This is supposed to load the page, parse the dom, and then replace the iframe with id dsq-app1 with it's visible content. If I execute those commands one by one via my python command line it works as expected. I can then see the paragraphs with all the visible content. When instead I execute all those commands at once, either by executing the script or by pasting all this snippet in my interpreter, it behaves differently. The paragraphs are missing, the content still exists in json format, but it's not what I want.
Any idea why this may happening? Something to do with replace_with maybe?

Sounds like the dom elements are not yet loaded when your code try to reach them.
Try to wait for the elements to be fully loaded and just then replace.
This works for your when you run it command by command because then you let the driver load all the elements before you execute more commands.

To add to Or Duan's answer I provide what I ended up doing. The problem of finding whether a page or parts of a page have loaded completely is an intricate one. I tried to use implicit and explicit waits but again I ended up receiving half-loaded frames. My workaround is to check the readyState of the original document and the readyState of iframes.
Here is a sample function
def _check_if_load_complete(driver, timeout=10):
elapsed_time = 1
while True:
if (driver.execute_script('return document.readyState') == 'complete' or
elapsed_time == timeout):
break
else:
sleep(0.0001)
elapsed_time += 1
then I used that function right after I changed the focus of the driver to the iframe
driver.switch_to_frame('dsq-app1')
_check_if_load_complete(driver, timeout=10)

Try to get the Page Source after detecting the required ID/CSS_SELECTOR/CLASS or LINK.
You can always use explicit wait of Selenium WebDriver.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
f = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,idName)
# here 10 is time for which script will try to find given id
# provide the id name
dom = BeautifulSoup(driver.page_source, parser)
f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))
with open('out.html', 'w') as fe:
fe.write(dom.encode('utf-8'))
Correct me if this not work

Related

Selenium webscraper not scraping desired tags

here are the two tags I am trying to scrape: https://i.stack.imgur.com/a1sVN.png. In case you are wondering, this is the link to that page (the tags I am trying to scrape are not behind the paywall): https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635
Below is the code in python I am using, does anyone know why the tags are not properly being stored in paragraphs?
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635'
driver = webdriver.Chrome()
driver.get(url)
paragraphs = driver.find_elements(By.CLASS_NAME, 'css-xbvutc-Paragraph e3t0jlg0')
print(len(paragraphs)) # => prints 0

So you have two problems impacting you.
you should wait for the page to load after you get() the webpage. You can do this with something like import time and time.sleep(10)
The elements that you are trying to scrape, the class tags that you are searching for change on every page load. However, the fact that it is a data-type='paragraph' stays constant, therefore you are able to do:
paragraphs = driver.find_elements(By.XPATH, '//*[#data-type="paragraph"]') # search by XPath to find the elements with that data attribute
print(len(paragraphs))
prints: 2 after the page is loaded.

Just to add-on to #Andrew Ryan's answer, you can use explicit wait for shorter and more dynamical waiting time.
paragraphs = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//*[#data-type="paragraph"]'))
)
print(len(paragraphs))

xpath returns more than one result, how to handle in python

I have started selenium using python. I am able to change the message text using find_element_by_id. I want to do the same with find_element_by_xpath which is not successful as the xpath has two instances. want to try this out to learn about xpath.
I want to do web scraping of a page using python in which I need clarity on using Xpath mainly needed for going to next page.
#This code works:
import time
import requests
import requests
from selenium import webdriver
driver = webdriver.Chrome()
url = "http://www.seleniumeasy.com/test/basic-first-form-demo.html"
driver.get(url)
eleUserMessage = driver.find_element_by_id("user-message")
eleUserMessage.clear()
eleUserMessage.send_keys("Testing Python")
time.sleep(2)
driver.close()
#This works fine. I wish to do the same with xpath.
#I inspect the the Input box in chrome, copy the xpath '//*[#id="user-message"]' which seems to refer to the other box as well.
# I wish to use xpath method to write text in this box as follows which does not work.
driver = webdriver.Chrome()
url = "http://www.seleniumeasy.com/test/basic-first-form-demo.html"
driver.get(url)
eleUserMessage = driver.find_elements_by_xpath('//*[#id="user-message"]')
eleUserMessage.clear()
eleUserMessage.send_keys("Test Python")
time.sleep(2)
driver.close()

To elaborate on my comment you would use a list like this:
eleUserMessage_list = driver.find_elements_by_xpath('//*[#id="user-message"]')
my_desired_element = eleUserMessage_list[0] # or maybe [1]
my_desired_element.clear()
my_desired_element.send_keys("Test Python")
time.sleep(2)
The only real difference between find_elements_by_xpath and find_element_by_xpath is the first option returns a list that needs to be indexed. Once it's indexed, it works the same as if you had run the second option!

python parse evernote shared notebook

I am trying to get data from evernote 'shared notebook'.
For example, from this one: https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c
I tried to use Beautiful Soup:
url = 'https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c'
r = requests.get(url)
bs = BeautifulSoup(r.text, 'html.parser')
bs
The result doesn't contain any text information from the notebook, only some code.
I also seen an advice to use selenium and find elements by XPath.
For example I want to find the head of this note - 'Term 3 Week2'. In Google Chrome i found that it's XPath is '/html/body/div[1]/div[1]/b/span/u/b'.
So i tried this:
driver = webdriver.PhantomJS()
driver.get(url)
t = driver.find_element_by_xpath('/html/body/div[1]/div[1]/b/span/u/b')
But it also didn't work, the result was 'NoSuchElementException:... '.
I am a newbie in python and especially parsing, so I would be glad to receive any help.
I am using python 3.6.2 and jupiter-notebook.
Thanks in advance.

The easiest way to interface with Evernote is to use their official Python API.
After you've configured your API key and can generally connect, you can then download and reference Notes and Notebooks.
Evernote Notes use their own template language called ENML (EverNote Markup Language) which is a subset of HTML. You'll be able to use BeautifulSoup4 to parse the ENML and extract the elements you're looking for.
If you're trying to extract information against a local installation (instead of their web app) you may also be able to get what you need from the executable. See how to pass arguments to the local install to extract data. For this you're going to need to use the Python3 subprocess module.
HOWEVER
If you want to use selenium, this will get you started:
import selenium.webdriver.support.ui as ui
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# your example URL
URL = 'https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c'
# create the browser interface, and a generic "wait" that we can use
# to intelligently block while the driver looks for elements we expect.
# 10: maximum wait in seconds
# 0.5: polling interval in seconds
driver = Chrome()
wait = ui.WebDriverWait(driver, 10, 0.5)
driver.get(URL)
# Note contents are loaded in an iFrame element
find_iframe = By.CSS_SELECTOR, 'iframe.gwt-Frame'
find_html = By.TAG_NAME, 'html'
# .. so we have to wait for the iframe to exist, switch our driver context
# and then wait for that internal page to load.
wait.until(EC.frame_to_be_available_and_switch_to_it(find_iframe))
wait.until(EC.visibility_of_element_located(find_html))
# since ENML is "just" HTML we can select the top tag and get all the
# contents inside it.
doc = driver.find_element_by_tag_name('html')
print(doc.get_attribute('innerHTML')) # <-- this is what you want
# cleanup our browser instance
driver.quit()

Python: how i can print all the source code by using Selenium

driver.page_source don't returns all the source code.It is detaily printing only some parts of code, but it's missing a big part of code. How can i fix this?
This is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
def htmlToLuna():
url ='https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A'
driver = webdriver.Chrome('C:\\Python27\\chromedriver\\chromedriver.exe')
driver.get(url)
web=open('web.txt','w')
web.write(driver.page_source)
print driver.page_source
web.close()
print htmlToLuna()

Here is a simple code all it does is it opens the url and gets the length page source and waits for five seconds and will get the length of page source again.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
initial = len(browser.page_source)
print(initial)
time.sleep(5)
new_source = browser.page_source
print(len(new_source)
see the output:
15722
48800
you see that the length of the page source increases after a wait? you must make sure that the page is fully loaded before getting the source. But this is not a proper implementation since it blindly waits.
Here is a nice way to do this, The browser will wait until the element of your choice is found. Timeout is set for 10 sec.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
try:
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.CodeMirror > div:nth-child(1) > textarea:nth-child(1)'))) # 10 seconds delay
print("Result:")
print(len(browser.page_source))
except TimeoutException:
print("Your exception message here!")
The output: Result: 52195
Reference:
https://stackoverflow.com/a/26567563/7642415
http://selenium-python.readthedocs.io/locating-elements.html
Hold on! even that wont make any guarantees for getting full page source, since individual elements are loaded dynamically. If the browser finds the element it moves on. So make sure you find the proper element to make sure the page has been loaded fully.
P.S Mine is Python3 & webdriver is in my environment PATH. So my code needs to be modified a bit to make it work for Python 2.x versions. I guess only print statements are to be modified.

PhantomJS (Selenium) Cannot Load PDFs from direct urls

I was trying to save some PDF from a link via PhantomJS (selenium). So, I refered to this code that turns webpages to pdfs. and it worked just fine when I ran the exactly same code.
So, I have this pdf I wanted to save from a direct url and I tried that script... it didn't work. It just saves a PDF with 1 white page. That's all...
My Code :
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def execute(script, args):
driver.execute('executePhantomScript', {'script': script, 'args' : args })
driver = webdriver.PhantomJS('phantomjs')
# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')
try:
WebDriverWait(driver, 40).until(EC.presence_of_element_located((By.ID, 'plugin')))
except Exception as TimeoutException:
print("I waited for far too long and still couldn't fine the view.")
pass
# set page format
# inside the execution script, webpage is "this"
pageFormat = '''this.paperSize = {format: "A4", orientation: "portrait" };'''
execute(pageFormat, [])
# render current page
render = '''this.render("test2.pdf")'''
execute(render, [])
I'm not sure what's happening and why is it happening. Need some assistance.
EDIT: This is just the test PDF that I was trying to get via Selenium. There are some other PDFs which I need to get and that website is checking god-knows-what to decide whether it's a human or a bot. So, Selenium is the only way.
EDIT 2 : So, here's the website I was practicing on : http://services.ecourts.gov.in/ecourtindia/cases/case_no.php?state_cd=26&dist_cd=8&appFlag=web
Select "Cr Rev - Criminal Revision" from "Case Type" drop down and input any number in case number and year. Click on "Go".
This will show a little table, click on "view" and it should show a table on full page.
Scroll down to the "orders" table and you should see "Copy of order". That's the pdf I'm trying to get.I have tried requests as well and it did not work.

Currently, PhantomJS and Chrome headless doesn't support download a file. If you are OK with Chrome browser, please see my example below. It finds a elements, and then add an attribute download. Finally, it clicks on the link to download the file to default Downloads folder.
import time
driver = webdriver.Chrome()
driver.get('http://www.planetpublish.com/free-ebooks/93/heart-of-darkness/')
pdfLinks = driver.find_elements_by_css_selector(".entry-content ul > li > a")
for pdfLink in pdfLinks:
script = "arguments[0].setAttribute('download',arguments[1]);"
driver.execute_script(script, pdfLink, pdfLink.text)
time.sleep(1)
pdfLink.click()
time.sleep(3)
driver.quit()

If you're just looking at downloading PDFs which aren't protected behind some javascript or stuff (essentially straightforward stuff), I suggest using the requests library instead.
import requests
url ='http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf'
r = requests.get(url)
with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
f.write(r.content)
# If large file
with requests.get(url, stream=True) as r:
with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)

I recommend you look at the pdfkit library.
import pdfkit
pdfkit.from_url('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf', 'out.pdf')
It makes downloading pdfs very simple with python. You will also need to download this for the library to work.
You could also try the code from this link shown below
#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
from selenium.webdriver.support.ui import WebDriverWait
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')
button = browser.find_element_by_name('button')
button.click()
# wait for the page to load
WebDriverWait(browser, timeout=10).until(
lambda x: x.find_element_by_id('someId_that_must_be_on_new_page'))
# store it to string variable
page_source = browser.page_source
print(page_source)
which you will need to edit to make work for your pdf.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get visible content of a page using selenium and BeautifulSoup - python

Sounds like the dom elements are not yet loaded when your code try to reach them. Try to wait for the elements to be fully loaded and just then replace. This works for your when you run it command by command because then you let the driver load all the elements before you execute more commands.

Related

Selenium webscraper not scraping desired tags

xpath returns more than one result, how to handle in python

python parse evernote shared notebook

Python: how i can print all the source code by using Selenium

PhantomJS (Selenium) Cannot Load PDFs from direct urls

Categories

Resources