driver.page_source don't returns all the source code.It is detaily printing only some parts of code, but it's missing a big part of code. How can i fix this?
This is my code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
def htmlToLuna():
url ='https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A'
driver = webdriver.Chrome('C:\\Python27\\chromedriver\\chromedriver.exe')
driver.get(url)
web=open('web.txt','w')
web.write(driver.page_source)
print driver.page_source
web.close()
print htmlToLuna()
Here is a simple code all it does is it opens the url and gets the length page source and waits for five seconds and will get the length of page source again.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
initial = len(browser.page_source)
print(initial)
time.sleep(5)
new_source = browser.page_source
print(len(new_source)
see the output:
15722
48800
you see that the length of the page source increases after a wait? you must make sure that the page is fully loaded before getting the source. But this is not a proper implementation since it blindly waits.
Here is a nice way to do this, The browser will wait until the element of your choice is found. Timeout is set for 10 sec.
if __name__=="__main__":
browser = webdriver.Chrome()
browser.get("https://codefights.com/tournaments/Xph7eTJQssbXjDLzP/A")
try:
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.CodeMirror > div:nth-child(1) > textarea:nth-child(1)'))) # 10 seconds delay
print("Result:")
print(len(browser.page_source))
except TimeoutException:
print("Your exception message here!")
The output: Result: 52195
Reference:
https://stackoverflow.com/a/26567563/7642415
http://selenium-python.readthedocs.io/locating-elements.html
Hold on! even that wont make any guarantees for getting full page source, since individual elements are loaded dynamically. If the browser finds the element it moves on. So make sure you find the proper element to make sure the page has been loaded fully.
P.S Mine is Python3 & webdriver is in my environment PATH. So my code needs to be modified a bit to make it work for Python 2.x versions. I guess only print statements are to be modified.
Related
I am trying to webscrape indeed.com to search for jobs using python, with selenium and beautifulsoup. I want to click next page but cant seem to figure out how to do this. Looked at many threads but it is unclear to me which element I am supposed to perform on. Here is the web page html and the code marked with grey comes up when I inspect the next button.
Also just to mention I tried first to follow what happens to the url when mousedown is executed. After reading the addppurlparam function and adding the strings in the function and using that url I just get thrown back to page one.
Here is my code for the class with selenium meant to click on the button:
from selenium import webdriver
from selenium.webdriver import ActionChains
driver = webdriver.Chrome("C:/Users/alleballe/Downloads/chromedriver.exe")
driver.get("https://se.indeed.com/Internship-jobb")
print(driver.title)
#assert "Python" in driver.title
elem = driver.find_element_by_class_name("pagination-list")
elem = elem.find_element_by_xpath("//li/a[#aria-label='Nästa']")
print(elem)
assert "No results found." not in driver.page_source
assert elem
action = ActionChains(driver).click(elem)
action.perform()
print(elem)
driver.close()
The indeed site is formatted so that it shows 10 per page.
Your photo shows the wrong section of HTML instead you can see the links contain start=0 for the first page, start=10 for the second, start=20 for the third,...
You could use this knowledge to do a code like this:
while True:
i = 0
driver.get(f'https://se.indeed.com/jobs?q=Internship&start={i}')
# code here
i = i + 10
But, to directly answer to your question you should do:
next_page_link = driver.find_element_by_xpath('/html/head/link[6]')
driver.get(next_page_link)
This will find the link and then get it.
its work. paginated to next page.
driver.find_element_by_class_name("pagination-list").find_element_by_tag_name('a').click()
I want to retrieve all visible content of a web page. Let say for example this webpage. I am using a headless firefox browser remotely with selenium.
The script I am using looks like this
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
dom = BeautifulSoup(driver.page_source, parser)
f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))
with open('out.html', 'w') as fe:
fe.write(dom.encode('utf-8'))
This is supposed to load the page, parse the dom, and then replace the iframe with id dsq-app1 with it's visible content. If I execute those commands one by one via my python command line it works as expected. I can then see the paragraphs with all the visible content. When instead I execute all those commands at once, either by executing the script or by pasting all this snippet in my interpreter, it behaves differently. The paragraphs are missing, the content still exists in json format, but it's not what I want.
Any idea why this may happening? Something to do with replace_with maybe?
Sounds like the dom elements are not yet loaded when your code try to reach them.
Try to wait for the elements to be fully loaded and just then replace.
This works for your when you run it command by command because then you let the driver load all the elements before you execute more commands.
To add to Or Duan's answer I provide what I ended up doing. The problem of finding whether a page or parts of a page have loaded completely is an intricate one. I tried to use implicit and explicit waits but again I ended up receiving half-loaded frames. My workaround is to check the readyState of the original document and the readyState of iframes.
Here is a sample function
def _check_if_load_complete(driver, timeout=10):
elapsed_time = 1
while True:
if (driver.execute_script('return document.readyState') == 'complete' or
elapsed_time == timeout):
break
else:
sleep(0.0001)
elapsed_time += 1
then I used that function right after I changed the focus of the driver to the iframe
driver.switch_to_frame('dsq-app1')
_check_if_load_complete(driver, timeout=10)
Try to get the Page Source after detecting the required ID/CSS_SELECTOR/CLASS or LINK.
You can always use explicit wait of Selenium WebDriver.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
f = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,idName)
# here 10 is time for which script will try to find given id
# provide the id name
dom = BeautifulSoup(driver.page_source, parser)
f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))
with open('out.html', 'w') as fe:
fe.write(dom.encode('utf-8'))
Correct me if this not work
I have problem with following task:
Open Google start page
Type request in search form
Choose result where url matches some given url(for example http://www.theguardian.com)
Currently i have this script:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("https://google.com/")
search_form = driver.find_element_by_xpath("/html/body/div/div[3]/form/div[2]/div[2]/div[1]/div[1]/div[3]/div/div[3]/div/input[1]")
search_form.send_keys("guardian")
search_form.send_keys(Keys.ENTER)
driver.find_element_by_xpath('//a[starts-with(#href,"http://www.theguardian.com")]').click()
It succesfully executes first 2 subtasks but when on last line throws exception:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: {"method":"xpath","selector":"//a[starts-with(#href,\"http://www.theguardian.com\")]"}
Also i have this script which satisfies only last subtask:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
q = "guardian"
browser = webdriver.Firefox()
body = browser.find_element_by_tag_name("body")
body.send_keys(Keys.CONTROL + 't')
browser.get("https://www.google.com/search?q=" + q + "&start=" + str(counter))
browser.find_element_by_xpath('//a[starts-with(#href,"http://www.theguardian.com")]').click()
I works OK. My question is why first script throws exception on how can i modify it so it opens search result as second script does?
UPDATE:
As Bart and Shubham mentioned in comments, problem was in that i was trying to find element on page that wasn't yet loaded. So solution is to use 'wait'.
Selenium-webdriver provides 2 types of 'wait' -- explicit and implicit more on that in documentation.
For my solution i used implicit wait. Basically, it's telling WebDriver to wait for certain amount of time to find an element if it's not immediately available.
For that i just added 1 line to script:
driver.implicitly_wait(5)
Probably this is because the elements the first version is not yet on the page. If you create a "wait until element is present" kind of loop (do not know if it exists by heart) then it should work.
The second example does work because browser.get only returns if the page is loaded.
You can do something like below ...
Try to put some wait from first place
The code is in java but it is very similar/near to python, take a reference from it
You can check everytime that your element is present or not in your HTML DOM to prevent from error/failer of script. like below:-
if (driver.findElements(By.xpath('//a[starts-with(#href,"http://www.theguardian.com')).size() != 0) {
YOUR FIRST Working code
System.out.println("element exists");
}
else{
Your second working code
}
Hope it will help you :)
In this video, you can see how it can be done:
https://www.youtube.com/watch?v=IUBwtLG9hbs
Move to 12:30 to see it in action: Go to google.com, type something in the search box and click on a search result.
I am scraping a website with a lot of javascript that is generated when the page is called. As a result, traditional web scraping methods (beautifulsoup, ect.) are not working for my purposes (at least I have been unsuccessful in getting them to work, all of the important data is in the javascript parts). As a result I have started using selenium webdriver. I need to scrape a few hundred pages, each of which has between 10 and 80 data points (each with about 12 fields), so it is important that this script (is that the right terminology?) can run for quite awhile without me having to babysit it.
I have the code working for a single page, and I have a controlling section that tells the scraping section what page to scrape. The problem is that sometimes the javascript portions of the page load, and sometimes they don't when they don't(~1/7), a refresh fixes things, but occasionally the refresh will freeze webdriver and thus the python runtime environment as well. Annoyingly, when it freezes like this, the code fails to time out. What is going on?
Here is a stripped down version of my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time, re, random, csv
from collections import namedtuple
def main(url_full):
driver = webdriver.Firefox()
driver.implicitly_wait(15)
driver.set_page_load_timeout(30)
#create HealthPlan namedtuple
HealthPlan = namedtuple( "HealthPlan", ("State, County, FamType, Provider, PlanType, Tier,") +
(" Premium, Deductible, OoPM, PrimaryCareVisitCoPay, ER, HospitalStay,") +
(" GenericRx, PreferredPrescription, RxOoPM, MedicalDeduct, BrandDrugDeduct"))
#check whether the page has loaded and handle page load and time out errors
pageNotLoaded= bool(True)
while pageNotLoaded:
try:
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
# Handle page load error by testing presence of showAll,
# an important feature of the page, which only appears if everything else loads
try:
driver.find_element_by_xpath('//*[#id="showAll"]').text
# catch NoSuchElementException=>refresh page
except NoSuchElementException:
try:
driver.refresh()
# catch TimeoutException => quit and load the page
# in a new instance of firefox,
# I don't think the code ever gets here, because it freezes in the refresh
# and will not throw the timeout exception like I would like
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
pageNotLoaded= False
scrapePage() # this is a dummy function, everything from here down works fine,
I have looked extensively for similar problems, and I do not think anyone else has posted about this on so, or anywhere else that I have looked. I am using python 2.7, selenium 2.39.0 and I am trying to scrape Healthcare.gov 's get premium estimate's pages
EDIT: (as an example,this page) It may also be worth mentioning that the page fails to load completely more often when the computer has been on/ doing this for awhile (i'm guessing that the free ram is getting full, and it glitches while loading) this is kind of beside the point though, because this should be handled by the try/except.
EDIT2: I should also mention that this is being run on windows7 64bit, with firefox 17 (which I believe is the newest supported version)
Dude, time.sleep it's a fail!
What's this?
time.sleep(3+ abs(random.normalvariate(1.8,3)))
Try this:
class TestPy(unittest.TestCase):
def waits(self):
self.implicit_wait = 30
Or this:
(self.)driver.implicitly_wait(10)
Or this:
WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath('some_xpath'))
Or, instead of driver.refresh() you can trick it :
driver.get(your url)
Also you can cick the cookie :
driver.delete_all_cookies()
scrapePage() # this is a dummy function, everything from here down works fine, :
http://scrapy.org
I have a HTML code like this:
<div class="links nopreview"><span><a class="csiAction"
href="/WebAccess/home.html#URL=centric://REFLECTION/INSTANCE/_CS_Data/null">Home</a></span> • <span><span><a class="csiAction"
href="/WebAccess/home.html#URL=centric://SITEADMIN/_CS_Site">Setup</a></span> • </span><span><a
title="Sign Out" class="csiAction csiActionLink">Sign Out</a></span></div>
I would like to click on the link that has the text Home. As this Home link appears after login, I have a code like this:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
browser = webdriver.Firefox() # Get local session of firefox
browser.get("http://myServer/WebAccess/login.html") # Load App page
elem = browser.find_element_by_name("LoginID") # Find the Login box
elem.send_keys("Administrator")
elem = browser.find_element_by_name("Password") # Find the Password box
elem.send_keys("Administrator" + Keys.RETURN)
#try:
elem = browser.find_element_by_link_text("Home")
elem.click()
The part till login works great. However the last but one line is problematic
elem = browser.find_element_by_link_text("Home")
It raises this NoSuchElementException where the Home link is there as you can see from the HTML code.
raise exception_class(message, screen, stacktrace)
NoSuchElementException: Message: u'Unable to locate element: {"method":"link text","selector":"Home"}'
Any guidance as to what I am doing wrong, please?
Have you tried adding an implicit wait to this so that it waits instead of running to quickly.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import re
browser = webdriver.Firefox() # Get local session of firefox
browser.implicitly_wait(10) #wait 10 seconds when doing a find_element before carrying on
browser.get("http://myServer/WebAccess/login.html") # Load App page
elem = browser.find_element_by_name("LoginID") # Find the Login box
elem.send_keys("Administrator")
elem = browser.find_element_by_name("Password") # Find the Password box
elem.send_keys("Administrator" + Keys.RETURN)
#try:
elem = browser.find_element_by_link_text("Home")
elem.click()
The implicitly_wait call makes the browser poll until the item is on the page and visible to be interacted with.
The most common issues with NoSuchElementException while the element is there are:
the element is in different window/frame, so you've to switch to it first,
your page is not loaded or your method of page load is not reliable.
Solution could include:
check if you're using the right frame/window by: driver.window_handles,
write a wait wrapper to wait for an element to appear,
try XPath instead, like: driver.find_element_by_xpath(u'//a[text()="Foo"]').click(),
use pdb to diagnose your problem more efficiently.
See also: How to find_element_by_link_text while having: NoSuchElement Exception?
Maybe the element you are looking for doesn't exactly match that text string?
I know it can be tricky if it looks like it does on-screen, but sometimes there are oddities embedded like this simple markup "Home" or "Home" which makes the first char italic:
"<i>H</i>ome" is visually identical to "<em>H</em>ome" but does not match text.
Edit: after writing the above answer, I studied the question closer and discovered the HTML sample does show "Home" in plain text, but was not visible due to long lines not wrapping. So I edited the OP to wrap the line for readability.
New observation: I noticed that the Logout element has a "title" attribute, but the Home link element lacks such--try giving it one and using that.
Try adding an implicit wait to this in order to wait, instead of running too quickly.
Or
else you can import time and use time.sleep(25)