I have been using Python's Selenium Webdriver getting elements with this HTML code.
However, I could not access any of the elements inside this #document tag.
I used both
driver.find_element_by_xpath("html/body/div[#id='frame']/iframe/*"), and I tried
elem = driver.find_element_by_tag("iframe"), following by
elem.find_element_by_xpath to find inner elements but failed.
I also tried to do driver.switch_to_frame(driver.find_element_by_tag("iframe")), following with xpath expressions to find inner elements, but it also did not work.
Frame:
<div>
<iframe>
#document
<html>
<body>
<div>
....
</div>
</body>
</html>
</iframe>
</div>
Switching to the iframe and then using the normal query methods is the correct approach to use. I use it successfully throughout a large test suite.
Remember to switch back to the default content when you've finished working inside the iframe though.
Now, to solve your problem. How are you serving the contents of the iframe? Have you literally just written the html and saved it to a file or are you looking at an example site. You might find that the iframe doesn't actually contain the content you expect. Try this.
from selenium.webdriver import Firefox
b = Firefox()
b.get('localhost:8000') # or wherever you are serving this html from
iframe = b.find_element_by_css_selector('iframe')
b.switch_to_frame(iframe)
print b.page_source
That will be the html inside the iframe. Is the contents what you expect? Or is it mainly empty. If it's empty then I suspect it's because you need to serve the contents of the iframe separately.
Web application developers are not very fond of iframes in general. As per their suggestion, I added a 'wait' time using Expected Conditions . After that you can fetch values of your tags. Here I have mentioned as val1.
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
.... #some code
.... #some code
wait(driver, 60).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,'//*[#id="iframeid"]')))
.... #some code
.... #some code
val1 = wait(browser, 20).until(
EC.presence_of_element_located((By.XPATH,'//tr[(#cid="1")]/td[#ret="2" and #c="21"]')))
Hope this helps !
Related
here are the two tags I am trying to scrape: https://i.stack.imgur.com/a1sVN.png. In case you are wondering, this is the link to that page (the tags I am trying to scrape are not behind the paywall): https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635
Below is the code in python I am using, does anyone know why the tags are not properly being stored in paragraphs?
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635'
driver = webdriver.Chrome()
driver.get(url)
paragraphs = driver.find_elements(By.CLASS_NAME, 'css-xbvutc-Paragraph e3t0jlg0')
print(len(paragraphs)) # => prints 0
So you have two problems impacting you.
you should wait for the page to load after you get() the webpage. You can do this with something like import time and time.sleep(10)
The elements that you are trying to scrape, the class tags that you are searching for change on every page load. However, the fact that it is a data-type='paragraph' stays constant, therefore you are able to do:
paragraphs = driver.find_elements(By.XPATH, '//*[#data-type="paragraph"]') # search by XPath to find the elements with that data attribute
print(len(paragraphs))
prints: 2 after the page is loaded.
Just to add-on to #Andrew Ryan's answer, you can use explicit wait for shorter and more dynamical waiting time.
paragraphs = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//*[#data-type="paragraph"]'))
)
print(len(paragraphs))
Trying to identify a javascript button on a website and press it to extend the page.
The website in question is the tencent appstore after performing a basic search. At the bottom of the page is a button titled "div.load-more-new" where upon pressing will extend the page with more apps.
the html is as follows
<div data-v-33600cb4="" class="load-more-btn-new" style="">
<a data-v-33600cb4="" href="javascript:void(0);">加载更多
<i data-v-33600cb4="" class="load-more-icon">
</i>
</a>
</div>
At first I thought I could identify the button using BeautifulSoup but all calls to find result as empty.
from selenium import webdriver
import BeautifulSoup
import time
url = 'https://webcdn.m.qq.com/webapp/homepage/index.html#/appSearch?kw=%25E7%2594%25B5%25E5%25BD%25B1'
WebDriver = webdriver.Chrome('/chromedriver')
WebDriver.get(url)
time.sleep(5)
# Find using BeuatifulSoup
soup = BeautifulSoup(WebDriver.page_source,'lxml')
button = soup.find('div',{'class':'load-more-btn-new'})
[0] []
After looking around here, it became apparent that even if I could it in BeuatifulSoup, it would not help in pressing the button. Next I tried to find the element in the driver and use .click()
driver.find_element_by_class_name('div.load-more-btn-new').click()
[1] NoSuchElementException
driver.find_element_by_css_selector('.load-more-btn-new').click()
[2] NoSuchElementException
driver.find_element_by_class_name('a.load-more-new.load-more-btn-new[data-v-33600cb4]').click()
[3] NoSuchElementException
but all return with the same error: 'NoSuchElementException'
Your selections wont work, cause they do not point on the <a>.
This one selects by class name and you try to click the <div> that holds your <a>:
driver.find_element_by_class_name('div.load-more-btn-new').click()
This one is very close but is missing the a in selection:
driver.find_element_by_css_selector('.load-more-btn-new').click()
This one try to find_element_by_class_name but is a wild mix of tag, attribute and class:
driver.find_element_by_class_name('a.load-more-new.load-more-btn-new[data-v-33600cb4]').click()
How to fix?
Select your element more specific and nearly like in your second apporach:
driver.find_element_by_css_selector('.load-more-btn-new a').click()
or
driver.find_element_by_css_selector('a[data-v-33600cb4]').click()
Note:
While working with newer selenium versions you will get DeprecationWarning: find_element_by_ commands are deprecated. Please use find_element()*
from selenium.webdriver.common.by import By
driver.find_element(By.CSS_SELECTOR, '.load-more-btn-new a').click()
What is wrong in the below code
import os
import time
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://x.x.x.x/html/load.jsp")
elm1 = driver.find_element_by_link_text("load")
time.sleep(10)
elm1.click()
time.sleep(30)
driver.close()
The page source is
<body>
<div class="formcenterdiv">
<form class="form" action="../load" method="post">
<header class="formheader">Loader</header>
<div align="center"><button class="formbutton">load</button></div>
</form>
</div>
</body></html>
I want to click on button load. when I ran the above code getting this error
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element: load
As the documentation says, find_elements_by_link_text only works on a tags:
Use this when you know link text used within an anchor tag. With this
strategy, the first element with the link text value matching the
location will be returned. If no element has a matching link text
attribute, a NoSuchElementException will be raised.
The solution is to use a different selector like find_element_by_class_name:
elm1 = driver.find_element_by_class_name('formbutton')
Did you try using Xpath?
As the OP said, find_elements_by_link_text works on a tags only:
Below code might help you out
driver.get_element_by_xpath("/html/body/div/form/div/button")
(New to Python and 1st post)
See code below, but here's the issue:
I'm trying to scrape the webpage in the code for all job titles on the page, but when I print the list, I'm not getting any values. I've tried using different xpaths to see if I could get something to print, but every time my list is always empty.
Does anyone know if it is an issue with my code, or if there is something about the site structure that I didn't consider?
Thanks in advance!
from lxml import html
import requests
page = requests.get("https://careers.homedepot.com/job-search-results/?location=Atlanta%2C%20GA%2C%20United%20States&latitude=33.7489954&longitude=-84.3879824&radius=15&parent_category=Corporate%2FOther")
tree = html.fromstring(page.content)
Job_Title = tree.xpath('//*[#id="widget-jobsearch-results-list"]/div/div/div/div[#class="jobTitle"]/a/text()')
print (Job_Title)
Information that you're looking for is generated dynamically with some JavaScript while requests allows to get just initial HTML page source.
You might need to use selenium(+chromedriver) to get required data:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://careers.homedepot.com/job-search-results/?location=Atlanta%2C%20GA%2C%20United%20States&latitude=33.7489954&longitude=-84.3879824&radius=15&parent_category=Corporate%2FOther")
xpath = "//a[starts-with(#id, 'job-results')]"
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
jobs = [job.text for job in driver.find_elements_by_xpath(xpath)]
Try a library that can parse JS (dryscrape is a lightweight alternative).
Here's a code sample
from lxml import html
import requests
import dryscrape
session = dryscrape.Session()
session.visit("https://careers.homedepot.com/job-search-results/?location=Atlanta%2C%20GA%2C%20United%20States&latitude=33.7489954&longitude=-84.3879824&radius=15&parent_category=Corporate%2FOther")
page = session.body()
tree = html.fromstring(page.content)
Job_Title = tree.xpath('//*[#id="widget-jobsearch-results-list"]/div/div/div/div[#class="jobTitle"]/a/text()')
print (Job_Title)
That page build HTML(table) with JS. In other words, Target block does not exist as HTML on that page. Please open the source and check it.
<div class="entry-content-wrapper clearfix">
<div id="widget-jobsearch-results-list"></div> # <- Target block is empty!
<div id="widget-jobsearch-results-pages"></div>
</div>
I am trying to run Selenium on a local HTML string but can't seem to find any documentation on how to do so. I retrieve HTML source from an e-mail API, so Selenium won't be able to parse it directly. Is there anyway to alter the following so that it would read the HTML string below:
Python Code for remote access:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.python.org")
assert "Python" in driver.title
elem = driver.find_element_by_class_name("q")
Local HTML Code:
s = "<body>
<p>This is a test</p>
<p class="q">This is a second test</p>
</body>"
If you don't want to create a file or load a URL before being able to replace the content of the page, you can always leverage the Data URLs feature, which supports HTML, CSS and JavaScript:
from selenium import webdriver
driver = webdriver.Chrome()
html_content = """
<html>
<head></head>
<body>
<div>
Hello World =)
</div>
</body>
</html>
"""
driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))
If I understand the question correctly, I can imagine 2 ways to do this:
Save HTML code as file, and load it as url file:///file/location. The problem with that is that location of file and how file is loaded by a browser may differ for various OSs / browsers. But implementation is very simple on the other hand.
Another option is to inject your code onto some page, and then work with it as a regular dynamic HTML. I think this is more reliable, but also more work. This question has a good example.
Here was my solution for doing basic generated tests without having to make lots of temporary local files.
import json
from selenium import webdriver
driver = webdriver.PhantomJS() # or your browser of choice
html = '''<div>Some HTML</div>'''
driver.execute_script("document.write('{}')".format(json.dumps(html)))
# your tests
If I am reading correctly you are simply trying to get text from an element. If that is the case then the following bit should fit your needs:
elem = driver.find_element_by_class_name("q").text
print elem
Assuming "q" is the element you need.