I'm trying to web scrape a table from an iframe. In order to switch the driver to that frame I'm using driver.find_element_by_xpath, but the problem is that the path in the html code includes some namespaces that I cannot get Python to figure out using the local-name() function.
Here is the chunk of the HTML I'm using:
<xbrl:campo-captura xbrl:solo-lectura="true" xbrl:id-hecho-plantilla="ar_pros_CorporateStructure_11933a35-3932-44c0-b394-f0ebd4f722d2"
id="8a97271e-df5c-4fbe-bedf-513ea1508bf2"><div>
<div>
<i style="cursor:pointer; float:right;margin-right:-20px;" id="d9fa20ae-c55f-4344-baf5-0112a13827b6" class="i i-arrow-down-2 botonDetalleOperacionXbrl">
</i>
<div id="abrir_nota_F2a26d5a7-2934-4ff0-86df-7a8983c05e47" style="cursor:pointer;float:right;margin-right:-20px;margin-top:20px;" data-toggle="tooltip" data-placement="right" title="Abrir nota">
<i class="fa fa-external-link"></i>
</div>
</div>
<div class="campoTextBlock">
<div id="F2a26d5a7-2934-4ff0-86df-7a8983c05e47">
<div class="celdaAnchoFijo textBlockLimit div-default divTextBlockMaximo" id="divAreaTextod9fa20ae-c55f-4344-baf5-0112a13827b6" style="overflow-y:hidden">
<iframe scrolling="no" id="frame_8a97271e-df5c-4fbe-bedf-513ea1508bf2" style="width:100%;height:100%" frameborder="0"></iframe>
</div>
</div>
</div>
<div>
</div>
</div></xbrl:campo-captura>
I want to get to the "iframe" using something like:
framLogin= driver.find_element_by_xpath('//[local-name()="campo-captura"][#*[local-name()="id-
hecho-plantilla" and .="ar_pros_CorporateStructure_11933a35-3932-44c0-b394-f0ebd4f722d2"]]
/div[2]/div/div/iframe')
The message I get is
Given xpath expression ... is invalid: SyntaxError: Document.evaluate: The expression is not a legal expression.
I've already looked for more information but all I have found is not for Python.
I'm aware I could get to the iframe by using its id, but later on I want to make a loop to scrap the same tables in other URLs with the exact same format, and the iframe's id is not constant.
Your immediate syntax error can be fixed by changing
//[local-name()="campo-captura"]
to
//*[local-name()="campo-captura"]
^
Related
I want to get all text inside a div with xpath
Here HTML code:
<div class="JobDescriptionsc__DescriptionContainer-sc-1jylha1-2 dGyoDf">
<div class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf">
<div class="DraftEditor-root">
<div class="DraftEditor-editorContainer">
<div class="public-DraftEditor-content" contenteditable="false" spellcheck="false" style="outline:none;user-select:text;-webkit-user-select:text;white-space:pre-wrap;word-wrap:break-word">
<div data-contents="true">
#Here the all text
<div class="" data-block="true" data-editor="d54la" data-offset-key="bhkoa-0-0">
<div data-offset-key="bhkoa-0-0" class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr">
<span data-offset-key="bhkoa-0-0" style="font-weight:bold">
<span data-text="true">Job Description:</span>
</span>
</div>
</div>
<div class="" data-block="true" data-editor="d54la" data-offset-key="51e5u-0-0">
<div data-offset-key="51e5u-0-0" class="public-DraftStyleDefault-block public-DraftStyleDefault-ltr">
<span data-offset-key="51e5u-0-0">
<span data-text="true">ยท Identify & developed application base on predefined business requirements.</span>
</span>
</div>
</div>
...
#there's more, I'm just showing you a few
</div>
</div>
</div>
</div>
</div>
</div>
This my XPath code:
dom_job.xpath('//*[#class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf"]//text()')
I need the all text inside the div parent with xpath, can it?
I'm assuming the Python module which provides your XPath interpreter supports XPath version 1. Your XPath expression below returns the set of all text nodes which are descendants of the div element:
//*[#class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf"]//text()
You should be able to iterate over all that collection of text nodes, and concatenate them into a single string, in Python.
But it's simpler, if you want the concatenated value of the text nodes within a particular div, to just apply the XPath string() function to the div; e.g.:
string(//*[#class="DraftEditorContainersc__DraftEditorContainer-sc-1x4uima-0 cGUaQf"])
See https://www.w3.org/TR/1999/REC-xpath-19991116/#function-string
Note that, in XPath 1, if you apply the string() function to a larger set of nodes (such as the set of text nodes returned by your first query), the function will return the string value of just the first node.
I am trying to use selenium to loop through a list of properties on a web page and return the property address and auction time. I have the following python code so far and html for the web page below.
I'm able to return the links to every property in the list, but can't seen to return the values I need from the "H4" tags. I think I'm doing something wrong with getting the elements by Xpath but I can't seem to figure it out.
Any help would be greatly appreciated!
HTML:
<div data-elem-id="asset_list_content">
<a href="/details/123-memory-lane">
<div data-elm-id="asset_2352111_address" class="styles__address-container--2l39p styles__u-mr-1--3qZyj">
<h4 data-elm-id="asset_2352111_address_content_1" class="styles__asset-font-big--vQU7K">123 memory-lane</h4>
<label data-elm-id="asset_2352111_address_content_2" class="styles__asset-font-small--2JgrX">POWDER SPRINGS, GA 30127, Cobb County</label>
</div>
<div class="styles__auction-container--45DZU styles__u-ml-1--34mF_">
<h4 data-elm-id="asset_2352111_auction_date" class="styles__asset-font-big--vQU7K">Apr 04, 10:00am</h4>
</div>
</a>
<a href="/details/456-memory-lane">
<div data-elm-id="asset_8463157_address" class="styles__address-container--2l39p styles__u-mr-1--3qZyj">
<h4 data-elm-id="asset_8463157_address_content_1" class="styles__asset-font-big--vQU7K">456 memory-lane</h4>
<label data-elm-id="asset_8463157_address_content_2" class="styles__asset-font-small--2JgrX">POWDER SPRINGS, GA 30127, Cobb County</label>
</div>
<div class="styles__auction-container--45DZU styles__u-ml-1--34mF_">
<h4 data-elm-id="asset_8463157_auction_date" class="styles__asset-font-big--vQU7K">March 10, 10:00am</h4>
</div>
</a>
</div>
Python (Selenium):
propertyList = browser.find_elements_by_xpath('//div[#data-elm-id="asset_list_content"]')
for element in propertyList:
propertyLinks = element.find_elements_by_tag_name('a')
for propertyLink in propertyLinks:
propertyAddress = propertyLink.get_element_by_xpath('//h4[1]')
propertyAuctionTime = propertyLink.get_element_by_xpath('//h4[2]')
print(propertyAddress).text
print(propertyAuctionTime).text
Output:
propertyAddress = propertyLink.get_element_by_xpath('//h4[1]')
AttributeError: 'WebElement' object has no attribute 'get_element_by_xpath'
The error seems to be you are using get_element_by_xpath(), which isn't a valid method. You used find_elements_by_xpath() in your code before that moment, and to find the elements you are looking for you just need to use the method that only finds a single element: find_element_by_xpath().
Lets say I have following HTML Code
<div class="12">
<div class="something"></div>
</div>
<div class="12">
<div class="34">
<span>TODAY</span>
</div>
</div>
<div class="12">
<div class="something"></div>
</div>
<div class="12">
<div class="something"></div>
</div>
Now If I use driver.find_elements_by_class_name("something") then I get all the classes present in the HTML code. But I want to get classes only after a specific word ("Today") in HTML. How to exclude classes that appear before the specific word. Next divs and classes could be at any level.
You can use search by XPath as below:
driver.find_elements_by_xpath('//*/text()[.="some specific word"]/following-sibling::div[#class="something"]')
Note that you might need some modifications in case your real HTML differs from provided simplified HTML
Update
replace following-sibling with following if required div nodes are not siblings:
driver.find_elements_by_xpath('//*/text()[.="some specific word"]/following::div[#class="something"]')
I am trying to use selenium to grab text data from a page.
Printing the html attributes:
element = driver.find_element_by_id("divresults")
Results:
print(element.get_attribute('innerHTML'))
<div id="divDesktopResults"> </div>
Results:
print(element.get_attribute('outerHTML'))
<div id="divresults" data-bind="html:resultsContent"><div id="divDesktopResults"> </div></div>
Tried grabbing this element
Results:
driver.find_element_by_css_selector("span[class='glyphicon glyphicon-tasks']")
Message: no such element: Unable to locate element: {"method":"css selector","selector":"span[class='glyphicon glyphicon-tasks']"}
This is the code when copied from the Browser. There is much more below 'divresults' that did not show up in the innerhtml printout
<div id="divresults" data-bind="html:resultsContent">
<div>
<div class="row" style="font-size:8pt;">
<a data-toggle="tooltip" style="text-decoration:underline" href="#pdfviewer?ID=D218101736">
<strong>D218101736 </strong>
<span class="glyphicon glyphicon-new-window"></span>
</a>
<div class="btn-group" style="font-size:8pt;margin-left:10px;" id="btnD218101736">
<span style="display:none;font-size:8pt;" id="lblD218101736"> Added To Cart</span>
<button type="button" style="font-size:8pt;" class="btn btn-primary dropdown-toggle" data-toggle="dropdown"> Add To Cart
<span class="caret"></span>
</button>
<ul class="dropdown-menu" role="menu">
<li> <strong>Regular ($7.00)</strong> </li>
<li> <strong>Certified ($12.00)</strong> </li>
</ul>
</div>
</div> <br>
<ul class="nav nav-tabs compact">
<li class="active">
<a data-toggle="tab" href="#D218101736_Doc">
<span class="glyphicon glyphicon-file"></span>
<span>Doc Info</span>
</a>
</li>
<li class="hidden-xs">
<a data-toggle="tab" href="#D218101736_Thumbnail">
<span class="glyphicon glyphicon-th-large"></span>
<span>Thumbnail</span>
</a>
</li>
....
How to I get data beneath divresults in the instance?
My guess is that it's one of two things:
There is more than one element that matches that locator. To investigate this, try using $$("#divresults") in the dev console and make sure that it returns 1. If it returns more than one, run $$("#divresults")[0] and make sure the element returned is the one you want. If it is, go on to step 2. If it isn't, you will need to find a locator that is more specific. If you want our help, you will need to provide a link to the page or more of the surrounding HTML to the desired element.
You need to add a wait so that the contents of the element can finish loading. You could wait for a locator like #divresults strong or any number of locators to find some of the elements that were missing. You would wait for them to be visible (or at least present). See the docs for more info and options.
Below represents the element tag from where I got the element by its CSS selector "#xwt_widget_navigation__slidemenu__SlideOutNavigationButton_1"
<div class="xwtSlideMenuLevel0 hasIcon xwtSlideOutNavigationButton xwtSlideOutNavigationButtonFocused dijitFocused" data-dojo-attach-event="ondijitclick: _onItemClick, onmouseenter: _mouseEnter, onmouseleave: _mouseLeave" id="xwt_widget_navigation__slidemenu__SlideOutNavigationButton_1" widgetid="xwt_widget_navigation__slidemenu__SlideOutNavigationButton_1">
<div class="xwtSlideOutNavigationButtonInner" tabindex="0" role="button" data-dojo-attach-point="buttonNode,focusNode" title="Topology">
<div class="xwtSlideOutNavigationButtonIcon topologyIcon" data-dojo-attach-point="iconNode">
</div>
<div class="xwtSlideOutNavigationButtonTitle" data-dojo-attach-point="titleNode">
Topology
</div>
<div class="xwtSlideOutNavigationArrow" data-dojo-attach-point="arrowNode">
<span class="xwtSlideOutNavigationArrowIcon"></span>
</div>
</div>
</div>
Problem is when I do a
topo = driver.find_element_by_css_selector("xwt_widget_navigation__slidemenu__SlideOutNavigationButton_1")
topo.click()
This returns a None, no error or any sort.Am running this on a Linux cmd line and as such using the pyvirtualdisplay. When I do a screenshot it doesn't show that anything happened but when I do a debug using the pdb.set_trace() to step through the code it works. I have looked at the examples on StackOverflow and elsewhere but I couldn't find anything similar or helpful. Can someone tell me what am doing wrong?
I think you are missing a # as your css_selector is basically using the element id. Try the following:
topo = driver.find_element_by_css_selector("#xwt_widget_navigation__slidemenu__SlideOutNavigationButton_1")
Or
topo = driver.find_element_by_id("xwt_widget_navigation__slidemenu__SlideOutNavigationButton_1")