Python Selenium pulling child property from parent? - python

I'm trying to scrape a webform for text in specific fields however i can't do it with xpath because some forms are missing fields which won't be included in the page when it loads (i.e. if /html/blah/blah/p[3] is the initials field for one form it might be first name on another form but have the same xpath. The structure for the fields is like this:
<p><strong>Initials:</strong> WT</p>
so using python selenium i'm doing
driver.find_element_by_xpath("//*[contains(text(), 'Initials:')]") which does successfully pull the "Initials:" text between the strong tags but i specifically need the child text after it, in this case WT. It has the attribute "nextSibling.data" which contains the WT value but from my googling i don't think its possible to pull that attribute with python selenium. Does anyone know a way to pull the WT text following the xpath query?

The 'WT' text is in a weird spot. I don't think it is actually a sibling per-se. The only way I know to grab that text would be to use p_element.get_attribute('outerHTML'), which in this instance should grab the string '<p><strong>Initials:</strong> WT</p>'. I doubt this is the cleanest solution, but here's a way to parse that text out:
strong_close_tag = '</strong>'
p_close_tag = '</p>'
p_element = driver.find_element_by_xpath("//*[contains(text(), 'Initials:')]/parent")
print(p_element.get_attribute('outerHTML')[text.index(strong_close_tag)+len(strong_close_tag):text.index(p_close_tag)])
OR -- use p_element.get_attribute('innerHTML'), which should return just <strong>Initials:</strong> WT. Then, similarly, grab the text after the </strong> closing tab, maybe like this:
p_element = driver.find_element_by_xpath("//*[contains(text(), 'Initials:')]/parent")
print p_element.get_attribute('innerHTML').split("</strong>",1)[1]

Related

Need Selenium to return the class title content of given HTML

Using Selenium to perform some webscraping. Have it log in to a site, where an HTML table of data is returned with five values at a time. I'm going to have Selenium scrape a particular bit of data off the table, write to a file, click next, and repeat with the next five.
New automation script. I've a myriad of variations of get_attribute, find_elements_by_class_name, etc. Example:
pnum = prtnames.get_attribute("title")
for x in prtnames:
print('pnum')
Here's the HTML from one of the returned values:
<div class="text-container prtname"><span class="PrtName" title="P011">P011</span></div>
I need to get that "P011" value. Obviously Selenium doesn't have "find_elements_by_title", and there is no HTML id for the value. The Xpath for that line of HTML is:
//*[#id="printerConnectTable"]/tbody/tr[5]/td/table/tbody/tr[1]/td[2]/div/span
But I don't see a reference to "title" or "P011" in that Xpath.
pnum = prtnames.get_attribute("title")
AttributeError: 'list' object has no attribute 'get_attribute'
It's like get_attribute doesn't exist, but there is some (albeit not much) documentation on it.
Fundamentally I'd like to grab that "P011" value and print to console, then I know Selenium is working with the right data.
P.S. I'm self-taught with all of this, I'm automating a sysadmin task.
I think the problem is that prtnames is a list of element, not a specific element. You can use a list comprehension if you want a list of the attributes of titles for the list of prtnames.
pnums = [x.get_attribute('title') for x in prtnames]

Extracting HTML tag content with xpath from a specific website

I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.
Example page: link
I am trying to extract the company name and position name. Chrome shows that the company name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/span[1]"
and the position name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/b/font"
This bit of code tries to extract those values from a locally saved and parsed copy of the page:
import lxml.html as h
xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)
However, the print commands return empty strings, meaning nothing was extracted!
What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.
I would really appreciate any help with getting those two values extracted.
Try it like this:
company = xslt_root.xpath("//div[#data-tn-component='jobHeader']/span[#class='company']/text()")
position = xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']
Once we have the //div[#data-tn-component='jobHeader'] path things become pretty straightforward:
select the text of the child span /span[#class='company']/text() to get the company name
/b[#class='jobtitle']//text() is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text() to get the position.
An alternative is to select the b or font node and use text_content() to get the text (recursively, if needed), e.g.
xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']")[0].text_content()
Despite your assumption, it seems that the content on the page is loaded dynamically, and is thus not present during loading time.
This means you can't access the elements from your downloaded HTML file (if you do not believe me, try to look for job-content in the actual file on your computer, which will only contain placeholders and descriptors.
It seems you would have to use technologies like Selenium to perform this task.
Again, I want to stress that whatever you are doing (automatically), is a violation of indeed.com's Terms and Conditions, so I would suggest not to go too far with this anyways.

Using Xpath to get the anchor text of a link in Python when the link has no class

(disclaimer: I only vaguely know python & am pretty new to coding)
I'm trying to get the text part of a link, but it doesn't have a specific class, and depending on how I word my code I get either way too many things (the xpath wasn't specific enough) or a blank [ ].
A screenshot of what I'm trying to access is :
Tree is all the html from the page.
The code that returns a blank is:
cardInfo=tree.xpath('div[#class="cardDetails"]/table/tbody/tr/td[2]/a/text()')
The code that returns way too much:
cardInfo=tree.xpath('a[contains(#href, 'domain_name')]/text()')
I tried going into Inspect in chrome and copying the xpath, which also gave me nothing. I've successfully gotten other things out of the page that are just plain text, not links. Super sorry if I didn't explain this well but does anyone have an idea of what I can write?
If you meant to find text next to Set Name::
>>> import lxml.html
>>> tree = lxml.html.parse('http://shop.tcgplayer.com/pokemon/jungle/nidoqueen-7')
>>> tree.xpath(".//b[text()='Set Name:']/parent::td/following-sibling::td/a/text()")
['Jungle']
.//b[text()='Set Name:'] to find b tag with Set Name: text,
parent::td - parent td element of it,
following-sibling::td - following td element

Selenium - cant get text from span element

I'm very confused by getting text using Selenium.
There are span tags with some text inside them. When I search for them using driver.find_element_by_..., everything works fine.
But the problem is that the text can't be got from it.
The span tag is found because I can't use .get_attribute('outerHTML') command and I can see this:
<span class="branding">ThrivingHealthy</span>
But if I change .get_attribute('outerHTML') to .text it returns empty text which is not correct as you can see above.
Here is the example (outputs are pieces of dictionary):
display_site = element.find_element_by_css_selector('span.branding').get_attribute('outerHTML')
'display_site': u'<span class="branding">ThrivingHealthy</span>'
display_site = element.find_element_by_css_selector('span.branding').text
'display_site': u''
As you can clearly see, there is a text but it does not finds it. What could be wrong?
EDIT: I've found kind of workaround. I've just changed the .text to .get_attribute('innerText')
But I'm still curious why it works this way?
The problem is that there are a LOT of tags that are fetched using span.branding. When I just queried that page using find_elements (plural), it returned 20 tags. Each tag seems to be doubled... I'm not sure why but my guess is that one set is hidden while the other is visible. From what I can tell, the first of the pair is hidden. That's probably why you aren't able to pull text from it. Selenium's design is to not interact with elements that a user can't interact with. That's likely why you can get the element but when you try to pull text, it doesn't work. Your best bet is to pull the entire set with find_elements and then just loop through the set getting the text. You will loop through like 20 and only get text from 10 but it looks like you'll still get the entire set anyway. It's weird but it should work.

trouble getting text from xpath entry in python

I am on the website
http://www.baseball-reference.com/players/event_hr.cgi?id=bondsba01&t=b
and trying to scrape the data from the tables. When I pull the xpath from one entry, say the pitcher
"Terry Mulholland," I retrieve this:
pitchers = site.xpath("/html/body/div[2]/div[2]/div[6]/table/tbody/tr/td[3]/table/tbody/tr[2]/td/a)
When I try to print pitcher[0].text for pitcher in printers, I get [] rather than the text, Any idea why?
The problem is, last tbody doesn't exist in the original source. If you get that xpath via some browser, keep in mind that browsers can guess and add missing elements to make html valid.
Removing the last tbody resolves the problem.
In : import lxml.html as html
In : site = html.parse("http://www.baseball-reference.com/players/event_hr.cgi?id=bondsba01&t=b")
In : pitchers = site.xpath("/html/body/div[2]/div[2]/div[6]/table/tbody/tr/td[3]/table/tr[2]/td/a")
In : pitchers[0].text
Out: 'Terry Mulholland'
But I need to add that, the xpath expression you are using is pretty fragile. One div added in some convenient place and now you have a broken script. If possible, try to find better references like id or class that points to your expected location.

Categories

Resources