I'm very confused by getting text using Selenium.
There are span tags with some text inside them. When I search for them using driver.find_element_by_..., everything works fine.
But the problem is that the text can't be got from it.
The span tag is found because I can't use .get_attribute('outerHTML') command and I can see this:
<span class="branding">ThrivingHealthy</span>
But if I change .get_attribute('outerHTML') to .text it returns empty text which is not correct as you can see above.
Here is the example (outputs are pieces of dictionary):
display_site = element.find_element_by_css_selector('span.branding').get_attribute('outerHTML')
'display_site': u'<span class="branding">ThrivingHealthy</span>'
display_site = element.find_element_by_css_selector('span.branding').text
'display_site': u''
As you can clearly see, there is a text but it does not finds it. What could be wrong?
EDIT: I've found kind of workaround. I've just changed the .text to .get_attribute('innerText')
But I'm still curious why it works this way?
The problem is that there are a LOT of tags that are fetched using span.branding. When I just queried that page using find_elements (plural), it returned 20 tags. Each tag seems to be doubled... I'm not sure why but my guess is that one set is hidden while the other is visible. From what I can tell, the first of the pair is hidden. That's probably why you aren't able to pull text from it. Selenium's design is to not interact with elements that a user can't interact with. That's likely why you can get the element but when you try to pull text, it doesn't work. Your best bet is to pull the entire set with find_elements and then just loop through the set getting the text. You will loop through like 20 and only get text from 10 but it looks like you'll still get the entire set anyway. It's weird but it should work.
Related
Currently I'm trying to scrape data from a website. Therefore I'm using Selenium.
Everything is working as it should. Until I realised I have to scrape a tooltiptext.
I found already different threads on stackoverflow that are providing an answer. Anyway I did not manage to solve this issue so far.
After a few hours of frustration I realised the following:
This span has nothing to do with the tooltip I guess. Because the tooltip looks like this:
There is actually a span that I can't read. I try to read it like this:
bewertung = driver.find_elements_by_xpath('//span[#class="a-icon-alt"]')
for item in bewertung:
print(item.text)
So Selenium finds this element. But unfortunatly '.text' returns nothing. Why is it always empty ?
And what for is the span from the first screenshot ? Btw. it is not displayed at the Website as well.
Since you've mentioned Selenium finds this element, I would assume you must have print the len of bewertung list
something like
print(len(bewertung))
if this list has some element in it, you could probably use innerText
bewertung = driver.find_elements_by_xpath('//span[#class="a-icon-alt"]')
for item in bewertung:
print(item.get_attribute("innerText"))
Note that, you are using find_elements which won't throw any error instead if it does not find the element it will return an empty list.
so if you use find_element instead, it would throw the exact error.
Also, I think you've xpath for the span (Which does not appear in UI, sometime they don't appear until some actions are triggered.)
You can try to use this xpath instead:
//i[#data-hook='average-stars-rating-anywhere']//span[#data-hook='acr-average-stars-rating-text']
Something like this in code:
bewertung = driver.find_elements_by_xpath("//i[#data-hook='average-stars-rating-anywhere']//span[#data-hook='acr-average-stars-rating-text']")
for item in bewertung:
print(item.text)
I am trying to extract the contents of a specific tag on a webpage by using lxml, namely on Indeed.com.
Example page: link
I am trying to extract the company name and position name. Chrome shows that the company name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/span[1]"
and the position name is located at
"//*[#id='job-content']/tbody/tr/td[1]/div/b/font"
This bit of code tries to extract those values from a locally saved and parsed copy of the page:
import lxml.html as h
xslt_root = h.parse("Temp/IndeedPosition.html")
company = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/span[1]/text()")
position = xslt_root.xpath("//*[#id='job-content']/tbody/tr/td[1]/div/b/font/text()")
print(company)
print(position)
However, the print commands return empty strings, meaning nothing was extracted!
What is going on? Am I using the right tags? I don't think these are dynamically generated since the page loads normally with javascript disabled.
I would really appreciate any help with getting those two values extracted.
Try it like this:
company = xslt_root.xpath("//div[#data-tn-component='jobHeader']/span[#class='company']/text()")
position = xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']//text()")
['The Habitat Company']
['Janitor-A (Scattered Sites)']
Once we have the //div[#data-tn-component='jobHeader'] path things become pretty straightforward:
select the text of the child span /span[#class='company']/text() to get the company name
/b[#class='jobtitle']//text() is a bit more convoluted: since the job title is embedded in a font tag. But we can just select any descendant text using //text() to get the position.
An alternative is to select the b or font node and use text_content() to get the text (recursively, if needed), e.g.
xslt_root.xpath("//div[#data-tn-component='jobHeader']/b[#class='jobtitle']")[0].text_content()
Despite your assumption, it seems that the content on the page is loaded dynamically, and is thus not present during loading time.
This means you can't access the elements from your downloaded HTML file (if you do not believe me, try to look for job-content in the actual file on your computer, which will only contain placeholders and descriptors.
It seems you would have to use technologies like Selenium to perform this task.
Again, I want to stress that whatever you are doing (automatically), is a violation of indeed.com's Terms and Conditions, so I would suggest not to go too far with this anyways.
(disclaimer: I only vaguely know python & am pretty new to coding)
I'm trying to get the text part of a link, but it doesn't have a specific class, and depending on how I word my code I get either way too many things (the xpath wasn't specific enough) or a blank [ ].
A screenshot of what I'm trying to access is :
Tree is all the html from the page.
The code that returns a blank is:
cardInfo=tree.xpath('div[#class="cardDetails"]/table/tbody/tr/td[2]/a/text()')
The code that returns way too much:
cardInfo=tree.xpath('a[contains(#href, 'domain_name')]/text()')
I tried going into Inspect in chrome and copying the xpath, which also gave me nothing. I've successfully gotten other things out of the page that are just plain text, not links. Super sorry if I didn't explain this well but does anyone have an idea of what I can write?
If you meant to find text next to Set Name::
>>> import lxml.html
>>> tree = lxml.html.parse('http://shop.tcgplayer.com/pokemon/jungle/nidoqueen-7')
>>> tree.xpath(".//b[text()='Set Name:']/parent::td/following-sibling::td/a/text()")
['Jungle']
.//b[text()='Set Name:'] to find b tag with Set Name: text,
parent::td - parent td element of it,
following-sibling::td - following td element
I asked my previous question here:
Xpath pulling number in table but nothing after next span
This worked and i managed to see the number i wanted in a firefox plugin called xpath checker. the results show below.
so I know i can find this number with this xpath, but when trying to run a python scrpit to find and save the number it says it cannot find it.
try:
views = browser.find_element_by_xpath("//div[#class='video-details-inside']/table//span[#class='added-time']/preceding-sibling::text()")
except NoSuchElementException:
print "NO views"
views = 'n/a'
pass
I no that pass is not best practice but i am just testing this at the moment trying to find the number. I'm wondering if i need to change something on the end of the xpath like .text as the xpath checker normally shows a results a little differently. Like below:
i needed to use the xpath i gave rather than the one used in the above picture because i only want the number and not the date. You can see part of the source in my previous question.
Thanks in advance! scratching my head here.
The xpath used in find_element_by_xpath() has to point to an element, not a text node and not an attribute. This is a critical thing here.
The easiest approach here would be to:
get the td's text (parent)
get the span's text (child)
remove child's text from parent's
Code:
span = browser.find_element_by_xpath("//div[#class='video-details-inside']/table//span[#class='added-time']")
td = span.find_element_by_xpath('..')
views = td.text.replace(span.text, '').strip()
I am on the website
http://www.baseball-reference.com/players/event_hr.cgi?id=bondsba01&t=b
and trying to scrape the data from the tables. When I pull the xpath from one entry, say the pitcher
"Terry Mulholland," I retrieve this:
pitchers = site.xpath("/html/body/div[2]/div[2]/div[6]/table/tbody/tr/td[3]/table/tbody/tr[2]/td/a)
When I try to print pitcher[0].text for pitcher in printers, I get [] rather than the text, Any idea why?
The problem is, last tbody doesn't exist in the original source. If you get that xpath via some browser, keep in mind that browsers can guess and add missing elements to make html valid.
Removing the last tbody resolves the problem.
In : import lxml.html as html
In : site = html.parse("http://www.baseball-reference.com/players/event_hr.cgi?id=bondsba01&t=b")
In : pitchers = site.xpath("/html/body/div[2]/div[2]/div[6]/table/tbody/tr/td[3]/table/tr[2]/td/a")
In : pitchers[0].text
Out: 'Terry Mulholland'
But I need to add that, the xpath expression you are using is pretty fragile. One div added in some convenient place and now you have a broken script. If possible, try to find better references like id or class that points to your expected location.