Extracting the value of an attribute with a hyphen using Beautiful Soup 4 - python

I am traversing through my tags using beautifulsoup 4. I have the following tag contents and am unable to extract the atttribute value of the 'data-event-name' attribute. I want '15:02' from this.
This is the html I need to extract 15:02 from
I have tried many many things but am unable to get this value. I tried using the re package, getattr python, find, find_all, etc, etc. this is one example of something I tried:
for racemeetnum,r1_a in enumerate(r1, start=1):
event1 = getattr(r1_a, 'data-event-name') # doesnt work
<

Thank you #Jack Fleming. I managed to sort this last night. In the end my issue wasn't that I couldn't find the attribute, it was that I wasn't trapping the errors when the attribute wasn't found. I surrounded code with a try/except and it worked fine.
Thanks for responding!

Related

How to get data-timestamp using python/selenium

Below is the html of the table I want to extract the data-timestamp from.
The webpage is at https://nl.soccerway.com/national/argentina/primera-division/20182019/regular-season/r47779/matches/?ICID=PL_3N_02
So far I tried verious variants I found on here but nothing seemed to work. Can someone help me to extract the (for example) 1536962400. So in other words I want to extract every data-timestamp value of the table. Any suggestions are more than welcome! I have used selenium/python to extract table data from the website but data-timestamp always gives errors.
data-timestamp is an attribute of tr element, you can try this:
element_list = driver.find_elements_by_xpath("//table[contains(#class,'matches')]/tbody/tr")
for items in element_list:
print(items.get_attribute('data-timestamp'))

Using Xpath to get the anchor text of a link in Python when the link has no class

(disclaimer: I only vaguely know python & am pretty new to coding)
I'm trying to get the text part of a link, but it doesn't have a specific class, and depending on how I word my code I get either way too many things (the xpath wasn't specific enough) or a blank [ ].
A screenshot of what I'm trying to access is :
Tree is all the html from the page.
The code that returns a blank is:
cardInfo=tree.xpath('div[#class="cardDetails"]/table/tbody/tr/td[2]/a/text()')
The code that returns way too much:
cardInfo=tree.xpath('a[contains(#href, 'domain_name')]/text()')
I tried going into Inspect in chrome and copying the xpath, which also gave me nothing. I've successfully gotten other things out of the page that are just plain text, not links. Super sorry if I didn't explain this well but does anyone have an idea of what I can write?
If you meant to find text next to Set Name::
>>> import lxml.html
>>> tree = lxml.html.parse('http://shop.tcgplayer.com/pokemon/jungle/nidoqueen-7')
>>> tree.xpath(".//b[text()='Set Name:']/parent::td/following-sibling::td/a/text()")
['Jungle']
.//b[text()='Set Name:'] to find b tag with Set Name: text,
parent::td - parent td element of it,
following-sibling::td - following td element

Selenium - cant get text from span element

I'm very confused by getting text using Selenium.
There are span tags with some text inside them. When I search for them using driver.find_element_by_..., everything works fine.
But the problem is that the text can't be got from it.
The span tag is found because I can't use .get_attribute('outerHTML') command and I can see this:
<span class="branding">ThrivingHealthy</span>
But if I change .get_attribute('outerHTML') to .text it returns empty text which is not correct as you can see above.
Here is the example (outputs are pieces of dictionary):
display_site = element.find_element_by_css_selector('span.branding').get_attribute('outerHTML')
'display_site': u'<span class="branding">ThrivingHealthy</span>'
display_site = element.find_element_by_css_selector('span.branding').text
'display_site': u''
As you can clearly see, there is a text but it does not finds it. What could be wrong?
EDIT: I've found kind of workaround. I've just changed the .text to .get_attribute('innerText')
But I'm still curious why it works this way?
The problem is that there are a LOT of tags that are fetched using span.branding. When I just queried that page using find_elements (plural), it returned 20 tags. Each tag seems to be doubled... I'm not sure why but my guess is that one set is hidden while the other is visible. From what I can tell, the first of the pair is hidden. That's probably why you aren't able to pull text from it. Selenium's design is to not interact with elements that a user can't interact with. That's likely why you can get the element but when you try to pull text, it doesn't work. Your best bet is to pull the entire set with find_elements and then just loop through the set getting the text. You will loop through like 20 and only get text from 10 but it looks like you'll still get the entire set anyway. It's weird but it should work.

Getting text from an enclosure tag with XPath

I just started learning XPath, lxml, and python this week. I would like to know if there is an XPath expression that will get the url text out of an enclosure tag. Is this possible?
The XML I'm dealing with:
<enclosure url="http://nakeddiscovery.com/libsyn/Naked_Scientists_Show_17.08.14.mp3" length="51671039" type="audio/mpeg" ></enclosure>
I can get a list with this statement :
urls = parsedFeed.xpath('//enclosure/#url')
then loop through the list, but I was hoping for a more direct way. I've tried //enclosure/#url/text(), //enclosure/#url.text, etc... And many variations of that. I have Googled it, searched stackoverflow, but nada.
Any ideas?
Thanks!
You can try using index :
parsedFeed.xpath('//enclosure/#url')[0]

Searching on class names with a space (\s) Python lxml

I wonder if anybody can help :)
I am using python lxml and cssselector to scrape data from HTML pages.
I can select most classes with ease using this method and find it very convenient but I am having a problem selecting class names with a space
For example I want to extract Blah from the following class:
<li class="feature height">Blah blah</li>
I have tried using the following css selectors without success - whole path is not included as this is not the problem
li.feature.height
li.feature height
li.feature:height
Anybody know how to do this? I can't find the answer and am sure it must be a fairly common thing that people need to do...
I cannot just select the parent element
li.feature
as the data is not in the same order on different pages, same applies for nth element selections...
Scratching my head on this a while now and searched alot, hope somebody knows!
I can work around it by getting the data using re's and that works but I wonder if there is a simple solution...
Thanks for you help in advance!
Matt
Extra information as requested - it doesn't work because it returns an empty list or a negative result for boolean
so if use the
css_9_seed_height = 'html body div.seedicons ul li.feature.height'
# 9. Get seed_height
seed_height_obj = root.cssselect(css_9_seed_height)
print seed_height_obj
This returns an empty list - ie the class is not found but its there
You can assume that root.cssselect() works correctly as I am retrieving lots of other info in the same way

Categories

Resources