Searching on class names with a space (\s) Python lxml - python

I wonder if anybody can help :)
I am using python lxml and cssselector to scrape data from HTML pages.
I can select most classes with ease using this method and find it very convenient but I am having a problem selecting class names with a space
For example I want to extract Blah from the following class:
<li class="feature height">Blah blah</li>
I have tried using the following css selectors without success - whole path is not included as this is not the problem
li.feature.height
li.feature height
li.feature:height
Anybody know how to do this? I can't find the answer and am sure it must be a fairly common thing that people need to do...
I cannot just select the parent element
li.feature
as the data is not in the same order on different pages, same applies for nth element selections...
Scratching my head on this a while now and searched alot, hope somebody knows!
I can work around it by getting the data using re's and that works but I wonder if there is a simple solution...
Thanks for you help in advance!
Matt
Extra information as requested - it doesn't work because it returns an empty list or a negative result for boolean
so if use the
css_9_seed_height = 'html body div.seedicons ul li.feature.height'
# 9. Get seed_height
seed_height_obj = root.cssselect(css_9_seed_height)
print seed_height_obj
This returns an empty list - ie the class is not found but its there
You can assume that root.cssselect() works correctly as I am retrieving lots of other info in the same way

Related

I'm having trouble selectiong an element using Selenium with Python

I want to read out the text in this html element using selenium with python. I just can't find a way to find or select it without using the text (i don't want that because its content changes)
<div font-size="14px" color="text" class="sc-gtsrHT jFEWVt">0.101 ONE</div>
Do you have an idea how i could select it? The conventional ways listed in the documentation seem to not work for me. To be honest i'm not very good with html what doesn't make things any easier.
Thank you in advance
Try this :
element = browser.find_element_by_class_name('sc-gtsrHT jFEWVt').text
Or use a loop if you have several elements :
elements = browser.find_elements_by_class_name('sc-gtsrHT jFEWVt')
for e in elements:
print(e.text)
print(browser.find_element_by_xpath("//*[#class='sc-gtsrHT jFEWVt']").text)
You could simply grab it by class name. It's 2 class names so it would be like so. by_class_name only uses one.
If the class name isn't dynamic otherwise you'd have to right click and copy the xpath or find a unique identiftier.
Find by XPath as long as font size and color attribute are consistent. Be like,
//div[#font-size='14px' and #color='text' and starts-with(#class,'sc-')]
I guess the class name is random?

Using Xpath to get the anchor text of a link in Python when the link has no class

(disclaimer: I only vaguely know python & am pretty new to coding)
I'm trying to get the text part of a link, but it doesn't have a specific class, and depending on how I word my code I get either way too many things (the xpath wasn't specific enough) or a blank [ ].
A screenshot of what I'm trying to access is :
Tree is all the html from the page.
The code that returns a blank is:
cardInfo=tree.xpath('div[#class="cardDetails"]/table/tbody/tr/td[2]/a/text()')
The code that returns way too much:
cardInfo=tree.xpath('a[contains(#href, 'domain_name')]/text()')
I tried going into Inspect in chrome and copying the xpath, which also gave me nothing. I've successfully gotten other things out of the page that are just plain text, not links. Super sorry if I didn't explain this well but does anyone have an idea of what I can write?
If you meant to find text next to Set Name::
>>> import lxml.html
>>> tree = lxml.html.parse('http://shop.tcgplayer.com/pokemon/jungle/nidoqueen-7')
>>> tree.xpath(".//b[text()='Set Name:']/parent::td/following-sibling::td/a/text()")
['Jungle']
.//b[text()='Set Name:'] to find b tag with Set Name: text,
parent::td - parent td element of it,
following-sibling::td - following td element

Python crawler not finding specific Xpath

I asked my previous question here:
Xpath pulling number in table but nothing after next span
This worked and i managed to see the number i wanted in a firefox plugin called xpath checker. the results show below.
so I know i can find this number with this xpath, but when trying to run a python scrpit to find and save the number it says it cannot find it.
try:
views = browser.find_element_by_xpath("//div[#class='video-details-inside']/table//span[#class='added-time']/preceding-sibling::text()")
except NoSuchElementException:
print "NO views"
views = 'n/a'
pass
I no that pass is not best practice but i am just testing this at the moment trying to find the number. I'm wondering if i need to change something on the end of the xpath like .text as the xpath checker normally shows a results a little differently. Like below:
i needed to use the xpath i gave rather than the one used in the above picture because i only want the number and not the date. You can see part of the source in my previous question.
Thanks in advance! scratching my head here.
The xpath used in find_element_by_xpath() has to point to an element, not a text node and not an attribute. This is a critical thing here.
The easiest approach here would be to:
get the td's text (parent)
get the span's text (child)
remove child's text from parent's
Code:
span = browser.find_element_by_xpath("//div[#class='video-details-inside']/table//span[#class='added-time']")
td = span.find_element_by_xpath('..')
views = td.text.replace(span.text, '').strip()

Chosing next relative in Python BeautifulSoup with automation

First of all - I'm creating xml document with python BeautifulSoup.
Currently, what I'm trying to create, is very similar to this example.
<options>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
</options>
Notice, that there should be only one tag, called name.
As options can be much more in count, and different as well, I decided to create little python function, which could help me create such result.
array = ['FirstName','SecondName','ThirdName']
# This list will be guideline for function to let it know, how much options will be in result, and how option tags will be called.
def create_options(array):
soup.append(soup.new_tag('options'))
if len(array) > 0: # It's small error handling, so you could see, if given array isn't empty by any reason. Optional.
for i in range(len(array)):
soup.options.append(soup.new_tag('opt'))
# With beatifullsoup methods, we create opt tags inside options tag. Exact amount as in parsed array.
counter = 0
# There's option to use python range() method, but for testing purposes, current approach is sufficient enough.
for tag in soup.options.find_all():
soup.options.find('opt')['name'] = str(array[counter])
# Notice, that in this part tag name is assigned only to first opt element. We'll discuss this next.
counter += 1
print len(array), ' options were created.'
else:
print 'No options were created.'
You notice, that in function, tag assignment is handled by for loop, which, unfortunately, assigns all different tag names to first option in options element.
BeautifulSoup has .next_sibling and .previous_sibling, which can help me in this task.
As they describe by name, with them I can access next or previous sibling in element. So, by this example:
soup.options.find('opt').next_sibling['name'] = str(array[counter])
We can access second child of options element. So, if we add .next_sibling to each soup.items.find('opt'), we could then move from first element to next.
Problem is, that by finding option element in options with:
soup.options.find('opt')
each time we access first option. But my function is willing to access with each item in list, next option as well. So it means, as more items are in list, more .next_sibling methods it must add to first option.
In result, with logic I constructed, with 4th or further item in list, accessing relevant option for assigning it's appropriate tag, should look like this:
soup.options.find('opt').next_sibling.next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
And now we are ready to my questions:
1st. As I didn't found any other kind of method, how to do it with Python BeautifulSoup methods, I'm not sure, that my approach still is only way. Is there any other method?
2st. How could I achieve result by this approach, if as my experiments show me, that I can't put variable inside method row? (So I could multiply methods)
#Like this
thirdoption = .next_sibling.next_sibling.next_sibling
#As well, it's not quite possible, but it's just example.
soup.options.find('opt').next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
3st. May be I read BeautifulSoup documentation badly, and just didn't found method, which could help me in this task?
I managed to achieve result, ignoring BeatifulSoup methods.
Python has element tree methods, which were sufficient enough to work with.
So, let me show the example code, and explain it, what it does. Comments provide explanation more precisely.
"""
Before this code, there goes soup xml document generation. Except part, I mentioned in topic, we just create empty options tags in document, thus, creating almost done document.
Right after that, with this little script, we will use basic python provided element tree methods.
"""
import xml.etree.ElementTree as ET
ET_tree = ET.parse("exported_file.xml")
# Here we import exactly the same file, we opened with soup. Exporting can be done in different file, if you wish.
ET_root = ET_tree.getroot()
for position, opt in enumerate(item.find('options')):
# Position is pretty important, as this will remove 'counter' thing in for loop, I was using in code in first example. Position will be used for getting out exact items from array, which works like template for our option tag names.
opt.set('name', str(array[position]))
opt.text = 'text'
# Same way, with position, we can get data from relevant array, provided, that they are inherited or connected in same way.
tree = ET.ElementTree(ET_root).write('exported_file.xml',encoding="UTF-8",xml_declaration=True)
# This part was something, I researched before quite lot. This code will help save xml document with utf-8 encoding, which is very important.
This approach is pretty inefficient, as for achieving same result, I could use ET for everything.
Thought, BeatifulSoup prepares document in nice output, which in any way is very neat, as element-tree creates files for software friendly only look.

Find text and elements with python and selenium?

When I go to a certain webpage I am trying to find a certain element and piece of text:
<span class="Bold Orange Large">0</span>
This didn't work: (It gave an error of compound class names or something...)
elem = browser.find_elements_by_class_name("Bold Orange Large")
So I tried this: (but I'm not sure it worked because I don't really understand the right way to do css selectors in selenium...)
elem = browser.find_elements_by_css_selector("span[class='Bold Orange Large']")
Once I find the span element, I want to find the number that is inside...
num = elem.(what to put here??)
Any help with css selectors, class names, and finding element text would be great!!
Thanks.
EDIT:
My other problem is that there are multiple of those exact span elements but with different numbers inside..how can I deal with that?
you're correct in your usage of css selectors! Also your first attempt was failing because there are spaces in the class name and selenium does not seem to be able to find standalone identifiers with spacing at all. I think that is a bad development practice to begin with, so its not your problem. Selenium itself does not include an html editor, because its already been done before.
Try looking here: How to find/replace text in html while preserving html tags/structure.
Also this one is relevant and popular as well: RegEx match open tags except XHTML self-contained tags

Categories

Resources