How to access comments using lxml - python

I am trying to remove comments from a list of elements that were obtained by using lxml
The best I have been able to do is:
no_comments=[element for element in element_list if 'HtmlComment' not in str(type(each))]
I am wondering if there is a more direct way?
I am going to add something based on Matthew's answer - he got me almost there the problem is that when the element are taken from the tree the comments lose some identity (I don't know how to describe it) so that it cannot be determined whether they are HtmlComment class objects using the isinstance() method
However, that method can be used when the elements are being iterated through on the tree
from lxml.html import HtmlComment
no_comments=[element for element in root.iter() if not isinstance(element,HtmlComment)
For those novices like me root is the base html element that holds all of the other elements in the tree there are a number of ways to get it. One is to open the file and iterate through it so instead of root.iter() in the above
html.fromstring(open(r'c:\temp\testlxml.htm').read()).iter()

You can cut out the strings:
from lxml.html import HtmlComment # or similar
no_comments=[element for element in element_list if not isinstance(element, HtmlComment)]

Related

Python Beautiful Soup 4 - finding element by class and aria-label

I am trying to find an element with a particular class name and aria-label using Beautiful Soup 4. More specifically, I am scrapping an HTML code where each item on the list has the same class (nd-list__item in-feat-item) but a different aria-label (e.g. aria-label="rooms"). Source code below:
I have to search for a specific combination of class and aria-label because if I am unable to find it, I must return a None value (e.g. if there is none <li .... aria-label="rooms"></li> I must return None. Using bs_object.find_all method on the whole list and then iterating over each of the list elements is rather inefficient, as some listings may have different orderings (e.g. if there are no numbers of rooms provided, then the first element will be "aria-label="surface") -> so I must be able to query directly whether the particular element is contained in the bs object.
Do you have some recommendations on how to do that without going in for bs_object.find_all('li', class_='nd-list_item in-feat__item') and then iterating over the whole list? I also thought about searching for the parent <ul></ul> tag and then using Regex - but it is also an overly complicated procedure. Thanks in advance for all the answers!

Chosing next relative in Python BeautifulSoup with automation

First of all - I'm creating xml document with python BeautifulSoup.
Currently, what I'm trying to create, is very similar to this example.
<options>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
</options>
Notice, that there should be only one tag, called name.
As options can be much more in count, and different as well, I decided to create little python function, which could help me create such result.
array = ['FirstName','SecondName','ThirdName']
# This list will be guideline for function to let it know, how much options will be in result, and how option tags will be called.
def create_options(array):
soup.append(soup.new_tag('options'))
if len(array) > 0: # It's small error handling, so you could see, if given array isn't empty by any reason. Optional.
for i in range(len(array)):
soup.options.append(soup.new_tag('opt'))
# With beatifullsoup methods, we create opt tags inside options tag. Exact amount as in parsed array.
counter = 0
# There's option to use python range() method, but for testing purposes, current approach is sufficient enough.
for tag in soup.options.find_all():
soup.options.find('opt')['name'] = str(array[counter])
# Notice, that in this part tag name is assigned only to first opt element. We'll discuss this next.
counter += 1
print len(array), ' options were created.'
else:
print 'No options were created.'
You notice, that in function, tag assignment is handled by for loop, which, unfortunately, assigns all different tag names to first option in options element.
BeautifulSoup has .next_sibling and .previous_sibling, which can help me in this task.
As they describe by name, with them I can access next or previous sibling in element. So, by this example:
soup.options.find('opt').next_sibling['name'] = str(array[counter])
We can access second child of options element. So, if we add .next_sibling to each soup.items.find('opt'), we could then move from first element to next.
Problem is, that by finding option element in options with:
soup.options.find('opt')
each time we access first option. But my function is willing to access with each item in list, next option as well. So it means, as more items are in list, more .next_sibling methods it must add to first option.
In result, with logic I constructed, with 4th or further item in list, accessing relevant option for assigning it's appropriate tag, should look like this:
soup.options.find('opt').next_sibling.next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
And now we are ready to my questions:
1st. As I didn't found any other kind of method, how to do it with Python BeautifulSoup methods, I'm not sure, that my approach still is only way. Is there any other method?
2st. How could I achieve result by this approach, if as my experiments show me, that I can't put variable inside method row? (So I could multiply methods)
#Like this
thirdoption = .next_sibling.next_sibling.next_sibling
#As well, it's not quite possible, but it's just example.
soup.options.find('opt').next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
3st. May be I read BeautifulSoup documentation badly, and just didn't found method, which could help me in this task?
I managed to achieve result, ignoring BeatifulSoup methods.
Python has element tree methods, which were sufficient enough to work with.
So, let me show the example code, and explain it, what it does. Comments provide explanation more precisely.
"""
Before this code, there goes soup xml document generation. Except part, I mentioned in topic, we just create empty options tags in document, thus, creating almost done document.
Right after that, with this little script, we will use basic python provided element tree methods.
"""
import xml.etree.ElementTree as ET
ET_tree = ET.parse("exported_file.xml")
# Here we import exactly the same file, we opened with soup. Exporting can be done in different file, if you wish.
ET_root = ET_tree.getroot()
for position, opt in enumerate(item.find('options')):
# Position is pretty important, as this will remove 'counter' thing in for loop, I was using in code in first example. Position will be used for getting out exact items from array, which works like template for our option tag names.
opt.set('name', str(array[position]))
opt.text = 'text'
# Same way, with position, we can get data from relevant array, provided, that they are inherited or connected in same way.
tree = ET.ElementTree(ET_root).write('exported_file.xml',encoding="UTF-8",xml_declaration=True)
# This part was something, I researched before quite lot. This code will help save xml document with utf-8 encoding, which is very important.
This approach is pretty inefficient, as for achieving same result, I could use ET for everything.
Thought, BeatifulSoup prepares document in nice output, which in any way is very neat, as element-tree creates files for software friendly only look.

Search for an XML node in a parent by string with python

I'm working with python xml.dom. I'm looking for a particular method that takes in a node and string and returns the xml node that is is named string. I can't find it in the documentation
I'm thinking it would work something like this
nodeObject =parent.FUNCTION('childtoFind')
where the nodeObject is under the parent
Or barring the existence of such a method, is there a way I can make the string a node object?
You are looking for the .getElementsByTagname() function:
nodeObjects = parent.getElementsByTagname('childtoFind')
It returns a list; if you need only one node, use indexing:
nodeObject = parent.getElementsByTagname('childtoFind')[0]
You really want to use the ElementTree API instead, it's easier to use. Even the minidom documentation makes this recommendation:
Users who are not already proficient with the DOM should consider using the xml.etree.ElementTree module for their XML processing instead.
The ElementTree API has a .find() function that let's you find the first matching descendant:
element = parent.find('childtoFind')

Selenium / lxml : Get xpath

Is there a get_xpath method or a way to accomplish something similar in selenium or lxml.html. I have a feeling that I have seen somewhere but can't find anything like that in the docs.
Pseudocode to illustrate:
browser.find_element_by_name('search[1]').get_xpath()
>>> '//*[#id="langsAndSearch"]/div[1]/form/input[1]'
This trick works in lxml:
In [1]: el
Out[1]: <Element span at 0x109187f50>
In [2]: el.getroottree().getpath(el)
Out[2]: '/html/body/div/table[2]/tbody/tr[1]/td[3]/table[2]/tbody/tr/td[1]/p[4]/span'
See documentation of getpath.
As there is no unique mapping between an element and an xpath expression, a general solution is not possible. But if you know something about your xml/html, it might be easy to write it your own. Just start with your element, walk up the tree using the parent and generate your expression.
Whatever search function you use, you can reformat your search using xpath to return your element. For instance,
driver.find_element_by_id('foo')
driver.find_element_by_xpath('//*#id="foo"')
will return exactly the same elements.
That being said, I would argue that to extend selenium with this method would be possible, but nearly pointless-- you're already providing the module with all the information it needs to find the element, why use xpath (which will almost certainly be harder to read?) to do this at all?
In your example, browser.find_element_by_name('search[1]').get_xpath() would simply return '//*#name="search[1]"'. since the assumption is that your original element search returned what you were looking for.

How can I get the only element of a certain type out of a list more cleanly than what I am doing?

I am working with some xml files. The schema for the files specifies that there can only be one of a certain type of element (in this case I am working with the footnotes element).
There can be several footnote elements in the footnotes element, I am trying to grab and process the footnotes element so that I can iterate through it to discover the footnote elements.
here is my current approach
def get_footnotes(element_list):
footnoteDict=od()
footnotes_element=[item for item in element_list if item.tag=='footnotes'][0]
for eachFootnote in footnotes_element.iter():
if eachFootnote.tag=='footnote':
footnoteDict[eachFootnote.values()[0]]=eachFootnote.text
return footnoteDict
element_list is a list of elements that are relevant for me after iterating through the entire tree
So I am wondering if there is a more pythonic way to get the footnotes element instead of iterating through the list of elements it seems to me that this is clumsy with this being
footnotes_element=[item for item in element_list if item.tag=='footnotes'][0]
Something like this should do the job:
from lxml import etree
xmltree = etree.fromstring(your_xml)
for footnote in xmltree.iterfind("//footnotes/footnote"):
# do something
pass
It's easier to help if you provide some sample XML.
Edit:
If you are working with really big files, you might want to look into iterparse.
This question seems to have quite a nice example: python's lxml and iterparse method

Categories

Resources