First of all - I'm creating xml document with python BeautifulSoup.
Currently, what I'm trying to create, is very similar to this example.
<options>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
<opt name='string'>ContentString</opt>
</options>
Notice, that there should be only one tag, called name.
As options can be much more in count, and different as well, I decided to create little python function, which could help me create such result.
array = ['FirstName','SecondName','ThirdName']
# This list will be guideline for function to let it know, how much options will be in result, and how option tags will be called.
def create_options(array):
soup.append(soup.new_tag('options'))
if len(array) > 0: # It's small error handling, so you could see, if given array isn't empty by any reason. Optional.
for i in range(len(array)):
soup.options.append(soup.new_tag('opt'))
# With beatifullsoup methods, we create opt tags inside options tag. Exact amount as in parsed array.
counter = 0
# There's option to use python range() method, but for testing purposes, current approach is sufficient enough.
for tag in soup.options.find_all():
soup.options.find('opt')['name'] = str(array[counter])
# Notice, that in this part tag name is assigned only to first opt element. We'll discuss this next.
counter += 1
print len(array), ' options were created.'
else:
print 'No options were created.'
You notice, that in function, tag assignment is handled by for loop, which, unfortunately, assigns all different tag names to first option in options element.
BeautifulSoup has .next_sibling and .previous_sibling, which can help me in this task.
As they describe by name, with them I can access next or previous sibling in element. So, by this example:
soup.options.find('opt').next_sibling['name'] = str(array[counter])
We can access second child of options element. So, if we add .next_sibling to each soup.items.find('opt'), we could then move from first element to next.
Problem is, that by finding option element in options with:
soup.options.find('opt')
each time we access first option. But my function is willing to access with each item in list, next option as well. So it means, as more items are in list, more .next_sibling methods it must add to first option.
In result, with logic I constructed, with 4th or further item in list, accessing relevant option for assigning it's appropriate tag, should look like this:
soup.options.find('opt').next_sibling.next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
And now we are ready to my questions:
1st. As I didn't found any other kind of method, how to do it with Python BeautifulSoup methods, I'm not sure, that my approach still is only way. Is there any other method?
2st. How could I achieve result by this approach, if as my experiments show me, that I can't put variable inside method row? (So I could multiply methods)
#Like this
thirdoption = .next_sibling.next_sibling.next_sibling
#As well, it's not quite possible, but it's just example.
soup.options.find('opt').next_sibling.next_sibling.next_sibling['name'] = str(array[counter])
3st. May be I read BeautifulSoup documentation badly, and just didn't found method, which could help me in this task?
I managed to achieve result, ignoring BeatifulSoup methods.
Python has element tree methods, which were sufficient enough to work with.
So, let me show the example code, and explain it, what it does. Comments provide explanation more precisely.
"""
Before this code, there goes soup xml document generation. Except part, I mentioned in topic, we just create empty options tags in document, thus, creating almost done document.
Right after that, with this little script, we will use basic python provided element tree methods.
"""
import xml.etree.ElementTree as ET
ET_tree = ET.parse("exported_file.xml")
# Here we import exactly the same file, we opened with soup. Exporting can be done in different file, if you wish.
ET_root = ET_tree.getroot()
for position, opt in enumerate(item.find('options')):
# Position is pretty important, as this will remove 'counter' thing in for loop, I was using in code in first example. Position will be used for getting out exact items from array, which works like template for our option tag names.
opt.set('name', str(array[position]))
opt.text = 'text'
# Same way, with position, we can get data from relevant array, provided, that they are inherited or connected in same way.
tree = ET.ElementTree(ET_root).write('exported_file.xml',encoding="UTF-8",xml_declaration=True)
# This part was something, I researched before quite lot. This code will help save xml document with utf-8 encoding, which is very important.
This approach is pretty inefficient, as for achieving same result, I could use ET for everything.
Thought, BeatifulSoup prepares document in nice output, which in any way is very neat, as element-tree creates files for software friendly only look.
Related
I'm looking for way how to search objects in GUI according to xpath. part of identification of application
I have xpath as follows:
./pane[0]/pane[0]/pane[1]/pane[0]/pane[1]/pane[0]/pane[0]/pane[0]/text[0]/edit[0]
This should (if i have no problem there) point to selected EDIT in application element tree.
I tried to use this xpath to identify items like this
#app is the application under test, this is working correctly
top = app.top_window()
first = top.child_window(control_type="Pane")
print first #here is first problem: This will find all the child_windows, no just the direct children, without depth search (is it possible to just search particular direct childrens, without deeper search?)
first = top.child_window(control_type="Pane", ctrl_index=0)
#this is much better
second = first.child_window(control_type="Pane", ctrl_index=0)
print second
#this is working, i'm looking for [0] indexed Pane under first found element
third = second.child_window(control_type="Pane", ctrl_index=1)
print third
# we have another problem , the algorithm is depth first, so ctrl_index=1 is not referencing to '2nd child of element named second', but instead to the first element of first pane, founded under 2nd element. (I'm looking for wide first algorithm)
I don't have recursive function written by now, but maybe I'm going the wrong way.
So the question is, is there any way how to recover path to element in xpath like style?
Thanks
For the first case it should look so:
first = top.child_window(control_type="Pane", depth=2)
# a bit confusing, will bind to depth=1 in future major release.
For third case yes, ctrl_index is used before filtering by other criteria. If you need to apply search criteria first and then choose from small filtered list, there is another param suitable for your case: found_index.
It should be changed so:
third = second.child_window(control_type="Pane", found_index=1)
I want to get an XPATH-Value from a Steamstoresite, e.g. http://store.steampowered.com/app/234160/. On the right side are 2 boxes. The first one contains Title, Genre, Developer ... I just need the Genre here. There is a different count on every game. Some have 4 Genres, some just one. And then there is another block, where the gamefeatures are listet (like Singleplayer, Multiplayer, Coop, Gamepad, ...)
I need all those values.
Also sometimes there is an image between (PEGI/USK)
http://store.steampowered.com/app/233290.
import requests
from lxml import html
page = requests.get('http://store.steampowered.com/app/234160/')
tree = html.fromstring(page.text)
blockone = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[1]")
blocktwo = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[2]")
print "Detailblock:" , blockone
print "Featureblock:" , blocktwo
This is the code I have so far. When I try it it just prints:
Detailblock: [<Element div at 0x2ce5868>]
Featureblock: [<Element div at 0x2ce58b8>]
How do I make this work?
xpath returns a list of matching elements. You're just printing out that list.
If you want the first element, you need blockone[0]. If you want all elements, you have to loop over them (e.g., with a comprehension).
And meanwhile, what do you want to print for each element? The direct inner text? The HTML for the whole subtree rooted at that element? Something else? Whatever you want, you need to use the appropriate method on the Element type to get it; lxml can't read your mind and figure out what you want, and neither can we.
It sounds like what you really want is just some elements deeper in the tree. You could xpath your way there. (Instead of going through all of the elements one by one and relying on index as you did, I'm just going to write the simplest way to get to what I think you're asking for.)
genres = [a.text for a in blockone[0].xpath('.//a')]
Or, really, why even get that blockone in the first place? Why not just xpath directly to the elements you wanted in the first place?
gtags = tree.xpath(".//*[#id='main_content']/div[4]/div[3]/div[2]/div/div[1]//a")
genres = [a.text for a in gtags]
Also, you could make this a lot simpler—and a lot more robust—if you used the information in the tags instead of finding them by explicitly walking the structure:
gtags = tree.xpath(".//div[#class='glance_tags popular_tags']//a")
Or, since there don't seem to be any other app_tag items anywhere, just:
gtags = tree.xpath(".//a[#class='app_tag']")
I wonder if anybody can help :)
I am using python lxml and cssselector to scrape data from HTML pages.
I can select most classes with ease using this method and find it very convenient but I am having a problem selecting class names with a space
For example I want to extract Blah from the following class:
<li class="feature height">Blah blah</li>
I have tried using the following css selectors without success - whole path is not included as this is not the problem
li.feature.height
li.feature height
li.feature:height
Anybody know how to do this? I can't find the answer and am sure it must be a fairly common thing that people need to do...
I cannot just select the parent element
li.feature
as the data is not in the same order on different pages, same applies for nth element selections...
Scratching my head on this a while now and searched alot, hope somebody knows!
I can work around it by getting the data using re's and that works but I wonder if there is a simple solution...
Thanks for you help in advance!
Matt
Extra information as requested - it doesn't work because it returns an empty list or a negative result for boolean
so if use the
css_9_seed_height = 'html body div.seedicons ul li.feature.height'
# 9. Get seed_height
seed_height_obj = root.cssselect(css_9_seed_height)
print seed_height_obj
This returns an empty list - ie the class is not found but its there
You can assume that root.cssselect() works correctly as I am retrieving lots of other info in the same way
I'm pretty new to XPath and couldn't figure it out looking at other solutions.
What I'm trying to do is select all the a elements inside a given td (td[2] in example) and running a for statement to output the text contained within the a elements.
Source code:
multiple = HTML.ElementFromURL(url).xpath('//table[contains(#class, "mg-b20")]/tr[3]/td[2]/*[self::a]')
for item in multiple:
Log("text = %s" %item.text)
Any pointer in how I can make this work?
Thanks!
The XPath you need is pretty close:
//table[contains(#class, "mg-b20")]/tr[3]/td[2]//a
I don't know what library you're using, but I suspect it is the Plex Parsekit API. If so, parsekit uses lxml.etree as its underlying library, so you can simplify your code even further:
element = HTML.ElementFromURL(url)
alltext = element.xpath('string(//table[contains(#class, "mg-b20")]/tr[3]/td[2]//a)')
for item in alltext:
Log("text = %s" % item);
This will even take care of corner cases like mixed content, e.g. this:
I am anchor text <span>But I am too and am not in Element.text</span> and I am in Element.tail
I am trying to remove comments from a list of elements that were obtained by using lxml
The best I have been able to do is:
no_comments=[element for element in element_list if 'HtmlComment' not in str(type(each))]
I am wondering if there is a more direct way?
I am going to add something based on Matthew's answer - he got me almost there the problem is that when the element are taken from the tree the comments lose some identity (I don't know how to describe it) so that it cannot be determined whether they are HtmlComment class objects using the isinstance() method
However, that method can be used when the elements are being iterated through on the tree
from lxml.html import HtmlComment
no_comments=[element for element in root.iter() if not isinstance(element,HtmlComment)
For those novices like me root is the base html element that holds all of the other elements in the tree there are a number of ways to get it. One is to open the file and iterate through it so instead of root.iter() in the above
html.fromstring(open(r'c:\temp\testlxml.htm').read()).iter()
You can cut out the strings:
from lxml.html import HtmlComment # or similar
no_comments=[element for element in element_list if not isinstance(element, HtmlComment)]