How can I replace an element during iteration in an elementtree? I'm writing a treeprocessor for markdown and would like to wrap an element.
<pre class='inner'>...</pre>
Should become
<div class='wrapper'><pre class='inner'>...</pre></div>
I use getiterator('pre') to find the elements, but I don't know how to wrap it. The trouble point is replacing the found element with the new wrapper, but preserving the existing one as the child.
This is a bit of a tricky one. First, you'll need to get the parent element as described in this previous question.
parent_map = dict((c, p) for p in tree.getiterator() for c in p)
If you can get markdown to use lxml, this is a little easier -- I believe that lxml elements know their parents already.
Now, when you get your element from iterating, you can also get the parent:
for elem in list(tree.getiterator('pre')):
parent = parent_map[elem]
wrap_elem(parent, elem)
Note that I've turned the iterator from the tree into a list -- We don't want to modify the tree while iterating over it. That could be trouble.
Finally, you're in position to move the element around:
def wrap_elem(parent, elem)
parent_index = list(parent).index(elem)
parent.remove(elem)
new_elem = ET.Element('div', attrib={'class': 'wrapper'})
parent.insert(parent_index, new_elem)
new_elem.append(elem)
*Note that I haven't tested this code exactly... let me know if you find any bugs.
In my experience, you can use the method below to get what you want:
xml.etree.ElementTree.SubElement( I will just call it ET.Subelement) http://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.SubElement
Here is the steps:
Before your iteration, you should get the parent element of these iterated element first, store it into variable parent.
Then,
1, store the element <pre class='inner'>...</pre> into a variable temp
2, add a new subelement div into parent:
div = ET.SubElement(parent, 'div')
and set the attrib of div:
div.set('class','wrapper')
3, add the element in step 1 as a subelement of div,
ET.SubElement(div, temp)
4, delete the element in step 1:
parent.remove(temp)
Something like this works for one:
for i, element in enumerate(parent):
if is_the_one_you_want_to_replace(element):
parent.remove(element)
parent.insert(i, new_element)
break
Something like this works for many:
replacement_map = {}
for i, element in enumerate(parent):
if is_an_element_you_want_to_replace(element):
replacement_map[i] = el_to_remove, el_to_add
for index, (el_to_remove, el_to_add) in replacement_map.items():
parent.remove(el_to_remove)
parent.insert(index, el_to_add)
Another solution that works for me, similar to lyfing's.
Copy the element into a temp; retag the original element with the wanted outer tag and clear it, then append the copy into the original.
import copy
temp = copy.deepcopy(elem)
elem.tag = "div"
elem.set("class","wrapper")
elem.clear()
elem.append(temp)
Related
I wanted to ask how it is possible to deal with lists of extracted data within one variable. As the (xpath) selector only extracts the first .extract_first() or all of the content .extract (), I wondered, how I can iterate and extract only one element...like .extract()[i] and i=i+1... How does that have to be put?
It seems so obvious but at this point I don't understand how to make use of itemloaders, pipelines or whatever scrapy documentations provides to get this problem solved.
item ['author'] = sel.xpath('.//a[contains(#data-hook, "review-author")]/text()').extract_first()
item ['author'] = sel.xpath('.//a[contains(#data-hook, "review-author")]/text()').extract()[0]
item ['author'] = sel.xpath('.//a[contains(#data-hook, "review-author")]/text()').extract()[i] ... i=i+1???
Also if you could just point me to the correct direction, I would be so grateful!
You can iterate over list with for loop:
for author in sel.xpath('.//a[contains(#data-hook, "review-author")]/text()').extract():
item ['author'] = author
If you have a list you can iterate over it with a for-loop.
item ['author'] = sel.xpath('.//a[contains(#data-hook, "review-author")]/text()').extract()
// Using this for-loop construct instead of indices avoids off-by-one errors
// and the code won't run if the list is empty.
for element in item['author']:
print element
// Do whatever you want with the element.
I'm scrapping from here: https://www.usatoday.com/sports/ncaaf/sagarin/ and the page is just a mess of font tags. I've been able to successfully scrape the data that I need, but I'm curious if I could written this 'cleaner' I guess for lack of a better word. It just seems silly that I have to use three different temporary lists as I stage the cleanup of the scrapped data.
For example, here is my snippet of code that gets the overall rating for each team in the "table" on that page:
source = urllib.request.urlopen('https://www.usatoday.com/sports/ncaaf/sagarin/').read()
soup = bs.BeautifulSoup(source, "lxml")
page_source = soup.find("font", {"color": "#000000"}
sagarin_raw_rating_list = page_source.find_all("font", {"color": "#9900ff"})
raw_ratings = sagarin_raw_rating_list[:-1]
temp_list = [element.text for element in raw_ratings]
temp_list_cleanup1 = [element for element in temp_list if element != 'RATING']
temp_list_cleanup2 = re.findall(" \s*(-?\d+\.\d+)", str(temp_list_cleanup1))
final_ratings_list = [element for element in temp_list_cleanup2 if element != home_team_advantage] # This variable is scrapped from another piece of code
print(final_ratings_list)
This is for a private program for me and some friends so I'm the only one ever maintaining it, but it just seems a bit convoluted. Part of the problem is the site because I have to do so much work to extract the relevant data.
The main thing I see is that you turn temp_list_cleanup1 into a string kind of unnecessarily. I don't think there's going to be that much of a difference between re.findall on one giant string and re.search on a bunch of smaller strings. After that you can swap out most of the list comprehensions [...] for generator comprehensions (...). It doesn't eliminate any lines of code, but you don't store extra lists that you won't ever need again
temp_iter = (element.text for element in raw_ratings)
temp_iter_cleanup1 = (element for element in temp_iter if element != 'RATING')
# search each element individually, rather than one large string
temp_iter_cleanup2 = (re.search(" \s*(-?\d+\.\d+)", element).group(1)
for element in temp_iter_cleanup1)
# here do a list comprehension so that you have the scrubbed data stored
final_ratings_list = [element for element in temp_iter_cleanup2 if element != home_team_advantage]
I need to create a dictionary where will be key a string and a value a list. The trick is I need to do it in a loop.
My minimalised code looks like this at the moment:
for elem in xmlTree.iter():
# skipping root element
if elem.tag == xmlTree.getroot().tag:
continue
# this is supposed to be my temporary list
tmpList = []
for child in elem:
tableWColumns[elem.tag] = tmpList.append(child.tag)
print(tableWColumns)
This prints only the list created in the last iteration.
Problem apparently lies in the fact that whenever I change the list, all of its references are changed as well. I Googled that. What I haven't Googled though is the way how can I deal with it when using a loop.
The solution I am supposed to use when I want to keep the list is to copy it to some other list and then I can change the original one without losing data. What I don't know is how do I do it, when I basically need to do this dynamically.
Also I am limited to use of standard libraries only.
The problem is because of that you are creating the tmpList = [] list in each iteration and put it [].So python replace the new with older in each iteration, thus you see the last iteration result in your list.
Instead you can use collections.defaultdict :
from collections import defaultdict
d=defaultdict(list)
for elem in xmlTree.iter():
# skipping root element
if elem.tag == xmlTree.getroot().tag:
continue
# this is supposed to be my temporary list
for child in elem:
d[elem.tag].append(child.tag)
print(tableWColumns)
Or you can use dict.setdefault method :
d={}
for elem in xmlTree.iter():
# skipping root element
if elem.tag == xmlTree.getroot().tag:
continue
# this is supposed to be my temporary list
for child in elem:
d.setdefault(elem.tag,[]).append(child.tag)
print(tableWColumns)
Also note as #abarnert says tmpList.append(child.tag) will return None.so after assignment actually python will assign None to tableWColumns[elem.tag].
The big problem here is that tmpList.append(child.tag) returns None. In fact, almost all mutating methods in Python return None.
To fix that, you can either do the mutation, then insert the value in a separate statement:
for child in elem:
tmpList.append(child.tag)
tableWColumns[elem.tag] = tmpList
… or not try to mutate the list in the first place. For example
tableWColumns[elem.tag] = tmpList + [child.tag for child in elem]
That will get rid of your all-values-are-None problem, but then you've got a new problem. If any tag appears more than once, you're only going to get the children from the last copy of that tag, not from all copies. That's because you build a new list each time, and reassign tableWColumns[elem.tag] to that new list, instead of modifying whatever was there.
To solve that problem, you need to fetch the existing value into tmpList instead of creating a new one:
tmpList = tableWColumns.get(elem.tag, [])
tableWColumns[elem.tag] = tmpList + [child.tag for child in elem]
Or, as Kasra's answer says, you can simplify this by using a defaultdict or the setdefault method.
I am using QGraphicsWebView and trying to iterate over QWebElements. At first tried :
frame = self.page().mainFrame()
doc = frame.documentElement()
h = frame.findFirstElement("head")
b = frame.findFirstElement("body")
elements = h.findAll("link")
for d in elements :
print d.tagName()
So you see what I thought but, but later on find that there's elements in QWebElementCollection, not in list. Please help me with iterating over DOM tree.
a QWebElement's findAll method returns a QWebElementCollection, which can be converted to a QList instance with it's toList() method. To iterate over a list of matched elements, you could use:
body_element = frame.findFirstElement("body")
for el in body_element.findAll("div").toList():
print el.tagName()
[N.B. I'm quite new to Python & Xpath]
I was wanting to perform an Xpath query and use a variable in the expression to cycle through all the elements using an index.
For example, rather than having the specific position of [1], [2], etc, I'd like to be able to place a variable i in this expression:
for x in root.xpath('/movies/a_movie[i]/studio'):
print "<studio>" + x.text + "</studio>"
I'm unsure whether it's even possible, but I guess it can't hurt to ask!
To clarify, this is why I would like to do this: I'm trying to process all the elements of a_movie[1], then at the end of the function, I want to process all the elements of a_movie[2], and so on.
You want to loop over the /movies/a_movie tags instead, but only those that have a studio child:
for a_movie in root.xpath('/movies/a_movie[studio]'):
name = a_movie.find('name').text
studio = a_movie.find('studio')
print "<studio>" + studio.text + "</studio>"
If you wanted to just print out the child elements of the a_movie element as XML, I'd just loop over the element (looping over the child elements) and use ElementTree.tostring() on each:
for a_movie in root.xpath('/movies/a_movie'):
for elem in a_movie:
print ElementTree.tostring(elem),