[N.B. I'm quite new to Python & Xpath]
I was wanting to perform an Xpath query and use a variable in the expression to cycle through all the elements using an index.
For example, rather than having the specific position of [1], [2], etc, I'd like to be able to place a variable i in this expression:
for x in root.xpath('/movies/a_movie[i]/studio'):
print "<studio>" + x.text + "</studio>"
I'm unsure whether it's even possible, but I guess it can't hurt to ask!
To clarify, this is why I would like to do this: I'm trying to process all the elements of a_movie[1], then at the end of the function, I want to process all the elements of a_movie[2], and so on.
You want to loop over the /movies/a_movie tags instead, but only those that have a studio child:
for a_movie in root.xpath('/movies/a_movie[studio]'):
name = a_movie.find('name').text
studio = a_movie.find('studio')
print "<studio>" + studio.text + "</studio>"
If you wanted to just print out the child elements of the a_movie element as XML, I'd just loop over the element (looping over the child elements) and use ElementTree.tostring() on each:
for a_movie in root.xpath('/movies/a_movie'):
for elem in a_movie:
print ElementTree.tostring(elem),
Related
I'm scrapping from here: https://www.usatoday.com/sports/ncaaf/sagarin/ and the page is just a mess of font tags. I've been able to successfully scrape the data that I need, but I'm curious if I could written this 'cleaner' I guess for lack of a better word. It just seems silly that I have to use three different temporary lists as I stage the cleanup of the scrapped data.
For example, here is my snippet of code that gets the overall rating for each team in the "table" on that page:
source = urllib.request.urlopen('https://www.usatoday.com/sports/ncaaf/sagarin/').read()
soup = bs.BeautifulSoup(source, "lxml")
page_source = soup.find("font", {"color": "#000000"}
sagarin_raw_rating_list = page_source.find_all("font", {"color": "#9900ff"})
raw_ratings = sagarin_raw_rating_list[:-1]
temp_list = [element.text for element in raw_ratings]
temp_list_cleanup1 = [element for element in temp_list if element != 'RATING']
temp_list_cleanup2 = re.findall(" \s*(-?\d+\.\d+)", str(temp_list_cleanup1))
final_ratings_list = [element for element in temp_list_cleanup2 if element != home_team_advantage] # This variable is scrapped from another piece of code
print(final_ratings_list)
This is for a private program for me and some friends so I'm the only one ever maintaining it, but it just seems a bit convoluted. Part of the problem is the site because I have to do so much work to extract the relevant data.
The main thing I see is that you turn temp_list_cleanup1 into a string kind of unnecessarily. I don't think there's going to be that much of a difference between re.findall on one giant string and re.search on a bunch of smaller strings. After that you can swap out most of the list comprehensions [...] for generator comprehensions (...). It doesn't eliminate any lines of code, but you don't store extra lists that you won't ever need again
temp_iter = (element.text for element in raw_ratings)
temp_iter_cleanup1 = (element for element in temp_iter if element != 'RATING')
# search each element individually, rather than one large string
temp_iter_cleanup2 = (re.search(" \s*(-?\d+\.\d+)", element).group(1)
for element in temp_iter_cleanup1)
# here do a list comprehension so that you have the scrubbed data stored
final_ratings_list = [element for element in temp_iter_cleanup2 if element != home_team_advantage]
I'm making a program that allows the user to log loot they receive from monsters in an MMO. I have the drop tables for each monster stored in text files. I've tried a few different formats but I still can't pin down exactly how to take that information into python and store it into a list of lists of lists.
The text file is formatted like this
item 1*4,5,8*ns
item 2*3*s
item 3*90,34*ns
The item # is the name of the item, the numbers are different quantities that can be dropped, and the s/ns is whether the item is stackable or not stackable in game.
I want the entire drop table of the monster to be stored in a list called currentDropTable so that I can reference the names and quantities of the items to pull photos and log the quantities dropped and stuff.
The list for the above example should look like this
[["item 1", ["4","5","8"], "ns"], ["item 2", ["2","3"], "s"], ["item 3", ["90","34"], "ns"]]
That way, I can reference currentDropTable[0][0] to get the name of an item, or if I want to log a drop of 4 of item 1, I can use currentDropTable[0][1][0].
I hope this makes sense, I've tried the following and it almost works, but I don't know what to add or change to get the result I want.
def convert_drop_table(list):
global currentDropTable
currentDropTable = []
for i in list:
item = i.split('*')
currentDropTable.append(item)
dropTableFile = open("droptable.txt", "r").read().split('\n')
convert_drop_table(dropTableFile)
print(currentDropTable)
This prints everything properly except the quantities are still an entity without being a list, so it would look like
[['item 1', '4,5,8', 'ns'], ['item 2', '2,3', 's']...etc]
I've tried nesting another for j in i, split(',') but then that breaks up everything, not just the list of quantities.
I hope I was clear, if I need to clarify anything let me know. This is the first time I've posted on here, usually I can just find another solution from the past but I haven't been able to find anyone who is trying to do or doing what I want to do.
Thank you.
You want to split only the second entity by ',' so you don't need another loop. Since you know that item = i.split('*') returns a list of 3 items, you can simply change your innermost for-loop as follows,
for i in list:
item = i.split('*')
item[1] = item[1].split(',')
currentDropTable.append(item)
Here you replace the second element of item with a list of the quantities.
You only need to split second element from that list.
def convert_drop_table(list):
global currentDropTable
currentDropTable = []
for i in list:
item = i.split('*')
item[1] = item[1].split(',')
currentDropTable.append(item)
The first thing I feel bound to say is that it's usually a good idea to avoid using global variables in any language. Errors involving them can be hard to track down. In fact you could simply omit that function convert_drop_table from your code and do what you need in-line. Then readers aren't obliged to look elsewhere to find out what it does.
And here's yet another way to parse those lines! :) Look for the asterisks then use their positions to select what you want.
currentDropTable = []
with open('droptable.txt') as droptable:
for line in droptable:
line = line.strip()
p = line.find('*')
q = line.rfind('*')
currentDropTable.append([line[0:p], line[1+p:q], line[1+q:]])
print (currentDropTable)
Language: Python 3.4
OS: Windows 8.1
I have some lists like the following:
a = ['text1', 'text2', 'text3','text4','text5']
b = ['text1', 'text2', 'text3','text4','New_element', 'text5']
What is the simplest way to find the elements between two tags in a list?
I want to be able to get it even if the lists and tags have variable number of elements or variable length.
Ex: get elements between text1 and text4 or text1 or text5, etc. Or get the elements between text1 and text5 that has longer length.
I tried using regular expressions like:
re.findall(r'text1(.*?)text5', a)
This will give me an error I guess because you can only use this in a string but not lists.
To get the location of an element in a list use index(). Then use the discovered index to create a slice of the list like:
Code:
print(b[b.index('text3')+1:b.index('text5')])
Results:
['text4', 'New_element']
You can use the list.index method to find the first occurrence of each of your tags, then slice the list to get the value between the indexes.
def find_between_tags(lst, start_tag, end_tag):
start_index = lst.index(start_tag)
end_index = lst.index(end_tag, start_index)
return lst[start_index + 1: end_index]
If either of the tags is not in the list (or if the end tag only occurs before the start tag), one of the index calls will raise a ValueError. You could suppress the exception if you want to do something else, but just letting the caller deal with it seems like a reasonable option to me, so I've left the exception uncaught.
If the tags might occur in this list multiple times, you could extend the logic of the function above to find all of them. For this you'll want to use the start argument to list.index, which will tell it not to look at values before the previous end tag.
def find_all_between_tags(lst, start_tag, end_tag):
search_from = 0
try:
while True:
start_index = lst.index(start_tag, search_from)
end_index = lst.index(end_tag, start_index + 1)
yield lst[start_index + 1:end_index]
search_from = end_index + 1
except ValueError:
pass
This generator does suppress the ValueError, since it keeps on searching until it can't find another pair of tags. If the tags don't exist anywhere in the list, the generator will be empty, but it won't raise any exception (other than StopIteration).
You can get the items between the values by utilizing the index function to search for the index of both objects in the list. Be sure to add one to the index of the first object so it won't be included in the result. See my code below:
def get_sublist_between(e1, e2, li):
return li[li.index(e1) + 1:li.index(e2)]
How can I replace an element during iteration in an elementtree? I'm writing a treeprocessor for markdown and would like to wrap an element.
<pre class='inner'>...</pre>
Should become
<div class='wrapper'><pre class='inner'>...</pre></div>
I use getiterator('pre') to find the elements, but I don't know how to wrap it. The trouble point is replacing the found element with the new wrapper, but preserving the existing one as the child.
This is a bit of a tricky one. First, you'll need to get the parent element as described in this previous question.
parent_map = dict((c, p) for p in tree.getiterator() for c in p)
If you can get markdown to use lxml, this is a little easier -- I believe that lxml elements know their parents already.
Now, when you get your element from iterating, you can also get the parent:
for elem in list(tree.getiterator('pre')):
parent = parent_map[elem]
wrap_elem(parent, elem)
Note that I've turned the iterator from the tree into a list -- We don't want to modify the tree while iterating over it. That could be trouble.
Finally, you're in position to move the element around:
def wrap_elem(parent, elem)
parent_index = list(parent).index(elem)
parent.remove(elem)
new_elem = ET.Element('div', attrib={'class': 'wrapper'})
parent.insert(parent_index, new_elem)
new_elem.append(elem)
*Note that I haven't tested this code exactly... let me know if you find any bugs.
In my experience, you can use the method below to get what you want:
xml.etree.ElementTree.SubElement( I will just call it ET.Subelement) http://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.SubElement
Here is the steps:
Before your iteration, you should get the parent element of these iterated element first, store it into variable parent.
Then,
1, store the element <pre class='inner'>...</pre> into a variable temp
2, add a new subelement div into parent:
div = ET.SubElement(parent, 'div')
and set the attrib of div:
div.set('class','wrapper')
3, add the element in step 1 as a subelement of div,
ET.SubElement(div, temp)
4, delete the element in step 1:
parent.remove(temp)
Something like this works for one:
for i, element in enumerate(parent):
if is_the_one_you_want_to_replace(element):
parent.remove(element)
parent.insert(i, new_element)
break
Something like this works for many:
replacement_map = {}
for i, element in enumerate(parent):
if is_an_element_you_want_to_replace(element):
replacement_map[i] = el_to_remove, el_to_add
for index, (el_to_remove, el_to_add) in replacement_map.items():
parent.remove(el_to_remove)
parent.insert(index, el_to_add)
Another solution that works for me, similar to lyfing's.
Copy the element into a temp; retag the original element with the wanted outer tag and clear it, then append the copy into the original.
import copy
temp = copy.deepcopy(elem)
elem.tag = "div"
elem.set("class","wrapper")
elem.clear()
elem.append(temp)
I need to change the name and extension of a series of files. The names are currently 'tmax.##.txt', but I need it to be 'tmax_##.txt'. Then, I want to change the .txt extension to .asc. I've tried the below code and the first loop works as expected to produce 'tmax_01'. The second loop runs, but produces unexpected results, 't'.
list_raw = 'tmax.01.txt', 'tmax.02.txt', 'tmax.03.txt'
for i in list_raw:
list_conv = i.replace('.','_')
for i in list_conv:
list_final = i.replace('_txt','.asc')
Any suggestions?
You are just assigning new values to a variable in each iteration of the loop. What you want to do is create a new list from the modified elements of an existing list, which is best done with a list comprehension:
list_raw = ['tmax.01.txt', 'tmax.02.txt', 'tmax.03.txt']
list_final = [i.replace(".", "_").replace("_txt", ".asc") for i in list_raw]
Note that you can do this, as in my example, in one step - there is no reason to iterate over the list twice, and produce an intermediate list, which is inefficient.
You could also do i.replace(".", "_", 1) to only replace the first ., and avoid having to do the awkward hack with the file extension. However, I would personally use i[:-4].replace(".", "_") + ".asc" - that is, cut off the existing extension with a slice, replace the .s, and then add the new extension.
If the extensions are likely to vary in length, you may want to look into the os.path module, as suggested by sotapme.
Because you're talking of files it may be worth using os.path as it's likely that the next part of your code will be to manipulate these or other files. (just guessing)
os.path.splitext('afile.txt')[0] + '.asc'
Gives
'afile.asc'
In the first loop: -
for i in list_raw:
list_conv = i.replace('.','_')
Your list_conv contains a str object. And it will contain the last element in the list with the appropriate replacement.
Then in your 2nd loop: -
for i in list_conv:
list_final = i.replace('_txt','.asc')
You are just iterating over string sequence, which will give you 1 character at a time. And list_final will contain the last character with the appropriate replacement done.
Since the last character in tmax_03_txt is t, that is why you got t.
If you want to do the replacement on each element of the list, then you can use list comprehension, and chaning of method invocation: -
>>> list_raw = ['tmax.01.txt', 'tmax.02.txt', 'tmax.03.txt']
>>> [elem.replace('.', '_').replace('_txt', '.asc') for elem in list_raw]
16: ['tmax_01.asc', 'tmax_02.asc', 'tmax_03.asc']
Alternately you could use the string method rsplit.
list_raw = ['tmax.01.txt', 'tmax.02.txt', 'tmax.03.txt']
list_final = [filename.rsplit('.',1)[0] + '.ext' for filename in list_raw]
Where ext is the new extension. The 1 in rsplit() indicates that only the rightmost '.' will act as split point.