My html looks like:
<td>
<table ..>
<tr>
<th ..>price</th>
<th>$99.99</th>
</tr>
</table>
</td>
So I am in the current table cell, how would I get the 99.99 value?
I have so far:
td[3].findChild('th')
But I need to do:
Find th with text 'price', then get next th tag's string value.
Think about it in "steps"... given that some x is the root of the subtree you're considering,
x.findAll(text='price')
is the list of all items in that subtree containing text 'price'. The parents of those items then of course will be:
[t.parent for t in x.findAll(text='price')]
and if you only want to keep those whose "name" (tag) is 'th', then of course
[t.parent for t in x.findAll(text='price') if t.parent.name=='th']
and you want the "next siblings" of those (but only if they're also 'th's), so
[t.parent.nextSibling for t in x.findAll(text='price')
if t.parent.name=='th' and t.parent.nextSibling and t.parent.nextSibling.name=='th']
Here you see the problem with using a list comprehension: too much repetition, since we can't assign intermediate results to simple names. Let's therefore switch to a good old loop...:
Edit: added tolerance for a string of text between the parent th and the "next sibling" as well as tolerance for the latter being a td instead, per OP's comment.
for t in x.findAll(text='price'):
p = t.parent
if p.name != 'th': continue
ns = p.nextSibling
if ns and not ns.name: ns = ns.nextSibling
if not ns or ns.name not in ('td', 'th'): continue
print ns.string
I've added ns.string, that will give the next sibling's contents if and only if they're just text (no further nested tags) -- of course you can instead analize further at this point, depends on your application's needs!-). Similarly, I imagine you won't be doing just print but something smarter, but I'm giving you the structure.
Talking about the structure, notice that twice I use if...: continue: this reduces nesting compared to the alternative of inverting the if's condition and indenting all the following statements in the loop -- and "flat is better than nested" is one of the koans in the Zen of Python (import this at an interactive prompt to see them all and meditate;-).
With pyparsing, it's easy to reach into the middle of some HTML for a tag pattern like this:
from pyparsing import makeHTMLTags, Combine, Word, nums
th,thEnd = makeHTMLTags("TH")
floatnum = Combine(Word(nums) + "." + Word(nums))
priceEntry = (th + "price" + thEnd +
th + "$" + floatnum("price") + thEnd)
tokens,startloc,endloc = priceEntry.scanString(html).next()
print tokens.price
Pyparsing's makeHTMLTags helper returns a pair of pyparsing expressions, one for the start tag and one for the end tag. The start tag pattern is much more than just adding "<>"s around the given string, but also allows for extra whitespace, variable case, and the presence or absence of tag attributes. For instance, note that even though I specified "TH" as the table head tag, it will also match "th", "Th", "tH" and "TH". Pyparsing's default whitespace skipping behavior will also handle extra spaces, between tag and "$", between "$" and numeric price, etc., without having to sprinkle "zero or more whitespace chars could go here" indicators. Lastly, by assigning the results name "price" (following floatum in the definition of priceEntry), it makes it very simple to access that specific value from the full list of tokens matching the overall priceEntry expression.
(Combine is used for 2 purposes: it disallows whitespace between the components of the number; and returns a single combined token "99.99" instead of the list ["99", ".", "99"].)
Related
I wrote a code to extract multiple information regarding movies. I have a problem with manipulating one of the strings :
'\n'
'\t\t\t\t85\xa0mins \xa0\n'
'\t\t\t\t\n'
'\t\t\t\t\tMore details at\n'
'\t\t\t\t\t'
I want to extract the duration of the movie, which, in this case, is the number 85.
I don't really know how to extract it since the format is pretty weird. My web scraping program yields items as dictionaries, for example:
{'film_director': ['Alfred Hitchcock'],
'film_rating': ['4.0'],
'film_time': '\n'
'\t\t\t\t81\xa0mins \xa0\n'
'\t\t\t\t\n'
'\t\t\t\t\tMore details at\n'
'\t\t\t\t\t',
'film_title': ['Rope'],
'film_year': ['1948']}
I have tried splitting it, but it doesn't seem to work. Any other ideas?
film_dict = {'film_director': ['Alfred Hitchcock'],
'film_rating': ['4.0'],
'film_time': '\n'
'\t\t\t\t81\xa0mins \xa0\n'
'\t\t\t\t\n'
'\t\t\t\t\tMore details at\n'
'\t\t\t\t\t',
'film_title': ['Rope'],
'film_year': ['1948']}
film_time = (film_dict ['film_time'].replace ('\t', '')[: 7])
print (film_time)
Line eleven takes the film time value, removes the tab character and the truncates it to just the part that you need. The replace method replaces '\n' with nothing which just removes it. [:7] slices from the beginning of the modified string up to character number 8.
I have been googling and looking at other question here on search for a string in a BeautifulSoup object.
Per my searching, the following should detect the string - but it doesn't:
strings = soup.find_all(string='Results of Operations and Financial Condition')
However, the following detects the string:
tags = soup.find_all('div',{'class':'info'})
for tag in tags:
if re.search('Results of Operations and Financial Condition',tag.text):
''' Do Something'''
Why does one work and the other not?
You might want to use:
strings = soup.find_all(string=lambda x: 'Results of Operations and Financial Condition' in x)
This happens because the implementation of find_all looks for the string you search to match exactly. I suppose you might have some other text next to 'Results of Operations and Financial Condition'.
If you check the docs here you can see that you can give a function to that string param and it seems that the following lines are equivalent:
soup.find_all(string='Results of Operations and Financial Condition')
soup.find_all(string=lambda x: x == 'Results of Operations and Financial Condition')
For this code
page = urllib.request.urlopen('https://en.wikipedia.org/wiki/Alloxylon_pinnatum')
sp = bs4.BeautifulSoup(page)
print(sp.find_all(string=re.compile('The pinkish-red compound flowerheads'))) # You need to use like this to search within text nodes.
print(sp.find_all(string='The pinkish-red compound flowerheads, known as'))
print(sp.find_all(string='The pinkish-red compound flowerheads, known as ')) #notice space at the end of string
Results are -
['The pinkish-red compound flowerheads, known as ']
[]
['The pinkish-red compound flowerheads, known as ']
It looks like string argument searches for exact full string match, not whether some HTML text node contains that string, but exact value of the HTML text node. You can however use regular expressions to search whether a text node contains some string, as shown in above code.
Currently, I have a table that looks like this:
<tr class="tdc"><td class="myip_tdc">Account<br/><small>client</small></td>
<td class="tdc">Nov, 19 2015 05:18 pmĀ </td>
<td class="tdc"><small><span style="color:green"> Check </span></small></td>
<tr class="tr"><td class="tde" colspan="6">
<div class="divl" id="wtt1266" style="display: block"><table><tr><td style="padding: 5px"><table><tr><td colspan="3"></td></tr><tr><td>
</td><td>
The cell containing the string "Check" is the one I want to look for. I assume it's looking for the exact string, so maybe I need regex to handle the fact that I do not want "checked" to also count. I haven't even gotten there yet, but if someone has insight to offer, I'll take it!
So, I have the following code:
soup = BeautifulSoup(nextpage, "lxml") #page is now converted to a BeautifulSoup object
table = soup.find("table", {'class':'tbled'}) #here is our table
tablerow = soup.find("tr", {'class':"tr"}) #here is a single row of that table
tablecell = soup.find("td", {'class':'tdc'})
for line in tablerow:
if line.find("Check"):
print "Yay"
print line
So, the problem with this is that it's printing all the cells (good), but printing "Yay" after every line. I just want it to print "Yay" after the single cell with "Check" in it. I thought the if statement would take care of that, but I've messed up that logic somehow. Any ideas?
If you wanted to go the regex route instead, this would be the regex
for line in tablerow:
match = re.search("\bCheck\b", line)
if match:
print "Yay"
This would match Check but not Checked
Or if you dont want it to be case specific
for line in tablerow:
match = re.search("\b.heck\b", line)
if match:
print "Yay"
Would also work
There are multiple ways to approach the problem.
One idea would be to pass a function as a text argument value to the find() method. That function would strip the text of an element and compare it to Check. Then, once the element found, we can go up in the tree and find the parent td element:
elm = soup.find(text=lambda x: x and x.strip() == "Check")
td = elm.find_parent("td", class_="tdc")
To extend #Nefarii's answer, here is how you can apply the word-bounded regular expression:
elm = soup.find(text=re.compile(r"\b[Cc]heck\b"))
td = elm.find_parent("td", class_="tdc")
For example, html block:
<p><b>text1</b> (<span>asdftext2</span>)</p>
I need to select all tags "a" and all the rest must be the plain text just like we see in browser:
result = ["text1", " (", <tag_a>, "text2", ")"]
or something like that.
Tried:
hxs.select('.//a|text()')
in this case it finds all tags "a" but text is returned only from direct children.
At the same time:
hxs.select('.//text()|a')
gets all texts, but tags "a" only from direct children.
UPDATE
elements = []
for i in hxs.select('.//node()'):
try:
tag_name = i.select('name()').extract()[0]
except TypeError:
tag_name = '_text'
if tag_name == 'a':
elements.append(i)
elif tag_name == '_text':
elements.append(i.extract())
is there a better way?
Is this the kind of thing you're looking for?
You can remove the descendant tags from the block using etree.strip_tags
from lxml import etree
d = etree.HTML('<html><body><p><b>text1</b> (<span>asdftext2</span>)</p></body></html>')
block = d.xpath('/html/body/p')[0]
# etree.strip_tags apparently takes a list of tags to strip, but it wasn't working for me
for tag in set(x.tag for x in block.iterdescendants() if x.tag != 'a'):
etree.strip_tags(block,tag)
block.xpath('./text()|a')
Yields:
['text1', ' (', <Element a at fa4a48>, 'text2', ')']
It looks to me as if you are stepping beyond XPath territory. XPath is good at selecting things from the input but not at constructing output. It was designed, of course, for use with XSLT where XSLT instructions handle the output side. I'm not sure what the Python equivalent would be.
These relative XPath expressions:
.//text()|.//a
Or
.//node()[self::text()|self::a]
Meanning: all descendant text nodes or a elements from the context node.
Note: It's up to the host language or the XPath engine whether this node set result is ordered by document order or not. By definition, node sets are unorderd.
I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.