find_all does not find text in mixed content - python

I have a little bit of screen scraping code in Python, using BeautifulSoup, that is giving me headache. A small change to the html made my code break, but I can't see why it fails to work. This is basically a demo of how the html looked when parsed:
soup=BeautifulSoup("""
<td>
<a href="https://alink.com">
Foo Some text Bar
</a>
</td>
""")
links = soup.find_all('a',text=re.compile('Some text'))
links[0]['href'] # => "https://alink.com"
After an upgrade, the a tag body now includes an img tag, which makes the code break.
<td>
<a href="https://alink.com">
<img src="dummy.gif" >
Foo Some text Bar
</a>
</td>
'links' is now an empty list, so the regex is not finding anything.
I hacked around it by matching on the text alone, then finding
its parent, but that seems even more fragile:
links = soup.find_all(text=re.compile('Some text'))
links[0].parent['href'] # => "https://alink.com"
What is the addition of an img tag as a sibling to the text
content breaking the search done by BeautifulSoup, and is there
a way of modifying the first code to work?

The difference is that the 2nd example has an incomplete img tag:
it should be either
<img src="dummy.gif" />
Foo Some text Bar
or
<img src="dummy.gif" > </img>
Foo Some text Bar
Instead, it is parsed as
<img src="dummy.gif" >
Foo Some text Bar
</img>
So the element found isn't a any longer, but img, whose parent is a.

The first example works only if a.string is not None i.e., iff the text is the only child.
As a workaround, you could use a function predicate:
a = soup.find(lambda tag: tag.name == 'a' and tag.has_attr('href') and 'Some text' in tag.text)
print(a['href'])
# -> 'https://alink.com'

Related

Adding a variable string to an expression in Airium (Python)

I'm working on a little jig that generates a static gallery page based on a folder full of images. My current hangup is generating the HTML itself-
I used Airium to reverse-translate my existing HTML to Airium's python code, and added the variables I want to modify for each anchor tag in a loop. But I can't for the life of me figure out how to get it to let me add 'thumblink'. I'm not sure why it's treating it so differently from the others, my guess is that Airium expects foo:bar but not foo:bar(xyz) with xyz being the only part I want to pull out and modify.
from airium import Airium
imagelink = "image name here" # after pulling image filename from list
thumblink = "thumb link here" # after resizing image to thumb size
artistname = "artist name here" # after extracting artist name from filename
a = Airium()
with a.a(href='javascript:void(0);', **{'data-image': imagelink}):
with a.div(klass='imagebox', style='background-image:url(images/2015-12-29kippy.png)'):
a.div(klass='artistname', _t= artistname)
html = str(a) # cast to string
print(html) # print to console
where "images/2015-12-29kippy.png" is what I'd replace with string variable "thumblink".
image and artist do translate correctly in the output after testing -
<a href="javascript:void(0);" data-image="image name here">
<div class="imagebox" style="background-image:url(images/2015-12-29kippy.png)">
<div class="artistname">artist name here</div>
</div>
</a>
>>>

BeautifulSoup4: Fail to find 'a' tag with specific href value by find()

I am trying to crawl the realtime Bitcoin-HKD Currency from https://www.coinbase.com/pt-PT/price/ with python3.
The only way I found to locate it specificly in the HTML is by this tage a with href="/pt-PT/price/bitcoin"
<a href="/pt-PT/price/bitcoin" title="Visite a moeda Bitcoin" data-element-handle="asset-highlight-top-daily-volume" class="Link__A-eh4rrz-0 hfBqui AssetHighlight__StyledLink-sc-1srucyv-1 cbFcph" color="slate">
<h2 class="AssetHighlight__Title-sc-1srucyv-2 jmJxYl">Volume mais alto (24 h)</h2>
<div class="Flex-l69ttv-0 gaVUrq">
<img src="https://dynamic-assets.coinbase.com/e785e0181f1a23a30d9476038d9be91e9f6c63959b538eabbc51a1abc8898940383291eede695c3b8dfaa1829a9b57f5a2d0a16b0523580346c6b8fab67af14b/asset_icons/b57ac673f06a4b0338a596817eb0a50ce16e2059f327dc117744449a47915cb2.png" alt="Visite a moeda Bitcoin" aria-label="Visite a moeda Bitcoin" loading="lazy" class="AssetHighlight__AssetImage-sc-1srucyv-5 lcjcxh"/>
<div class="Flex-l69ttv-0 kvilOX">
<div class="Flex-l69ttv-0 gTbYCC">
<h3 class="AssetHighlight__SubTitle-sc-1srucyv-3 gdcBEE">Bitcoin</h3>
<p class="AssetHighlight__Price-sc-1srucyv-4 bUAWAG">460 728,81 HK$</p>
Here 460 728,81 HK$ is the data wanted.
Thus I applied the following codes:
import bs4
import urllib.request as req
url="https://www.coinbase.com/prthe ice/bitcoin/hkd"
request=req.Request(url,headers={
"user-agent":"..."
})
with req.urlopen(request) as response:
data=response.read().decode("utf-8")
root=bs4.BeautifulSoup(data,"html.parser")
secBitcoin=root.find('a',href="/pt-PT/price/bitcoin")
realtimeCurrency=secBitcoin.find('p')
print(realtimeCurrency.string)
However, it always returns secBitcoin = None. No result matches.
The find function works just fine when I search 'div' label with class parameter.
I have also tried format like
.find('a[href="/pt-PT/price/bitcoin"]')
But nothing works.
It's possible the page is loading the currency values after the initial page load. You could try hitting ctrl+s to save the full webpage and open that file instead of using requests. If that also doesn't work, then I'm not sure where the problem is.
And if that does work, then you'll probably need to use something like selenium to get what you need
href is an attribute of an element and hence I think you cannot find it that way.
def is_a_and_href_matching(element):
is_a = element.name == a
if is_a and element.has_attr(href):
if element['href'] == "/pt-PT/price/bitcoin":
return True
return False
secBitcoins=root.find_all(is_a_and_href_matching)
for secBitcoin in secBitcoins:
p = setBitcoin.find('p')

Keep \n in string content and write to one line

I have the following code for parsing some HTML. I need to save the output (html result) as a single line of code with the escaped character sequences there such as \n but I'm either getting a representation I can't use from repr() because of the single quotes or the output is being written to multiple lines like so (interpreting the escape sequences):
<section class="prog__container">
<span class="prog__sub">Title</span>
<p>PEP 336 - Make None Callable</p>
<span class="prog__sub">Description</span>
<p>
<p>
<code>
None
</code>
should be a callable object that when called with any
arguments has no side effect and returns
<code>
None
</code>
.
</p>
</p>
</section>
What I require (including the escape sequences):
<section class="prog__container">\n <span class="prog__sub">Title</span>\n <p>PEP 336 - Make None Callable</p>\n <span class="prog__sub">Description</span>\n <p>\n <p>\n <code>\n None\n </code>\n should be a callable object that when called with any\n arguments has no side effect and returns\n <code>\n None\n </code>\n .\n </p>\n </p>\n </section>
My Code
soup = BeautifulSoup(html, "html.parser")
for match in soup.findAll(['div']):
match.unwrap()
for match in soup.findAll(['a']):
match.unwrap()
html = soup.contents[0]
html = str(html)
html = html.splitlines(True)
html = " ".join(html)
html = re.sub(re.compile("\n"), "\\n", html)
html = repl(html) # my current solution works, but unusable
The above is my solution, but an object representation is no good, I need the string representation. How can I achieve this?
Why don't use just repr?
a = """this is the first line
this is the second line"""
print repr(a)
Or even (if I clear with your issue of exact output without literal quotes)
print repr(a).strip("'")
Output:
'this is the first line\nthis is the second line'
this is the first line\nthis is the second line
import bs4
html = '''<section class="prog__container">
<span class="prog__sub">Title</span>
<p>PEP 336 - Make None Callable</p>
<span class="prog__sub">Description</span>
<p>
<p>
<code>
None
</code>
should be a callable object that when called with any
arguments has no side effect and returns
<code>
None
</code>
.
</p>
</p>
</section>'''
soup = bs4.BeautifulSoup(html, 'lxml')
str(soup)
out:
'<html><body><section class="prog__container">\n<span class="prog__sub">Title</span>\n<p>PEP 336 - Make None Callable</p>\n<span class="prog__sub">Description</span>\n<p>\n</p><p>\n<code>\n None\n </code>\n should be a callable object that when called with any\n arguments has no side effect and returns\n <code>\n None\n </code>\n .\n </p>\n</section></body></html>'
There are more complex way to output the html code in the Document
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen('https://www.example.com')
soup = BeautifulSoup(r.read(), 'html.parser')
html = str(soup)
This will give your html as one string and lines separated by \n

Replacing string contents with regular expressions

I am trying to remove all the html surrounding the data that I seek from a webpage so that all that is left is the raw data that I will then be able to input into a database. so if I have something like:
<p class="location"> Atlanta, GA </p>
The following code would return
Atlanta, GA </p>
But what I expect is not what is returned. This is a more specific solution to the basic problem I found here. Any help would be appreciated, thanks! Code is found below.
def delHTML(self, html):
"""
html is a list made up of items with data surrounded by html
this function should get rid of the html and return the data as a list
"""
for n,i in enumerate(html):
if i==re.match('<p class="location">',str(html[n])):
html[n]=re.sub('<p class="location">', '', str(html[n]))
return html
As rightfully pointed out in the comments, you should be using a specific library to parse HTML and extract text, here are some examples:
html2text: Limited functionnality, but exactly what you need.
BeautifulSoup: More complex, more powerful.
Assuming all you want is to extract the data contained in <p class="location"> tags, you could use a quick & dirty (but correct) approach with the Python HTMLParser module (a simple HTML SAX parser), like this:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
PLocationID=0
PCount=0
buf=""
out=[]
def handle_starttag(self, tag, attrs):
if tag=="p":
self.PCount+=1
if ("class", "location") in attrs and self.PLocationID==0:
self.PLocationID=self.PCount
def handle_endtag(self, tag):
if tag=="p":
if self.PLocationID==self.PCount:
self.out.append(self.buf)
self.buf=""
self.PLocationID=0
self.PCount-=1
def handle_data(self, data):
if self.PLocationID:
self.buf+=data
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed("""
<html>
<body>
<p>This won't appear!</p>
<p class="location">This <b>will</b></p>
<div>
<p class="location">This <span class="someclass">too</span></p>
<p>Even if <p class="location">nested Ps <p class="location"><b>shouldn't</b> <p>be allowed</p></p> <p>this will work</p></p> (this last text is out!)</p>
</div>
</body>
</html>
""")
print parser.out
Output:
['This will', 'This too', "nested Ps shouldn't be allowed this will work"]
This will extract all the text contained inside any <p class="location"> tag, stripping all the tags inside it. Separate tags (if not nested - which shouldn't be allowed anyhow for paragraphs) will have a separate entry in the out list.
Notice that for more complex requirements this can easily get out of hand; in those cases a DOM parser is way more appropriate.

Beautiful Soup: Get the Contents of Sub-Nodes

I have following python code:
def scrapeSite(urlToCheck):
html = urllib2.urlopen(urlToCheck).read()
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
tdtags = soup.findAll('td', { "class" : "c" })
for t in tdtags:
print t.encode('latin1')
This will return me following html code:
<td class="c">
FOO
</td>
<td class="c">
BAR
</td>
I'd like to get the text between the a-Node (e.g. FOO or BAR), which would be t.contents.contents. Unfortunately it doesn't work that easy :)
Does anyone have an idea how to solve that?
Thanks a lot, any help is appreciated!
Cheers,
Joseph
In this case, you can use t.contents[1].contents[0] to get FOO and BAR.
The thing is that contents returns a list with all elements (Tags and NavigableStrings), if you print contents, you can see it's something like
[u'\n', FOO, u'\n']
So, to get to the actual tag you need to access contents[1] (if you have the exact same contents, this can vary depending on the source HTML), after you've find the proper index you can use contents[0] afterwards to get the string inside the a tag.
Now, as this depends on the exact contents of the HTML source, it's very fragile. A more generic and robust solution would be to use find() again to find the 'a' tag, via t.find('a') and then use the contents list to get the values in it t.find('a').contents[0] or just t.find('a').contents to get the whole list.
For your specific example, pyparsing's makeHTMLTags can be useful, since they are tolerant of many HTML variabilities in HTML tags, but provide a handy structure to the results:
html = """
<td class="c">
FOO
</td>
<td class="c">
BAR
</td>
<td class="d">
BAZZ
</td>
"""
from pyparsing import *
td,tdEnd = makeHTMLTags("td")
a,aEnd = makeHTMLTags("a")
td.setParseAction(withAttribute(**{"class":"c"}))
pattern = td + a("anchor") + SkipTo(aEnd)("aBody") + aEnd + tdEnd
for t,_,_ in pattern.scanString(html):
print t.aBody, '->', t.anchor.href
prints:
FOO -> more.asp
BAR -> alotmore.asp

Categories

Resources