I have the following code for parsing some HTML. I need to save the output (html result) as a single line of code with the escaped character sequences there such as \n but I'm either getting a representation I can't use from repr() because of the single quotes or the output is being written to multiple lines like so (interpreting the escape sequences):
<section class="prog__container">
<span class="prog__sub">Title</span>
<p>PEP 336 - Make None Callable</p>
<span class="prog__sub">Description</span>
<p>
<p>
<code>
None
</code>
should be a callable object that when called with any
arguments has no side effect and returns
<code>
None
</code>
.
</p>
</p>
</section>
What I require (including the escape sequences):
<section class="prog__container">\n <span class="prog__sub">Title</span>\n <p>PEP 336 - Make None Callable</p>\n <span class="prog__sub">Description</span>\n <p>\n <p>\n <code>\n None\n </code>\n should be a callable object that when called with any\n arguments has no side effect and returns\n <code>\n None\n </code>\n .\n </p>\n </p>\n </section>
My Code
soup = BeautifulSoup(html, "html.parser")
for match in soup.findAll(['div']):
match.unwrap()
for match in soup.findAll(['a']):
match.unwrap()
html = soup.contents[0]
html = str(html)
html = html.splitlines(True)
html = " ".join(html)
html = re.sub(re.compile("\n"), "\\n", html)
html = repl(html) # my current solution works, but unusable
The above is my solution, but an object representation is no good, I need the string representation. How can I achieve this?
Why don't use just repr?
a = """this is the first line
this is the second line"""
print repr(a)
Or even (if I clear with your issue of exact output without literal quotes)
print repr(a).strip("'")
Output:
'this is the first line\nthis is the second line'
this is the first line\nthis is the second line
import bs4
html = '''<section class="prog__container">
<span class="prog__sub">Title</span>
<p>PEP 336 - Make None Callable</p>
<span class="prog__sub">Description</span>
<p>
<p>
<code>
None
</code>
should be a callable object that when called with any
arguments has no side effect and returns
<code>
None
</code>
.
</p>
</p>
</section>'''
soup = bs4.BeautifulSoup(html, 'lxml')
str(soup)
out:
'<html><body><section class="prog__container">\n<span class="prog__sub">Title</span>\n<p>PEP 336 - Make None Callable</p>\n<span class="prog__sub">Description</span>\n<p>\n</p><p>\n<code>\n None\n </code>\n should be a callable object that when called with any\n arguments has no side effect and returns\n <code>\n None\n </code>\n .\n </p>\n</section></body></html>'
There are more complex way to output the html code in the Document
from bs4 import BeautifulSoup
import urllib.request
r = urllib.request.urlopen('https://www.example.com')
soup = BeautifulSoup(r.read(), 'html.parser')
html = str(soup)
This will give your html as one string and lines separated by \n
Related
Trying to grab the magnet link from the following code
rawdata = ''' <div class="iaconbox center floatright">
<a rel="12624681,0" class="icommentjs kaButton smallButton rightButton" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html#comment">209 <i class="ka ka-comment"></i></a> <a class="icon16" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html" title="Verified Torrent"><i class="ka ka16 ka-verify ka-green"></i></a> <div data-sc-replace="" data-sc-slot="_ae58c272c09a10c792c6b17d55c20208" class="none" data-sc-params="{ 'name': 'Zootopia%202016%201080p%20HDRip%20x264%20AC3-JYK', 'extension': 'mkv', 'magnet': 'magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce' }"></div>
<a data-nop="" title="Torrent magnet link" href="magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce" class="icon16 askFeedbackjs" data-id="CE8357DED670F06329F6028D2F2CEA6F514646E0"><i class="ka ka16 ka-magnet"></i></a>
<a data-download="" title="Download torrent file" href="https://kat.cr/torrents/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681/" class="icon16 askFeedbackjs"><i class="ka ka16 ka-arrow-down"></i></a>
</div> '''
Using this command
rawdata[rawdata.find("<")+1:rawdata.find(">")]
Gives me
div class="iaconbox center floatright"
But when I try to find Magnet link
rawdata[rawdata.find("href="magnet:?")+1:rawdata.find(""")]
It gives me
' '
What I actually want it to give me
magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce
It's so easy with Shell, but it has to be done with Python itself.
try rawdata[rawdata.find('href="magnet:?')+1:rawdata.find('"')]
It's better to use regular expression.
import re
rawdata = '''your rawdata......'''
regex = re.compile('href="(.+)" class="icon16')
magnet_href = regex.search(rawdata).group(1)
First of all, as pointed out by HenryM, you need to use single quotes or escape the " to make the strings valid.
Second, find() always returns the first index of the character found. So you will find the first " and not the one ending the link. To fix this use the beg parameter to define the beginning of your search.
Additionally, you need to add the length of your query to the start index, as find gives you the starting index of the match, not the end you are looking for. The code would look something like this (completely untested):
start = rawdata.find('href="magnet:?') + 14
end = rawdata.find('"', beg=start)
link = rawdata[start:end]
The input data is an HTML fragment. You should not be using regular expressions to parse it.
Use a parser instead. Here is a working sample using BeautifulSoup HTML parser:
from bs4 import BeautifulSoup
rawdata = ''' <div class="iaconbox center floatright">
<a rel="12624681,0" class="icommentjs kaButton smallButton rightButton" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html#comment">209 <i class="ka ka-comment"></i></a> <a class="icon16" href="https://kat.cr/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681.html" title="Verified Torrent"><i class="ka ka16 ka-verify ka-green"></i></a> <div data-sc-replace="" data-sc-slot="_ae58c272c09a10c792c6b17d55c20208" class="none" data-sc-params="{ 'name': 'Zootopia%202016%201080p%20HDRip%20x264%20AC3-JYK', 'extension': 'mkv', 'magnet': 'magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce' }"></div>
<a data-nop="" title="Torrent magnet link" href="magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce" class="icon16 askFeedbackjs" data-id="CE8357DED670F06329F6028D2F2CEA6F514646E0"><i class="ka ka16 ka-magnet"></i></a>
<a data-download="" title="Download torrent file" href="https://kat.cr/torrents/zootopia-2016-1080p-hdrip-x264-ac3-jyk-t12624681/" class="icon16 askFeedbackjs"><i class="ka ka16 ka-arrow-down"></i></a>
</div> '''
soup = BeautifulSoup(rawdata, "html.parser")
print(soup.find("a", title="Torrent magnet link")["href"])
Prints:
magnet:?xt=urn:btih:CE8357DED670F06329F6028D2F2CEA6F514646E0&dn=zootopia+2016+1080p+hdrip+x264+ac3+jyk&tr=udp%3A%2F%2Ftracker.publicbt.com%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce
OK, so as a beginner webscraper I feel as though I've seen both used, seemingly interchangeably when converting the default unicode of text in HTML. I know contents() is a list object but other than that, what the heck is the difference?
I've noticed that .encode("utf-8") seems to work more universally.
thanks,
-confused souper.
The documentation of encode_contents:
encode_contents(self, indent_level=None, encoding='utf-8', formatter='minimal') method of bs4.BeautifulSoup instance
Renders the contents of this tag as a bytestring.
The documentation ofencode method:
encode(self, encoding='utf-8', indent_level=None, formatter='minimal', errors='xmlcharrefreplace')
encode method will work on a bs4.BeautifulSoup object instance. encode_contents will work on the contents of a bs4.BeautifulSoup instance.
>>> html = "<div>div content <p> a paragraph </p></div>"
>>> soup = BeautifulSoup(html)
>>> soup.div.encode()
>>> '<div>div content <p> a paragraph </p></div>'
>>> soup.div.contents
>>> [u'div content ', <p> a paragraph </p>]
>>> soup.div.encode_contents()
>>> 'div content <p> a paragraph </p>'
The method signature for encode_contents() shows that in addition to encoding content, it can also format the output:
encode_contents(self, indent_level=None, encoding='utf-8', formatter='minimal') method of bs4.BeautifulSoup instance
Renders the contents of this tag as a bytestring.
For example:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><p>Caf\xe9</p></body></html>')
>>> soup.encode_contents()
'<html><body><p>Caf\xc3\xa9</p></body></html>'
>>> soup.encode_contents(indent_level=1)
'<html>\n <body>\n <p>\n Caf\xc3\xa9\n </p>\n </body>\n</html>'
>>> soup.encode_contents(indent_level=1, encoding='iso-8859-1')
'<html>\n <body>\n <p>\n Caf\xe9\n </p>\n </body>\n</html>'
str.encode('utf-8') can only perform the encoding part, no formatting.
I have a little bit of screen scraping code in Python, using BeautifulSoup, that is giving me headache. A small change to the html made my code break, but I can't see why it fails to work. This is basically a demo of how the html looked when parsed:
soup=BeautifulSoup("""
<td>
<a href="https://alink.com">
Foo Some text Bar
</a>
</td>
""")
links = soup.find_all('a',text=re.compile('Some text'))
links[0]['href'] # => "https://alink.com"
After an upgrade, the a tag body now includes an img tag, which makes the code break.
<td>
<a href="https://alink.com">
<img src="dummy.gif" >
Foo Some text Bar
</a>
</td>
'links' is now an empty list, so the regex is not finding anything.
I hacked around it by matching on the text alone, then finding
its parent, but that seems even more fragile:
links = soup.find_all(text=re.compile('Some text'))
links[0].parent['href'] # => "https://alink.com"
What is the addition of an img tag as a sibling to the text
content breaking the search done by BeautifulSoup, and is there
a way of modifying the first code to work?
The difference is that the 2nd example has an incomplete img tag:
it should be either
<img src="dummy.gif" />
Foo Some text Bar
or
<img src="dummy.gif" > </img>
Foo Some text Bar
Instead, it is parsed as
<img src="dummy.gif" >
Foo Some text Bar
</img>
So the element found isn't a any longer, but img, whose parent is a.
The first example works only if a.string is not None i.e., iff the text is the only child.
As a workaround, you could use a function predicate:
a = soup.find(lambda tag: tag.name == 'a' and tag.has_attr('href') and 'Some text' in tag.text)
print(a['href'])
# -> 'https://alink.com'
I am trying to remove all the html surrounding the data that I seek from a webpage so that all that is left is the raw data that I will then be able to input into a database. so if I have something like:
<p class="location"> Atlanta, GA </p>
The following code would return
Atlanta, GA </p>
But what I expect is not what is returned. This is a more specific solution to the basic problem I found here. Any help would be appreciated, thanks! Code is found below.
def delHTML(self, html):
"""
html is a list made up of items with data surrounded by html
this function should get rid of the html and return the data as a list
"""
for n,i in enumerate(html):
if i==re.match('<p class="location">',str(html[n])):
html[n]=re.sub('<p class="location">', '', str(html[n]))
return html
As rightfully pointed out in the comments, you should be using a specific library to parse HTML and extract text, here are some examples:
html2text: Limited functionnality, but exactly what you need.
BeautifulSoup: More complex, more powerful.
Assuming all you want is to extract the data contained in <p class="location"> tags, you could use a quick & dirty (but correct) approach with the Python HTMLParser module (a simple HTML SAX parser), like this:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
PLocationID=0
PCount=0
buf=""
out=[]
def handle_starttag(self, tag, attrs):
if tag=="p":
self.PCount+=1
if ("class", "location") in attrs and self.PLocationID==0:
self.PLocationID=self.PCount
def handle_endtag(self, tag):
if tag=="p":
if self.PLocationID==self.PCount:
self.out.append(self.buf)
self.buf=""
self.PLocationID=0
self.PCount-=1
def handle_data(self, data):
if self.PLocationID:
self.buf+=data
# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed("""
<html>
<body>
<p>This won't appear!</p>
<p class="location">This <b>will</b></p>
<div>
<p class="location">This <span class="someclass">too</span></p>
<p>Even if <p class="location">nested Ps <p class="location"><b>shouldn't</b> <p>be allowed</p></p> <p>this will work</p></p> (this last text is out!)</p>
</div>
</body>
</html>
""")
print parser.out
Output:
['This will', 'This too', "nested Ps shouldn't be allowed this will work"]
This will extract all the text contained inside any <p class="location"> tag, stripping all the tags inside it. Separate tags (if not nested - which shouldn't be allowed anyhow for paragraphs) will have a separate entry in the out list.
Notice that for more complex requirements this can easily get out of hand; in those cases a DOM parser is way more appropriate.
I have an html file and I want to replace the empty paragraphs with a space.
mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , " ")
This is not working.
Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:
Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.
You won't be sorry! Profit guaranteed!
I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.
Here's an example of how to remove empty <p> elements using lxml. lxml's HTMLParser deals with HTML very well.
from lxml import etree
from StringIO import StringIO
input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''
parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)
for p in tree.xpath("//p"):
if len(p):
continue
t = p.text
if not (t and t.strip()):
p.getparent().remove(p)
print etree.tostring(tree.getroot(), pretty_print=True)
... which produces the output:
<html>
<body>
<p>This </p>
<p>is a test</p>
<p>
<b>Bye.</b>
</p>
</body>
</html>
Note that I misread the question when replying to this, and I'm only removing the empty <p> elements, not replacing them with  . With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:
How can one replace an element with text in lxml?
I think for this particular problem a parsing module would be overkill
simply that function:
>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"
>>> mystring.replace("<p></p>"," ")
'This <p>is a test</p> '
What if <p> is entered as <P>, or < p >, or has an attribute added, or is given using the empty tag syntax <P/>? Pyparsing's HTML tag support handles all of these variations:
from pyparsing import makeHTMLTags, replaceWith, withAttribute
mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'
p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))
null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith(" "))
print null_paragraph.transformString(mystring)
Prints:
This <p>is a test</p>
using regexp ?
import re
result = re.sub("<p>\s*</p>"," ", mystring, flags=re.MULTILINE)
compile the regexp if you use it often.
I wrote that code:
from lxml import etree
from StringIO import StringIO
html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li> </li> <p> </p></ul> <div> </div></div>"""
document = etree.iterparse(StringIO(html_tags), html=True)
for a, e in document:
if not (e.text and e.text.strip()) and len(e) == 0:
e.getparent().remove(e)
print etree.tostring(document.root)