BeautifulSoup changes > to > - python

I need to edit some existing html files, using BeautifulSoup. A problem appears when the DOCTYPE includes an ATTLIST element.
Here's a minimal example.
from bs4 import BeautifulSoup
doc = """
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
"""
soup = BeautifulSoup(doc, features='html.parser')
print(soup.prettify())
The output is
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
As seen, the last '>' of DOCTYPE turns into an entity.
With
print(soup.prettify(formatter=None))
I get
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type">
<meta content="CA43667" name="dc:identifier">
</head>
</html>
Now the DOCTYPE is fine, but the ending slashes in the "meta" elements disappear, and the document won't validate on our system. Other formatter options don't seem to work either.
Any solution for this?

Are you running the latest version of BeautifulSoup? I think you'll just need to update BeautifulSoup. Or it may be a weird installation of BeautifulSoup. Try this in your command line:
pip uninstall beautifulsoup4
pip install beautifulsoup4
As when I run this:
from bs4 import BeautifulSoup
doc = """
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
"""
soup = BeautifulSoup(doc, features='html.parser')
print(soup.prettify(formatter=None))
This is outputting:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
[<!ATTLIST span bodyref CDATA #IMPLIED>
]>
<html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="application/xhtml+xml; charset=utf-8" http-equiv="Content-type"/>
<meta content="CA43667" name="dc:identifier"/>
</head>
</html>
Which I believe is what you're looking for. I've tested this on an online IDE too and seems to be matching my computer. Here is a link: https://onlinegdb.com/HyzXahzAE

Related

This XML file does not appear to have any style information associated with it. The document tree is shown below.2

When using the following code in django template:
<!DOCTYPE html>
<html lang="en">
<head>
<link href="http://52.11.183.14/static/wiki/bootstrap/css/wiki-bootstrap.css" type="text/css" rel="stylesheet"/>
<link href="http://52.11.183.14/static/wiki/bootstrap/css/simple-sidebar.css" type="text/css" rel="stylesheet"/>
<title> Profile - Technology βιβλιοθήκη </title>
</head>
<body>
<div class="container">
{% for p in profiles %}
{{p}}
{% endfor %}
</div>
</body>
</html>
I receive the following error:
This XML file does not appear to have any style information associated with it. The document tree is shown below.
Why? And what can I do to fix it?
Solved: by change HttpResponse on render_to_response
my_context={'profiles': profiles}
c = RequestContext(request,{'profiles': profiles})
return render_to_response('wiki/profile.html',
my_context,
context_instance=RequestContext(request))
#return HttpResponse(t.render(c), content_type="application/xhtml+xml")
You must replace your content html tag with this.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="rtl" xmlns="http://www.w3.org/1999/xhtml">

BeautifulSoup cannot parse the html tags which don't have closing element

Here is the HTML code I working on it
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>sdasdsadsad</title>
<link rel="alternate" media="only screen and (max-width: 640px)" href="local:80" />
<meta name="description" content="sdddsdsdsdsdsd">
<meta name="keywords" content="3333333333333333">
<meta property="og:title" content="444444444444444444444444">
<meta property="og:type" content="article">
<meta property="og:description" content="dsdsdsdsddsds">
</head>
<body></body>
</html>
I want to get the line contains tag "<meta name = description" , which doesn't have close element </meta>. There is my code
import glob, os, re, urllib2, codecs
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
html_doc = """
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>sdasdsadsad</title>
<link rel="alternate" media="only screen and (max-width: 640px)" href="local:80" />
<meta name="description" content="sdddsdsdsdsdsd">
<meta name="keywords" content="3333333333333333">
<meta property="og:title" content="444444444444444444444444">
<meta property="og:type" content="article">
<meta property="og:description" content="dsdsdsdsddsds">
</head>
<body></body>
</html>
"""
soup = BeautifulSoup(html_doc)
aa = soup.find("meta", {"name":"description"})
print aa.encode("utf-8")
Running the Python code, but the console show
<meta content="sdddsdsdsdsdsd" name="description">
<meta content="3333333333333333" name="keywords">
<meta content="444444444444444444444444" property="og:title">
<meta content="article" property="og:type">
<meta content="dsdsdsdsddsds" property="og:description">
</meta></meta></meta></meta></meta>
But if "<meta content="sdddsdsdsdsdsd" name="description">" has close element </meta>, I can get exactly the line:
<meta content="sdddsdsdsdsdsd" name="description"> </meta>
Would you like to tell me why the reason BeautifulSoup get all HTML tag under <meta name = description , and how to get the line contains <meta name = description
Thanks.
Use the lxml module as the parser and it will work, I've tested it.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')
aa = soup.find("meta", {"name":"description"})
print aa.encode('utf-8')
# console output
<meta content="sdddsdsdsdsdsd" name="description"/>

Unicode characters from SQLite to HTML

I'm building a web page using CGI and Python (yes, I know, horrible combination!). I have some unicode data stored inside my SQLite 3 database, which I load in my Python script. When it's time to combine those unicode characters with HTML, I'm viewing things like this in my web browser:
\xc3\xb3
Instead of:
ó
I think the problem isn't in the HTML code, because I have defined the encoding like this:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
This is how I'm rendering the HTML code (using web.py's templating system):
render = web.template.render("templates")
return render.index(name)
Where name is:
name = cursor.execute("SELECT name FROM names").fetchone()
And the template (index.html):
$def with (name)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="es" lang="es">
<head>
<title>untitled</title>
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
</head>
<body>
Hello, $name!
</body>
</html>
How can I achieve this?

Error with Beautiful Soup

I have to remove the text in the title tag from this source:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="en">
<head>
<title>Microsoft to acquire Nokia’s devices & services business, license Nokia’s patents and mapping services</title>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" />
</title>
I am using this to remove the text:
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
ourUrl = opener.open("http://www.thehindubusinessline.com/industry-and-economy/info-tech/nokia-cannot-license-brand-nokia-post-microsoft-deal/article5156470.ece").read()
soup = BeautifulSoup(ourUrl)
print soup
dem = soup.findAll('p')
hea = soup.findAll('title')
This code correctly extracts the p tags however fails when trying to extract title. Thanks. I have only included a part of the code, dont worry the rest of it works fine.
There is an error in your html code! You have 2 </title> endtags:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="en">
<head>
<title>Microsoft to acquire Nokia’s devices & services business, license Nokia’s patents and mapping services</title>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" />
</title> #You already have endtag of <title>
So the fixed code should look like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html dir="ltr" lang="en">
<head>
<title>Microsoft to acquire Nokia’s devices & services business, license Nokia’s patents and mapping services</title>
<meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9; IE=10" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta id="ctl00_WtCampaignId" name="DCSext.wt_linkid" />

Alternative to u'' for unicode strings

I have the following asp script which uses python 2.5:
<%# Language = Python CODEPAGE="65001"%>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="sv" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
Jellö Wörld<br>
<%
Response.Write(u'Hellö Wörld<br>')
%>
</body>
</html>
It works correctly, hurrah! However, it will become annoying if I have to use u'' all over the place. What alternatives are there? Is there any future I can import so that I can have python3 like strings?
Thanks for your help,
Barry.
from __future__ import unicode_literals
However, you will need 2.6 or later for this to work.

Categories

Resources