Python: find <title>

Python: find <title> - python

I have this:
response = urllib2.urlopen(url)
html = response.read()
begin = html.find('<title>')
end = html.find('</title>',begin)
title = html[begin+len('<title>'):end].strip()
if the url = http://www.google.com then the title have no problem as "Google",
but if the url = "http://www.britishcouncil.org/learning-english-gateway" then the title become
"<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<base href="http://www.britishcouncil.org/" />
<META http-equiv="Content-Type" Content="text/html;charset=utf-8">
<meta name="WT.sp" content="Learning;Home Page Smart View" />
<meta name="WT.cg_n" content="Learn English Gateway" />
<META NAME="DCS.dcsuri" CONTENT="/learning-english-gateway.htm">..."
What is actually happening, why I couldn't return the "title"?

That URL returns a document with <TITLE>...</TITLE> and find is case-sensitive. I strongly suggest you use an HTML parser like Beautiful Soup.

Let's analyze why we got that answer. If you open the website and view the source, we note that it doesn't have <title>...</title>. Instead we have <TITLE>...</TITLE>. So what happened to the 2 find calls? Both will be -1!
begin = html.find('<title>') # Result: -1
end = html.find('</title>') # Result: -1
Then begin+len('<title>') will be -1 + 7 = 6. So your final line would be extracting html[6:-1]. It turns out that negative indices actually mean something legitimate in Python (for good reasons). It means to count from the back. Hence -1 here refers to the last character in html. So what you are getting is a substring from the 6th character (inclusive) to the last character (exclusive).
What can we do then? Well, for one, you can use regular expression matcher that ignore case or use a proper HTML parser. If this is a one-off thing and space/performance isn't much of a concern, the quickest approach might be to create a copy of html and lower-cased the entire string:
def get_title(html):
html_lowered = html.lower();
begin = html_lowered.find('<title>')
end = html_lowered.find('</title>')
if begin == -1 or end == -1:
return None
else:
# Find in the original html
return html[begin+len('<title>'):end].strip()

Working solution with lxml and urllib using Python 3
import lxml.etree, urllib.request
def documenttitle(url):
conn = urllib.request.urlopen(url)
parser = lxml.etree.HTMLParser(encoding = "utf-8")
tree = lxml.etree.parse(conn, parser = parser)
return tree.find('.//title')

Related

Genshi: for loop inserts line breaks

Source code: I have the following program.
import genshi
from genshi.template import MarkupTemplate
html = '''
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:py="http://genshi.edgewall.org/">
<head>
</head>
<body>
<py:for each="i in range(3)">
<py:choose>
<del py:when="i == 1">
${i}
</del>
<py:otherwise>
${i}
</py:otherwise>
</py:choose>
</py:for>
</body>
</html>
'''
template = MarkupTemplate(html)
stream = template.generate()
html = stream.render('html')
print(html)
Expected output: the numbers are printed consecutively with no whitespace (and most critically no line-break) between them.
<html>
<head>
</head>
<body>
0<del>1</del>2
</body>
</html>
Actual output: It outputs the following:
<html>
<head>
</head>
<body>
0
<del>1</del>
2
</body>
</html>
Question: How do I eliminate the line-breaks? I can deal with the leading whitespace by stripping it from the final HTML, but I don't know how to get rid of the line-breaks. I need the contents of the for loop to be displayed as a single continuous "word" (e.g. 012 instead of 0 \n 1 \n 2).
What I've tried:
Reading the Genshi documentation.
Searching StackOverflow
Searching Google
Using a <?python ...code... ?> code block. This doesn't work since the carets in the <del> tags are escaped and displayed.
<?python
def numbers():
n = ''
for i in range(3):
if i == 1:
n += '<del>{i}</del>'.format(i=i)
else:
n += str(i)
return n
?>
${numbers()}
Produces 0<del>1</del>2
I also tried this, but using genshi.builder.Element('del') instead. The results are the same, and I was able to conclusively determine that the string returned by numbers() is being escaped after the return occurs.
A bunch of other things that I can't recall at the moment.

Not ideal, but I did finally find an acceptable solution. The trick is to put the closing caret for a given tag on the next line right before the next tag's opening caret.
<body>
<py:for each="i in range(3)"
><py:choose
><del py:when="i == 1">${i}</del
><py:otherwise>${i}</py:otherwise
></py:choose
</py:for>
</body>
Source: https://css-tricks.com/fighting-the-space-between-inline-block-elements/
If anyone has a better approach I'd love to hear it.

How to wrap custom <root> element around whole HTML document?

I have a large number of HTML documents that must be converted to XML. Not all may look exactly the same. For example, the sample below ends with an HTML comment tag, not with the HTML tag.
Note this question is related to this one.
Here is my code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
...
<comment>here is a comment inside the head tag</comment>
</head>
<body>
...
<comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>
I wish to wrap the entire document with a custom tag called <root>. So far, the best I can do is wrap <root> around <html>.
root_tag = bs4.Tag(name="root")
soup.html.wrap(root_tag)
How can I position the <root> element such that it wraps the entire document?

A little crude, as this is just wrapping any given file in <root> </root>
See if it works for your use case:
def root_wrap(file):
fin = open(file, 'r+')
fin.write('<root>')
for line in fin:
fin.write(line)
fin.write('</root>')
fin.close()

How can I get content value from meta tag using PyQuery?

<meta name="keywords" content="Ruby On Rails (Software), Authentication (Software Genre), Tutorial (Industry), howto, tips, tricks">
How can I get content value from this meta tag using PyQuery?
from pyquery import PyQuery
def get_data(myurl):
query = PyQuery(url=myurl)
title = query("title").text()
keyword = query("meta[name=keywords]").text()
...

You've almost made it, missing apostrophe character to set the keywork
keyword = query("meta[name='keywords']").text()

HTMLParsing in Python

So i have a need to process some HTML in Python, and my requirement is that i need to find a certain tag and replace it with different charecter based on the content of the charecters...
<html>
<Head>
</HEAD>
<body>
<blah>
<_translate attr="french"> I am no one,
and no where <_translate>
<Blah/>
</body>
</html>
Should become
<html>
<Head>
</HEAD>
<body>
<blah>
Je suis personne et je suis nulle part
<Blah/>
</body>
</html>
I would like to leave the original HTML untouched an only replace the tags labeled 'important-tag'. Attributes and the contents of that tag will be important to generate the tags output.
I had though about using extending HTMLParser Object but I am having trouble getting out the orginal HTML when i want it. I think what i most want is to parse the HTML into tokens, with the orginal text in each token so i can output my desired output ... i.e. get somthing like
(tag, "<html>")
(data, "\n ")
(tag, "<head>")
(data, "\n ")
(end-tag,"</HEAD>")
ect...
ect...
Anyone know of a good pythonic way to accomplish this ? Python 2.7 standard libs are prefered, third party would also be useful to consider...
Thanks!

You can use lxml to perform such a task http://lxml.de/tutorial.html and use XPath to navigate easily trough your html:
from lxml.html import fromstring
my_html = "HTML CONTENT"
root = fromstring(my_html)
nodes_to_process = root.xpath("//_translate")
for node in nodes_to_process:
lang = node.attrib["attr"]
translate = AWESOME_TRANSLATE(node.text, lang)
node.parent.text = translate
I'll leave up to you the implementation of the AWESOME_TRANSLATE function ;)

How to control newline processing in the lxml xpath text() function?

Having switched from Fedora 17 to 18, I get different parsing behaviour for the same lxml code, apparently due to different versions of the underlying libraries (libxml2 and libxslt versions changed).
Here's an example of lxml code with different results for the two versions:
from io import BytesIO
from lxml import etree
myHtmlString = \
'<!doctype html public "-//w3c//dtd html 4.0 transitional//en">\r\n'+\
'<html>\r\n'+\
'<head>\r\n'+\
' <title>Title</title>\r\n'+\
'</head>\r\n'+\
'<body/>\r\n'+\
'</html>\r\n'
myFile = BytesIO(myHtmlString)
myTree = etree.parse(myFile, etree.HTMLParser())
myTextElements = myTree.xpath("//text()")
myFullText = ''.join([myEl for myEl in myTextElements])
assert myFullText == 'Title', repr(myFullText)
The f17 version passes the assert, i.e. xpath("//text()") only returns text 'Title', whereas the f18 version fails with output
Traceback (most recent call last):
File "TestLxml.py", line 17, in <module>
assert myFullText == 'Title', repr(myFullText)
AssertionError: '\r\n\r\n Title\r\n\r\n\r\n'
Apparently, the f18 version handles newlines and whitespace differently from the f17 version.
Is there a way to have control over this behaviour? (An optional argument somewhere?)
Or even better, is there a way in which I can get the old behaviour back using the new libraries?

in XML, the text() returns the text inside the tags as is (unstripped), so if you have any whitespace characters, tabs, new lines they will be included.
It might be that the way you construct the multiline string with + and \n\r accidentally testing two different strings.
If you change your string to a triple quote string like the example below and test it.
from io import BytesIO
from lxml import etree
html = '''
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<title>Title</title>
</head>
<body/>
</html>
'''
tree = etree.parse(BytesIO(html), etree.HTMLParser())
text_elements = tree.xpath("//text()")
full_text = ''.join(text_elements)
assert full_text == 'Title', repr(full_text)
You can also see that surrounding the text with spaces or new lines make them part of the text() function return. See title below.
html = '''
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<title> Title </title>
</head>
<body/>
</html>
'''
tree = etree.parse(BytesIO(html), etree.HTMLParser())
text_elements = tree.xpath("//text()")
full_text = ''.join(text_elements)
assert full_text == ' Title ', repr(full_text)
If you don't need the spaces you can always call strip() on the string yourself. If you're sure you're getting spaces even though your tags do not contain them, then you should report that as a bug on the lxml mailing list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: find <title> - python

That URL returns a document with <TITLE>...</TITLE> and find is case-sensitive. I strongly suggest you use an HTML parser like Beautiful Soup.

Working solution with lxml and urllib using Python 3 import lxml.etree, urllib.request def documenttitle(url): conn = urllib.request.urlopen(url) parser = lxml.etree.HTMLParser(encoding = "utf-8") tree = lxml.etree.parse(conn, parser = parser) return tree.find('.//title')

Related

Genshi: for loop inserts line breaks

How to wrap custom <root> element around whole HTML document?

How can I get content value from meta tag using PyQuery?

HTMLParsing in Python

How to control newline processing in the lxml xpath text() function?

Categories

Resources