i need get a page source (html) and convert him to uft8, because i want find some text in this page( like, if 'my_same_text' in page_source: then...). This page contains russian text (сyrillic symbols), and this tag
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
I use flask, and request python lib.
i send request
source = requests.get('url/')
if 'сyrillic symbols' in source.text: ...
and i can`t find my text, this is due to the encoding
how i can convert text to utf8? i try .encode() .decode() but it did not help.
Let's create a page with an windows-1251 charset given in meta tag and some Russian nonsense text. I saved it in Sublime Text as a windows-1251 file, for sure.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1251">
</head>
<body>
<p>Привет, мир!</p>
</body>
</html>
You can use a little trick in the requests library:
If you change the encoding, Requests will use the new value of r.encoding whenever you call r.text.
So it goes like that:
In [1]: import requests
In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')
In [3]: result.encoding = 'windows-1251'
In [4]: u'Привет' in result.text
Out[4]: True
Voila!
If it doesn't work for you, there's a slightly uglier approach.
You should take a look at what encoding do the web-server is sending you.
It may be that the encoding of the response is actually cp1252 (also known as ISO-8859-1), or whatever else, but neither utf8 nor cp1251. It may differ and depends on a web-server!
In [1]: import requests
In [2]: result = requests.get('http://127.0.0.1:1234/1251.html')
In [3]: result.encoding
Out[3]: 'ISO-8859-1'
So we should recode it accordingly.
In [4]: u'Привет'.encode('cp1251').decode('cp1252') in result.text
Out[4]: True
But that just looks ugly to me (also, I suck at encodings and it's not really the best solution at all). I'd go with a re-setting the encoding using requests itself.
As documented, requests automatically decode response.text to unicode, so you must either look for a unicode string:
if u'cyrillic symbols' in source.text:
# ...
or encode response.text in the appropriate encoding:
# -*- coding: utf-8 -*-
# (....)
if 'cyrillic symbols' in source.text.encode("utf-8"):
# ...
The first solution being much simpler and lighter.
Related
Here is the MWE, test.py - the test webpage that is written inline as mypage, is served from http://sdaaubckp.sourceforge.net/test/test-utf8.html , so you should be able to run this script as-is:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys
import re
import lxml.html as LH
import requests
if sys.version_info[0]<3: # python 2
from StringIO import StringIO
else: #python 3
from io import StringIO
# this page uploaded on: http://sdaaubckp.sourceforge.net/test/test-utf8.html
mypage = """
<!doctype html>
<html lang="en">
<head>
<!-- Basic Page Needs
–––––––––––––––––––––––––––––––––––––––––––––––––– -->
<meta charset="utf-8">
<title>My Page</title>
<meta name="description" content="">
<meta name="author" content="">
</head>
<body>
<div>Testing: tøst</div>
</body>
</html>
"""
url_page = "http://sdaaubckp.sourceforge.net/test/test-utf8.html"
confpage = requests.get(url_page)
print(confpage.encoding) # it detects ISO-8859-1, even if the page declares <meta charset="utf-8">?
confpage.encoding = "UTF-8"
print(confpage.encoding) # now it says UTF-8, but...
#print(confpage.content)
if sys.version_info[0]<3: # python 2
mystr = confpage.content
else: #python 3
mystr = confpage.content.decode("utf-8")
for line in iter(mystr.splitlines()):
if 'Testing' in line:
print(line)
confpagetree = LH.fromstring(confpage.content)
print(confpagetree) # <Element html at 0x7f4b7074eec0>
#print(confpagetree.text_content())
for line in iter(confpagetree.text_content().splitlines()):
if 'Testing' in line:
print(line)
I'm running this on Ubuntu 14.04.5 LTS; both Python 2 and 3 give the same results with this script:
$ python2 test.py
ISO-8859-1
UTF-8
<div>Testing: tøst</div>
<Element html at 0x7fb5b9d12ec0>
Testing: tøst
$ python3 test.py
ISO-8859-1
UTF-8
<div>Testing: tøst</div>
<Element html at 0x7f272fc53318>
Testing: tøst
Note how:
In both cases, confpage.encoding detects ISO-8859-1, even if the webpage declares <meta charset="utf-8">
In both cases, correct UTF-8 character ø is printed from confpage.content
In both cases, corrupt UTF-8 representation ø is output from lxml.html.fromstring(confpage.content).text_content()
My suspicion is, since the webpage uses – UTF-8 character (Char: '–' u: 8211 [0x2013] b: 226,128,147 [0xE2,0x80,0x93] n: EN DASH [General Punctuation]) before it declares <meta charset="utf-8"> in the <head>, this somehow borks requests and/or lxml.html.fromstring().text_content(), which results with the corrupt representation.
My question is - what can I do, so I get a correct UTF-8 character at the output of lxml.html.fromstring().text_content() - hopefully for both Python 2 and 3?
The root problem is that you're using confpage.content instead of confpage.text.
requests.Response.content gives you the raw bytes (bytes in 3.x, str in 2.x), as pulled off the wire. It doesn't matter what encoding is, because you aren't using it.
requests.Response.text gives you the decoded Unicode (str in 3.x, unicode in 2.x), based on the encoding.
So, setting the encoding but then using content doesn't do anything. If you just change the rest of your code to use text instead of content (and get rid of the now-spurious decode for Python 3), it will work:
mystr = confpage.text
for line in iter(mystr.splitlines()):
if 'Testing' in line:
print(line)
confpagetree = LH.fromstring(confpage.text)
print(confpagetree) # <Element html at 0x7f4b7074eec0>
#print(confpagetree.text_content())
for line in iter(confpagetree.text_content().splitlines()):
if 'Testing' in line:
print(line)
If you want to go through the exact problem with each of your examples:
Your first example is right in Python 3, but not the best way to do it. By calling decode("utf-8") on the content, since the bytes do happen to be UTF-8, you're decoding them properly. So they will print out properly.
Your first example is wrong in Python 2. You're just printing the content, which is a bunch of UTF-8 bytes. If your console is UTF-8 (as it is on macOS, and might be on Linux), this will happen to work. If your console is something else, like cp1252 or Latin-1 (as it is on Windows, and might be on Linux), this will give you mojibake.
Your second example is also wrong. By passing bytes to LH.fromstring, you're forcing lxml to guess what encoding to use, and it guesses Latin-1, so you get mojibake.
Having switched from Fedora 17 to 18, I get different parsing behaviour for the same lxml code, apparently due to different versions of the underlying libraries (libxml2 and libxslt versions changed).
Here's an example of lxml code with different results for the two versions:
from io import BytesIO
from lxml import etree
myHtmlString = \
'<!doctype html public "-//w3c//dtd html 4.0 transitional//en">\r\n'+\
'<html>\r\n'+\
'<head>\r\n'+\
' <title>Title</title>\r\n'+\
'</head>\r\n'+\
'<body/>\r\n'+\
'</html>\r\n'
myFile = BytesIO(myHtmlString)
myTree = etree.parse(myFile, etree.HTMLParser())
myTextElements = myTree.xpath("//text()")
myFullText = ''.join([myEl for myEl in myTextElements])
assert myFullText == 'Title', repr(myFullText)
The f17 version passes the assert, i.e. xpath("//text()") only returns text 'Title', whereas the f18 version fails with output
Traceback (most recent call last):
File "TestLxml.py", line 17, in <module>
assert myFullText == 'Title', repr(myFullText)
AssertionError: '\r\n\r\n Title\r\n\r\n\r\n'
Apparently, the f18 version handles newlines and whitespace differently from the f17 version.
Is there a way to have control over this behaviour? (An optional argument somewhere?)
Or even better, is there a way in which I can get the old behaviour back using the new libraries?
in XML, the text() returns the text inside the tags as is (unstripped), so if you have any whitespace characters, tabs, new lines they will be included.
It might be that the way you construct the multiline string with + and \n\r accidentally testing two different strings.
If you change your string to a triple quote string like the example below and test it.
from io import BytesIO
from lxml import etree
html = '''
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<title>Title</title>
</head>
<body/>
</html>
'''
tree = etree.parse(BytesIO(html), etree.HTMLParser())
text_elements = tree.xpath("//text()")
full_text = ''.join(text_elements)
assert full_text == 'Title', repr(full_text)
You can also see that surrounding the text with spaces or new lines make them part of the text() function return. See title below.
html = '''
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<title> Title </title>
</head>
<body/>
</html>
'''
tree = etree.parse(BytesIO(html), etree.HTMLParser())
text_elements = tree.xpath("//text()")
full_text = ''.join(text_elements)
assert full_text == ' Title ', repr(full_text)
If you don't need the spaces you can always call strip() on the string yourself. If you're sure you're getting spaces even though your tags do not contain them, then you should report that as a bug on the lxml mailing list.
I'm trying to fetch a segment of some website. The script works, however it's a website that has accents such as á, é, í, ó, ú.
When I fetch the site using urllib or urllib2, the site source code is not encoded in utf-8, which I would like it to be, as utf-8 supports these accents.
I believe that the target site is encoded in utf-8 as it contains the following meta tag:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
My python script:
opener = urllib2.build_opener()
opener.addheaders = [('Accept-Charset', 'utf-8')]
url_response = opener.open(url)
deal_html = url_response.read().decode('utf-8')
However, I keep getting results that look like they are not encoded un utf-8.
E.g: "Milán" on website = "Mil\xe1n" after urllib2 fetches it
Any suggestions?
Your script is working correctly. The "\xe1" string is the representation of the unicode object resulting from decoding. For example:
>>> "Mil\xc3\xa1n".decode('utf-8')
u'Mil\xe1n'
The "\xc3\xa1" sequence is the UTF-8 sequence for leter a with diacritic mark: á.
I have this:
response = urllib2.urlopen(url)
html = response.read()
begin = html.find('<title>')
end = html.find('</title>',begin)
title = html[begin+len('<title>'):end].strip()
if the url = http://www.google.com then the title have no problem as "Google",
but if the url = "http://www.britishcouncil.org/learning-english-gateway" then the title become
"<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<base href="http://www.britishcouncil.org/" />
<META http-equiv="Content-Type" Content="text/html;charset=utf-8">
<meta name="WT.sp" content="Learning;Home Page Smart View" />
<meta name="WT.cg_n" content="Learn English Gateway" />
<META NAME="DCS.dcsuri" CONTENT="/learning-english-gateway.htm">..."
What is actually happening, why I couldn't return the "title"?
That URL returns a document with <TITLE>...</TITLE> and find is case-sensitive. I strongly suggest you use an HTML parser like Beautiful Soup.
Let's analyze why we got that answer. If you open the website and view the source, we note that it doesn't have <title>...</title>. Instead we have <TITLE>...</TITLE>. So what happened to the 2 find calls? Both will be -1!
begin = html.find('<title>') # Result: -1
end = html.find('</title>') # Result: -1
Then begin+len('<title>') will be -1 + 7 = 6. So your final line would be extracting html[6:-1]. It turns out that negative indices actually mean something legitimate in Python (for good reasons). It means to count from the back. Hence -1 here refers to the last character in html. So what you are getting is a substring from the 6th character (inclusive) to the last character (exclusive).
What can we do then? Well, for one, you can use regular expression matcher that ignore case or use a proper HTML parser. If this is a one-off thing and space/performance isn't much of a concern, the quickest approach might be to create a copy of html and lower-cased the entire string:
def get_title(html):
html_lowered = html.lower();
begin = html_lowered.find('<title>')
end = html_lowered.find('</title>')
if begin == -1 or end == -1:
return None
else:
# Find in the original html
return html[begin+len('<title>'):end].strip()
Working solution with lxml and urllib using Python 3
import lxml.etree, urllib.request
def documenttitle(url):
conn = urllib.request.urlopen(url)
parser = lxml.etree.HTMLParser(encoding = "utf-8")
tree = lxml.etree.parse(conn, parser = parser)
return tree.find('.//title')
On a Python driven web app using a sqlite datastore I had this error:
Could not decode to UTF-8 column
'name' with text '300µL-10-10'
Reading here it looks like I need to switch my text-factory to str and get bytestrings but when I do this my html output looks like this:
300�L-10-10
I do have my content-type set as:
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Unfortunately, the data in your datastore is not encoded as UTF-8; instead, it's probably either latin-1 or cp1252. To decode it automatically, try setting Connection.text_factory to your own function:
def convert_string(s):
try:
u = s.decode("utf-8")
except UnicodeDecodeError:
u = s.decode("cp1252")
return u
conn.text_factory = convert_string