lxml unicode output issue

lxml unicode output issue - python

New to python and lxml so please bear with me. Now stuck with what appears to be unicode issue. I tried .encode, beautiful soup's unicodedammit with no luck. Had searched the forum and web, but my lack of python skill failed to apply suggested solution to my particular code. Appreciate any help, thanks.
Code:
import requests
import lxml.html
sourceUrl = "http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty.htm"
sourceHtml = requests.get(sourceUrl)
htmlTree = lxml.html.fromstring(sourceHtml.text)
for stockCodes in htmlTree.xpath('''/html/body/printfriendly/table/tr/td/table/tr/td/table/tr/table/tr/td'''):
string = stockCodes.text
print string
Error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)

When I run your code like this python lx.py, I don't get the error. But when I send the result to sdtout python lx.py > output.txt, it occurs. So try this:
# -*- coding: utf-8 -*-
import requests
import lxml.html
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
This allows you to switch from the default ASCII to UTF-8, which the Python runtime will use whenever it has to decode a string buffer to unicode.

The text attribute always returns pure bytes but the content attribute should try to encode it for you. You could also try: sourceHTML.text.encode('utf-8') or sourceHTML.text.encode('ascii') but I'm fairly certain the latter will cause that same exception.

Related

Trouble reading MARC data using MARCReader() and pymarc

So I am trying to teach myself python and pymarc for a school project I am working on. I have a sample marc file and I am trying to read it using this simple code:
from pymarc import *
reader = MARCReader(open('dump.mrc', 'rb'), to_unicode=True)
for record in reader:
print(record)
The for loop is to just print out each record to make sure I am getting the correct data. The only thing is I am getting this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
I've looked online but could not find an answer to my problem. What does this error mean and how can I go about fixing it? Thanks in advance.

You can set the python environment to support UTF-8 and get record as a dictionary.
Try:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from pymarc import *
reader = MARCReader(open('dump.mrc', 'rb'), to_unicode=True, force_utf8=True)
for record in reader:
print record.as_dict()
Note:
If you still get the unicode exception, you can set to_unicode=False and skip force_utf8=True.
Also please check if your dump.mrc file is encoded to UTF-8 or not. Try:
$ chardet dump.mrc

How to decode ascii in python

I send cyrillic letters from postman to django as a parameter in url and got something like %D0%B7%D0%B2 in variable search_text
actually if to print search_text I got something like текст printed
I've tried in console to make the following and didn't get an error
>>> a = "текст"
>>> a
'\xd1\x82\xd0\xb5\xd0\xba\xd1\x81\xd1\x82'
>>> print a
текст
>>> b = a.decode("utf-8")
>>> b
u'\u0442\u0435\u043a\u0441\u0442'
>>> print b
текст
>>>
by without console I do have an error:
"""WHERE title LIKE '%%{}%%' limit '{}';""".format(search_text, limit))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
How to prevent it?

To decode urlencoded string (with '%' signs) use the urllib:
import urllib
byte_string=urllib.unquote('%D0%B7%D0%B2')
and then you'll need to decode the byte_string from it's original encoding, i.e.:
import urllib
import codecs
byte_string=urllib.unquote('%D0%B7%D0%B2')
unicode_string=codecs.decode(byte_string, 'utf-8')
and print(unicode_string) will print зв.
The problem is with the unknown encoding. You have to know what encoding is used for the data you get. To specify the default encoding used in your script .py file, place the following line at the top:
# -*- coding: utf-8 -*-
Cyrillic might be 'cp866', 'cp1251', 'koi8_r' and 'utf-8', this are the most common. So when using decode try those.
Python 2 doesn't use unicode by default, so it's best to enable it or swich to Python 3. To enable unicode in .py file put the following line on top of all imports:
from __future__ import unicode_literals
So i.e. in Python 2.7.9, the following works fine:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
a="текст"
c="""WHERE title LIKE '%%{}%%' limit '{}';""".format(a, '10')
print(c)
Also see:
https://docs.python.org/2/library/codecs.html
https://docs.python.org/2/howto/unicode.html.

it depends on what encoding the django program is expecting and the strings search_text, limit are. usually its sufficient to do this:
"""WHERE title LIKE '%%{}%%' limit '{}';""".decode("utf-8").format(search_text.decode("utf-8"), limit)
EDIT** after reading your edits, it seems you are having problems changing back your urlparsed texts into strings. heres an example of how to do this:
import urlparse
print urlparse.urlunparse(urlparse.urlparse("ресторан"))

You can use '{}'.format(search_text.encode('utf-8'))) to interpret the string as utf-8, but it probably will show your cyrillic letters as \xd0.
And read The Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets.

python: os.path.exists() unicode exception

In my python program, I use untangle for parsing XML file:
from untangle import parse
parse(xml)
The XML is encoded in utf-8 and contains non-ASCII characters. In my program, this is causing trouble. When the xml string is passed to untangle, it tries to be smart and automatically check if it's a file name first. So it calls
os.path.exists(xml)
And it looks like the os module tries to convert it back to ascii and caused the following exception:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 169-172: ordinal not in range(128)
At the top of this file, I'm doing this as a trick that supposedly would work around this:
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
Unfortunately, it didn't work.
I don't know what else can go wrong. Please help.

It’s a bit odd that untangle doesn’t offer direct functions for this.
The simplest solution would be to copy the relevant implementation of untangle.parse to parse files:
def parse_text (text):
parser = untangle.make_parser()
sax_handler = untangle.Handler()
parser.setContentHandler(sax_handler)
parser.parse(StringIO(content))
return sax_handler.root

Does decoding help for your case like below? Reloading sys and setting utf-8 as default is not a good habit.
from untangle import parse
xml=isinstance(xml, str) and xml.decode("utf-8") or xml
parse(xml)

Python webpage source read with special characters

I am reading a page source from a webpage, then parsing a value from that source.
There I am facing a problem with special characters.
In my python controller file iam using # -*- coding: utf-8 -*-.
But I am reading a webpage source which is using charset=iso-8859-1
So when I read the page content without specifying any encoding it is throwing error as UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 133: invalid start byte
when I use string.decode("iso-8859-1").encode("utf-8") then it is parsing data without any error. But it is displaying the value as 'F\u00fcnke' instead of 'Fünke'.
Please let me know how I can solve this issue.
I would greatly appreciate any suggestions.

Encoding is a PITA in Python3 for sure (and 2 in some cases as well).
Try checking these links out, they might help you:
Python - Encoding string - Swedish Letters
Python3 - ascii/utf-8/iso-8859-1 can't decode byte 0xe5 (Swedish characters)
http://docs.python.org/2/library/codecs.html
Also it would be nice with the code for "So when I read the page content without specifying any encoding" My best guess is that your console doesn't use utf-8 (for instance, windows.. your # -*- coding: utf-8 -*- only tells Python what type of characters to find within the sourcecode, not the actual data the code is going to parse or analyze itself.
For instance i write:
# -*- coding: iso-8859-1 -*-
import time
# Här skriver jag ut tiden (Translation: Here, i print out the time)
print(time.strftime('%H:%m:%s'))

Parsing unicode input using python json.loads

What is the best way to load JSON Strings in Python?
I want to use json.loads to process unicode like this:
import json
json.loads(unicode_string_to_load)
I also tried supplying 'encoding' parameter with value 'utf-16', but the error did not go away.
Full SSCCE with error:
# -*- coding: utf-8 -*-
import json
value = '{"foo" : "bar"}'
print(json.loads(value)['foo']) #This is correct, prints 'bar'
some_unicode = unicode("degradé")
#last character is latin e with acute "\xe3\xa9"
value = '{"foo" : "' + some_unicode + '"}'
print(json.loads(value)['foo']) #incorrect, throws error
Error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
6: ordinal not in range(128)

I typecasting the string into unicode string using 'latin-1' fixed the error:
UnicodeDecodeError: 'utf16' codec can't decode byte 0x38 in
position 6: truncated data
Fixed code:
import json
ustr_to_load = unicode(str_to_load, 'latin-1')
json.loads(ustr_to_load)
And then the error is not thrown.

The OP clarifies (in a comment!)...:
Source data is huge unicode encoded
string
Then you have to know which of the many unicode encodings it uses -- clearly not 'utf-16', since that failed, but there are so many others -- 'utf-8', 'iso-8859-15', and so forth. You either try them all until one works, or print repr(str_to_load[:80]) and paste what it shows as an edit of your question, so we can guess on your behalf!-).

The simplest way I have found is
import simplejson as json
that way your code remains the same
json.loads(str_to_load)
reference: https://simplejson.readthedocs.org/en/latest/

With django you can use SimpleJSON and use loads instead of just load.
from django.utils import simplejson
simplejson.loads(str_to_load, "utf-8")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml unicode output issue - python

The text attribute always returns pure bytes but the content attribute should try to encode it for you. You could also try: sourceHTML.text.encode('utf-8') or sourceHTML.text.encode('ascii') but I'm fairly certain the latter will cause that same exception.

Related

Trouble reading MARC data using MARCReader() and pymarc

How to decode ascii in python

python: os.path.exists() unicode exception

Python webpage source read with special characters

Parsing unicode input using python json.loads

Categories

Resources