Unicode characters with BlobStore in App Engine

Unicode characters with BlobStore in App Engine - python

Is there a way to store unicode data with App Engine's BlobStore (in Python)?
I'm saving the data like this
file_name = files.blobstore.create(mime_type='application/octet-stream')
with files.open(file_name, 'a') as f:
f.write('<as><a>' + '</a><a>'.join(stringInUnicode) + '</a></as>')
But on the production (not development) server I'm getting this error. It seems to be converting my Unicode into ASCII and I don't know why.
Why is it trying to convert back to ASCII? Can I avoid this?
Traceback (most recent call last):
File "/base/data/home/apps/myapp/1.349473606437967000/myfile.py", line 137, in get
f.write('<as><a>' + '</a><a>'.join(stringInUnicode) + '</a></as>')
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 364, in write
self._make_rpc_call_with_retry('Append', request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 472, in _make_rpc_call_with_retry
_make_call(method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 226, in _make_call
rpc.make_call(method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 509, in make_call
self.__rpc.MakeCall(self.__service, method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 115, in MakeCall
self._MakeCallImpl()
File "/base/python_runtime/python_lib/versions/1/google/appengine/runtime/apiproxy.py", line 161, in _MakeCallImpl
self.request.Output(e)
File "/base/python_runtime/python_lib/versions/1/google/net/proto/ProtocolBuffer.py", line 204, in Output
self.OutputUnchecked(e)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file_service_pb.py", line 2390, in OutputUnchecked
out.putPrefixedString(self.data_)
File "/base/python_runtime/python_lib/versions/1/google/net/proto/ProtocolBuffer.py", line 432, in putPrefixedString
v = str(v)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 313: ordinal not in range(128)

A BLOB store contains binary data: bytes, not characters. So you're going to have to do an encode step of some sort. utf-8 seems as good an encoding as any.
f.write('<as><a>' + '</a><a>'.join(stringInUnicode) + '</a></as>')
This will go wrong if an item in stringInUnicode contains <, & or ]]> sequences. You'll want to do some escaping (either using a proper XML library to serialise the data, or manually):
with files.open(file_name, 'a') as f:
f.write('<as>')
for line in stringInUnicode:
line= line.replace(u'&', u'&').replace(u'<', u'<').replace(u'>', u'>');
f.write('<a>%s</a>' % line.encode('utf-8'))
f.write('</as>')
(This will still be ill-formed XML if the strings ever include control characters, but there's not so much you can do about that. If you need to store arbitrary binary in XML you'd need some ad-hoc encoding such as base-64 on top.)

Related

pywinauto save (permant) control identifiers to variable

I am trying to save control identifiers to a variable.
The reason is that: i use main_dlg.print_control_identifiers(filename="control_ids.text") but i am getting this error:
Traceback (most recent call last):
File "run_tests.py", line 7, in <module>
main_dlg.print_control_identifiers(filename="control_ids.text")
File "C:\python\lib\site-packages\pywinauto\application.py", line 696, in prin
t_control_identifiers
print_identifiers([this_ctrl, ], log_func=log_func)
File "C:\python\lib\site-packages\pywinauto\application.py", line 685, in prin
t_identifiers
print_identifiers(ctrl.children(), current_depth + 1, log_func)
File "C:\python\lib\site-packages\pywinauto\application.py", line 681, in prin
t_identifiers
log_func(output)
File "C:\python\lib\site-packages\pywinauto\application.py", line 694, in log_
func
log_file.write(str(msg) + os.linesep)
File "C:\python\lib\codecs.py", line 721, in write
return self.writer.write(data)
File "C:\python\lib\codecs.py", line 377, in write
data, consumed = self.encode(object, self.errors)
File "C:\python\lib\encodings\cp1252.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 71-76: c
haracter maps to <undefined>
because the program i am trying to control has greek characters inside.
So, i decided to save the control identifiers to a variable and then save it with the right encoding to a txt file.
A simple solution i thought is:
from pywinauto.application import Application
import os
app = Application(backend="uia").start("C:/python/Lib/site-packages/QtDesigner/designer.exe")
main_dlg = app.QtDesigner
main_dlg.wait('visible')
main_dlg.print_control_identifiers()
and then:
python script_name.py > output.txt
after that in output.txt file there are the control identifiers of the program (in this example QtDesigner).

UnicodeDecodeError when extracting comments from a web page using lxml and xpath

Hi I am trying to extract comments on a web page using lxml and xpath. Here is my code:
pg = requests.get('https://www.makeupalley.com/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream', timeout=30)
tr_pg = html.fromstring(pg.content)
cm_pg = tr_pg.xpath('//p[#class="break-word"]/text()')
for cm in cm_pg:
print cm
I got this error
Traceback (most recent call last):
File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 22, in <module>
process_page('/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream')
File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 10, in process_page
cm_pg = tr_pg.xpath('//p[#class="break-word"]/text()')
File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:57884)
File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:166905)
File "src/lxml/xpath.pxi", line 230, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:165893)
File "src/lxml/extensions.pxi", line 623, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:160088)
File "src/lxml/extensions.pxi", line 657, in lxml.etree._createNodeSetResult (src/lxml/lxml.etree.c:160529)
File "src/lxml/extensions.pxi", line 678, in lxml.etree._unpackNodeSetEntry (src/lxml/lxml.etree.c:160740)
File "src/lxml/extensions.pxi", line 804, in lxml.etree._buildElementStringResult (src/lxml/lxml.etree.c:162214)
File "src/lxml/apihelpers.pxi", line 1417, in lxml.etree.funicode (src/lxml/lxml.etree.c:29944)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 615: invalid continuation byte
I know that there is an invalid character in the comments. How do I solve this?

Can you ask Requests to attempt to decode it for you? Use response.text (a string) rather than response.content (bytes).
The encoding of the source is probably something other than UTF-8, which your XPath library might be assuming. response.encoding is Requests best guess at what it is. Sometimes web servers/pages aren't configured to explicitly say what encoding they're using then all you can do is guess.
Doesn't help that encoding can be specified in an HTTP header and/or in a <meta> tag. Or websites can lie. Or they might mixing encodings. Note that your target website can't even validate because the encoding is wrong, and even with that it's rife with errors.

The page have badly encoded characters.
Ex:
Voil�! You will now have an airbrushed look.[...](� la Cover Girl!)
You can avoid them by manually decoding:
>>> pg.content.decode('utf8', errors='ignore')
u'Voil! You will now have an airbrushed look.[...]( la Cover Girl!)'

python: csv to json conversion when csv contains unicode

I'm trying to use the following code (within web2py) to read a csv file and convert it into a json object:
import csv
import json
originalfilename, file_stream = db.tablename.file.retrieve(info.file)
file_contents = file_stream.read()
csv_reader = csv.DictReader(StringIO(file_contents))
json = json.dumps([x for x in csv_reader])
This produces the following error:
'utf8' codec can't decode byte
0xa0 in position 1: invalid start byte
Apparently, there is a problem handling the spaces in the .csv file. The problem appears to stem from the json.dumps() line. The traceback from that point on:
Traceback (most recent call last):
File ".../web2py/gluon/restricted.py", line 212, in restricted
exec ccode in environment
File ".../controllers/default.py", line 2345, in <module>
File ".../web2py/gluon/globals.py", line 194, in <lambda>
self._caller = lambda f: f()
File ".../web2py/gluon/tools.py", line 3021, in f
return action(*a, **b)
File ".../controllers/default.py", line 697, in generate_vis
request.vars.json = json.dumps(list(csv_reader))
File "/usr/local/lib/python2.7/json/__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "/usr/local/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
Any suggestions regarding how to resolve this, or another way to get a csv file (which contains a header; using StringIO) into a json object that won't produce similar complications? Thank you.

The csv module (under Python 2) is purely byte-based; all strings you get out of it are bytes. However JSON is Unicode character-based, so there is an implicit conversion when you try to write out the bytes you got from CSV into JSON. Python guessed UTF-8 for this, but your CSV file wasn't UTF-8 - it was probably Windows code page 1252 (Western European - like ISO-8859-1 only not quite).
A quick fix would be to transcode your input (file_contents= file_contents.decode('windows-1252').encode('utf-8')), but probably you don't really want to rely on json guessing a particular encoding.
Best would be to explicitly decode your strings at the point of reading them from CSV. Then JSON will be able to cope with them OK. Unfortately csv doesn't have built-in decoding (at least in this Python version), but you can do it manually:
class UnicodeDictReader(csv.DictReader):
def __init__(self, f, encoding, *args, **kwargs):
csv.DictReader.__init__(self, f, *args, **kwargs)
self.encoding = encoding
def next(self):
return {
k.decode(self.encoding): v.decode(self.encoding)
for (k, v) in csv.DictReader.next(self).items()
}
csv_reader = UnicodeDictReader(StringIO(file_contents), 'windows-1252')
json_output = json.dumps(list(csv_reader))
it's not known in advance what sort of encoding will come up
Well that's more of a problem, since it's impossible to guess accurately what encoding a file is in. You would either have to specific a particular encoding, or give the user a way to signal what the encoding is, if you want to support non-ASCII characters properly.

Try replacing your final line with
json = json.dumps([x.encode('utf-8') for x in csv_reader])

Running unidecode over the file contents seems to do the trick:
from isounidecode import unidecode
...
file_contents = unidecode(file_stream.read())
...
Thanks, everyone!

Python SUDS unicode decode error returned from Webservice

I am attempting to use a Webservice created by one of our developers that allows us to upload files into the system, within certain restrictions.
Using SUDS, I get the following information:
Suds ( https://fedorahosted.org/suds/ ) version: 0.4 GA build: R699-20100913
Service ( ConnectToEFS ) tns="http://tempuri.org/"
Prefixes (3)
ns0 = "http://schemas.microsoft.com/2003/10/Serialization/"
ns1 = "http://schemas.microsoft.com/Message"
ns2 = "http://tempuri.org/"
Ports (1):
(BasicHttpBinding_IConnectToEFS)
Methods (2):
CreateContentFolder(xs:string FileCode, xs:string FolderName, xs:string ContentType, xs:string MetaDataXML, )
UploadFile(ns1:StreamBody FileByteStream, )
Types (4):
ns1:StreamBody
ns0:char
ns0:duration
ns0:guid
My method to using UploadFile is as follows:
def webserviceUploadFile(self, targetLocation, fileName, fileSource):
fileSource = './test_files/' + fileSource
ntlm = WindowsHttpAuthenticated(username=uname, password=upass)
client = Client(webservice_url, transport=ntlm)
client.set_options(soapheaders={'TargetLocation':targetLocation, 'FileName': fileName})
body = client.factory.create('AIRDocument')
body_file = open(fileSource, 'rb')
body_data = body_file.read()
body.FileByteStream = body_data
return client.service.UploadFile(body)
Running this gets me the following result:
Traceback (most recent call last):
File "test_cases.py", line 639, in test_upload_file_invalid_extension
result_string = self.HM.webserviceUploadFile('9999', 'AD-1234-5424__44.exe',
'test_data.pdf')
File "test_cases.py", line 81, in webserviceUploadFile
return client.service.UploadFile(body)
File "build\bdist.win32\egg\suds\client.py", line 542, in __call__
return client.invoke(args, kwargs)
File "build\bdist.win32\egg\suds\client.py", line 595, in invoke
soapenv = binding.get_message(self.method, args, kwargs)
File "build\bdist.win32\egg\suds\bindings\binding.py", line 120, in get_message
content = self.bodycontent(method, args, kwargs)
File "build\bdist.win32\egg\suds\bindings\document.py", line 63, in bodycontent
p = self.mkparam(method, pd, value)
File "build\bdist.win32\egg\suds\bindings\document.py", line 105, in mkparam
return Binding.mkparam(self, method, pdef, object)
File "build\bdist.win32\egg\suds\bindings\binding.py", line 287, in mkparam
return marshaller.process(content)
File "build\bdist.win32\egg\suds\mx\core.py", line 62, in process
self.append(document, content)
File "build\bdist.win32\egg\suds\mx\core.py", line 75, in append
self.appender.append(parent, content)
File "build\bdist.win32\egg\suds\mx\appender.py", line 102, in append
appender.append(parent, content)
File "build\bdist.win32\egg\suds\mx\appender.py", line 243, in append
Appender.append(self, child, cont)
File "build\bdist.win32\egg\suds\mx\appender.py", line 182, in append
self.marshaller.append(parent, content)
File "build\bdist.win32\egg\suds\mx\core.py", line 75, in append
self.appender.append(parent, content)
File "build\bdist.win32\egg\suds\mx\appender.py", line 102, in append
appender.append(parent, content)
File "build\bdist.win32\egg\suds\mx\appender.py", line 198, in append
child.setText(tostr(content.value))
File "build\bdist.win32\egg\suds\sax\element.py", line 251, in setText
self.text = Text(value)
File "build\bdist.win32\egg\suds\sax\text.py", line 43, in __new__
result = super(Text, cls).__new__(cls, *args, **kwargs)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 10: ordinal
not in range(128)
After much research and talking with the developer of the webservice, I modified the body_data = body_file.read() into body_data = body_file.read().decode("UTF-8") which gets me this error:
Traceback (most recent call last):
File "test_cases.py", line 639, in test_upload_file_invalid_extension
result_string = self.HM.webserviceUploadFile('9999', 'AD-1234-5424__44.exe', 'test_data.pdf')
File "test_cases.py", line 79, in webserviceUploadFile
body_data = body_file.read().decode("utf-8")
File "C:\python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid
continuation byte
Which is less than helpful.
After more research into the problem, I tried adding 'errors='ignore'' to the UTF-8 encode, and this was the result:
<TransactionDescription>Error in INTL-CONF_France_PROJ_MA_126807.docx: An exception has been thrown when reading the stream.. Inner Exception: System.Xml.XmlException: The byte 0x03 is not valid at this location. Line 1, position 318.
at System.Xml.XmlExceptionHelper.ThrowXmlException(XmlDictionaryReader reader, String res, String arg1, String arg2, String arg3)
at System.Xml.XmlUTF8TextReader.Read()
at System.ServiceModel.Dispatcher.StreamFormatter.MessageBodyStream.Exhaust(XmlDictionaryReader reader)
at System.ServiceModel.Dispatcher.StreamFormatter.MessageBodyStream.Read(Byte[] buffer, Int32 offset, Int32 count). Source: System.ServiceModel</TransactionDescription>
Which pretty much stumps me on what to do. Based on the result stack trace by the webservice, it looks like it wants UTF-8 but I can't seem to get it to the webservice without Python or SUDS throwing a fit, or by ignoring problems in the encoding. The system I'm working on only takes in MicroSoft office type files (doc, xls, and the like), PDFs, and TXT files, so using something that I have more control on the encoding is not an option. I also tried detecting the encoding used by the sample PDF and the sample DOCX, but using what it suggested (Latin-1, ISO8859-x, and several windows XXXX) all were accepted by Python and SUDS, but not by the webservice.
Also note in the example shown, its most frequently referencing a test to an invalid extension. This error applies even in what should be a test of the successful upload, which is the only time really that the final stacktrace ever shows up.

You can use this base64.b64encode(body_file.read()) and this will return the base64 string value. So your request variable must be a string.

AllegroGraph - UTF-8 characters in N-Triples

When I use the AllegroGraph 4.6 Python API, I can use the connection.addTriple() method to try to add a triple that ends in a literal containing a unicode character (×):
conn.addTriple( ..., ..., '5 × 10**5' )
This doesn't work. I get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position...
Here's the full traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 357, in addTriple
self._convert_term_to_mini_term(obj), cxt)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 235, in _convert_term_to_mini_term
return self._to_ntriples(term)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 367, in _to_ntriples
else: return term.toNTriples();
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/model/literal.py", line 182, in toNTriples
sb.append(strings.encode_ntriple_string(self.getLabel()))
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/util/strings.py", line 52, in encode_ntriple_string
string = unicode(string)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 18: ordinal not in range(128)
Instead I can add the triple like this:
conn.addTriple( ..., ..., u'5 × 10**5' )
That way I don't get an error.
But if I load a file of ntriples that contains some UTF-8 encoded characters using connection.addFile(filename, format=RDFFormat.NTRIPLES), I get this error message if the ntriples file is saved as ANSI encoding from Notepad++:
400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream # #x10046f9ea2> at line 12764 (last character was
#\×): nil
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
commitEvery=self.add_commit_size)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing
I get this error message if the file is saved as UTF-8 encoding:
400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream # #x100486e8b2> at line 1 (last character was
#\): Subjects must be resources (i.e., URIs or blank nodes)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
commitEvery=self.add_commit_size)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing
However, if the file is set to ANSI encoding in Notepad++, I can go in and paste the × character, save, and then the file loads fine. Or, if I change the file encoding to UTF-8 after I paste the character, then the character changes to some strange xD7 character. If the file is set to UTF-8 encoding and I paste the × in there, then if I change the encoding to ANSI the × changes to a Ã—.
When the file was given to me, it had Ã— where the × should have been, and when I tried to load it in AllegroGraph I got the first 400 MALFORMED DATA error, which fails at the line where the character actually appears in the file (12764), instead of just at the first line. I assume that the reason I get the second 400 MALFORMED DATA error on line 1 has something to do with the header written by Notepad++ for UTF-8 encoded files. So apparently, I have to save a file as ANSI if I want AllegroGraph not to hiccup immediately, but there has to be some way to tell AllegroGraph to read things like Ã— as UTF-8 characters.
In the file, the triple looks like:
<...some subject URI...> <...some predicate URI...> "5 × 10**5" .

\xd7 is the Latin-1 encoding of ×.
Ã— is what you get if you mistakenly decode × to cp1252 (often Windows' default codec) if it's been encoded in UTF-8.
When you're given files that show Ã—, try changing the codec that's used to display them to UTF-8.
For an overview of Unicode in Python see here. ~ Thanks to Daenyth.
As you found out from AllegroGraph support:
AllegroGraph can take unicode characters in nTriples using \uXXXX
notation. Alternatively one can use RDFXML, which allows you to leave the
unicode characters as they are.

use codecs module.
import codecs
f = codecs.open('file.txt','r','utf8')
this will open your file forcing the utf8 encoding

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unicode characters with BlobStore in App Engine - python

Related

pywinauto save (permant) control identifiers to variable

UnicodeDecodeError when extracting comments from a web page using lxml and xpath

python: csv to json conversion when csv contains unicode

Python SUDS unicode decode error returned from Webservice

AllegroGraph - UTF-8 characters in N-Triples

Categories

Resources