AllegroGraph - UTF-8 characters in N-Triples - python

When I use the AllegroGraph 4.6 Python API, I can use the connection.addTriple() method to try to add a triple that ends in a literal containing a unicode character (×):
conn.addTriple( ..., ..., '5 × 10**5' )
This doesn't work. I get the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position...
Here's the full traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 357, in addTriple
self._convert_term_to_mini_term(obj), cxt)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 235, in _convert_term_to_mini_term
return self._to_ntriples(term)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 367, in _to_ntriples
else: return term.toNTriples();
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/model/literal.py", line 182, in toNTriples
sb.append(strings.encode_ntriple_string(self.getLabel()))
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/util/strings.py", line 52, in encode_ntriple_string
string = unicode(string)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 18: ordinal not in range(128)
Instead I can add the triple like this:
conn.addTriple( ..., ..., u'5 × 10**5' )
That way I don't get an error.
But if I load a file of ntriples that contains some UTF-8 encoded characters using connection.addFile(filename, format=RDFFormat.NTRIPLES), I get this error message if the ntriples file is saved as ANSI encoding from Notepad++:
400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream # #x10046f9ea2> at line 12764 (last character was
#\×): nil
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
commitEvery=self.add_commit_size)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing
I get this error message if the file is saved as UTF-8 encoding:
400 MALFORMED DATA: N-Triples parser error while parsing
#<http request stream # #x100486e8b2> at line 1 (last character was
#\): Subjects must be resources (i.e., URIs or blank nodes)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/openrdf/repository/repositoryconnection.py", line 341, in addFile
commitEvery=self.add_commit_size)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/repository.py", line 342, in loadFile
nullRequest(self, "POST", "/statements?" + params, body, contentType=mime)
File "/cygdrive/c/agraph-4.6-client-python/src2/franz/miniclient/request.py", line 198, in nullRequest
if (status < 200 or status > 204): raise RequestError(status, body)
franz.miniclient.request.RequestError: Server returned 400: N-Triples parser error while parsing
However, if the file is set to ANSI encoding in Notepad++, I can go in and paste the × character, save, and then the file loads fine. Or, if I change the file encoding to UTF-8 after I paste the character, then the character changes to some strange xD7 character. If the file is set to UTF-8 encoding and I paste the × in there, then if I change the encoding to ANSI the × changes to a ×.
When the file was given to me, it had × where the × should have been, and when I tried to load it in AllegroGraph I got the first 400 MALFORMED DATA error, which fails at the line where the character actually appears in the file (12764), instead of just at the first line. I assume that the reason I get the second 400 MALFORMED DATA error on line 1 has something to do with the header written by Notepad++ for UTF-8 encoded files. So apparently, I have to save a file as ANSI if I want AllegroGraph not to hiccup immediately, but there has to be some way to tell AllegroGraph to read things like × as UTF-8 characters.
In the file, the triple looks like:
<...some subject URI...> <...some predicate URI...> "5 × 10**5" .

\xd7 is the Latin-1 encoding of ×.
× is what you get if you mistakenly decode × to cp1252 (often Windows' default codec) if it's been encoded in UTF-8.
When you're given files that show ×, try changing the codec that's used to display them to UTF-8.
For an overview of Unicode in Python see here. ~ Thanks to Daenyth.
As you found out from AllegroGraph support:
AllegroGraph can take unicode characters in nTriples using \uXXXX
notation. Alternatively one can use RDFXML, which allows you to leave the
unicode characters as they are.

use codecs module.
import codecs
f = codecs.open('file.txt','r','utf8')
this will open your file forcing the utf8 encoding

Related

A strange python unicode decode error trying to do redis dump (prefer python3, but seeing in both 2 & 3))

I am trying to do a redis dump using the python package redis-dump-load.
It is in UTF-8, except for apparently one key, which I'm being told is in ascii. No idea why, but I thought, ok, if there is a UnicodeDecodeError for this one key (I am receiving lots of data from the dump stream up to this point), what I will do is get the encoding of the bytes string I have, and decode with that instead.
However, I am still getting a UnicodeDecodeError for ASCII as well! It is bytes, it is ascii, I think it might just be corrupt and I plan to just skip it, but curious if anyone had any other ideas.
This is my code snippet:
value = {}
for k in response:
try:
value[k.decode(encoding)] = response[k].decode(encoding)
except UnicodeDecodeError:
print("Error for", k)
print(type(k))
orig_encoding = chardet.detect(k)['encoding']
print(orig_encoding)
value[k.decode(orig_encoding)] = response[k].decode(orig_encoding)
return value
And here is the ouput I see:
Error for b'meta'
<class 'bytes'>
ascii
Traceback (most recent call last):
File "redisdl.py", line 243, in handle_response
value[k.decode(encoding)] = response[k].decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "redisdl.py", line 635, in <module>
main()
File "redisdl.py", line 626, in main
do_dump(options)
File "redisdl.py", line 547, in do_dump
dump(output, **kwargs)
File "redisdl.py", line 174, in dump
for key, type, ttl, value in _reader(r, pretty, encoding, keys):
File "redisdl.py", line 293, in _reader
type, ttl, value = _read_key(encoded_key, r, pretty, encoding)
File "redisdl.py", line 284, in _read_key
value = reader.handle_response(results[2], pretty, encoding)
File "redisdl.py", line 250, in handle_response
response[k].decode(orig_encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)
Am I missing something? I have looked at multiple SO answers, believe me, and I thought up until now I understood str, bytes, decoding, and encoding. But maybe not. Even py2 & py3 differences I get. I think. Starting to doubt everything...

UnicodeDecodeError when extracting comments from a web page using lxml and xpath

Hi I am trying to extract comments on a web page using lxml and xpath. Here is my code:
pg = requests.get('https://www.makeupalley.com/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream', timeout=30)
tr_pg = html.fromstring(pg.content)
cm_pg = tr_pg.xpath('//p[#class="break-word"]/text()')
for cm in cm_pg:
print cm
I got this error
Traceback (most recent call last):
File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 22, in <module>
process_page('/product/showreview.asp/ItemId=164662/Sublime-Skin-BB-Cream-6-in-1/Yves-Rocher/BB-Cream')
File "/Users/ghozan/PycharmProjects/MakeupAlley/main.py", line 10, in process_page
cm_pg = tr_pg.xpath('//p[#class="break-word"]/text()')
File "src/lxml/lxml.etree.pyx", line 1587, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:57884)
File "src/lxml/xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:166905)
File "src/lxml/xpath.pxi", line 230, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:165893)
File "src/lxml/extensions.pxi", line 623, in lxml.etree._unwrapXPathObject (src/lxml/lxml.etree.c:160088)
File "src/lxml/extensions.pxi", line 657, in lxml.etree._createNodeSetResult (src/lxml/lxml.etree.c:160529)
File "src/lxml/extensions.pxi", line 678, in lxml.etree._unpackNodeSetEntry (src/lxml/lxml.etree.c:160740)
File "src/lxml/extensions.pxi", line 804, in lxml.etree._buildElementStringResult (src/lxml/lxml.etree.c:162214)
File "src/lxml/apihelpers.pxi", line 1417, in lxml.etree.funicode (src/lxml/lxml.etree.c:29944)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 615: invalid continuation byte
I know that there is an invalid character in the comments. How do I solve this?
Can you ask Requests to attempt to decode it for you? Use response.text (a string) rather than response.content (bytes).
The encoding of the source is probably something other than UTF-8, which your XPath library might be assuming. response.encoding is Requests best guess at what it is. Sometimes web servers/pages aren't configured to explicitly say what encoding they're using then all you can do is guess.
Doesn't help that encoding can be specified in an HTTP header and/or in a <meta> tag. Or websites can lie. Or they might mixing encodings. Note that your target website can't even validate because the encoding is wrong, and even with that it's rife with errors.
The page have badly encoded characters.
Ex:
Voil�! You will now have an airbrushed look.[...](� la Cover Girl!)
You can avoid them by manually decoding:
>>> pg.content.decode('utf8', errors='ignore')
u'Voil! You will now have an airbrushed look.[...]( la Cover Girl!)'

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte

I am using hfcca to calculate cyclomatic complexity for a c++ code. hfcca is a simple python script (https://code.google.com/p/headerfile-free-cyclomatic-complexity-analyzer/). When i am trying to run the script to generate the output in the form of an xml file i am getting following errors :
Traceback (most recent call last):
"./hfcca.py", line 802, in <module>
main(sys.argv[1:])
File "./hfcca.py", line 798, in main
print(xml_output([f for f in r], options))
File "./hfcca.py", line 798, in <listcomp>
print(xml_output([f for f in r], options))
File "/x/home06/smanchukonda/PREFIX/lib/python3.3/multiprocessing/pool.py", line 652, in next
raise value
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 434852: invalid continuation byte
Please help me with this..
The problem looks like the file has characters represented with latin1 that aren't characters in utf8. The file utility can be useful for figuring out what encoding a file should be treated as, e.g:
monk#monk-VirtualBox:~$ file foo.txt
foo.txt: UTF-8 Unicode text
Here's what the bytes mean in latin1:
>>> b'\xe2'.decode('latin1')
'â'
Probably easiest is to convert the files to utf8.
I also had the same problem rendering Markup("""yyyyyy""") but i solved it using an online tool with removed the 'bad' characters. https://pteo.paranoiaworks.mobi/diacriticsremover/
It is a nice tool and works even offline.

UnicodeDecodeError in Python 2.7

I am trying to read a utf-8 encoded xml file in python and I am doing some processing on the lines read from the file something like below:
next_sent_separator_index = doc_content.find(word_value, int(characterOffsetEnd_value) + 1)
Where doc_content is the line read from the file and word_value is one of the string from the the same line. I am getting encoding related error for above line whenever doc_content or word_value is having some Unicode characters. So, I tried to decode them first with utf-8 decoding (instead of default ascii encoding) as below :
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
But I am still getting UnicodeDecodeError as below :
Traceback (most recent call last):
File "snippetRetriver.py", line 402, in <module>
sentences_list,lemmatised_sentences_list = getSentenceList(form_doc)
File "snippetRetriver.py", line 201, in getSentenceList
next_sent_separator_index = doc_content.decode('utf-8').find(word_value.decode('utf-8'), int(characterOffsetEnd_value) + 1)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8: ordinal not in range(128)
Can anyone suggest me a suitable approach / way to avoid these kind of encoding errors in python 2.7 ?
codecs.utf_8_decode(input.encode('utf8'))

Unicode characters with BlobStore in App Engine

Is there a way to store unicode data with App Engine's BlobStore (in Python)?
I'm saving the data like this
file_name = files.blobstore.create(mime_type='application/octet-stream')
with files.open(file_name, 'a') as f:
f.write('<as><a>' + '</a><a>'.join(stringInUnicode) + '</a></as>')
But on the production (not development) server I'm getting this error. It seems to be converting my Unicode into ASCII and I don't know why.
Why is it trying to convert back to ASCII? Can I avoid this?
Traceback (most recent call last):
File "/base/data/home/apps/myapp/1.349473606437967000/myfile.py", line 137, in get
f.write('<as><a>' + '</a><a>'.join(stringInUnicode) + '</a></as>')
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 364, in write
self._make_rpc_call_with_retry('Append', request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 472, in _make_rpc_call_with_retry
_make_call(method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file.py", line 226, in _make_call
rpc.make_call(method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 509, in make_call
self.__rpc.MakeCall(self.__service, method, request, response)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 115, in MakeCall
self._MakeCallImpl()
File "/base/python_runtime/python_lib/versions/1/google/appengine/runtime/apiproxy.py", line 161, in _MakeCallImpl
self.request.Output(e)
File "/base/python_runtime/python_lib/versions/1/google/net/proto/ProtocolBuffer.py", line 204, in Output
self.OutputUnchecked(e)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/files/file_service_pb.py", line 2390, in OutputUnchecked
out.putPrefixedString(self.data_)
File "/base/python_runtime/python_lib/versions/1/google/net/proto/ProtocolBuffer.py", line 432, in putPrefixedString
v = str(v)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 313: ordinal not in range(128)
A BLOB store contains binary data: bytes, not characters. So you're going to have to do an encode step of some sort. utf-8 seems as good an encoding as any.
f.write('<as><a>' + '</a><a>'.join(stringInUnicode) + '</a></as>')
This will go wrong if an item in stringInUnicode contains <, & or ]]> sequences. You'll want to do some escaping (either using a proper XML library to serialise the data, or manually):
with files.open(file_name, 'a') as f:
f.write('<as>')
for line in stringInUnicode:
line= line.replace(u'&', u'&').replace(u'<', u'<').replace(u'>', u'>');
f.write('<a>%s</a>' % line.encode('utf-8'))
f.write('</as>')
(This will still be ill-formed XML if the strings ever include control characters, but there's not so much you can do about that. If you need to store arbitrary binary in XML you'd need some ad-hoc encoding such as base-64 on top.)

Categories

Resources