Fellows,
I am unable to parse a unicode text file submitted using django forms. Here are the quick steps I performed:
Uploaded a text file ( encoding: utf-16 ) ( File contents: Hello World 13 )
On server side, received the file using filename = request.FILES['file_field']
Going line by line: for line in filename: yield line
type(filename) gives me <class 'django.core.files.uploadedfile.InMemoryUploadedFile'>
type(line) is <type 'str'>
print line : '\xff\xfeH\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d\x00 \x001\x003\x00'
codecs.BOM_UTF16_LE == line[:2] returns True
Now, I want to re-construct the unicode or ascii string back like "Hello World 13" so that I can parse the integer from line.
One of the ugliest way of doing this is to retrieve using line[-5:] (= '\x001\x003\x00') and thus construct using line[-5:][1], line[-5:][3].
I am sure there must be better way of doing this. Please help.
Thanks in advance!
Use codecs.iterdecode() to decode the object on the fly:
from codecs import iterdecode
for line in iterdecode(filename, 'utf16'): yield line
Related
hello I'm trying to convert a google service account JSON key (contained in a base64 encoded field named privateKeyData in file foo.json - more context here ) into the actual JSON file (I need that format as ansible only accepts that)
The foo.json file is obtained using this google python api method
what I'm trying to do (though I am using python) is also described this thread which by the way does not work for me (tried on OSx and Linux).
#!/usr/bin/env python3
import json
import base64
with open('/tmp/foo.json', 'r') as f:
ymldict = json.load(f)
b64encodedCreds = ymldict['privateKeyData']
b64decodedBytes = base64.b64decode(b64encodedCreds,validate=True)
outputStr = b64decodedBytes
print(outputStr)
#issue
outputStr = b64decodedBytes.decode('UTF-8')
print(outputStr)
yields
./test.py
b'0\x82\t\xab\x02\x01\x030\x82\td\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\tU\x04\x82\tQ0\x82\tM0\x82\x05q\x06\t*\x86H\x86\xf7\r\x01\x07\x01\xa0\x82\x05b\x04\x82\x05^0\x82\x05Z0\x82\x05V\x06\x0b*\x86H\x86\xf7\r\x01\x0c\n\x01\x02\xa0\x82\x#TRUNCATING HERE
Traceback (most recent call last):
File "./test.py", line 17, in <module>
outputStr = b64decodedBytes.decode('UTF-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 1: invalid start byte
I think I have run out of ideas and spent now more than a day on this :(
what am I doing wrong?
Your base64 decoding logic looks fine to me. The problem you are facing is probably due to a character encoding mismatch. The response body you received after calling create (your foo.json file) is probably not encoded with UTF-8. Check out the response header's Content-Type field. It should look something like this:
Content-Type: text/javascript; charset=Shift_JIS
Try to decode your base64 decoded string with the encoding used in the content type
b64decodedBytes.decode('Shift_JIS')
In Python, I usually do simple JSON with this sort of template:
url = "url"
file = urllib2.urlopen(url)
json = file.read()
parsed = json.loads(json)
and then get at the variables with calls like:
parsed[obj name][value name]
But, this works with JSON that's formatted roughly like:
{'object':{'index':'value', 'index':'value'}}
The JSON I just encountered is formatted like:
{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}
so there are no names for me to reference the different blocks. Of course the blocks give different info, but have the same "keys" -- much like XML is usually formatted. Using my method above, how would I parse through this JSON?
The following is not a valid JSON.
{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}
Where as
[{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}] is a valid JSON.
and python trackback shows that
import json
string = "{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}"
parsed = json.loads(string)
print parsed
Traceback (most recent call last):
File "/Users/tron/Desktop/test3.py", line 3, in <module>
parsed_json = json.loads(json_string)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 27 - line 1 column 54 (char 26 - 53)
[Finished in 0.0s with exit code 1]
where is if you do
json_string = '[{"a":"value", "b":"value"},{"a":"value", "b":"value"}]'
everything works fine.
If that is the case, you can refer to it as an array of Jsons. where json_string[0] is the first JSON string. json_string[1] is the second and so on.
Otherwise if you think this is going to be an issue that you "just have to deal with". Here is one option:
Think of the ways JSON can be malformed and write a simple class to account for them. In the case above, here is a hacky way you can deal with it.
import json
json_string = '{"a":"value", "b":"value"},{"a":"value", "b":"value"}'
def parseJson(string):
parsed_json = None
try:
parsed_json = json.loads(string)
print parsed_json
except ValueError, e:
print string, "didnt parse"
if "Extra data" in str(e.args):
newString = "["+string+"]"
print newString
return parseJson(newString)
You could add more if/else to deal with various things you run into. I have to admit, this is very hacky and I don't think you can ever account for every possible mutation.
Good luck
The result must be list of dict:
[{'index1':'value1', 'index2':'value2'},{'index1':'value1', 'index2':'value2'}]
thus you can reference it using numbers: item[1]['index1']
I'm trying to allow users to signup to my service and I'm noticing errors whenever somebody signs up with Latin american characters in their name.I tried reading several SO posts/websites as per below:
Python regex against Latin-1 character encoding?
http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0
http://docs.python.org/2/library/json.html
https://pypi.python.org/pypi/anyjson
but was still unable to solve it. My code example is as per below:
>>> val = json.dumps({"name":"Déjà"}, encoding="ISO-8859-1")
>>> val
'{"name": "D\\u00c3\\u00a9j\\u00c3\\u00a0"}'
Is there anyway to force the encoding to work in this case for both that and deserializing? Any help is appreciated!
EDIT
The client is Android and iPhone applications. I'm using the following libraries to encode the json on the clients:
http://loopj.com/android-async-http/ (android)
https://github.com/AFNetworking/AFNetworking (ios)
EDIT 2
The same text was received by the server from the Android client as per below:
{"NAME":"D\ufffdj\ufffd"}
I was using anyjson to deserialize that and it said:
File "/usr/local/lib/python2.7/dist-packages/anyjson/__init__.py", line 135, in loads
return implementation.loads(value)
File "/usr/local/lib/python2.7/dist-packages/anyjson/__init__.py", line 99, in loads
return self._decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 454, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 374, in decode
obj, end = self.raw_decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 393, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
ValueError: ('utf8', "D\xe9j\xe0", 1, 2, 'invalid continuation byte')
JSON should almost always be in Unicode (when encoded), and if you're writing a webserver, UTF-8. The following, in Python 3, is basically correct:
In [1]: import json
In [2]: val = json.dumps({"name":"Déjà"})
In [3]: val
Out[3]: '{"name": "D\\u00e9j\\u00e0"}'
A closer look:
'{"name": "D\\u00e9j\\u00e0"}'
^^^^^^^
The text \u00e9, which in JSON means "é".
The slash is doubled because we're looking at a repr of a str.
You can then send val to the client, and in Javascript, JSON.parse should give you the right result.
Because you mentioned, "when somebody signs up": that implies data coming from the client (web browser) to you. How is that data being sent? What library/libraries are you writing a webserver in?
Turns out this was mainly an issue in how I was doing the encoding from the Android side.
I am now setting the StringEntity this way in Android and it's working now:
StringEntity se = new StringEntity(obj.toString(), "UTF-8");
se.setContentType("application/json;charset=UTF-8");
se.setContentEncoding( new BasicHeader(HTTP.CONTENT_TYPE, "application/json"));
Also, I was using anyjson on the server which was using simplejson. This was creating errors at times as well. I switched to using the json library for Python.
I try to read an email from a file, like this:
import email
with open("xxx.eml") as f:
msg = email.message_from_file(f)
and I get this error:
Traceback (most recent call last):
File "I:\fakt\real\maildecode.py", line 53, in <module>
main()
File "I:\fakt\real\maildecode.py", line 50, in main
decode_file(infile, outfile)
File "I:\fakt\real\maildecode.py", line 30, in decode_file
msg = email.message_from_file(f) #, policy=mypol
File "C:\Python33\lib\email\__init__.py", line 56, in message_from_file
return Parser(*args, **kws).parse(fp)
File "C:\Python33\lib\email\parser.py", line 55, in parse
data = fp.read(8192)
File "C:\Python33\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1920: character maps to <undefined>
The file contains a multipart email, where the part is encoded in UTF-8. The file's content or encoding might be broken, but I have to handle it anyway.
How can I read the file, even if it has Unicode errors? I cannot find the policy object compat32 and there seems to be no way to handle an exception and let Python continue right where the exception occured.
What can I do?
To parse an email message in Python 3 without unicode errors, read the file in binary mode and use the email.message_from_binary_file(f) (or email.message_from_bytes(f.read())) method to parse the content (see the documentation of the email.parser module).
Here is code that parses a message in a way that is compatible with Python 2 and 3:
import email
with open("xxx.eml", "rb") as f:
try:
msg = email.message_from_binary_file(f) # Python 3
except AttributeError:
msg = email.message_from_file(f) # Python 2
(tested with Python 2.7.13 and Python 3.6.0)
I can't test on your message, so I don't know if this will actually work, but you can do the string decoding yourself:
with open("xxx.eml", encoding='utf-8', errors='replace') as f:
text = f.read()
msg = email.message_from_string(f)
That's going to get you a lot of replacement characters if the message isn't actually in UTF-8. But if it's got \x81 in it, UTF-8 is my guess.
with open('email.txt','rb') as f:
ascii_txt = f.read().encode('ascii','backslashreplace')
with open('email.txt','w') as f:
f.write(ascii_text)
#now do your processing stuff
I doubt it is the best way to handle this ... but its at least a way ...
A method which works on python 3, which finds finds the encoding and reloads with the correct one.
msg=email.message_from_file(open('file.eml', errors='replace'))
codes=[x for x in msg.get_charsets() if x!=None]
if len(codes)>=1 :
msg=email.message_from_file(open('file.eml', encoding=codes[0]))
I have tried with msg.get_charset(), but it sometimes answers None while another encoding is available, hence the slightly involved encoding detection
For example, I have a file a.js whose content is:
Hello, 你好, bye.
Which contains two Chinese characters whose unicode form is \u4f60\u597d
I want to write a python program which convert the Chinese characters in a.js to its unicode form to output b.js, whose content should be: Hello, \u4f60\u597d, bye.
My code:
fp = open("a.js")
content = fp.read()
fp.close()
fp2 = open("b.js", "w")
result = content.decode("utf-8")
fp2.write(result)
fp2.close()
but it seems that the Chinese characters are still one character , not an ASCII string like I want.
>>> print u'Hello, 你好, bye.'.encode('unicode-escape')
Hello, \u4f60\u597d, bye.
But you should consider using JSON, via json.
You can try codecs module
codecs.open(filename, mode[, encoding[, errors[, buffering]]])
a = codecs.open("a.js", "r", "cp936").read() # a is a unicode object
codecs.open("b.js", "w", "utf16").write(a)
There two ways you can use.
first one, use 'encode' method
str1 = "Hello, 你好, bye. "
print(str1.encode("raw_unicode_escape"))
print(str1.encode("unicode_escape"))
Also you can use 'codecs' module:
import codecs
print(codecs.raw_unicode_escape_encode(str1))
I found that repr(content.decode("utf-8")) will return "u'Hello, \u4f60\u597d, bye'"
so repr(content.decode("utf-8"))[2:-1] will do the job
you can use repr:
a = u"Hello, 你好, bye. "
print repr(a)[2:-1]
or you can use encode method:
print a.encode("raw_unicode_escape")
print a.encode("unicode_escape")