Parse JSON containing "\u as UTF-8 bytes" in Python

Parse JSON containing "\u as UTF-8 bytes" in Python - python

I have a JSON file from the Facebook's "Download your data" feature and instead of escaping Unicode characters as their codepoint number, it's escaped just as a sequence of UTF-8 bytes.
For example, the letter á (U+00E1) is escaped in the JSON file as \u00c3\u00a1 instead of \u00e1. 0xC3 0xA1 is UTF-8 encoding for U+00E1.
The json library in Python 3 decodes it as Ã¡ which corresponds to U+00C3 and U+00A1.
Is there a way to parse such a file correctly (so that I get the letter á) in Python?

It seems they encoded their Unicode string into bytes using utf-8 then transformed the bytes into JSON. This is very bad behaviour from them.
Python 3 example:
>>> '\u00c3\u00a1'.encode('latin1').decode('utf-8')
'á'
You need to parse the JSON and walk the entire data to fix it:
def visit_list(l):
return [visit(item) for item in l]
def visit_dict(d):
return {visit(k): visit(v) for k, v in d.items()}
def visit_str(s):
return s.encode('latin1').decode('utf-8')
def visit(node):
funcs = {
list: visit_list,
dict: visit_dict,
str: visit_str,
}
func = funcs.get(type(node))
if func:
return func(node)
else:
return node
incorrect = '{"foo": ["\u00c3\u00a1", 123, true]}'
correct_obj = visit(json.loads(incorrect))

Related

How to get UTF-16 (decimal) in Python?

I have Unicode Code Point of an emoticon represented as U+1F498:
emoticon = u'\U0001f498'
I would like to get utf-16 decimal groups of this character, which according to this website are 55357 and 56472.
I tried to do print emoticon.encode("utf16") but did not help me at all because it gives some other characters.
Also, trying to decode from UTF-8 before encode it to UTF-16 as follow print str(int("0001F498", 16)).decode("utf-8").encode("utf16") does not help either.
How do I correctly get the utf-16 decimal groups of a unicode character?

You can encode the character with the utf-16 encoding, and then convert every 2 bytes of the encoded data to integers with int.from_bytes (or struct.unpack in python 2).
Python 3
def utf16_decimals(char, chunk_size=2):
# encode the character as big-endian utf-16
encoded_char = char.encode('utf-16-be')
# convert every `chunk_size` bytes to an integer
decimals = []
for i in range(0, len(encoded_char), chunk_size):
chunk = encoded_char[i:i+chunk_size]
decimals.append(int.from_bytes(chunk, 'big'))
return decimals
Python 2 + Python 3
import struct
def utf16_decimals(char):
# encode the character as big-endian utf-16
encoded_char = char.encode('utf-16-be')
# convert every 2 bytes to an integer
decimals = []
for i in range(0, len(encoded_char), 2):
chunk = encoded_char[i:i+2]
decimals.append(struct.unpack('>H', chunk)[0])
return decimals
Result:
>>> utf16_decimals(u'\U0001f498')
[55357, 56472]

In a Python 2 "narrow" build, it is as simple as:
>>> emoticon = u'\U0001f498'
>>> map(ord,emoticon)
[55357, 56472]
This works in Python 2 (narrow and wide builds) and Python 3:
from __future__ import print_function
import struct
emoticon = u'\U0001f498'
print(struct.unpack('<2H',emoticon.encode('utf-16le')))
Output:
(55357, 56472)
This is a more general solution that prints the UTF-16 code points for any length of string:
from __future__ import print_function,division
import struct
def utf16words(s):
encoded = s.encode('utf-16le')
num_words = len(encoded) // 2
return struct.unpack('<{}H'.format(num_words),encoded)
emoticon = u'ABC\U0001f498'
print(utf16words(emoticon))
Output:
(65, 66, 67, 55357, 56472)

How to convert a string to a base64 format standard RFC-2045 MIME?

I'm trying to convert a string to base64 format standard RFC-2045.
My code is
import base64
auth_header = base64.b64encode('user:abcd'.decode('ascii'))
Don't know base64 standard whether it using rfc-2045

Base64 encoding in python: It provides encoding and decoding functions for the encodings specified in RFC 3548, which defines the Base16, Base32, and Base64 algorithms, and for the de-facto standard Ascii85 and Base85 encodings.
RFC-2045 MIME is used by email.
Refer :
encoding-of-headers-in-mimetext
Docs :
email.mime: Creating email and MIME objects from scratch
I think the solution you were looking for is a code for Base64 encoding with RFC 2045, check this:
# Copyright (C) 2002-2006 Python Software Foundation
# Author: Ben Gertzfield
# Contact: email-sig#python.org
"""Base64 content transfer encoding per RFCs 2045-2047.
This module handles the content transfer encoding method defined in RFC 2045
to encode arbitrary 8-bit data using the three 8-bit bytes in four 7-bit
characters encoding known as Base64.
It is used in the MIME standards for email to attach images, audio, and text
using some 8-bit character sets to messages.
This module provides an interface to encode and decode both headers and bodies
with Base64 encoding.
RFC 2045 defines a method for including character set information in an
`encoded-word' in a header. This method is commonly used for 8-bit real names
in To:, From:, Cc:, etc. fields, as well as Subject: lines.
This module does not do the line wrapping or end-of-line character conversion
necessary for proper internationalized headers; it only does dumb encoding and
decoding. To deal with the various line wrapping issues, use the email.Header
module.
"""
__all__ = [
'base64_len',
'body_decode',
'body_encode',
'decode',
'decodestring',
'encode',
'encodestring',
'header_encode',
]
from binascii import b2a_base64, a2b_base64
from email.utils import fix_eols
CRLF = '\r\n'
NL = '\n'
EMPTYSTRING = ''
# See also Charset.py
MISC_LEN = 7
# Helpers
def base64_len(s):
"""Return the length of s when it is encoded with base64."""
groups_of_3, leftover = divmod(len(s), 3)
# 4 bytes out for each 3 bytes (or nonzero fraction thereof) in.
# Thanks, Tim!
n = groups_of_3 * 4
if leftover:
n += 4
return n
def header_encode(header, charset='iso-8859-1', keep_eols=False,
maxlinelen=76, eol=NL):
"""Encode a single header line with Base64 encoding in a given charset.
Defined in RFC 2045, this Base64 encoding is identical to normal Base64
encoding, except that each line must be intelligently wrapped (respecting
the Base64 encoding), and subsequent lines must start with a space.
charset names the character set to use to encode the header. It defaults
to iso-8859-1.
End-of-line characters (\\r, \\n, \\r\\n) will be automatically converted
to the canonical email line separator \\r\\n unless the keep_eols
parameter is True (the default is False).
Each line of the header will be terminated in the value of eol, which
defaults to "\\n". Set this to "\\r\\n" if you are using the result of
this function directly in email.
The resulting string will be in the form:
"=?charset?b?WW/5ciBtYXp66XLrIHf8eiBhIGhhbXBzdGHuciBBIFlv+XIgbWF6euly?=\\n
=?charset?b?6yB3/HogYSBoYW1wc3Rh7nIgQkMgWW/5ciBtYXp66XLrIHf8eiBhIGhh?="
with each line wrapped at, at most, maxlinelen characters (defaults to 76
characters).
"""
# Return empty headers unchanged
if not header:
return header
if not keep_eols:
header = fix_eols(header)
# Base64 encode each line, in encoded chunks no greater than maxlinelen in
# length, after the RFC chrome is added in.
base64ed = []
max_encoded = maxlinelen - len(charset) - MISC_LEN
max_unencoded = max_encoded * 3 // 4
for i in range(0, len(header), max_unencoded):
base64ed.append(b2a_base64(header[i:i+max_unencoded]))
# Now add the RFC chrome to each encoded chunk
lines = []
for line in base64ed:
# Ignore the last character of each line if it is a newline
if line.endswith(NL):
line = line[:-1]
# Add the chrome
lines.append('=?%s?b?%s?=' % (charset, line))
# Glue the lines together and return it. BAW: should we be able to
# specify the leading whitespace in the joiner?
joiner = eol + ' '
return joiner.join(lines)
def encode(s, binary=True, maxlinelen=76, eol=NL):
"""Encode a string with base64.
Each line will be wrapped at, at most, maxlinelen characters (defaults to
76 characters).
If binary is False, end-of-line characters will be converted to the
canonical email end-of-line sequence \\r\\n. Otherwise they will be left
verbatim (this is the default).
Each line of encoded text will end with eol, which defaults to "\\n". Set
this to "\r\n" if you will be using the result of this function directly
in an email.
"""
if not s:
return s
if not binary:
s = fix_eols(s)
encvec = []
max_unencoded = maxlinelen * 3 // 4
for i in range(0, len(s), max_unencoded):
# BAW: should encode() inherit b2a_base64()'s dubious behavior in
# adding a newline to the encoded string?
enc = b2a_base64(s[i:i + max_unencoded])
if enc.endswith(NL) and eol != NL:
enc = enc[:-1] + eol
encvec.append(enc)
return EMPTYSTRING.join(encvec)
# For convenience and backwards compatibility w/ standard base64 module
body_encode = encode
encodestring = encode
def decode(s, convert_eols=None):
"""Decode a raw base64 string.
If convert_eols is set to a string value, all canonical email linefeeds,
e.g. "\\r\\n", in the decoded text will be converted to the value of
convert_eols. os.linesep is a good choice for convert_eols if you are
decoding a text attachment.
This function does not parse a full MIME header value encoded with
base64 (like =?iso-8895-1?b?bmloISBuaWgh?=) -- please use the high
level email.Header class for that functionality.
"""
if not s:
return s
dec = a2b_base64(s)
if convert_eols:
return dec.replace(CRLF, convert_eols)
return dec
# For convenience and backwards compatibility w/ standard base64 module
body_decode = decode
decodestring = decode
Solution:
print(encode('user:abcd'))
Output:
dXNlcjphYmNk
Reference : base64mime.py

Reading UTF-8 strings from a binary file

I have some files which contains a bunch of different kinds of binary data and I'm writing a module to deal with these files.
Amongst other, it contains UTF-8 encoded strings in the following format: 2 bytes big endian stringLength (which I parse using struct.unpack()) and then the string. Since it's UTF-8, the length in bytes of the string may be greater than stringLength and doing read(stringLength) will come up short if the string contains multi-byte characters (not to mention messing up all the other data in the file).
How do I read n UTF-8 characters (distinct from n bytes) from a file, being aware of the multi-byte properties of UTF-8? I've been googling for half an hour and all the results I've found are either not relevant or makes assumptions that I cannot make.

Given a file object, and a number of characters, you can use:
# build a table mapping lead byte to expected follow-byte count
# bytes 00-BF have 0 follow bytes, F5-FF is not legal UTF8
# C0-DF: 1, E0-EF: 2 and F0-F4: 3 follow bytes.
# leave F5-FF set to 0 to minimize reading broken data.
_lead_byte_to_count = []
for i in range(256):
_lead_byte_to_count.append(
1 + (i >= 0xe0) + (i >= 0xf0) if 0xbf < i < 0xf5 else 0)
def readUTF8(f, count):
"""Read `count` UTF-8 bytes from file `f`, return as unicode"""
# Assumes UTF-8 data is valid; leaves it up to the `.decode()` call to validate
res = []
while count:
count -= 1
lead = f.read(1)
res.append(lead)
readcount = _lead_byte_to_count[ord(lead)]
if readcount:
res.append(f.read(readcount))
return (''.join(res)).decode('utf8')
Result of a test:
>>> test = StringIO(u'This is a test containing Unicode data: \ua000'.encode('utf8'))
>>> readUTF8(test, 41)
u'This is a test containing Unicode data: \ua000'
In Python 3, it is of course much, much easier to just wrap the file object in a io.TextIOWrapper() object and leave decoding to the native and efficient Python UTF-8 implementation.

One character in UTF-8 can be 1byte,2bytes,3byte3.
If you have to read your file byte by byte, you have to follow the UTF-8 encoding rules. http://en.wikipedia.org/wiki/UTF-8
Most the time, you can just set the encoding to utf-8, and read the input stream.
You do not need to care how much bytes you have read.

Decoding JSON array with unicode character

I have the following JSON array:
[u'steve#gmail.com']
"u" is apparently the unicode character, and it was automatically created by Python. Now, I want to bring this back into Objective-C and decode it into an array using this:
+(NSMutableArray*)arrayFromJSON:(NSString*)json
{
if(!json) return nil;
NSData *jsonData = [json dataUsingEncoding:NSUTF8StringEncoding];
//I've also tried NSUnicodeStringEncoding here, same thing
NSError *e;
NSMutableArray *result= [NSJSONSerialization JSONObjectWithData:jsonData options:NSJSONReadingMutableContainers error:&e];
if (e != nil) {
NSLog(#"Error:%#", e.description);
return nil;
}
return result;
}
However, I get an error: (Cocoa error 3840.)" (Invalid value around character 1.)
How do I remedy this?
Edit: Here's how I bring the entity from Python back into objective-c:
First I convert the entity to a dictionary:
def to_dict(self):
return dict((p, unicode(getattr(self, p))) for p in self.properties()
if getattr(self, p) is not None)
I add this dictionary to a list, set the value of my responseDict['entityList'] to this list, then self.response.out.write(json.dumps(responseDict))
However the result I get back still has that 'u' character.

[u'steve#gmail.com'] is the decoded python value of the array it is not valid JSON.
The valid JSON string data would be just ["steve#gmail.com"].
Dump the data from python back into a JSON string by doing:
import json
python_data = [u'steve#gmail.com']
json_string = json.dumps(data)
The u prefix on python string literals indicates that those strings are unicode rather than the default encoding in python2.X (ASCII).

Python: How do I compare unicode to ascii text?

I'm trying to convert characters in one list into characters in another list at the same index in Japanese (zenkaku to hangaku moji, for those interested), and I can't get the comparison to work. I am decoding into utf-8 before I compare (decoding into ascii broke the program), but the comparison doesn't ever return true. Does anyone know what I'm doing wrong? Here's the code (indents are a little wacky due to SO's editor):
#!C:\Python27\python.exe
# coding=utf-8
import os
import shutil
import sys
zk = [
'。',
'、',
'「',
'」',
'（',
'）',
'！',
'？',
'・',
'／',
'ア','イ','ウ','エ','オ',
'カ','キ','ク','ケ','コ',
'サ','シ','ス','セ','ソ',
'ザ','ジ','ズ','ゼ','ゾ',
'タ','チ','ツ','テ','ト',
'ダ','ヂ','ヅ','デ','ド',
'ラ','リ','ル','レ','ロ',
'マ','ミ','ム','メ','モ',
'ナ','ニ','ヌ','ネ','ノ',
'ハ','ヒ','フ','ヘ','ホ',
'バ','ビ','ブ','ベ','ボ',
'パ','ピ','プ','ペ','ポ',
'ヤ','ユ','ヨ','ヲ','ン','ッ'
]
hk = [
'｡',
'､',
'｢',
'｣',
'(',
')',
'!',
'?',
'･',
'/',
'ｱ','ｲ','ｳ','ｴ','ｵ',
'ｶ','ｷ','ｸ','ｹ','ｺ',
'ｻ','ｼ','ｽ','ｾ','ｿ',
'ｻﾞ','ｼﾞ','ｽﾞ','ｾﾞ','ｿﾞ',
'ﾀ','ﾁ','ﾂ','ﾃ','ﾄ',
'ﾀﾞ','ﾁﾞ','ﾂﾞ','ﾃﾞ','ﾄﾞ',
'ﾗ','ﾘ','ﾙ','ﾚ','ﾛ',
'ﾏ','ﾐ','ﾑ','ﾒ','ﾓ',
'ﾅ','ﾆ','ﾇ','ﾈ','ﾉ',
'ﾊ','ﾋ','ﾌ','ﾍ','ﾎ',
'ﾊﾞ','ﾋﾞ','ﾌﾞ','ﾍﾞ','ﾎﾞ',
'ﾊﾟ','ﾋﾟ','ﾌﾟ','ﾍﾟ','ﾎﾟ',
'ﾔ','ﾕ','ﾖ','ｦ','ﾝ','ｯ'
]
def main():
if len(sys.argv) > 1:
filename = sys.argv[1]
else:
print("Please specify a file to check.")
return
try:
f = open(filename, 'r')
except IOError as e:
print("Sorry! The file doesn't exist.")
return
filecontent = f.read()
f.close()
#y = zk[29]
#print y.decode('utf-8')
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
print filename
if __name__ == "__main__":
main()
Am I missing a step?

Several.
zk = [
u'。',
u'、',
u'「',
...
...
f = codecs.open(filename, 'r', encoding='utf-8')
...
I'll let you work out the rest now that the hard work's been done.

Make sure that zk and hk lists contain Unicode strings. Either use Unicode literals e.g., u'a' or decode them at runtime:
fromutf8 = lambda s: s.decode('utf-8') if not isinstance(s, unicode) else s
zk = map(fromutf8, zk)
hk = map(fromutf8, hk)
You could use unicode.translate() to convert characters in one list into characters in another list at the same index:
import codecs
translation_table = dict(zip(map(ord,zk), hk))
with codecs.open(sys.argv[1], encoding='utf-8') as f:
for line in f:
print line.translate(translation_table),

You need to convert everything to the same form, and the form is Unicode strings. Unicode strings have no encoding in the sense .encode() or .decode(). When having a non-unicode string, it is actually a stream of bytes that expresses the value in some encoding. When converting to Unicode, you have to .decode(). When storing Unicode string to a sequence of bytes, you have to .encode() the abstraction to concrete bytes.
This way, when loading Unicode strings from an UTF-8 encoded file, or you have to read it into the old strings (non Unicode, sequences of bytes) and then .decode('utf-8'), or you can use `codecs.open(..., encoding='utf-8') -- then you get Unicode strings automatically.
The form # coding=utf-8 is not the usual, but it is OK... if the editor (I mean the tool that you use to write the text) also thinks this way. Then the old strings are displayed by the editor correctly. In the case they should be .decode('utf-8')d to get Unicode. Old strings with ASCII characters only in the same source can also be converted to Unicode using the .decode('utf-8').
To summarize: you are de coding from bytes to Unicode, and you are en coding the Unicode strings into sequence of bytes. It seems from the question that you are doing the opposite.
The following is completely wrong:
for f in filecontent:
for z in zk:
if f == z.decode('utf-8'):
print f
because the filecontent is the result of f.read(). This way it is a sequence of bytes. The f in the loop is one byte. The z.decode('utf-8') returns one Unicode character. They cannot be compared. (By the way, the f is a kind of misleading name for a byte value.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse JSON containing "\u as UTF-8 bytes" in Python - python

Related

How to get UTF-16 (decimal) in Python?

How to convert a string to a base64 format standard RFC-2045 MIME?

Reading UTF-8 strings from a binary file

Decoding JSON array with unicode character

Python: How do I compare unicode to ascii text?

Categories

Resources