I have some json files created by powershell using the ConvertTo-Json command. The content of the json file looks like
{
"Key1": "Value1",
"Key2": "Value2"
}
I ran the python interpreter to see if I could read the file but I get this weird output
>>> f=open('test.json', 'r')
>>> f.read()
'ÿ\xfe{\x00\n\x00\n\x00 \x00 \x00 \x00 \x00"\x00K\x00e\x00y\x001\x00"\x00:\x00 \x00 \x00"\x00V\x00a\x00l\x00u\x00e\x001\x00"\x00,\x00\n\x00\n\x00 \x00 \x00 \x00 \x00"\x00K\x00e\x00y\x002\x00"\x00:\x00 \x00 \x00"\x00V\x00a\x00l\x00u\x00e\x002\x00"\x00\n\x00\n\x00}\x00\n\x00\n\x00'
For some reason all the characters are escaped byte characters and there's the weird ÿ at the begninning (powershell error?).
The weird thing is this:
>>> f=open('test.json', 'r')
>>> str=f.read()
>>> type(str)
<class 'str'>
>>> json.loads(str)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Rutvik_Choudhary\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Users\Rutvik_Choudhary\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Users\Rutvik_Choudhary\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
So the input is a string, but the json module can't parse it (json.load(f) return the same error). What is causing this error? Is it a python thing, a powershell thing, a json thing?
As pointed out by jwodder, PowerShell has encoded your json using UTF-16LE. To get this data into json correctly, you need to open the file using the correct encoding. eg.
with open("test.json", "r", encoding="utf16") as f:
json_string = f.read()
my_dict = json.loads(json_string)
You don't need to tell Python which variant of UTF-16 is being used. This is the purpose of the first two bytes of the text file. It's called a Byte Order Mark (BOM). It lets a program know if UTF-16LE or UTF-16BE has been used to encode the text file.
It seems that you have a BOM at the start of your file. You can verify it in a hex editor or with a good text editor (Notepad++ shows if BOM is present).
If you want to load text files with Unicode BOM headers, like yours you should better use to codecs.open functions instead of open as the default open is not able to interpret the BOM.
Or you can have a look at tendo.unicode - a small library that I wrote that can improve life for people that are not used to Unicode texts.
Related
TL,DR; How can JSON containing a regex with escaped backslahes, be loaded using Python's JSON decoder?
Detail; The regular expression \\[0-9]\\ will match (for example):
\2\
The same regular expression could be encoded as a JSON value:
{
"pattern": "\\[0-9]\\"
}
And in turn, the JSON value could be encoded as a string in Python (note the single quotes):
'{"pattern": "\\[0-9]\\"}'
When loading the JSON in Python, a JSONDecodeError is raised:
import json
json.loads('{"pattern": "\\[0-9]\\"}')
The problem is caused by the regular expression escaping the blackslashes:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 14 (char 13)
>>> json.loads('{"pattern": "\\[0-9]\\"}')
This surprised me since each step seems reasonable (i.e. valid regex, valid JSON, and valid Python).
How can JSON containing a regex with escaped backslahes, be loaded using Python's JSON decoder?
What's happening is that Python is first escaping the input to loads as a string literal, making it '{"pattern": "\[0-9]\"}' (double backslash -> single backslash). Then, loads now attempts to escape \[, which is invalid. To fix, escape the backslashes again. However, it's easier and more practical to specify it as a raw string:
>>> import json
>>> json.loads('{"pattern": "\\[0-9]\\"}')
json.decoder.JSONDecodeError: Invalid \escape: line 1 column 14 (char 13)
>>> json.loads(r'{"pattern": "\\[0-9]\\"}')
{'pattern': '\\[0-9]\\'} # No error
Note that this problem won't apply if loading from a file.
test.json:
{"pattern": "\\[0-9]\\"}
Python:
import json
with open('test.json', 'r') as infile:
json.load(infile) # no problem
Basically, the problem arises with the fact that you're passing in a string literal, but ironically, your string literal isn't being taken literally.
The r means that the string is to be treated as a raw string, which means all escape codes will be ignored:
json.loads(r'{"pattern": "\\[0-9]\\"}')
I have a huge csv file in utf8 encoding, but some columns with encoding that differs from main file encoding. It looks like:
input.txt in UTF-8 encoding:
a,b,c
d,"e?",f
g,h,"kü"
same input.txt in win-1252
a,b,c
d,"eü",f
g,h,"kü
Code:
import csv
file = open("input.txt",encoding="...")
c = csv.reader(file, delimiter=';', quotechar='"')
for itm in c:
print(itm)
and standart python3 csv reader generates encoding error on such lines.I can not just ignore reading this line but I need only always good encoded "someOther" column.
Is it posible using standart csv reader to split somehow CSV data in some "bytes mode" and then convert each array element to normal python unicode string, or should I implement my own csv reader ?
Traceback:
Traceback (most recent call last):
File "C:\Development\t.py", line 7, in <module>
for itm in c:
File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 11: invalid start byte
How sure are you that your file is UTF8 encoded?
For the small sample that you've posted UTF8 decoding fails on the ü which is "LATIN SMALL LETTER U WITH DIAERESIS". When encoded as ISO-8859-1, ü is '\xfc'. Two other possibilities are that the CSV file is UTF-16 encoded (UTF-16 little endian is common on Windows), or even Windows-1252.
If your CSV file is encoded in one of the ISO-8859-X family of encodings; any of ISO 8859-1/3/4/9/10/14/15/16 encode ü as 0xfc.
To solve, use the correct encoding and open the file like this:
file = open("input.txt", encoding="iso-8859-1")
or, for Windows 1252:
file = open("input.txt", encoding="windows-1252")
or, for UTF-16:
file = open("input.txt", encoding="utf-16") # or utf-16-le or utf-16-be as required
I'm being passed some Json and am having trouble parsing it.
The object is currently simple with a single key/value pair. The key works fine but the value \d causes issues.
This is coming from an html form, via javascript. All of the below are literals.
Html: \d
Javascript: {'Key': '\d'}
Json: {"Key": "\\d"}
json.loads() doesn't seem to like Json in this format. A quick sanity check that I'm not doing anything silly works fine:
>>> import json
>>> json.loads('{"key":"value"}')
{'key': 'value'}
Since I'm declaring this string in Python, it should escape it down to a literal of va\\lue - which, when parsed as Json should be va\lue.
>>> json.loads('{"key":"va\\\\lue"}')
{'key': 'va\\lue'}
In case python wasn't escaping the string on the way in, I thought I'd check without the doubling...
>>> json.loads('{"key":"va\\lue"}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python33\lib\json\__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "C:\Python33\lib\json\decoder.py", line 352, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python33\lib\json\decoder.py", line 368, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 11 (char 10)
but it fails, as expected.
I can't see any way to parse Json field that should contain a single backslash after all the unescaping has taken place.
How can I get Python to deserialize this string literal {"a":"val\\ue"} (which is valid Json) into the appropriate python representation: {'a': 'val\ue'}?
As an aside, it doesn't help that PyDev is inconsistent with what representation of a string it uses. The watch window shows double backslashes, the tooltip of the variable shows quadruple backslashes. I assume that's the "If you were to type the string, this is what you'd have to use for it to escape to the original" representation, but it's by no means clear.
Edit to follow on from #twalberg's answer...
>>> input={'a':'val\ue'}
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec cant decode bytes in position 3-5: truncated \uXXXX escape
>>> input={'a':'val\\ue'}
>>> input
{'a': 'val\\ue'}
>>> json.dumps(input)
'{"a": "val\\\\ue"}'
>>> json.loads(json.dumps(input))
{'a': 'val\\ue'}
>>> json.loads(json.dumps(input))['a']
'val\\ue'
Using json.dumps() to see how json would represent your target string:
>>> orig = { 'a' : 'val\ue' }
>>> jstring = json.dumps(orig)
>>> print jstring
{"a": "val\\ue"}
>>> extracted = json.loads(jstring)
>>> print extracted
{u'a': u'val\\ue'}
>>> print extracted['a']
val\ue
>>>
This was in Python 2.7.3, though, so it may be only partially relevant to your Python 3.x environment. Still, I don't think JSON has changed that much...
I'm trying to use the following code (within web2py) to read a csv file and convert it into a json object:
import csv
import json
originalfilename, file_stream = db.tablename.file.retrieve(info.file)
file_contents = file_stream.read()
csv_reader = csv.DictReader(StringIO(file_contents))
json = json.dumps([x for x in csv_reader])
This produces the following error:
'utf8' codec can't decode byte
0xa0 in position 1: invalid start byte
Apparently, there is a problem handling the spaces in the .csv file. The problem appears to stem from the json.dumps() line. The traceback from that point on:
Traceback (most recent call last):
File ".../web2py/gluon/restricted.py", line 212, in restricted
exec ccode in environment
File ".../controllers/default.py", line 2345, in <module>
File ".../web2py/gluon/globals.py", line 194, in <lambda>
self._caller = lambda f: f()
File ".../web2py/gluon/tools.py", line 3021, in f
return action(*a, **b)
File ".../controllers/default.py", line 697, in generate_vis
request.vars.json = json.dumps(list(csv_reader))
File "/usr/local/lib/python2.7/json/__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "/usr/local/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
Any suggestions regarding how to resolve this, or another way to get a csv file (which contains a header; using StringIO) into a json object that won't produce similar complications? Thank you.
The csv module (under Python 2) is purely byte-based; all strings you get out of it are bytes. However JSON is Unicode character-based, so there is an implicit conversion when you try to write out the bytes you got from CSV into JSON. Python guessed UTF-8 for this, but your CSV file wasn't UTF-8 - it was probably Windows code page 1252 (Western European - like ISO-8859-1 only not quite).
A quick fix would be to transcode your input (file_contents= file_contents.decode('windows-1252').encode('utf-8')), but probably you don't really want to rely on json guessing a particular encoding.
Best would be to explicitly decode your strings at the point of reading them from CSV. Then JSON will be able to cope with them OK. Unfortately csv doesn't have built-in decoding (at least in this Python version), but you can do it manually:
class UnicodeDictReader(csv.DictReader):
def __init__(self, f, encoding, *args, **kwargs):
csv.DictReader.__init__(self, f, *args, **kwargs)
self.encoding = encoding
def next(self):
return {
k.decode(self.encoding): v.decode(self.encoding)
for (k, v) in csv.DictReader.next(self).items()
}
csv_reader = UnicodeDictReader(StringIO(file_contents), 'windows-1252')
json_output = json.dumps(list(csv_reader))
it's not known in advance what sort of encoding will come up
Well that's more of a problem, since it's impossible to guess accurately what encoding a file is in. You would either have to specific a particular encoding, or give the user a way to signal what the encoding is, if you want to support non-ASCII characters properly.
Try replacing your final line with
json = json.dumps([x.encode('utf-8') for x in csv_reader])
Running unidecode over the file contents seems to do the trick:
from isounidecode import unidecode
...
file_contents = unidecode(file_stream.read())
...
Thanks, everyone!
I am copying strings containing the word cafe (but with an accented e) from a javascript source file into a python script where I need to do some processing over the data and then output some JSON. I am having some trouble getting my head around the encoding/decoding details though. This is perhaps best illustrated with an example:
$ python
>>> import urllib2, json
>>> the_name = "Tasty Caf%C3%E9"
>>> the_name
'Tasty Caf%C3%E9'
>>> the_name_unquoted = urllib2.unquote(the_name)
>>> the_name_unquoted
'Tasty Caf\xc3\xe9'
>>> json.dumps({'bla': the_name_unquoted})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/lib/python2.7/json/encoder.py", line 201, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/lib/python2.7/json/encoder.py", line 264, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 9: invalid continuation byte
I've spent some time trying to understand how encodings work, though clearly I'm not getting it. Exactly what encoding/format (any other appropriate terminology here?) is the_name_unquoted in above and what is it about it that utf8 cannot decode correctly?
That because that character is supported by unicode encoding. You can fix this by converting the string to unicode.
the_name = u'Tasty Caf%C3%E9'
Alternatively, if a string exists already, you can convert it.
the_name = 'Tasty Caf%C3%E9'
the_name = unicode(the_name)
# or..
the_name = the_name.decode('utf8', the_name)