In python, is there a way to extract a embedded json string? - python

So I'm parsing a really big log file with some embedded json.
So I'll see lines like this
foo="{my_object:foo, bar:baz}" a=b c=d
The problem is that since the internal json can have spaces, but outside of the JSON, spaces act as tuple delimiters (except where they have unquoted strings . Huzzah for whatever idiot thought that was a good idea), I'm not sure how to figure out where the end of the JSON string is without reimplementing large portions of a json parser.
Is there a json parser for Python where I can give it '{"my_object":"foo", "bar":"baz"} asdfasdf', and it can return ({'my_object' : 'foo', 'bar':'baz'}, 'asdfasdf') or am I going to have to reimplement the json parser by hand?

Found a really cool answer. Use json.JSONDecoder's scan_once function
In [30]: import json
In [31]: d = json.JSONDecoder()
In [32]: my_string = 'key="{"foo":"bar"}"more_gibberish'
In [33]: d.scan_once(my_string, 5)
Out[33]: ({u'foo': u'bar'}, 18)
In [37]: my_string[18:]
Out[37]: '"more_gibberish'
Just be careful
In [38]: d.scan_once(my_string, 6)
Out[38]: (u'foo', 11)

Match everything around it.
>>> re.search('^foo="(.*)" a=.+ c=.+$', 'foo="{my_object:foo, bar:baz}" a=b c=d').group(1)
'{my_object:foo, bar:baz}'

Use shlex and json.
Something like:
import shlex
import json
def decode_line(line):
decoded = {}
fields = shlex.split(line)
for f in fields:
k, v = f.split('=', 1)
if k == "foo":
v = json.loads(v)
decoded[k] = v
return decoded
This does assume that the JSON inside the quotes is quoted properly.
Here's a short example program that uses the above:
import pipes
testdict = {"hello": "world", "foo": "bar"}
line = 'foo=' + pipes.quote(json.dumps(testdict)) + ' a=b c=d'
print line
print decode_line(line)
With output:
foo='{"foo": "bar", "hello": "world"}' a=b c=d
{'a': 'b', 'c': 'd', 'foo': {u'foo': u'bar', u'hello': u'world'}}

Related

How to correctly escape double quote (") inside a json string in Python

In the json file double quotes are escaped, am not sure what is that am missing here
import json
s = '{"title": "Fetching all Jobs from \"host_name\"."}'
j = json.loads(s)
print(j)
ValueError: Expecting , delimiter: line 1 column 36 (char 35)
Do you really need a string in the first place?
s = {"title": 'Fetching all Jobs from "host_name".'}
# If you want a string, then here
import json
j = json.dumps(s)
print(j)
The recycled value looks like so
{"title": "Fetching all Jobs from \"host_name\"."}
>>> s2 = r'{"title": "Fetching all Jobs from \"host_name\"."}'
>>> json.loads(s2)
{'title': 'Fetching all Jobs from "host_name".'}
Using r strings will help you escape the inner quotes in the json string.
import json
s = r'{"title": "Fetching all Jobs from \"host_name\"."}'
j = json.loads(s)
print(j)
But I am not sure if this is best practice.
There are two ways I know of to handle it, the first is to escape the '\':
s = '{"title": "Fetching all Jobs from \\"host_name\\"."}'
The second is to use a raw string literal:
s = r'{"title": "Fetching all Jobs from \"host_name\"."}'
note the 'r' in front of the string.
this wiil help you
>>> import json
>>> s= json.dumps('{"title": "Fetching all Jobs from \"host_name\"."}')
>>> j=json.loads(s)
>>> print(j)
{"title": "Fetching all Jobs from "host_name"."}
if you use json in this way, it might work for you:
import json
s = ‘my string with “double quotes” and more’
json.dumps(s)
'"my string with \\"double quotes\\" and more"'

how to correctly save "\u**" to json in python

I have a dictionary:
data = {"data": "\u512b"}
while I dump that to json:
import json
print json.dumps(data)
I got:{"a":"\\u512b"}
What should I do to get exactly {"a":"\u512b"}?
NOTE: I try to add u before the string so it becomes u'\u512b' and the extra \ won't show up again, please also tell me why
You can do some hacking.
import json
data = {"data": "\u512b"}
s = json.dumps(data)
print(s.replace(r'\u', 'u'))
print(type(s.replace(r'\u', 'u')))
Output:
{"data": "\u512b"}
<type 'str'>
My guess is that you are just confused by the output of the Python interpreter, displaying you the json.dumps generated string with its own \ escape character prepended to the \ character in the string. The JSON string as a value contains exactly one \, as you want (IIUC):
>>> data = {"data": "\u512b"}
>>> data
{'data': '\u512b'}
>>> import json
>>> json.dumps(data)
'{"data": "\\u512b"}'
>>> print(json.dumps(data))
{"data": "\u512b"}
>>> json.dump(data, open('data.json', 'w'))
>>> ^Z
C:\opt\Console2>type data.json
{"data": "\u512b"}
This is entirely independent of JSON in fact, as the following example shows:
>>> s = "s\\u"
>>> s
's\\u'
>>> print(s, len(s)) # length of s is 3, not 4
s\u 3
HTH!

python how to print list of strings with double quotes

I have a list i.e.
my_list = ['a', 'b', 'c', 'd','e','f']
I want to print this out in a pipe or comma delimited format, but because some of my list objects can have commas or pipes, I would like to wrap double quotes around each object when I print
i.e. the output should be
"a"|"b"|"c"|"d"|"e"|"f" rather than a|b|c|d|e|f
i can't use format on my version of python
Create a generator that formats each element, then unpack it and use a custom separator. If you are using Python 2, import the print() function first (this can be safely done in Python 3 as well):
>>> from __future__ import print_function
>>> print(*('"{}"'.format(item) for item in my_list), sep='|')
"a"|"b"|"c"|"d"|"e"|"f"
Don't do this yourself. You'll trip yourself trying to handle all the corner cases. (What if your fields can have double quotes in them?) Use the csv module instead:
s = StringIO()
writer = csv.writer(s, delimiter="|")
writer.writerow(["a", "b", "c", "d,", "e|", "foo\"bar"])
print i.getvalue()
You get:
a|b|c|d,|"e|"|"foo""bar"
>>> "|".join(['"{0}"'.format(x) for x in my_list])
"a"|"b"|"c"|"d"|"e"|"f"
>>> print '"' + '"|"'.join(my_list) + '"'
"a"|"b"|"c"|"d"|"e"|"f"

pythonic way of iterating over a collection of json objects stored in a text file

I have a text file that has several thousand json objects (meaning the textual representation of json) one after the other. They're not separated and I would prefer not to modify the source file. How can I load/parse each json in python? (I have seen this question, but if I'm not mistaken, this only works for a list of jsons (alreay separated by a comma?) My file looks like this:
{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}...
I don't see a clean way to do this without using the real JSON parser. The other options of modifying the text and using a non-JSON parser are risky. So the best way to go it find a way to iterate using the real JSON parser so that you're sure to comply with the JSON spec.
The core idea is to let the real JSON parser do all the work in identifying the groups:
import json, re
combined = '{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}'
start = 0
while start != len(combined):
try:
json.loads(combined[start:])
except ValueError as e:
pass
# Find the location where the parsing failed
end = start + int(re.search(r'column (\d+)', e.args[0]).group(1)) - 1
result = json.loads(combined[start:end])
start = end
print(result)
This outputs:
{u'json': 1}
{u'json': 2}
{u'json': 3}
{u'json': 4}
{u'json': 5}
I think the following would work as long as there are no non-comma-delimited json arrays of json sub-objects inside any of the outermost json objects. It's somewhat brute-force in that it reads the whole file into memory and attempts to fix it.
import json
def get_json_array(filename):
with open(filename, 'rt') as jsonfile:
json_array = '[{}]'.format(jsonfile.read().replace('}{', '},{'))
return json.loads(json_array)
for obj in get_json_array('multiobj.json'):
print(obj)
Output:
{u'json': 1}
{u'json': 2}
{u'json': 3}
{u'json': 4}
{u'json': 5}
Instead of modifying the source file, just make a copy. Use a regex to replace }{ with },{ and then hopefully a pre-built json reader will take care of it nicely.
EDIT: quick solution:
from re import sub
with open(inputfile, 'r') as fin:
text = sub(r'}{', r'},{', fin.read())
with open(outfile, 'w' as fout:
fout.write('[')
fout.write(text)
fout.write(']')
>>> import ast
>>> s = '{"json":1}{"json":2}{"json":3}{"json":4}{"json":5}'
>>> [ast.literal_eval(ele + '}') for ele in s.split('}')[:-1]]
[{'json': 1}, {'json': 2}, {'json': 3}, {'json': 4}, {'json': 5}]
Provided you have no nested objects and splitting on '}' is feasible this can be accomplished pretty simply.
Here is one pythonic way to do it:
from json.scanner import make_scanner
from json import JSONDecoder
def load_jsons(multi_json_str):
s = multi_json_str.strip()
scanner = make_scanner(JSONDecoder())
idx = 0
objects = []
while idx < len(s):
obj, idx = scanner(s, idx)
objects.append(obj)
return objects
I think json was never supposed to be used this way, but it solves your problem.
I agree with #Raymond Hettinger, you need to use json itself to do the work, text manipulation doesn't work for complex JSON objects. His answer parses the exception message to find the split position. It works, but it looks like a hack, hence, not pythonic :)
EDIT:
Just found out this is actually supported by json module, just use raw_decode like this:
decoder = JSONDecoder()
first_obj, remaining = decoder.raw_decode(multi_json_str)
Read http://pymotw.com/2/json/index.html#mixed-data-streams

Python: json.loads returns items prefixing with 'u'

I'll be receiving a JSON encoded string from Objective-C, and I am decoding a dummy string (for now) like the code below. My output comes out with character 'u' prefixing each item:
[{u'i': u'imap.gmail.com', u'p': u'aaaa'}, {u'i': u'333imap.com', u'p': u'bbbb'}...
How is JSON adding this Unicode character? What's the best way to remove it?
mail_accounts = []
da = {}
try:
s = '[{"i":"imap.gmail.com","p":"aaaa"},{"i":"imap.aol.com","p":"bbbb"},{"i":"333imap.com","p":"ccccc"},{"i":"444ap.gmail.com","p":"ddddd"},{"i":"555imap.gmail.com","p":"eee"}]'
jdata = json.loads(s)
for d in jdata:
for key, value in d.iteritems():
if key not in da:
da[key] = value
else:
da = {}
da[key] = value
mail_accounts.append(da)
except Exception, err:
sys.stderr.write('Exception Error: %s' % str(err))
print mail_accounts
The u- prefix just means that you have a Unicode string. When you really use the string, it won't appear in your data. Don't be thrown by the printed output.
For example, try this:
print mail_accounts[0]["i"]
You won't see a u.
Everything is cool, man. The 'u' is a good thing, it indicates that the string is of type Unicode in python 2.x.
http://docs.python.org/2/howto/unicode.html#the-unicode-type
The d3 print below is the one you are looking for (which is the combination of dumps and loads) :)
Having:
import json
d = """{"Aa": 1, "BB": "blabla", "cc": "False"}"""
d1 = json.loads(d) # Produces a dictionary out of the given string
d2 = json.dumps(d) # Produces a string out of a given dict or string
d3 = json.dumps(json.loads(d)) # 'dumps' gets the dict from 'loads' this time
print "d1: " + str(d1)
print "d2: " + d2
print "d3: " + d3
Prints:
d1: {u'Aa': 1, u'cc': u'False', u'BB': u'blabla'}
d2: "{\"Aa\": 1, \"BB\": \"blabla\", \"cc\": \"False\"}"
d3: {"Aa": 1, "cc": "False", "BB": "blabla"}
Those 'u' characters being appended to an object signifies that the object is encoded in Unicode.
If you want to remove those 'u' characters from your object, you can do this:
import json, ast
jdata = ast.literal_eval(json.dumps(jdata)) # Removing uni-code chars
Let's checkout from python shell
>>> import json, ast
>>> jdata = [{u'i': u'imap.gmail.com', u'p': u'aaaa'}, {u'i': u'333imap.com', u'p': u'bbbb'}]
>>> jdata = ast.literal_eval(json.dumps(jdata))
>>> jdata
[{'i': 'imap.gmail.com', 'p': 'aaaa'}, {'i': '333imap.com', 'p': 'bbbb'}]
Unicode is an appropriate type here. The JSONDecoder documentation describe the conversion table and state that JSON string objects are decoded into Unicode objects.
From 18.2.2. Encoders and Decoders:
JSON Python
==================================
object dict
array list
string unicode
number (int) int, long
number (real) float
true True
false False
null None
"encoding determines the encoding used to interpret any str objects decoded by this instance (UTF-8 by default)."
The u prefix means that those strings are unicode rather than 8-bit strings. The best way to not show the u prefix is to switch to Python 3, where strings are unicode by default. If that's not an option, the str constructor will convert from unicode to 8-bit, so simply loop recursively over the result and convert unicode to str. However, it is probably best just to leave the strings as unicode.
I kept running into this problem when trying to capture JSON data in the log with the Python logging library, for debugging and troubleshooting purposes. Getting the u character is a real nuisance when you want to copy the text and paste it into your code somewhere.
As everyone will tell you, this is because it is a Unicode representation, and it could come from the fact that you’ve used json.loads() to load in the data from a string in the first place.
If you want the JSON representation in the log, without the u prefix, the trick is to use json.dumps() before logging it out. For example:
import json
import logging
# Prepare the data
json_data = json.loads('{"key": "value"}')
# Log normally and get the Unicode indicator
logging.warning('data: {}'.format(json_data))
>>> WARNING:root:data: {u'key': u'value'}
# Dump to a string before logging and get clean output!
logging.warning('data: {}'.format(json.dumps(json_data)))
>>> WARNING:root:data: {'key': 'value'}
Try this:
mail_accounts[0].encode("ascii")
Just replace the u' with a single quote...
print (str.replace(mail_accounts,"u'","'"))

Categories

Resources