Formatting JSON in Python - python

What is the simplest way to pretty-print a string of JSON as a string with indentation when the initial JSON string is formatted without extra spaces or line breaks?
Currently I'm running json.loads() and then running json.dumps() with indent=2 on the result. This works, but it feels like I'm throwing a lot of compute down the drain.
Is there a more simple or efficient (built-in) way to pretty-print a JSON string? (while keeping it as valid JSON)
Example
import requests
import json
response = requests.get('http://spam.eggs/breakfast')
one_line_json = response.content.decode('utf-8')
pretty_json = json.dumps(json.loads(response.content), indent=2)
print(f'Original: {one_line_json}')
print(f'Pretty: {pretty_json}')
Output:
Original: {"breakfast": ["spam", "spam", "eggs"]}
Pretty: {
"breakfast": [
"spam",
"spam",
"eggs"
]
}

json.dumps(obj, indent=2) is better than pprint because:
It is faster with the same load methodology.
It has the same or similar simplicity.
The output will produce valid JSON, whereas pprint will not.
pprint_vs_dumps.py
import cProfile
import json
import pprint
from urllib.request import urlopen
def custom_pretty_print():
url_to_read = "https://www.cbcmusic.ca/Component/Playlog/GetPlaylog?stationId=96&date=2018-11-05"
with urlopen(url_to_read) as resp:
pretty_json = json.dumps(json.load(resp), indent=2)
print(f'Pretty: {pretty_json}')
def pprint_json():
url_to_read = "https://www.cbcmusic.ca/Component/Playlog/GetPlaylog?stationId=96&date=2018-11-05"
with urlopen(url_to_read) as resp:
info = json.load(resp)
pprint.pprint(info)
cProfile.run('custom_pretty_print()')
>>> 71027 function calls (42309 primitive calls) in 0.084 seconds
cProfile.run('pprint_json()')
>>>164241 function calls (140121 primitive calls) in 0.208 seconds
Thanks #tobias_k for pointing out my errors along the way.

I think for a true JSON object print, it's probably as good as it gets. timeit(number=10000) for the following took about 5.659214497s:
import json
d = {
'breakfast': [
'spam', 'spam', 'eggs',
{
'another': 'level',
'nested': [
{'a':'b'},
{'c':'d'}
]
}
],
'foo': True,
'bar': None
}
s = json.dumps(d)
q = json.dumps(json.loads(s), indent=2)
print(q)
I tried with pprint, but it actually wouldn't print the pure JSON string unless it's converted to a Python dict, which loses your true, null and false etc valid JSON as mentioned in the other answer. As well it doesn't retain the order in which the items appeared, so it's not great if order is important for readability.
Just for fun I whipped up the following function:
def pretty_json_for_savages(j, indentor=' '):
ind_lvl = 0
temp = ''
for i, c in enumerate(j):
if c in '{[':
print(indentor*ind_lvl + temp.strip() + c)
ind_lvl += 1
temp = ''
elif c in '}]':
print(indentor*ind_lvl + temp.strip() + '\n' + indentor*(ind_lvl-1) + c, end='')
ind_lvl -= 1
temp = ''
elif c in ',':
print(indentor*(0 if j[i-1] in '{}[]' else ind_lvl) + temp.strip() + c)
temp = ''
else:
temp += c
print('')
# {
# "breakfast":[
# "spam",
# "spam",
# "eggs",
# {
# "another": "level",
# "nested":[
# {
# "a": "b"
# },
# {
# "c": "d"
# }
# ]
# }
# ],
# "foo": true,
# "bar": null
# }
It prints pretty alright, and unsurprisingly it took a whooping 16.701202023s to run in timeit(number=10000), which is 3 times as much as a json.dumps(json.loads()) would get you. It's probably not worthwhile to build your own function to achieve this unless you spend some time to optimize it, and with the lack of a builtin for the same, it's probably best you stick with your gun since your efforts will most likely give diminishing returns.

Related

Splitting 250GB JSON file containing multiple tables into parquet

I have a JSON file with the following exemplified format,
{
"Table1": {
"Records": [
{
"Key1Tab1": "SomeVal",
"Key2Tab1": "AnotherVal"
},
{
"Key1Tab1": "SomeVal2",
"Key2Tab1": "AnotherVal2"
}
]
},
"Table2": {
"Records": [
{
"Key1Tab1": "SomeVal",
"Key2Tab1": "AnotherVal"
},
{
"Key1Tab1": "SomeVal2",
"Key2Tab1": "AnotherVal2"
}
]
}
}
The root keys are table names from an SQL database and its corresponding value is the rows.
I want to split the JSON file into seperate parquet files each representing a table.
Ie. Table1.parquet and Table2.parquet.
The big issue is the size of the file preventing me from loading it into memory.
Hence, I tried to use dask.bag to accommodate for the nested structure of the file.
import dask.bag as db
from dask.distributed import Client
client = Client(n_workers=4)
lines = db.read_text("filename.json")
But assessing the output with lines.take(4) shows that dask can't read the new lines correct.
('{\n', ' "Table1": {\n', ' "Records": [\n', ' {\n')
I've tried to search for solutions to the specific problem but without luck.
Is there any chance that the splitting can be solved with dask or is there other tools that could do the job?
As suggested here try the dask.dataframe.read_json() method
This may be sufficient, though I am unsure how it will behave if you don't have enough memory to store the entire resulting dataframe in-memory..
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
df = dd.read_json("filename.json")
df.to_parquet("filename.parquet", engine='pyarrow')
docs
https://distributed.dask.org/en/latest/manage-computation.html#dask-collections-to-futures
https://examples.dask.org/dataframes/01-data-access.html#Write-to-Parquet
If Dask doesn't process the file in chunks when on a single system (it may not happily do so as JSON is distinctly unfriendly to parse in such a way .. though I unfortunately don't have access to my test system to verify this) and the system memory is unable to handle the giant file, you may be able to extend the system memory with disk space by creating a big swapfile.
Note that this will create a ~300G file (increase count field for more) and be may be incredibly slow compared to memory (but perhaps still fast enough for your needs, especially if it's a 1-off).
# create and configure swapfile
dd if=/dev/zero of=swapfile.img bs=10M count=30000 status=progress
chmod 600 swapfile.img
mkswap swapfile.img
swapon swapfile.img
#
# run memory-greedy task
# ...
# ensure processes have exited
#
# disable and remove swapfile to reclaim disk space
swapoff swapfile.img # may hang for a long time
rm swapfile.img
The problem is, that dask will split the file on newline characters by default, and you can't guarantee that this will not be in the middle of one of your tables. Indeed, even if you get it right, you still need to manipulate the resultant text to make complete JSON objects for each partition.
For example:
def myfunc(x):
x = "".join(x)
if not x.endswith("}"):
x = x[:-2] + "}"
if not x.startswith("{"):
x = "{" + x
return [json.loads(x)]
db.read_text('temp.json',
linedelimiter="\n },\n",
blocksize=100).map_partitions(myfunc)
In this case, I have purposefully made the blocksize smaller than each part to demonstrate: you will get a JSON object or nothing for each partition.
_.compute()
[{'Table1': {'Records': [{'Key1Tab1': 'SomeVal', 'Key2Tab1': 'AnotherVal'},
{'Key1Tab1': 'SomeVal2', 'Key2Tab1': 'AnotherVal2'}]}},
{},
{'Table2': {'Records': [{'Key1Tab1': 'SomeVal', 'Key2Tab1': 'AnotherVal'},
{'Key1Tab1': 'SomeVal2', 'Key2Tab1': 'AnotherVal2'}]}},
{},
{},
{}]
Of course, in your case you can immediately do something with the JSON rather than return it, or you can map to your writing function next in the chain.
When working with large files the key to success is processing the data as a stream, i.e. in filter-like programs.
The JSON format is easy to parse. The following program reads the input char by char (I/O should be bufferred) and cuts the top-level JSON object to separate objects. It properly follows the data structure and not the formatting.
The demo program just prints "--NEXT OUTPUT FILE--", where real output file switch should be implemented. Whitespace stripping is implemented as a bonus.
import collections
OBJ = 'object'
LST = 'list'
def out(ch):
print(ch, end='')
with open('json.data') as f:
stack = collections.deque(); push = stack.append; pop = stack.pop
esc = string = False
while (ch := f.read(1)):
if esc:
esc = False
elif ch == '\\':
esc = True
elif ch == '"':
string = not string
if not string:
if ch in {' ', '\t', '\r', '\n'}:
continue
if ch == ',':
if len(stack) == 1 and stack[0] == OBJ:
out('}\n')
print("--- NEXT OUTPUT FILE ---")
out('{')
continue
elif ch == '{':
push(OBJ)
elif ch == '}':
if pop() is not OBJ:
raise ValueError("unmatched { }")
elif ch == '[':
push(LST)
elif ch == ']':
if pop() is not LST:
raise ValueError("unmatched [ ]")
out(ch)
Here is a sample output for my testfile:
{"key1":{"name":"John","surname":"Doe"}}
--- NEXT OUTPUT FILE ---
{"key2":"string \" ] }"}
--- NEXT OUTPUT FILE ---
{"key3":13}
--- NEXT OUTPUT FILE ---
{"key4":{"sub1":[null,{"l3":true},null]}}
The original file was:
{
"key1": {
"name": "John",
"surname": "Doe"
},
"key2": "string \" ] }", "key3": 13,
"key4": {
"sub1": [null, {"l3": true}, null]
}
}

Efficient API requests during iteration

So I'm looking for a way to speed up the output of the following code, calling google's natural language API:
tweets = json.load(input)
client = language.LanguageServiceClient()
sentiment_tweets = []
iterations = 1000
start = timeit.default_timer()
for i, text in enumerate(d['text'] for d in tweets):
document = types.Document(
content=text,
type=enums.Document.Type.PLAIN_TEXT)
sentiment = client.analyze_sentiment(document=document).document_sentiment
results = {'text': text, 'sentiment':sentiment.score, 'magnitude':sentiment.magnitude}
sentiment_tweets.append(results)
if (i % iterations) == 0:
print(i, " tweets processed")
sentiment_tweets_json = [json.dumps(sentiments) for sentiments in sentiment_tweets]
stop = timeit.default_timer()
The issue is the tweets list is around 100k entries, iterating and making calls one by one does not produce an output on a feasible timescale. I'm exploring potentially using asyncio for parallel calls, although as I'm still a beginner with Python and unfamiliar with the package, I'm not sure if you can make a function a coroutine with itself such that each instance of the function iterates through the list as expected, progressing sequentially. There is also the question of managing the total number of calls made by the app to be within the defined quota limits of the API. Just wanted to know if I was going in the right direction.
I use this method for concurrent calls:
from concurrent import futures as cf
def execute_all(mfs: list, max_workers: int = None):
"""Excecute concurrently and mfs list.
Parameters
----------
mfs : list
[mfs1, mfs2,...]
mfsN = {
tag: str,
fn: function,
kwargs: dict
}
.
max_workers : int
Description of parameter `max_workers`.
Returns
-------
dict
{status, result, error}
status = {tag1, tag2,..}
result = {tag1, tag2,..}
error = {tag1, tag2,..}
"""
result = {
'status': {},
'result': {},
'error': {}
}
max_workers = len(mfs)
with cf.ThreadPoolExecutor(max_workers=max_workers) as exec:
my_futures = {
exec.submit(x['fn'], **x['kwargs']): x['tag'] for x in mfs
}
for future in cf.as_completed(my_futures):
tag = my_futures[future]
try:
result['result'][tag] = future.result()
result['status'][tag] = 0
except Exception as err:
result['error'][tag] = err
result['result'][tag] = None
result['status'][tag] = 1
return result
Where each result returns indexed by a given tag (if matters to you identify the which call return which result) when:
mfs = [
{
'tag': 'tweet1',
'fn': process_tweet,
'kwargs': {
'tweet': tweet1
}
},
{
'tag': 'tweet2',
'fn': process_tweet,
'kwargs': {
'tweet': tweet2
}
},
]
results = execute_all(mfs, 2)
While async is one way you could go, another that might be easier is using the multiprocessing python functionalities.
from multiprocessing import Pool
def process_tweet(tweet):
pass # Fill in the blanks here
# Use five processes at once
with Pool(5) as p:
processes_tweets = p.map(process_tweet, tweets, 1)
In this case "tweets" is an iterator of some sort, and each element of that iterator will get passed to your function. The map function will make sure the results come back in the same order the arguments were supplied.

Compare Two Different JSON In Python Using Difflib, Showing Only The Differences

I am trying to compare 2 different pieces of (Javascript/JSON) code using difflib module in Python 3.8,
{"message": "Hello world", "name": "Jack"}
and
{"message": "Hello world", "name": "Ryan"}
Problem: When these 2 strings are prettified and compared using difflib, we get the inline differences, as well as all the common lines.
Is there a way to only show the lines that differ, in a manner thats clearer to look at? This will help significantly when both files are much larger in size, making it more challenging to identify the changed lines.
Thanks!
Actual Output
{
"message": "Hello world",
"name": "{J -> Ry}a{ck -> n}"
}
Desired Output
"name": "{J -> Ry}a{ck -> n}"
Even better will be something like:
{"name": "Jack"} -> {"name": "Ryan"}
Python Code Used
We use jsbeautifier here instead of json because the files we are comparing may sometimes be malformed JSON. json will throw an error while jsbeautifier still formats it the way we expect it to.
import jsbeautifier
def inline_diff(a, b):
"""
https://stackoverflow.com/questions/774316/python-difflib-highlighting-differences-inline/47617607#47617607
"""
import difflib
matcher = difflib.SequenceMatcher(None, a, b)
def process_tag(tag, i1, i2, j1, j2):
if tag == 'replace':
return '{' + matcher.a[i1:i2] + ' -> ' + matcher.b[j1:j2] + '}'
if tag == 'delete':
return '{- ' + matcher.a[i1:i2] + '}'
if tag == 'equal':
return matcher.a[i1:i2]
if tag == 'insert':
return '{+ ' + matcher.b[j1:j2] + '}'
assert false, "Unknown tag %r"%tag
return ''.join(process_tag(*t) for t in matcher.get_opcodes())
# File content to compare
file1 = '{"message": "Hello world", "name": "Jack"}'
file2 = '{"message": "Hello world", "name": "Ryan"}'
# Prettify JSON
f1 = jsbeautifier.beautify(file1)
f2 = jsbeautifier.beautify(file2)
# Print the differences to stdout
print(inline_diff(f1, f2))
For your desired output you can do even without usage of difflib, for example:
def find_diff(a, b):
result = []
a = json.loads(a)
b = json.loads(b)
for key in a:
if key not in b:
result.append(f'{dict({key: a[key]})} -> {"key deleted"}')
elif key in b and a[key] != b[key]:
result.append(f'{dict({key: a[key]})} -> {dict({key: b[key]})}')
return '\n'.join(t for t in result)
# File content to compare
file1 = '{"new_message": "Hello world", "name": "Jack"}'
file2 = '{"message": "Hello world", "name": "Ryan"}'
print(find_diff(f1, f2))
#{'new_message': 'Hello world'} -> key deleted
#{'name': 'Jack'} -> {'name': 'Ryan'}
There are plenty of ways to handle it, try to adapt it for your needs.

Convert list of paths to dictionary in python

I'm making a program in Python where I need to interact with "hypothetical" paths, (aka paths that don't and won't exist on the actual filesystem) and I need to be able to listdir them like normal (path['directory'] would return every item inside the directory like os.listdir()).
The solution I came up with was to convert a list of string paths to a dictionary of dictionaries. I came up with this recursive function (it's inside a class):
def DoMagic(self,paths):
structure = {}
if not type(paths) == list:
raise ValueError('Expected list Value, not '+str(type(paths)))
for i in paths:
print(i)
if i[0] == '/': #Sanity check
print('trailing?',i) #Inform user that there *might* be an issue with the input.
i[0] = ''
i = i.split('/') #Split it, so that we can test against different parts.
if len(i[1:]) > 1: #Hang-a-bout, there's more content!
structure = {**structure, **self.DoMagic(['/'.join(i[1:])])}
else:
structure[i[1]] = i[1]
But when I go to run it with ['foo/e.txt','foo/bar/a.txt','foo/bar/b.cfg','foo/bar/c/d.txt'] as input, I get:
{'e.txt': 'e.txt', 'a.txt': 'a.txt', 'b.cfg': 'b.cfg', 'd.txt': 'd.txt'}
I want to be able to just path['foo']['bar'] to get everything in the foo/bar/ directory.
Edit:
A more desirable output would be:
{'foo':{'e.txt':'e.txt','bar':{'a.txt':'a.txt','c':{'d.txt':'d.txt'}}}}
Edit 10-14-22 My first answer matches what the OP asks but isn't really the ideal approach nor the cleanest output. Since this question appears to be used more often, see a cleaner approach below that is more resilient to Unix/Windows paths and the output dictionary makes more sense.
from pathlib import Path
import json
def get_path_dict(paths: list[str | Path]) -> dict:
"""Builds a tree like structure out of a list of paths"""
def _recurse(dic: dict, chain: tuple[str, ...] | list[str]):
if len(chain) == 0:
return
if len(chain) == 1:
dic[chain[0]] = None
return
key, *new_chain = chain
if key not in dic:
dic[key] = {}
_recurse(dic[key], new_chain)
return
new_path_dict = {}
for path in paths:
_recurse(new_path_dict, Path(path).parts)
return new_path_dict
l1 = ['foo/e.txt', 'foo/bar/a.txt', 'foo/bar/b.cfg', Path('foo/bar/c/d.txt'), 'test.txt']
result = get_path_dict(l1)
print(json.dumps(result, indent=2))
Output:
{
"foo": {
"e.txt": null,
"bar": {
"a.txt": null,
"b.cfg": null,
"c": {
"d.txt": null
}
}
},
"test.txt": null
}
Older approach
How about this. It gets your desired output, however a tree structure may be cleaner.
from collections import defaultdict
import json
def nested_dict():
"""
Creates a default dictionary where each value is an other default dictionary.
"""
return defaultdict(nested_dict)
def default_to_regular(d):
"""
Converts defaultdicts of defaultdicts to dict of dicts.
"""
if isinstance(d, defaultdict):
d = {k: default_to_regular(v) for k, v in d.items()}
return d
def get_path_dict(paths):
new_path_dict = nested_dict()
for path in paths:
parts = path.split('/')
if parts:
marcher = new_path_dict
for key in parts[:-1]:
marcher = marcher[key]
marcher[parts[-1]] = parts[-1]
return default_to_regular(new_path_dict)
l1 = ['foo/e.txt','foo/bar/a.txt','foo/bar/b.cfg','foo/bar/c/d.txt', 'test.txt']
result = get_path_dict(l1)
print(json.dumps(result, indent=2))
Output:
{
"foo": {
"e.txt": "e.txt",
"bar": {
"a.txt": "a.txt",
"b.cfg": "b.cfg",
"c": {
"d.txt": "d.txt"
}
}
},
"test.txt": "test.txt"
}
Wouldn't simple tree, implemented via dictionaries, suffice?
Your implementation seems a bit redundant. It's hard to tell easily to which folder a file belongs.
https://en.wikipedia.org/wiki/Tree_(data_structure)
There's a lot of libs on pypi, if you need something extra.
treelib
There're also Pure paths in pathlib.

Python json dumps syntax error when appending list of dict

I got two functions that return a list of dictionary and i'm trying to get json to encode it, it works when i try doing it with my first function, but now i'm appending second function with a syntax error of ": expected". I will eventually be appending total of 7 functions that each output a list of dict. Is there a better way of accomplishing this?
import dmidecode
import simplejson as json
def get_bios_specs():
BIOSdict = {}
BIOSlist = []
for v in dmidecode.bios().values():
if type(v) == dict and v['dmi_type'] == 0:
BIOSdict["Name"] = str((v['data']['Vendor']))
BIOSdict["Description"] = str((v['data']['Vendor']))
BIOSdict["BuildNumber"] = str((v['data']['Version']))
BIOSdict["SoftwareElementID"] = str((v['data']['BIOS Revision']))
BIOSdict["primaryBIOS"] = "True"
BIOSlist.append(BIOSdict)
return BIOSlist
def get_board_specs():
MOBOdict = {}
MOBOlist = []
for v in dmidecode.baseboard().values():
if type(v) == dict and v['dmi_type'] == 2:
MOBOdict["Manufacturer"] = str(v['data']['Manufacturer'])
MOBOdict["Model"] = str(v['data']['Product Name'])
MOBOlist.append(MOBOdict)
return MOBOlist
def get_json_dumps():
jsonOBJ = json
#Syntax error is here, i can't use comma to continue adding more, nor + to append.
return jsonOBJ.dumps({'HardwareSpec':{'BIOS': get_bios_specs()},{'Motherboard': get_board_specs()}})
Use multiple items within your nested dictionary.
jsonOBJ.dumps({
'HardwareSpec': {
'BIOS': get_bios_specs(),
'Motherboard': get_board_specs()
}
})
And if you want multiple BIOS items or Motherboard items, just use a list.
...
'HardwareSpec': {
'BIOS': [
get_bios_specs(),
get_uefi_specs()
]
...
}
If you want a more convenient lookup of specs, you can just embed a dict:
jsonOBJ.dumps({'HardwareSpec':{'BIOS': get_bios_specs(),
'Motherboard': get_board_specs()
}
})

Categories

Resources