Compare Two Different JSON In Python Using Difflib, Showing Only The Differences - python

I am trying to compare 2 different pieces of (Javascript/JSON) code using difflib module in Python 3.8,
{"message": "Hello world", "name": "Jack"}
and
{"message": "Hello world", "name": "Ryan"}
Problem: When these 2 strings are prettified and compared using difflib, we get the inline differences, as well as all the common lines.
Is there a way to only show the lines that differ, in a manner thats clearer to look at? This will help significantly when both files are much larger in size, making it more challenging to identify the changed lines.
Thanks!
Actual Output
{
"message": "Hello world",
"name": "{J -> Ry}a{ck -> n}"
}
Desired Output
"name": "{J -> Ry}a{ck -> n}"
Even better will be something like:
{"name": "Jack"} -> {"name": "Ryan"}
Python Code Used
We use jsbeautifier here instead of json because the files we are comparing may sometimes be malformed JSON. json will throw an error while jsbeautifier still formats it the way we expect it to.
import jsbeautifier
def inline_diff(a, b):
"""
https://stackoverflow.com/questions/774316/python-difflib-highlighting-differences-inline/47617607#47617607
"""
import difflib
matcher = difflib.SequenceMatcher(None, a, b)
def process_tag(tag, i1, i2, j1, j2):
if tag == 'replace':
return '{' + matcher.a[i1:i2] + ' -> ' + matcher.b[j1:j2] + '}'
if tag == 'delete':
return '{- ' + matcher.a[i1:i2] + '}'
if tag == 'equal':
return matcher.a[i1:i2]
if tag == 'insert':
return '{+ ' + matcher.b[j1:j2] + '}'
assert false, "Unknown tag %r"%tag
return ''.join(process_tag(*t) for t in matcher.get_opcodes())
# File content to compare
file1 = '{"message": "Hello world", "name": "Jack"}'
file2 = '{"message": "Hello world", "name": "Ryan"}'
# Prettify JSON
f1 = jsbeautifier.beautify(file1)
f2 = jsbeautifier.beautify(file2)
# Print the differences to stdout
print(inline_diff(f1, f2))

For your desired output you can do even without usage of difflib, for example:
def find_diff(a, b):
result = []
a = json.loads(a)
b = json.loads(b)
for key in a:
if key not in b:
result.append(f'{dict({key: a[key]})} -> {"key deleted"}')
elif key in b and a[key] != b[key]:
result.append(f'{dict({key: a[key]})} -> {dict({key: b[key]})}')
return '\n'.join(t for t in result)
# File content to compare
file1 = '{"new_message": "Hello world", "name": "Jack"}'
file2 = '{"message": "Hello world", "name": "Ryan"}'
print(find_diff(f1, f2))
#{'new_message': 'Hello world'} -> key deleted
#{'name': 'Jack'} -> {'name': 'Ryan'}
There are plenty of ways to handle it, try to adapt it for your needs.

Related

Converting JSON to CSV using Python. How to remove certain text/characters if found, and how to better format the cell?

I apologise in advanced if i have not provided enough information,using wrong terminology or im not formatting my question correctly. This is my first time asking questions here.
This is the script for the python script: https://pastebin.com/WWViemwf
This is the script for the JSON file (contains the first 4 elements hydrogen, helium, lithium, beryllium): https://pastebin.com/fyiijpBG
As seen, I'm converting the file from ".json" to ".csv".
The JSON file sometimes contains fields that say "NotApplicable" or "Unknown". Or it will show me weird text that I'm not familiar with.
For example here:
"LiquidDensity": {
"data": "NotAvailable",
"tex_description": "\\text{liquid density}"
},
And here:
"MagneticMoment": {
"data": "Unknown",
"tex_description": "\\text{magnetic dipole moment}"
},
Here is the code ive made to convert from ".json" to ".csv":
#liquid density
liquid_density = element_data["LiquidDensity"]["data"]
if isinstance(liquid_density, dict):
liquid_density_value = liquid_density["value"]
liquid_density_unit = liquid_density["tex_unit"]
else:
liquid_density_value = liquid_density
liquid_density_unit = ""
However in the csv file it shows up like this.
I'm also trying to remove these characters that i'm seeing in the ".csv" file.
In the JSON file, this is how the data is viewed:
"AtomicMass": {
"data": {
"value": "4.002602",
"tex_unit": "\\text{u}"
},
"tex_description": "\\text{atomic mass}"
},
And this is how i coded to convert, using Python:
#atomic mass
atomic_mass = element_data["AtomicMass"]["data"]
if isinstance(atomic_mass, dict):
atomic_mass_value = atomic_mass["value"]
atomic_mass_unit = atomic_mass["tex_unit"]
else:
atomic_mass_value = atomic_mass
atomic_mass_unit = ""
What have i done wrong?
I've tried replacing:
#melting point
melting_point = element_data["MeltingPoint"]["data"]
if isinstance(melting_point, dict):
melting_point_value = melting_point["value"]
melting_point_unit = melting_point["tex_unit"]
else:
melting_point_value = melting_point
melting_point_value = ""
With:
#melting point
melting_point = element_data["MeltingPoint"]["data"]
if isinstance(melting_point, dict):
melting_point_value = melting_point["value"]
melting_point_unit = melting_point["tex_unit"]
elif melting_point == "NotApplicable" or melting_point == "Unknown":
melting_point_value = ""
melting_point_unit = ""
else:
melting_point_value = melting_point
melting_point_unit = ""
However that doesn't seem to work.
Your code is fine, what went wrong is at the writing, let me take out some part of it.
#I will only be using Liquid Density as example, so I won't be showing the others
headers = [..., "Liquid Density", ...]
#liquid_density data reading part
liquid_density = element_data["LiquidDensity"]["data"]
if isinstance(liquid_density, dict):
liquid_density_value = liquid_density["value"]
liquid_density_unit = liquid_density["tex_unit"]
else:
liquid_density_value = liquid_density
liquid_density_unit = ""
#your writing of the data into the csv
writer.writerow([..., liquid_density, ...])
You write liquid_density directly into your csv, that is why it shows the dictionary. If you want to write the value only, I believe you should change the value in write line to
writer.writerow([..., liquid_density_value, ...])

How to escape all HTML entities in show_popup() method and fix Parse Error in Sublime Text 3 plugin?

I am making a plugin for Sublime Text 3. It contacts my server in Java and receives a response in the form of a list of strings, that contains C code.
To display this code in a popup window you need to pass a string in HTML format to the method show_popup. Accordingly, all C-code characters that can be recognized by the parser as HTML entities should be replaced with their names (&name;) or numbers (&#number;). At first, I just replaced the most common characters with replace(), but it didn't always work out - Parse Error was displayed in the console:
Parse Error: <br> printf ("Decimals: %d %ld\n", 1977, 650000L);
<br> printf ("Preceding with blanks:&nbs
...
y</a></li><p><b>____________________________________________________</b></p>
</ul>
</body>
code: Unexpected character
I've tried to escape html entities with Python's html library:
import html
...
html.escape(string)
But Sublime doesn't see import and print in console that I was using a function without defining it - I guess he didn't see that I connected this library(Whyyy?). cgi.escape - is depricated, so I can't use this. I decided to write the function myself.
Then I saw a very interesting way to replace all the characters whose code is >127 and some other characters (&, <,>) with their numbers:
def escape_html (s):
out = ""
i = 0
while i < len(s):
c = s[i]
number = ord(c)
if number > 127 or c == '"' or c == '\'' or c == '<' or c == '>' or c == '&':
out += "&#"
out += str(number)
out += ";"
else:
out += c
i += 1
out = out.replace(" ", " ")
out = out.replace("\n", "<br>")
return out
This code works perfectly for displaying characters in a browser, but unfortunately it is not supported by Sublime Text 3.
As a result, I came to the conclusion that these characters should be replaced with their equivalent names:
def dumb_escape_html(s):
entities = [["&", "&"], ["<", "<"], [">", ">"], ["\n", "<br>"],
[" ", " "]]
for entity in entities:
s = s.replace(entity[0], entity[1])
return s
But again I faced an obstacle: not all names are supported in Sublime. And again an error Parse Error.
I also attach a link to JSON file, which contains answer from my server, content of which should be displayed in pop-up window: Example of data from sever (codeshare.io)
I absolutely do not understand, in what I make a mistake - I hope, that great programmers know how to solve my problem.
Edit. Minimal, Reproducible Example:
import sublime
import sublime_plugin
import string
import sys
import json
def get_func_name(line, column):
return "printf"
def get_const_data(func_name):
input_file = open ("PATH_TO_JSON/data_printf.json")
results = json.load(input_file)
return results
def dumb_escape_html(s):
entities = [["&", "&"], ["<", "<"], [">", ">"], ["\n", "<br>"],
[" ", " "]]
for entity in entities:
s = s.replace(entity[0], entity[1])
return s
def dumb_unescape_html(s):
entities = [["<", "<"], [">", ">"], ["<br>", "\n"],
[" ", " "], ["&", "&"]]
for entity in entities:
s = s.replace(entity[0], entity[1])
return s
class CoderecsysCommand(sublime_plugin.TextCommand):
def run(self, edit):
v = self.view
cur_line = v.substr(v.line(v.sel()[0]))
for sel in v.sel():
line_begin = v.rowcol(sel.begin())[0]
line_end = v.rowcol(sel.end())[0]
pos = v.rowcol(v.sel()[0].begin()) # (row, column)
try:
func_name = get_func_name(cur_line, pos[1]-1)
li_tree = ""
final_data = get_const_data(func_name)
for i in range(len(final_data)):
source = "source: " + final_data[i]["source"]
escaped = dumb_escape_html(final_data[i]["code"])
divider = "<b>____________________________________________________</b>"
li_tree += "<li><p>%s</p>%s <a href='%s'>Copy</a></li><p>%s</p>" %(source, escaped, escaped, divider)
# The html to be shown.
html = """
<body id=copy-multiline>
Examples of using <b>%s</b> function.
<ul>
%s
</ul>
</body>
""" %(func_name, li_tree)
self.view.show_popup(html, max_width=700, on_navigate=lambda example: self.copy_example(example, func_name, source))
except Exception as ex:
self.view.show_popup("<b style=\"color:#1c87c9\">CodeRec Error:</b> " + str(ex), max_width=700)
def copy_example(self, example, func_name, source):
# Copies the code to the clipboard.
unescaped = dumb_unescape_html(example)
unescaped = "// " + source + unescaped
sublime.set_clipboard(unescaped)
self.view.hide_popup()
sublime.status_message('Example of using ' + func_name + ' copied to clipboard !')

Convert list of paths to dictionary in python

I'm making a program in Python where I need to interact with "hypothetical" paths, (aka paths that don't and won't exist on the actual filesystem) and I need to be able to listdir them like normal (path['directory'] would return every item inside the directory like os.listdir()).
The solution I came up with was to convert a list of string paths to a dictionary of dictionaries. I came up with this recursive function (it's inside a class):
def DoMagic(self,paths):
structure = {}
if not type(paths) == list:
raise ValueError('Expected list Value, not '+str(type(paths)))
for i in paths:
print(i)
if i[0] == '/': #Sanity check
print('trailing?',i) #Inform user that there *might* be an issue with the input.
i[0] = ''
i = i.split('/') #Split it, so that we can test against different parts.
if len(i[1:]) > 1: #Hang-a-bout, there's more content!
structure = {**structure, **self.DoMagic(['/'.join(i[1:])])}
else:
structure[i[1]] = i[1]
But when I go to run it with ['foo/e.txt','foo/bar/a.txt','foo/bar/b.cfg','foo/bar/c/d.txt'] as input, I get:
{'e.txt': 'e.txt', 'a.txt': 'a.txt', 'b.cfg': 'b.cfg', 'd.txt': 'd.txt'}
I want to be able to just path['foo']['bar'] to get everything in the foo/bar/ directory.
Edit:
A more desirable output would be:
{'foo':{'e.txt':'e.txt','bar':{'a.txt':'a.txt','c':{'d.txt':'d.txt'}}}}
Edit 10-14-22 My first answer matches what the OP asks but isn't really the ideal approach nor the cleanest output. Since this question appears to be used more often, see a cleaner approach below that is more resilient to Unix/Windows paths and the output dictionary makes more sense.
from pathlib import Path
import json
def get_path_dict(paths: list[str | Path]) -> dict:
"""Builds a tree like structure out of a list of paths"""
def _recurse(dic: dict, chain: tuple[str, ...] | list[str]):
if len(chain) == 0:
return
if len(chain) == 1:
dic[chain[0]] = None
return
key, *new_chain = chain
if key not in dic:
dic[key] = {}
_recurse(dic[key], new_chain)
return
new_path_dict = {}
for path in paths:
_recurse(new_path_dict, Path(path).parts)
return new_path_dict
l1 = ['foo/e.txt', 'foo/bar/a.txt', 'foo/bar/b.cfg', Path('foo/bar/c/d.txt'), 'test.txt']
result = get_path_dict(l1)
print(json.dumps(result, indent=2))
Output:
{
"foo": {
"e.txt": null,
"bar": {
"a.txt": null,
"b.cfg": null,
"c": {
"d.txt": null
}
}
},
"test.txt": null
}
Older approach
How about this. It gets your desired output, however a tree structure may be cleaner.
from collections import defaultdict
import json
def nested_dict():
"""
Creates a default dictionary where each value is an other default dictionary.
"""
return defaultdict(nested_dict)
def default_to_regular(d):
"""
Converts defaultdicts of defaultdicts to dict of dicts.
"""
if isinstance(d, defaultdict):
d = {k: default_to_regular(v) for k, v in d.items()}
return d
def get_path_dict(paths):
new_path_dict = nested_dict()
for path in paths:
parts = path.split('/')
if parts:
marcher = new_path_dict
for key in parts[:-1]:
marcher = marcher[key]
marcher[parts[-1]] = parts[-1]
return default_to_regular(new_path_dict)
l1 = ['foo/e.txt','foo/bar/a.txt','foo/bar/b.cfg','foo/bar/c/d.txt', 'test.txt']
result = get_path_dict(l1)
print(json.dumps(result, indent=2))
Output:
{
"foo": {
"e.txt": "e.txt",
"bar": {
"a.txt": "a.txt",
"b.cfg": "b.cfg",
"c": {
"d.txt": "d.txt"
}
}
},
"test.txt": "test.txt"
}
Wouldn't simple tree, implemented via dictionaries, suffice?
Your implementation seems a bit redundant. It's hard to tell easily to which folder a file belongs.
https://en.wikipedia.org/wiki/Tree_(data_structure)
There's a lot of libs on pypi, if you need something extra.
treelib
There're also Pure paths in pathlib.

verifying formatted messages

I'm writing software which does some analysis of the input and returns a result. Part of the requirements includes it generates zero or more warnings or errors and includes those with the result. I'm also writing unit tests which, in particular, have some contrived data to verify the right warnings are emitted.
I need to be able to parse the warnings/errors and verify that the expected messages are correctly emitted. I figured I'd store the messages in a container and reference them with a message ID which is pretty similar to how I've done localization in the past.
errormessages.py right now looks pretty similar to:
from enum import IntEnum
NO_MESSAGE = ('')
HELLO = ('Hello, World')
GOODBYE = ('Goodbye')
class MsgId(IntEnum):
NO_MESSAGE = 0
HELLO = 1
GOODBYE = 2
Msg = {
MessageId.NO_MESSAGE: NO_MESSAGE,
MessageId.HELLO: HELLO,
MessageId.GOODBYE: GOODBYE,
}
So then the analysis can look similar to this:
from errormessages import Msg, MsgId
def analyse(_):
errors = []
errors.append(Msg[MsgId.HELLO])
return _, errors
And in the unit tests I can do something similar to
from errormessages import Msg, MsgId
from my import analyse
def test_hello():
_, errors = analyse('toy')
assert Msg[MsgId.HELLO] in errors
But some of the messages get formatted and I think that's going to play hell with parsing the messages for unit tests. I was thinking I'd add flavors of the messages; one for formatting and the other for parsing:
updated errormessages.py:
from enum import IntEnum
import re
FORMAT_NO_MESSAGE = ('')
FORMAT_HELLO = ('Hello, {}')
FORMAT_GOODBYE = ('Goodbye')
PARSE_NO_MESSAGE = re.compile(r'^$')
PARSE_HELLO = re.compile(r'^Hello, (.*)$')
PARSE_GOODBYE = re.compile('^Goodbye$')
class MsgId(IntEnum):
NO_MESSAGE = 0
HELLO = 1
GOODBYE = 2
Msg = {
MessageId.NO_MESSAGE: (FORMAT_NO_MESSAGE, PARSE_NO_MESSAGE),
MessageId.HELLO: (FORMAT_HELLO, PARSE_HELLO),
MessageId.GOODBYE: (FORMAT_GOODBYE, PARSE_GOODBYE),
}
So then the analysis can look like:
from errormessages import Msg, MsgId
def analyse(_):
errors = []
errors.append(Msg[MsgId.HELLO][0].format('World'))
return _, errors
And in the unit tests I can do:
from errormessages import Msg, MsgId
from my import analyse
import re
def test_hello():
_, errors = analyse('toy')
expected = {v: [] for v in MsgId}
expected[MsgId.HELLO] = [
Msg[MsgId.HELLO][1].match(msg)
for msg in errors
]
for _,v in expected.items():
if _ == MsgId.HELLO:
assert v
else:
assert not v
I was wondering if there's perhaps a better / simpler way? In particular, the messages are effectively repeated twice; once for the formatter and once for the regular expression. Is there a way to use a single string for both formatting and regular expression capturing?
Assuming the messages are all stored as format string templates (e.g. "Hello", or "Hello, {}" or "Hello, {firstname} {surname}"), then you could generate the regexes directly from the templates:
import re
import random
import string
def format_string_to_regex(format_string: str) -> re.Pattern:
"""Convert a format string template to a regex."""
unique_string = ''.join(random.choices(string.ascii_letters, k=24))
stripped_fields = re.sub(r"\{[^\{\}]*\}(?!\})", unique_string, format_string)
pattern = re.escape(stripped_fields).replace(unique_string, "(.*)")
pattern = pattern.replace("\{\{","\{").replace("\}\}", "\}")
return re.compile(f"^{pattern}$")
def is_error_message(error: str, expected_message: MessageId) -> bool:
"""Returns whether the error plausibly matches the MessageId."""
expected_format = format_string_to_regex(Msg[expected_message])
return bool(expected_format.match(error))

Formatting JSON in Python

What is the simplest way to pretty-print a string of JSON as a string with indentation when the initial JSON string is formatted without extra spaces or line breaks?
Currently I'm running json.loads() and then running json.dumps() with indent=2 on the result. This works, but it feels like I'm throwing a lot of compute down the drain.
Is there a more simple or efficient (built-in) way to pretty-print a JSON string? (while keeping it as valid JSON)
Example
import requests
import json
response = requests.get('http://spam.eggs/breakfast')
one_line_json = response.content.decode('utf-8')
pretty_json = json.dumps(json.loads(response.content), indent=2)
print(f'Original: {one_line_json}')
print(f'Pretty: {pretty_json}')
Output:
Original: {"breakfast": ["spam", "spam", "eggs"]}
Pretty: {
"breakfast": [
"spam",
"spam",
"eggs"
]
}
json.dumps(obj, indent=2) is better than pprint because:
It is faster with the same load methodology.
It has the same or similar simplicity.
The output will produce valid JSON, whereas pprint will not.
pprint_vs_dumps.py
import cProfile
import json
import pprint
from urllib.request import urlopen
def custom_pretty_print():
url_to_read = "https://www.cbcmusic.ca/Component/Playlog/GetPlaylog?stationId=96&date=2018-11-05"
with urlopen(url_to_read) as resp:
pretty_json = json.dumps(json.load(resp), indent=2)
print(f'Pretty: {pretty_json}')
def pprint_json():
url_to_read = "https://www.cbcmusic.ca/Component/Playlog/GetPlaylog?stationId=96&date=2018-11-05"
with urlopen(url_to_read) as resp:
info = json.load(resp)
pprint.pprint(info)
cProfile.run('custom_pretty_print()')
>>> 71027 function calls (42309 primitive calls) in 0.084 seconds
cProfile.run('pprint_json()')
>>>164241 function calls (140121 primitive calls) in 0.208 seconds
Thanks #tobias_k for pointing out my errors along the way.
I think for a true JSON object print, it's probably as good as it gets. timeit(number=10000) for the following took about 5.659214497s:
import json
d = {
'breakfast': [
'spam', 'spam', 'eggs',
{
'another': 'level',
'nested': [
{'a':'b'},
{'c':'d'}
]
}
],
'foo': True,
'bar': None
}
s = json.dumps(d)
q = json.dumps(json.loads(s), indent=2)
print(q)
I tried with pprint, but it actually wouldn't print the pure JSON string unless it's converted to a Python dict, which loses your true, null and false etc valid JSON as mentioned in the other answer. As well it doesn't retain the order in which the items appeared, so it's not great if order is important for readability.
Just for fun I whipped up the following function:
def pretty_json_for_savages(j, indentor=' '):
ind_lvl = 0
temp = ''
for i, c in enumerate(j):
if c in '{[':
print(indentor*ind_lvl + temp.strip() + c)
ind_lvl += 1
temp = ''
elif c in '}]':
print(indentor*ind_lvl + temp.strip() + '\n' + indentor*(ind_lvl-1) + c, end='')
ind_lvl -= 1
temp = ''
elif c in ',':
print(indentor*(0 if j[i-1] in '{}[]' else ind_lvl) + temp.strip() + c)
temp = ''
else:
temp += c
print('')
# {
# "breakfast":[
# "spam",
# "spam",
# "eggs",
# {
# "another": "level",
# "nested":[
# {
# "a": "b"
# },
# {
# "c": "d"
# }
# ]
# }
# ],
# "foo": true,
# "bar": null
# }
It prints pretty alright, and unsurprisingly it took a whooping 16.701202023s to run in timeit(number=10000), which is 3 times as much as a json.dumps(json.loads()) would get you. It's probably not worthwhile to build your own function to achieve this unless you spend some time to optimize it, and with the lack of a builtin for the same, it's probably best you stick with your gun since your efforts will most likely give diminishing returns.

Categories

Resources