In Python, how to parse the complex json file - python

I want to get "path" from the below json file; I used json.load to get read json file and then parse one by one using for key, value in data.items() and it leads to lot of for loop (Say 6 loops) to get to the value of "path"; Is there any simple method to retrieve the value of path?
The complete json file can be found here and below is the snippet of it.
{
"products": {
"com.ubuntu.juju:12.04:amd64": {
"version": "2.0.1",
"arch": "amd64",
"versions": {
"20161129": {
"items": {
"2.0.1-precise-amd64": {
"release": "precise",
"version": "2.0.1",
"arch": "amd64",
"size": 23525972,
"path": "released/juju-2.0.1-precise-amd64.tgz",
"ftype": "tar.gz",
"sha256": "f548ac7b2a81d15f066674365657d3681e3d46bf797263c02e883335d24b5cda"
}
}
}
}
},
"com.ubuntu.juju:14.04:amd64": {
"version": "2.0.1",
"arch": "amd64",
"versions": {
"20161129": {
"items": {
"2.0.1-trusty-amd64": {
"release": "trusty",
"version": "2.0.1",
"arch": "amd64",
"size": 23526508,
"path": "released/juju-2.0.1-trusty-amd64.tgz",
"ftype": "tar.gz",
"sha256": "7b86875234477e7a59813bc2076a7c1b5f1d693b8e1f2691cca6643a2b0dc0a2"
}
}
}
}
},

You can use recursive generator:
def get_paths(data):
if 'path' in data:
yield data['path']
for k in data.keys():
if isinstance(data[k], dict):
for i in get_paths(data[k]):
yield i
for path in get_paths(json_data): # loaded json data
print(path)

Is path key always at the same depth in the loaded json (which is a dict so) ? If so, what about doing
products = loaded_json['products']
for product in products.items():
print product[1].items()[2][1].items()[0][1].items()[0][1].items()[0][1]['path']
If not, the answer of Yevhen Kuzmovych is clearly better, cleaner and more general than mine.

If you only care about the path, I think using any JSON parser is an overkill, you can just use built in re regex and use the following pattern (\"path\":\s*\")(.*\s*)(?=\",). I didn't test the whole file but should be able to figure out the best pattern fairly easily.

If you only need the file names present in path field, you can easily get them by simply parsing the file:
import re
files = []
pathre = re.compile(r'\s*"path"\s*:\s*"(.*?)"')
with open('file.json') as fd:
for line in fd:
if "path" in line:
m = pathre.match(line)
if m is not None:
files.append(m.group(1))
If you need to process simultaneously the path and sha256 fields:
files = []
pathre = re.compile(r'\s*"path"\s*:\s*"(.*?)"')
share = re.compile(r'\s*"sha256"\s*:\s*"(.*?)"')
path = None
with open('file.json') as fd:
for line in fd:
if "path" in line:
m = pathre.match(line)
path = m.group(1)
elif "sha256" in line:
m = share.match(line)
if path is not None:
files.append((path, m.group(1)))
path = None

You can use a query language like JSONPath. Here you find the Python implementation: https://pypi.python.org/pypi/jsonpath-rw
Assuming you have your JSON content already loaded, you can do something like the following:
from jsonpath_rw import jsonpath, parse
# Load your JSON content first from a file or from a string
# json_data = ...
jsonpath_expr = parse('products..path')
for match in jsonpath_expr.find(json_data):
print(match.value)
For a further discussion you can read this: Is there a query language for JSON?

Related

How to search inside json file with redis?

Here I set a json object inside a key in a redis. Later I want to perform search on the json file stored in the redis. My search key will always be a json string like in the example below and i want to match this inside the stored json file.
Currently here i am doing this by iterating and comparing but instead i want to do it with redis. How can I do it ?
rd = redis.StrictRedis(host="localhost",port=6379, db=0)
if not rd.get("mykey"):
with open(os.path.join(BASE_DIR, "my_file.josn")) as fl:
data = json.load(fl)
rd.set("mykey", json.dumps(data))
else:
key_values = json.loads(rd.get("mykey"))
search_json_key = {
"key":"value",
"key2": {
"key": "val"
}
}
# here i am searching by iterating and comparing instead i want to do it with redis
for i in key_values['all_data']:
if json.dumps(i) == json.dumps(search_json_key):
# return
# mykey format looks like this:
{
"all_data": [
{
"key":"value",
"key2": {
"key": "val"
}
},
{
"key":"value",
"key2": {
"key": "val"
}
},
{
"key":"value",
"key2": {
"key": "val"
}
},
]
}
To do search with Redis and JSON you have two options - you can use the FT CREATE command to create an index that you can then use FT SEARCH over, (while both of these web pages show the CLI syntax you can do
rd.ft().create() / search() in your python script)
OR you can check out the python OM client that will take care of that to some extent for you.
Either way you'll have to do a bit of a rework to fully take advantage of Redis' search capabilities.

Reading json in python separated by newlines

I am trying to read some json with the following format. A simple pd.read_json() returns ValueError: Trailing data. Adding lines=True returns ValueError: Expected object or value. I've tried various combinations of readlines() and load()/loads() so far without success.
Any ideas how I could get this into a dataframe?
{
"content": "kdjfsfkjlffsdkj",
"source": {
"name": "jfkldsjf"
},
"title": "dsldkjfslj",
"url": "vkljfklgjkdlgj"
}
{
"content": "djlskgfdklgjkfgj",
"source": {
"name": "ldfjkdfjs"
},
"title": "lfsjdfklfldsjf",
"url": "lkjlfggdflkjgdlf"
}
The sample you have above isn't valid JSON. To be valid JSON these objects need to be within a JS array ([]) and be comma separated, as follows:
[{
"content": "kdjfsfkjlffsdkj",
"source": {
"name": "jfkldsjf"
},
"title": "dsldkjfslj",
"url": "vkljfklgjkdlgj"
},
{
"content": "djlskgfdklgjkfgj",
"source": {
"name": "ldfjkdfjs"
},
"title": "lfsjdfklfldsjf",
"url": "lkjlfggdflkjgdlf"
}]
I just tried on my machine. When formatted correctly, it works
>>> pd.read_json('data.json')
content source title url
0 kdjfsfkjlffsdkj {'name': 'jfkldsjf'} dsldkjfslj vkljfklgjkdlgj
1 djlskgfdklgjkfgj {'name': 'ldfjkdfjs'} lfsjdfklfldsjf lkjlfggdflkjgdlf
Another solution if you do not want to reformat your files.
Assuming your JSON is in a string called my_json you could do:
import json
import pandas as pd
splitted = my_json.split('\n\n')
my_list = [json.loads(e) for e in splitted]
df = pd.DataFrame(my_list)
Thanks for the ideas internet. None quite solved the problem in the way I needed (I had lots of newline characters in the strings themselves which meant I couldn't split on them) but they helped point the way. In case anyone has a similar problem, this is what worked for me:
with open('path/to/original.json', 'r') as f:
data = f.read()
data = data.split("}\n")
data = [d.strip() + "}" for d in data]
data = list(filter(("}").__ne__, data))
data = [json.loads(d) for d in data]
with open('path/to/reformatted.json', 'w') as f:
json.dump(data, f)
df = pd.read_json('path/to/reformatted.json')
If you can use jq then solution is simpler:
jq -s '.' path/to/original.json > path/to/reformatted.json

Python script to surgically replace one JSON field by another

At the moment I'm working with a large set of JSON files of the following form:
File00, at time T1:
{
"AAA": {
"BBB": {
"000": "value0"
},
"CCC": {
"111": "value1",
"222": "value2",
"333": "value3"
},
"DDD": {
"444": "value4"
}
}
It's the situation that now I have a new input for the sub-field "DDD", I'd like to "wholesale" replace it with the following:
"DDD": {
"666": "value6",
"007": "value13"
}
Accordingly the file would be changed to:
File00, at time T2:
{
"AAA": {
"BBB": {
"000": "value0"
},
"CCC": {
"111": "value1",
"222": "value2",
"333": "value3"
},
"DDD": {
"666": "value6",
"007": "value13"
}
}
In the situation I'm confronted with, there are many files similar to File00, so I'm endeavoring to create a script that can process all the files in a particular directory, identifying the JSON field DDD and replace it's contents with something new.
How to do this in Python?
Here are the steps I took for each file:
Read the json
Convert it to a Python dict
Edit the Python dict
Convert it back to json
Write it to the file
Repeat
Here is my code:
import json
#list of files.
fileList = ["file1.json", "file2.json", "file3.json"]
for jsonFile in fileList:
# Open and read the file, then convert json to a Python dictionary
with open(jsonFile, "r") as f:
jsonData = json.loads(f.read())
# Edit the Python dictionary
jsonData["AAA"]["DDD"]={"666": "value6","007": "value13"}
# Convert Python dictionary to json and write to the file
with open(jsonFile, "w") as f:
json.dump(jsonData, f)
Also, I got code for iterating through a directory from here. You probably want something like this:
import os
directory = os.fsencode(directory_in_str)
for file in os.listdir(directory):
filename = os.fsdecode(file)
if filename.endswith(".json"):
fileList.append(filename)

Parse a puppet manifest with Python

I have a puppet manifest file - init.pp for my puppet module
In this file there are parameters for the class and in most cases they're written in the same way:
Example Input:
class test_module(
$first_param = 'test',
$second_param = 'new' )
What is the best way that I can parse this file with Python and get a dict object like this, which includes all the class parameters?
Example output:
param_dict = {'first_param':'test', 'second_param':'new'}
Thanks in Advance :)
Puppet Strings is a rubygem that can be installed on top of Puppet and can output a JSON document containing lists of the class parameters, documentation etc.
After installing it (see above link), run this command either in a shell or from your Python program to generate JSON:
puppet strings generate --emit-json-stdout init.pp
This will generate:
{
"puppet_classes": [
{
"name": "test_module",
"file": "init.pp",
"line": 1,
"docstring": {
"text": "",
"tags": [
{
"tag_name": "param",
"text": "",
"types": [
"Any"
],
"name": "first_param"
},
{
"tag_name": "param",
"text": "",
"types": [
"Any"
],
"name": "second_param"
}
]
},
"defaults": {
"first_param": "'test'",
"second_param": "'new'"
},
"source": "class test_module(\n $first_param = 'test',\n $second_param = 'new' ) {\n}"
}
]
}
(JSON trimmed slightly for brevity)
You can load the JSON in Python with json.loads, and extract the parameter names from root["puppet_classes"]["docstring"]["tags"] (where tag_name is param) and any default values from root["puppet_classes"]["defaults"].
You can use regular expression (straightforward but fragile)
import re
def parse(data):
mm = re.search('\((.*?)\)', data,re.MULTILINE)
dd = {}
if not mm:
return dd
matches = re.finditer("\s*\$(.*?)\s*=\s*'(.*?)'", mm.group(1), re.MULTILINE)
for mm in matches:
dd[mm.group(1)] = mm.group(2)
return dd
You can use it as follows:
import codecs
with codecs.open(filename,'r') as ff:
dd = parse(ff.read())
I don't know about the "best" way, but one way would be:
1) Set up Rspec-puppet (see google or my blog post for how to do that).
2) Compile your code and generate a Puppet catalog. See my other blog post for that.
Now, the Puppet catalog you compiled is a JSON document.
3) Visually inspect the JSON document to find the data you are looking for. Its precise location in the JSON document depends on the version of Puppet you are using.
4) You can now use Python to extract the data as a dictionary from the JSON document.

How do I read a json file into python?

I'm new to JSON and Python, any help on this would be greatly appreciated.
I read about json.loads but am confused
How do I read a file into Python using json.loads?
Below is my JSON file format:
{
"header": {
"platform":"atm"
"version":"2.0"
}
"details":[
{
"abc":"3"
"def":"4"
},
{
"abc":"5"
"def":"6"
},
{
"abc":"7"
"def":"8"
}
]
}
My requirement is to read the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)]. The new list will be used to create a spark data frame.
Open the file, and get a filehandle:
fh = open('thefile.json')
https://docs.python.org/2/library/functions.html#open
Then, pass the file handle into json.load(): (don't use loads - that's for strings)
import json
data = json.load(fh)
https://docs.python.org/2/library/json.html#json.load
From there, you can easily deal with a python dictionary that represents your json-encoded data.
new_list = [(detail['abc'], detail['def']) for detail in data['details']]
Note that your JSON format is also wrong. You will need comma delimiters in many places, but that's not the question.
I'm trying to understand your question as best as I can, but it looks like it was formatted poorly.
First off your json blob is not valid json, it is missing quite a few commas. This is probably what you are looking for:
{
"header": {
"platform": "atm",
"version": "2.0"
},
"details": [
{
"abc": "3",
"def": "4"
},
{
"abc": "5",
"def": "6"
},
{
"abc": "7",
"def": "8"
}
]
}
Now assuming you are trying to parse this in python you will have to do the following.
import json
json_blob = '{"header": {"platform": "atm","version": "2.0"},"details": [{"abc": "3","def": "4"},{"abc": "5","def": "6"},{"abc": "7","def": "8"}]}'
json_obj = json.loads(json_blob)
final_list = []
for single in json_obj['details']:
final_list.append((int(single['abc']), int(single['def'])))
print(final_list)
This will print the following: [(3, 4), (5, 6), (7, 8)]

Categories

Resources