remove duplicates data from complex json file in python

remove duplicates data from complex json file in python - python

I have a complex json file it included nested dics in it.
it looks like this
{
"objectivelist": [{
"measureid": "1122",
"gradeID": "4222332",
"graduationdate": "May",
"system": {
"platform": "MAC",
"TeacherName": "Mike",
"manager": "Jim",
"studentinfomation": {
"ZIP": "94122",
"city": "SF"
}
},
"measureid": "1122",
"gradeID": "4222332",
"graduationdate": "May",
"system": {
"platform": "MAC",
"TeacherName": "joshe",
"manager": "steven"
},
"studentinfomation": {
"ZIP": "94122",
"city": "SF"
}
}]
}
Here the grade ID and Measured ID are the same, so the result should only need to show one times, and my result should be like this:
{"measureid":"1122","gradeID"4222332","graduationdate":"May"}
I do not need the managername, teachername etc.
not sure how to do this. I try to use comprehensation but do not know who to use it in nest dictionary.
Thank you guys.

Depending on how huge the json file is you may need better solution. We will hash fields which are of interest to us and build the unique json iteratively.
check_set = set()
output = []
interesting_fields = ['measureid', 'gradeID', 'graduationdate']
for dat in X['objectivelist']:
m = hashlib.md5()
m.update(dat['measureid'].encode('utf-8'))
m.update(dat['gradeID'].encode('utf-8'))
m.update(dat['graduationdate'].encode('utf-8'))
digest = m.hexdigest()
if digest not in check_set:
output.append({key: dat[key] for key in ['measureid', 'gradeID', 'graduationdate']})
check_set.add(digest)
And you can find your output in output.

Related

Automatically entering next JSON level using Python in a similar way to JQ in bash

I am trying to use Python to extract pricePerUnit from JSON. There are many entries, and this is just 2 of them -
{
"terms": {
"OnDemand": {
"7Y9ZZ3FXWPC86CZY": {
"7Y9ZZ3FXWPC86CZY.JRTCKXETXF": {
"offerTermCode": "JRTCKXETXF",
"sku": "7Y9ZZ3FXWPC86CZY",
"effectiveDate": "2020-11-01T00:00:00Z",
"priceDimensions": {
"7Y9ZZ3FXWPC86CZY.JRTCKXETXF.6YS6EN2CT7": {
"rateCode": "7Y9ZZ3FXWPC86CZY.JRTCKXETXF.6YS6EN2CT7",
"description": "Processed translation request in AWS GovCloud (US)",
"beginRange": "0",
"endRange": "Inf",
"unit": "Character",
"pricePerUnit": {
"USD": "0.0000150000"
},
"appliesTo": []
}
},
"termAttributes": {}
}
},
"CQNY8UFVUNQQYYV4": {
"CQNY8UFVUNQQYYV4.JRTCKXETXF": {
"offerTermCode": "JRTCKXETXF",
"sku": "CQNY8UFVUNQQYYV4",
"effectiveDate": "2020-11-01T00:00:00Z",
"priceDimensions": {
"CQNY8UFVUNQQYYV4.JRTCKXETXF.6YS6EN2CT7": {
"rateCode": "CQNY8UFVUNQQYYV4.JRTCKXETXF.6YS6EN2CT7",
"description": "$0.000015 per Character for TextTranslationJob:TextTranslationJob in EU (London)",
"beginRange": "0",
"endRange": "Inf",
"unit": "Character",
"pricePerUnit": {
"USD": "0.0000150000"
},
"appliesTo": []
}
},
"termAttributes": {}
}
}
}
}
}
The issue I run into is that the keys, which in this sample, are 7Y9ZZ3FXWPC86CZY, CQNY8UFVUNQQYYV4.JRTCKXETXF, and CQNY8UFVUNQQYYV4.JRTCKXETXF.6YS6EN2CT7 are a changing string that I cannot just type out as I am parsing the dictionary.
I have python code that works for the first level of these random keys -
with open('index.json') as json_file:
data = json.load(json_file)
json_keys=list(data['terms']['OnDemand'].keys())
#Get the region
for i in json_keys:
print((data['terms']['OnDemand'][i]))
However, this is tedious, as I would need to run the same code three times to get the other keys like 7Y9ZZ3FXWPC86CZY.JRTCKXETXF and 7Y9ZZ3FXWPC86CZY.JRTCKXETXF.6YS6EN2CT7, since the string changes with each JSON entry.
Is there a way that I can just tell python to automatically enter the next level of the JSON object, without having to parse all keys, save them, and then iterate through them? Using JQ in bash I can do this quite easily with jq -r '.terms[][][]'.

If you are really sure, that there is exactly one key-value pair on each level, you can try the following:
def descend(x, depth):
for i in range(depth):
x = next(iter(x.values()))
return x

You can use dict.values() to iterate over the values of a dict. You can also use next(iter(dict.values())) to get a first (only) element of a dict.
for demand in data['terms']['OnDemand'].values():
next_level = next(iter(demand.values()))
print(next_level)
If you expect other number of children than 1 in the second level, you can just nest the fors:
for demand in data['terms']['OnDemand'].values():
for sub_demand in demand.values()
print(sub_demand)
If you are insterested in the keys too, you can use dict.items() method to iterate over dict keys and values at the same time:
for demand_key, demand in data['terms']['OnDemand'].items():
for sub_demand_key, sub_demand in demand.items()
print(demand_key, sub_demand_key, sub_demand)

Pull key from json file when values is known (groovy or python)

Is there any way to pull the key from JSON if the only thing I know is the value? (In groovy or python)
An example:
I know the "_number" value and I need a key.
So let's say, known _number is 2 and as an output, I should get dsf34f43f34f34f
{
"id": "8e37ecadf4908f79d58080e6ddbc",
"project": "some_project",
"branch": "master",
"current_revision": "3rtgfgdfg2fdsf",
"revisions": {
"43g5g534534rf34f43f": {
"_number": 3,
"created": "2019-04-16 09:03:07.459000000",
"uploader": {
"_account_id": 4
},
"description": "Rebase"
},
"dsf34f43f34f34f": {
"_number": 2,
"created": "2019-04-02 10:54:14.682000000",
"uploader": {
"_account_id": 2
},
"description": "Rebase"
}
}
}

With Groovy:
def json = new groovy.json.JsonSlurper().parse("x.json" as File)
println(json.revisions.findResult{ it.value._number==2 ? it.key : null })
// => dsf34f43f34f34f

Python 3: (assuming that data is saved in data.json):
import json
with open('data.json') as f:
json_data = json.load(f)
for rev, revdata in json_data['revisions'].items():
if revdata['_number'] == 2:
print(rev)
Prints all revs where _number equals 2.

using dict-comprehension:
print({k for k,v in d['revisions'].items() if v.get('_number') == 2})
OUTPUT:
{'dsf34f43f34f34f'}

Using .values() with list of dictionaries?

I'm comparing json files between two different API endpoints to see which json records need an update, which need a create and what needs a delete. So, by comparing the two json files, I want to end up with three json files, one for each operation.
The json at both endpoints is structured like this (but they use different keys for same sets of values; different problem):
{
"records": [{
"id": "id-value-here",
"c": {
"d": "eee"
},
"f": {
"l": "last",
"f": "first"
},
"g": ["100", "89", "9831", "09112", "800"]
}, {
…
}]
}
So the json is represented as a list of dictionaries (with further nested lists and dictionaries).
If a given json endpoint (j1) id value ("id":) exists in the other endpoint json (j2), then that record should be added to j_update.
So far I have something like this, but I can see that .values() doesn't work because it's trying to operate on the list instead of on all the listed dictionaries(?):
j_update = {r for r in j1['records'] if r['id'] in
j2.values()}
This doesn't return an error, but it creates an empty set using test json files.
Seems like this should be simple, but tripping over the nesting I think of dictionaries in a list representing the json. Do I need to flatten j2, or is there a simpler dictionary method python has to achieve this?
====edit j1 and j2====
have same structure, use different keys; toy data
j1
{
"records": [{
"field_5": 2329309841,
"field_12": {
"email": "cmix#etest.com"
},
"field_20": {
"last": "Mixalona",
"first": "Clara"
},
"field_28": ["9002329309999", "9002329309112"],
"field_44": ["1002329309832"]
}, {
"field_5": 2329309831,
"field_12": {
"email": "mherbitz345#test.com"
},
"field_20": {
"last": "Herbitz",
"first": "Michael"
},
"field_28": ["9002329309831", "9002329309112", "8002329309999"],
"field_44": ["1002329309832"]
}, {
"field_5": 2329309855,
"field_12": {
"email": "nkatamaran#test.com"
},
"field_20": {
"first": "Noriss",
"last": "Katamaran"
},
"field_28": ["9002329309111", "8002329309112"],
"field_44": ["1002329309877"]
}]
}
j2
{
"records": [{
"id": 2329309831,
"email": {
"email": "mherbitz345#test.com"
},
"name_primary": {
"last": "Herbitz",
"first": "Michael"
},
"assign": ["8003329309831", "8007329309789"],
"hr_id": ["1002329309877"]
}, {
"id": 2329309884,
"email": {
"email": "yinleeshu#test.com"
},
"name_primary": {
"last": "Lee Shu",
"first": "Yin"
},
"assign": ["8002329309111", "9003329309831", "9002329309111", "8002329309999", "8002329309112"],
"hr_id": ["1002329309832"]
}, {
"id": 23293098338,
"email": {
"email": "amlouis#test.com"
},
"name_primary": {
"last": "Maxwell Louis",
"first": "Albert"
},
"assign": ["8002329309111", "8007329309789", "9003329309831", "8002329309999", "8002329309112"],
"hr_id": ["1002329309877"]
}]
}

If you read the json it will output a dict. You are looking for a particular key in the list of the values.
if 'records' in j2:
r = j2['records'][0].get('id', []) # defaults if id does not exist
It it prettier to do a recursive search but i dunno how you data is organized to quickly come up with a solution.
To give an idea for recursive search consider this example
def resursiveSearch(dictionary, target):
if target in dictionary:
return dictionary[target]
for key, value in dictionary.items():
if isinstance(value, dict):
target = resursiveSearch(value, target)
if target:
return target
a = {'test' : 'b', 'test1' : dict(x = dict(z = 3), y = 2)}
print(resursiveSearch(a, 'z'))

You tried:
j_update = {r for r in j1['records'] if r['id'] in j2.values()}
Aside from the r['id'/'field_5] problem, you have:
>>> list(j2.values())
[[{'id': 2329309831, ...}, ...]]
The id are buried inside a list and a dict, thus the test r['id'] in j2.values() always return False.
The basic solution is fairly simple.
First, create a set of j2 ids:
>>> present_in_j2 = set(record["id"] for record in j2["records"])
Then, rebuild the json structure of j1 but without the j1 field_5 that are not present in j2:
>>> {"records":[record for record in j1["records"] if record["field_5"] in present_in_j2]}
{'records': [{'field_5': 2329309831, 'field_12': {'email': 'mherbitz345#test.com'}, 'field_20': {'last': 'Herbitz', 'first': 'Michael'}, 'field_28': ['9002329309831', '9002329309112', '8002329309999'], 'field_44': ['1002329309832']}]}
It works, but it's not totally satisfying because of the weird keys of j1. Let's try to convert j1 to a more friendly format:
def map_keys(json_value, conversion_table):
"""Map the keys of a json value
This is a recursive DFS"""
def map_keys_aux(json_value):
"""Capture the conversion table"""
if isinstance(json_value, list):
return [map_keys_aux(v) for v in json_value]
elif isinstance(json_value, dict):
return {conversion_table.get(k, k):map_keys_aux(v) for k,v in json_value.items()}
else:
return json_value
return map_keys_aux(json_value)
The function focuses on dictionary keys: conversion_table.get(k, k) is conversion_table[k] if the key is present in the conversion table, or the key itself otherwise.
>>> j1toj2 = {"field_5":"id", "field_12":"email", "field_20":"name_primary", "field_28":"assign", "field_44":"hr_id"}
>>> mapped_j1 = map_keys(j1, j1toj2)
Now, the code is cleaner and the output may be more useful for a PUT:
>>> d1 = {record["id"]:record for record in mapped_j1["records"]}
>>> present_in_j2 = set(record["id"] for record in j2["records"])
>>> {"records":[record for record in mapped_j1["records"] if record["id"] in present_in_j2]}
{'records': [{'id': 2329309831, 'email': {'email': 'mherbitz345#test.com'}, 'name_primary': {'last': 'Herbitz', 'first': 'Michael'}, 'assign': ['9002329309831', '9002329309112', '8002329309999'], 'hr_id': ['1002329309832']}]}

sort items in json_util_dumps in Tornado handler

I would like to output as the following (ordered like in my script).
{"data": [
{ "cid": "CG138712",
"mac": "24-A4-3C-F6-51-21",
"category": "CPE- E",
"last_seen": "2017-12-11 12:42:10",
"cpe-o": {"cid": "CS247314",
"mac":"80-2A-A8-7E-1D-8E",
"category": "CPE-O",
"last_seen": "2018-05-14 15:28:42",
}
}]
}
But my code keeps output like that.
{"data": [
{ "cid": "CG138712",
"category": "CPE- E",
"cpe-o": {"cid": "CS247314",
"last_seen": "2018-05-14 15:28:42",
"category": "CPE-O",
"mac":"80-2A-A8-7E-1D-8E"
}
"mac": "24-A4-3C-F6-51-21",
"last_seen": "2017-12-11 12:42:10",
}]
}
This is how I implement in my script!
cpeo_dict = dict(......)
doc = {"cid": document['cid'],"mac": document['mac'],"category": document['category'],"last_seen": document['last_seen'].strftime("%Y-%m-%d %H:%M:%S"),"cpe-o": cpeo_dict}
docs_uplink.append(doc)
dumped = json_util.dumps(dict(data=docs_uplink))
I can't find how to add parameters in json_util.dumps function, I only found sort and OrderedDict of json.dumps.

Python dictionaries don't preserve order. There is a special class for preserving order in dicts - collections.OrderedDict. Instead of using dict, you need to use OrderedDict.
Example:
from collections import OrderedDict
doc = OrderedDict([
('cid', document['cid']),
('mac', document['mac']),
('category', document['category']),
# ... other keys ...
])
docs_uplink.append(doc)
dumped = json_util.dumps(dict(data=docs_uplink))

How do I read a json file into python?

I'm new to JSON and Python, any help on this would be greatly appreciated.
I read about json.loads but am confused
How do I read a file into Python using json.loads?
Below is my JSON file format:
{
"header": {
"platform":"atm"
"version":"2.0"
}
"details":[
{
"abc":"3"
"def":"4"
},
{
"abc":"5"
"def":"6"
},
{
"abc":"7"
"def":"8"
}
]
}
My requirement is to read the values of all "abc" "def" in details and add this is to a new list like this [(1,2),(3,4),(5,6),(7,8)]. The new list will be used to create a spark data frame.

Open the file, and get a filehandle:
fh = open('thefile.json')
https://docs.python.org/2/library/functions.html#open
Then, pass the file handle into json.load(): (don't use loads - that's for strings)
import json
data = json.load(fh)
https://docs.python.org/2/library/json.html#json.load
From there, you can easily deal with a python dictionary that represents your json-encoded data.
new_list = [(detail['abc'], detail['def']) for detail in data['details']]
Note that your JSON format is also wrong. You will need comma delimiters in many places, but that's not the question.

I'm trying to understand your question as best as I can, but it looks like it was formatted poorly.
First off your json blob is not valid json, it is missing quite a few commas. This is probably what you are looking for:
{
"header": {
"platform": "atm",
"version": "2.0"
},
"details": [
{
"abc": "3",
"def": "4"
},
{
"abc": "5",
"def": "6"
},
{
"abc": "7",
"def": "8"
}
]
}
Now assuming you are trying to parse this in python you will have to do the following.
import json
json_blob = '{"header": {"platform": "atm","version": "2.0"},"details": [{"abc": "3","def": "4"},{"abc": "5","def": "6"},{"abc": "7","def": "8"}]}'
json_obj = json.loads(json_blob)
final_list = []
for single in json_obj['details']:
final_list.append((int(single['abc']), int(single['def'])))
print(final_list)
This will print the following: [(3, 4), (5, 6), (7, 8)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

remove duplicates data from complex json file in python - python

Related

Automatically entering next JSON level using Python in a similar way to JQ in bash

Pull key from json file when values is known (groovy or python)

Using .values() with list of dictionaries?

sort items in json_util_dumps in Tornado handler

How do I read a json file into python?

Categories

Resources