Remove duplicate JSON objects from list in python

Remove duplicate JSON objects from list in python - python

I have a list of dict where a particular value is repeated multiple times, and I would like to remove the duplicate values.
My list:
te = [
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
}
]
function to remove duplicate values:
def removeduplicate(it):
seen = set()
for x in it:
if x not in seen:
yield x
seen.add(x)
When I call this function I get generator object.
<generator object removeduplicate at 0x0170B6E8>
When I try to iterate over the generator I get TypeError: unhashable type: 'dict'
Is there a way to remove the duplicate values or to iterate over the generator

You can easily remove duplicate keys by dictionary comprehension, since dictionary does not allow duplicate keys, as below-
te = [
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala",
"phone": "None"
},
{
"Name": "Bala1",
"phone": "None"
}
]
unique = { each['Name'] : each for each in te }.values()
print unique
Output-
[{'phone': 'None', 'Name': 'Bala1'}, {'phone': 'None', 'Name': 'Bala'}]

Because you can't add a dict to set. From this question:
You're trying to use a dict as a key to another dict or in a set. That does not work because the keys have to be hashable.
As a general rule, only immutable objects (strings, integers, floats, frozensets, tuples of immutables) are hashable (though exceptions are possible).
>>> foo = dict()
>>> bar = set()
>>> bar.add(foo)
Traceback (most recent call last):
File "<input>", line 1, in <module>
TypeError: unhashable type: 'dict'
>>>
Instead, you're already using if x not in seen, so just use a list:
>>> te = [
... {
... "Name": "Bala",
... "phone": "None"
... },
... {
... "Name": "Bala",
... "phone": "None"
... },
... {
... "Name": "Bala",
... "phone": "None"
... },
... {
... "Name": "Bala",
... "phone": "None"
... }
... ]
>>> def removeduplicate(it):
... seen = []
... for x in it:
... if x not in seen:
... yield x
... seen.append(x)
>>> removeduplicate(te)
<generator object removeduplicate at 0x7f3578c71ca8>
>>> list(removeduplicate(te))
[{'phone': 'None', 'Name': 'Bala'}]
>>>

You can still use a set for duplicate detection, you just need to convert the dictionary into something hashable such as a tuple. Your dictionaries can be converted to tuples by tuple(d.items()) where d is a dictionary. Applying that to your generator function:
def removeduplicate(it):
seen = set()
for x in it:
t = tuple(x.items())
if t not in seen:
yield x
seen.add(t)
>>> for d in removeduplicate(te):
... print(d)
{'phone': 'None', 'Name': 'Bala'}
>>> te.append({'Name': 'Bala', 'phone': '1234567890'})
>>> te.append({'Name': 'Someone', 'phone': '1234567890'})
>>> for d in removeduplicate(te):
... print(d)
{'phone': 'None', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Bala'}
{'phone': '1234567890', 'Name': 'Someone'}
This provides faster lookup (avg. O(1)) than a "seen" list (O(n)). Whether it is worth the extra computation of converting every dict into a tuple depends on the number of dictionaries that you have and how many duplicates there are. If there are a lot of duplicates, a "seen" list will grow quite large, and testing whether a dict has already been seen could become an expensive operation. This might justify the tuple conversion - you would have to test/profile it.

I just use md5 to compare everything.
filtered_json = []
md5_list = []
for item in json_fin:
md5_result = hashlib.md5(json.dumps(item, separators=(',', ':')).encode("utf-8")).hexdigest()
if md5_result not in md5_list:
md5_list.append(md5_result)
filtered_json.append(item)

Related

Changing value of a value in a dictionary within a list within a dictionary

I have a json like:
pd = {
"RP": [
{
"Name": "PD",
"Value": "qwe"
},
{
"Name": "qwe",
"Value": "change"
}
],
"RFN": [
"All"
],
"RIT": [
{
"ID": "All",
"IDT": "All"
}
]
}
I am trying to change the value change to changed. This is a dictionary within a list which is within another dictionary. Is there a better/ more efficient/pythonic way to do this than what I did below:
for key, value in pd.items():
ls = pd[key]
for d in ls:
if type(d) == dict:
for k,v in d.items():
if v == 'change':
pd[key][ls.index(d)][k] = "changed"
This seems pretty inefficient due to the amount of times I am parsing through the data.

String replacement could work if you don't want to write depth/breadth-first search.
>>> import json
>>> json.loads(json.dumps(pd).replace('"Value": "change"', '"Value": "changed"'))
{'RP': [{'Name': 'PD', 'Value': 'qwe'}, {'Name': 'qwe', 'Value': 'changed'}],
'RFN': ['All'],
'RIT': [{'ID': 'All', 'IDT': 'All'}]}

Create partial dict from recursively nested field list

After parsing a URL parameter for partial responses, e.g. ?fields=name,id,another(name,id),date, I'm getting back an arbitrarily nested list of strings, representing the individual keys of a nested JSON object:
['name', 'id', ['another', ['name', 'id']], 'date']
The goal is to map that parsed 'graph' of keys onto an original, larger dict and just retrieve a partial copy of it, e.g.:
input_dict = {
"name": "foobar",
"id": "1",
"another": {
"name": "spam",
"id": "42",
"but_wait": "there is more!"
},
"even_more": {
"nesting": {
"why": "not?"
}
},
"date": 1584567297
}
should simplyfy to:
output_dict = {
"name": "foobar",
"id": "1",
"another": {
"name": "spam",
"id": "42"
},
"date": 1584567297,
}
Sofar, I've glanced over nested defaultdicts, addict and glom, but the mappings they take as inputs are not compatible with my list (might have missed something, of course), and I end up with garbage.
How can I do this programmatically, and accounting for any nesting that might occur?

you can use:
def rec(d, f):
result = {}
for i in f:
if isinstance(i, list):
result[i[0]] = rec(d[i[0]], i[1])
else:
result[i] = d[i]
return result
f = ['name', 'id', ['another', ['name', 'id']], 'date']
rec(input_dict, f)
output:
{'name': 'foobar',
'id': '1',
'another': {'name': 'spam', 'id': '42'},
'date': 1584567297}
here the assumption is that on a nested list the first element is a valid key from the upper level and the second element contains valid keys from a nested dict which is the value for the first element

Using .values() with list of dictionaries?

I'm comparing json files between two different API endpoints to see which json records need an update, which need a create and what needs a delete. So, by comparing the two json files, I want to end up with three json files, one for each operation.
The json at both endpoints is structured like this (but they use different keys for same sets of values; different problem):
{
"records": [{
"id": "id-value-here",
"c": {
"d": "eee"
},
"f": {
"l": "last",
"f": "first"
},
"g": ["100", "89", "9831", "09112", "800"]
}, {
…
}]
}
So the json is represented as a list of dictionaries (with further nested lists and dictionaries).
If a given json endpoint (j1) id value ("id":) exists in the other endpoint json (j2), then that record should be added to j_update.
So far I have something like this, but I can see that .values() doesn't work because it's trying to operate on the list instead of on all the listed dictionaries(?):
j_update = {r for r in j1['records'] if r['id'] in
j2.values()}
This doesn't return an error, but it creates an empty set using test json files.
Seems like this should be simple, but tripping over the nesting I think of dictionaries in a list representing the json. Do I need to flatten j2, or is there a simpler dictionary method python has to achieve this?
====edit j1 and j2====
have same structure, use different keys; toy data
j1
{
"records": [{
"field_5": 2329309841,
"field_12": {
"email": "cmix#etest.com"
},
"field_20": {
"last": "Mixalona",
"first": "Clara"
},
"field_28": ["9002329309999", "9002329309112"],
"field_44": ["1002329309832"]
}, {
"field_5": 2329309831,
"field_12": {
"email": "mherbitz345#test.com"
},
"field_20": {
"last": "Herbitz",
"first": "Michael"
},
"field_28": ["9002329309831", "9002329309112", "8002329309999"],
"field_44": ["1002329309832"]
}, {
"field_5": 2329309855,
"field_12": {
"email": "nkatamaran#test.com"
},
"field_20": {
"first": "Noriss",
"last": "Katamaran"
},
"field_28": ["9002329309111", "8002329309112"],
"field_44": ["1002329309877"]
}]
}
j2
{
"records": [{
"id": 2329309831,
"email": {
"email": "mherbitz345#test.com"
},
"name_primary": {
"last": "Herbitz",
"first": "Michael"
},
"assign": ["8003329309831", "8007329309789"],
"hr_id": ["1002329309877"]
}, {
"id": 2329309884,
"email": {
"email": "yinleeshu#test.com"
},
"name_primary": {
"last": "Lee Shu",
"first": "Yin"
},
"assign": ["8002329309111", "9003329309831", "9002329309111", "8002329309999", "8002329309112"],
"hr_id": ["1002329309832"]
}, {
"id": 23293098338,
"email": {
"email": "amlouis#test.com"
},
"name_primary": {
"last": "Maxwell Louis",
"first": "Albert"
},
"assign": ["8002329309111", "8007329309789", "9003329309831", "8002329309999", "8002329309112"],
"hr_id": ["1002329309877"]
}]
}

If you read the json it will output a dict. You are looking for a particular key in the list of the values.
if 'records' in j2:
r = j2['records'][0].get('id', []) # defaults if id does not exist
It it prettier to do a recursive search but i dunno how you data is organized to quickly come up with a solution.
To give an idea for recursive search consider this example
def resursiveSearch(dictionary, target):
if target in dictionary:
return dictionary[target]
for key, value in dictionary.items():
if isinstance(value, dict):
target = resursiveSearch(value, target)
if target:
return target
a = {'test' : 'b', 'test1' : dict(x = dict(z = 3), y = 2)}
print(resursiveSearch(a, 'z'))

You tried:
j_update = {r for r in j1['records'] if r['id'] in j2.values()}
Aside from the r['id'/'field_5] problem, you have:
>>> list(j2.values())
[[{'id': 2329309831, ...}, ...]]
The id are buried inside a list and a dict, thus the test r['id'] in j2.values() always return False.
The basic solution is fairly simple.
First, create a set of j2 ids:
>>> present_in_j2 = set(record["id"] for record in j2["records"])
Then, rebuild the json structure of j1 but without the j1 field_5 that are not present in j2:
>>> {"records":[record for record in j1["records"] if record["field_5"] in present_in_j2]}
{'records': [{'field_5': 2329309831, 'field_12': {'email': 'mherbitz345#test.com'}, 'field_20': {'last': 'Herbitz', 'first': 'Michael'}, 'field_28': ['9002329309831', '9002329309112', '8002329309999'], 'field_44': ['1002329309832']}]}
It works, but it's not totally satisfying because of the weird keys of j1. Let's try to convert j1 to a more friendly format:
def map_keys(json_value, conversion_table):
"""Map the keys of a json value
This is a recursive DFS"""
def map_keys_aux(json_value):
"""Capture the conversion table"""
if isinstance(json_value, list):
return [map_keys_aux(v) for v in json_value]
elif isinstance(json_value, dict):
return {conversion_table.get(k, k):map_keys_aux(v) for k,v in json_value.items()}
else:
return json_value
return map_keys_aux(json_value)
The function focuses on dictionary keys: conversion_table.get(k, k) is conversion_table[k] if the key is present in the conversion table, or the key itself otherwise.
>>> j1toj2 = {"field_5":"id", "field_12":"email", "field_20":"name_primary", "field_28":"assign", "field_44":"hr_id"}
>>> mapped_j1 = map_keys(j1, j1toj2)
Now, the code is cleaner and the output may be more useful for a PUT:
>>> d1 = {record["id"]:record for record in mapped_j1["records"]}
>>> present_in_j2 = set(record["id"] for record in j2["records"])
>>> {"records":[record for record in mapped_j1["records"] if record["id"] in present_in_j2]}
{'records': [{'id': 2329309831, 'email': {'email': 'mherbitz345#test.com'}, 'name_primary': {'last': 'Herbitz', 'first': 'Michael'}, 'assign': ['9002329309831', '9002329309112', '8002329309999'], 'hr_id': ['1002329309832']}]}

Python3 remove entire nested dictionary

{'result':{'result':[
{
'Company':{
'PostAddress':None
},
'ExternalPartnerProperties':None,
'Id':123456,
'Level':'Level1',
'Name':'Name1',
'ParentId':456789,
'State':'InTrial',
'TrialExpirationTime':1435431669
},
{
'Company':{
'PostAddress':None
},
'ExternalPartnerProperties':None,
'Id':575155,
'Level':'Level2',
'Name':'Name2',
'ParentId':456789,
'State':'InTrial',
'TrialExpirationTime':1491590226
},
{
'Company':{
'PostAddress':None
},
'ExternalPartnerProperties':None,
'Id':888888,
'Level':'Level2',
'Name':'Name3',
'ParentId':456789,
'State':'InProduction',
'TrialExpirationTime':1493280310
},
My code:
for i in partner_output['result']['result']:
if "InProduction" in i['State']:
del i['Company'], i['ExternalPartnerProperties'], i['Id'], i['Level'], i['Name'], i['ParentId'], i['State'], i['TrialExpirationTime']
If I do this, then I return the following result
{'result': {'result': [{
'Company':{
'PostAddress':None
},
'ExternalPartnerProperties':None,
'Id':123456,
'Level':'Level1',
'Name':'Name1',
'ParentId':456789,
'State':'InTrial',
'TrialExpirationTime':1435431669
},
{
'Company':{
'PostAddress':None
},
'ExternalPartnerProperties':None,
'Id':575155,
'Level':'Level2',
'Name':'Name2',
'ParentId':456789,
'State':'InTrial',
'TrialExpirationTime':1491590226
},
{},
but the total number of items is still 3 ... the 3rd container is just empty, but still a container. How can I delete the 3rd container all together?
I cannot use:
for i in partner_output['result']['result']:
if "InProduction" in i['State']:
del partner_output['result'][i]
because I get the error:
TypeError: unhashable type: 'dict'
So I don't know what to do now :-(

You can use a list comprehension to replace the whole list, keeping the other items:
partner_output['result']['result'] = [
i for i in partner_output['result']['result']
if i['State'] != "InProduction"
]
Note that the test for the filter has been reversed; you want to keep all items that do not have 'State' set to InProduction. Alternatively, keep values where the state is set to InTrial:
partner_output['result']['result'] = [
i for i in partner_output['result']['result']
if i['State'] == "InTrial"
]
Your second attempt failed because you tried to use i, a reference to a dictionary, as a key in the outer partner_output['result'] dictionary. If you wanted to delete something from the partner_output['result']['result'] list, you'd have to use the integer index (del partner_output['result']['result'][2]), but you can't do that in a loop because that has consequences for the for loop progress across the list.

Bidirectional data structure conversion in Python

Note: this is not a simple two-way map; the conversion is the important part.
I'm writing an application that will send and receive messages with a certain structure, which I must convert from and to an internal structure.
For example, the message:
{
"Person": {
"name": {
"first": "John",
"last": "Smith"
}
},
"birth_date": "1997.01.12",
"points": "330"
}
This must be converted to :
{
"Person": {
"firstname": "John",
"lastname": "Smith",
"birth": datetime.date(1997, 1, 12),
"points": 330
}
}
And vice-versa.
These messages have a lot of information, so I want to avoid having to manually write converters for both directions. Is there any way in Python to specify the mapping once, and use it for both cases?
In my research, I found an interesting Haskell library called JsonGrammar which allows for this (it's for JSON, but that's irrelevant for the case). But my knowledge of Haskell isn't good enough to attempt a port.

That's actually quite an interesting problem. You could define a list of transformation, for example in the form (key1, func_1to2, key2, func_2to1), or a similar format, where key could contain separators to indicate different levels of the dict, like "Person.name.first".
noop = lambda x: x
relations = [("Person.name.first", noop, "Person.firstname", noop),
("Person.name.last", noop, "Person.lastname", noop),
("birth_date", lambda s: datetime.date(*map(int, s.split("."))),
"Person.birth", lambda d: d.strftime("%Y.%m.%d")),
("points", int, "Person.points", str)]
Then, iterate the elements in that list and transform the entries in the dictionary according to whether you want to go from form A to B or vice versa. You will also need some helper function for accessing keys in nested dictionaries using those dot-separated keys.
def deep_get(d, key):
for k in key.split("."):
d = d[k]
return d
def deep_set(d, key, val):
*first, last = key.split(".")
for k in first:
d = d.setdefault(k, {})
d[last] = val
def convert(d, mapping, atob):
res = {}
for a, x, b, y in mapping:
a, b, f = (a, b, x) if atob else (b, a, y)
deep_set(res, b, f(deep_get(d, a)))
return res
Example:
>>> d1 = {"Person": { "name": { "first": "John", "last": "Smith" } },
... "birth_date": "1997.01.12",
... "points": "330" }
...
>>> print(convert(d1, relations, True))
{'Person': {'birth': datetime.date(1997, 1, 12),
'firstname': 'John',
'lastname': 'Smith',
'points': 330}}

Tobias has answered it quite well. If you are looking for a library that ensures the Model Transformation dynamically then you can explore the Python's Model transformation library PyEcore.
PyEcore allows you to handle models and metamodels (structured data model), and gives the key you need for building ModelDrivenEngineering-based tools and other applications based on a structured data model. It supports out-of-the-box:
Data inheritance,
Two-ways relationship management (opposite references),
XMI (de)serialization,
JSON (de)serialization etc
Edit
I have found something more interesting for you with example similar to yours, check out JsonBender.
import json
from jsonbender import bend, K, S
MAPPING = {
'Person': {
'firstname': S('Person', 'name', 'first'),
'lastname': S('Person', 'name', 'last'),
'birth': S('birth_date'),
'points': S('points')
}
}
source = {
"Person": {
"name": {
"first": "John",
"last": "Smith"
}
},
"birth_date": "1997.01.12",
"points": "330"
}
result = bend(MAPPING, source)
print(json.dumps(result))
Output:
{"Person": {"lastname": "Smith", "points": "330", "firstname": "John", "birth": "1997.01.12"}}

Here is my take on this (converter lambdas and dot-based notation idea taken from tobias_k):
import datetime
converters = {
(str, datetime.date): lambda s: datetime.date(*map(int, s.split("."))),
(datetime.date, str): lambda d: d.strftime("%Y.%m.%d"),
}
mapping = [
('Person.name.first', str, 'Person.firstname', str),
('Person.name.last', str, 'Person.lastname', str),
('birth_date', str, 'Person.birth', datetime.date),
('points', str, 'Person.points', int),
]
def covert_doc(doc, mapping, converters, inverse=False):
converted = {}
for keys1, type1, keys2, type2 in mapping:
if inverse:
keys1, type1, keys2, type2 = keys2, type2, keys1, type1
converter = converters.get((type1, type2), type2)
keys1 = keys1.split('.')
keys2 = keys2.split('.')
obj1 = doc
while keys1:
k, *keys1 = keys1
obj1 = obj1[k]
dict2 = converted
while len(keys2) > 1:
k, *keys2 = keys2
dict2 = dict2.setdefault(k, {})
dict2[keys2[0]] = converter(obj1)
return converted
# Test
doc1 = {
"Person": {
"name": {
"first": "John",
"last": "Smith"
}
},
"birth_date": "1997.01.12",
"points": "330"
}
doc2 = {
"Person": {
"firstname": "John",
"lastname": "Smith",
"birth": datetime.date(1997, 1, 12),
"points": 330
}
}
assert doc2 == covert_doc(doc1, mapping, converters)
assert doc1 == covert_doc(doc2, mapping, converters, inverse=True)
This nice things are that you can reuse converters (even to convert different document structures) and that you only need to define non-trivial conversions. The drawback is that, as it is, every pair of types must always use the same conversion (maybe it could be extended to add optional alternative conversions).

You can use lists to describe paths to values in objects with type converting functions, for example:
from_paths = [
(['Person', 'name', 'first'], None),
(['Person', 'name', 'last'], None),
(['birth_date'], lambda s: datetime.date(*map(int, s.split(".")))),
(['points'], lambda s: int(s))
]
to_paths = [
(['Person', 'firstname'], None),
(['Person', 'lastname'], None),
(['Person', 'birth'], lambda d: d.strftime("%Y.%m.%d")),
(['Person', 'points'], str)
]
and a little function to covert from and to (much like tobias suggests but without string separation and using reduce to get values from dict):
def convert(from_paths, to_paths, obj):
to_obj = {}
for (from_keys, convfn), (to_keys, _) in zip(from_paths, to_paths):
value = reduce(operator.getitem, from_keys, obj)
if convfn:
value = convfn(value)
curr_lvl_dict = to_obj
for key in to_keys[:-1]:
curr_lvl_dict = curr_lvl_dict.setdefault(key, {})
curr_lvl_dict[to_keys[-1]] = value
return to_obj
test:
from_json = '''{
"Person": {
"name": {
"first": "John",
"last": "Smith"
}
},
"birth_date": "1997.01.12",
"points": "330"
}'''
>>> obj = json.loads(from_json)
>>> new_obj = convert(from_paths, to_paths, obj)
>>> new_obj
{'Person': {'lastname': u'Smith',
'points': 330,
'birth': datetime.date(1997, 1, 12), 'firstname': u'John'}}
>>> convert(to_paths, from_paths, new_obj)
{'birth_date': '1997.01.12',
'Person': {'name': {'last': u'Smith', 'first': u'John'}},
'points': '330'}
>>>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove duplicate JSON objects from list in python - python

I just use md5 to compare everything. filtered_json = [] md5_list = [] for item in json_fin: md5_result = hashlib.md5(json.dumps(item, separators=(',', ':')).encode("utf-8")).hexdigest() if md5_result not in md5_list: md5_list.append(md5_result) filtered_json.append(item)

Related

Changing value of a value in a dictionary within a list within a dictionary

Create partial dict from recursively nested field list

Using .values() with list of dictionaries?

Python3 remove entire nested dictionary

Bidirectional data structure conversion in Python

Categories

Resources