Python3: Remove duplicates from the dictionary list [duplicate]

Python3: Remove duplicates from the dictionary list [duplicate] - python

I have a list of dicts, and I'd like to remove the dicts with identical key and value pairs.
For this list: [{'a': 123}, {'b': 123}, {'a': 123}]
I'd like to return this: [{'a': 123}, {'b': 123}]
Another example:
For this list: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
I'd like to return this: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Try this:
[dict(t) for t in {tuple(d.items()) for d in l}]
The strategy is to convert the list of dictionaries to a list of tuples where the tuples contain the items of the dictionary. Since the tuples can be hashed, you can remove duplicates using set (using a set comprehension here, older python alternative would be set(tuple(d.items()) for d in l)) and, after that, re-create the dictionaries from tuples with dict.
where:
l is the original list
d is one of the dictionaries in the list
t is one of the tuples created from a dictionary
Edit: If you want to preserve ordering, the one-liner above won't work since set won't do that. However, with a few lines of code, you can also do that:
l = [{'a': 123, 'b': 1234},
{'a': 3222, 'b': 1234},
{'a': 123, 'b': 1234}]
seen = set()
new_l = []
for d in l:
t = tuple(d.items())
if t not in seen:
seen.add(t)
new_l.append(d)
print new_l
Example output:
[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]
Note: As pointed out by #alexis it might happen that two dictionaries with the same keys and values, don't result in the same tuple. That could happen if they go through a different adding/removing keys history. If that's the case for your problem, then consider sorting d.items() as he suggests.

Another one-liner based on list comprehensions:
>>> d = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> [i for n, i in enumerate(d) if i not in d[n + 1:]]
[{'b': 123}, {'a': 123}]
Here since we can use dict comparison, we only keep the elements that are not in the rest of the initial list (this notion is only accessible through the index n, hence the use of enumerate).

If using a third-party package would be okay then you could use iteration_utilities.unique_everseen:
>>> from iteration_utilities import unique_everseen
>>> l = [{'a': 123}, {'b': 123}, {'a': 123}]
>>> list(unique_everseen(l))
[{'a': 123}, {'b': 123}]
It preserves the order of the original list and ut can also handle unhashable items like dictionaries by falling back on a slower algorithm (O(n*m) where n are the elements in the original list and m the unique elements in the original list instead of O(n)). In case both keys and values are hashable you can use the key argument of that function to create hashable items for the "uniqueness-test" (so that it works in O(n)).
In the case of a dictionary (which compares independent of order) you need to map it to another data-structure that compares like that, for example frozenset:
>>> list(unique_everseen(l, key=lambda item: frozenset(item.items())))
[{'a': 123}, {'b': 123}]
Note that you shouldn't use a simple tuple approach (without sorting) because equal dictionaries don't necessarily have the same order (even in Python 3.7 where insertion order - not absolute order - is guaranteed):
>>> d1 = {1: 1, 9: 9}
>>> d2 = {9: 9, 1: 1}
>>> d1 == d2
True
>>> tuple(d1.items()) == tuple(d2.items())
False
And even sorting the tuple might not work if the keys aren't sortable:
>>> d3 = {1: 1, 'a': 'a'}
>>> tuple(sorted(d3.items()))
TypeError: '<' not supported between instances of 'str' and 'int'
Benchmark
I thought it might be useful to see how the performance of these approaches compares, so I did a small benchmark. The benchmark graphs are time vs. list-size based on a list containing no duplicates (that was chosen arbitrarily, the runtime doesn't change significantly if I add some or lots of duplicates). It's a log-log plot so the complete range is covered.
The absolute times:
The timings relative to the fastest approach:
The second approach from thefourtheye is fastest here. The unique_everseen approach with the key function is on the second place, however it's the fastest approach that preserves order. The other approaches from jcollado and thefourtheye are almost as fast. The approach using unique_everseen without key and the solutions from Emmanuel and Scorpil are very slow for longer lists and behave much worse O(n*n) instead of O(n). stpks approach with json isn't O(n*n) but it's much slower than the similar O(n) approaches.
The code to reproduce the benchmarks:
from simple_benchmark import benchmark
import json
from collections import OrderedDict
from iteration_utilities import unique_everseen
def jcollado_1(l):
return [dict(t) for t in {tuple(d.items()) for d in l}]
def jcollado_2(l):
seen = set()
new_l = []
for d in l:
t = tuple(d.items())
if t not in seen:
seen.add(t)
new_l.append(d)
return new_l
def Emmanuel(d):
return [i for n, i in enumerate(d) if i not in d[n + 1:]]
def Scorpil(a):
b = []
for i in range(0, len(a)):
if a[i] not in a[i+1:]:
b.append(a[i])
def stpk(X):
set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
return [json.loads(t) for t in set_of_jsons]
def thefourtheye_1(data):
return OrderedDict((frozenset(item.items()),item) for item in data).values()
def thefourtheye_2(data):
return {frozenset(item.items()):item for item in data}.values()
def iu_1(l):
return list(unique_everseen(l))
def iu_2(l):
return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items())))
funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2)
arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)}
b = benchmark(funcs, arguments, 'list size')
%matplotlib widget
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('ggplot')
mpl.rcParams['figure.figsize'] = '8, 6'
b.plot(relative_to=thefourtheye_2)
For completeness here is the timing for a list containing only duplicates:
# this is the only change for the benchmark
arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}
The timings don't change significantly except for unique_everseen without key function, which in this case is the fastest solution. However that's just the best case (so not representative) for that function with unhashable values because it's runtime depends on the amount of unique values in the list: O(n*m) which in this case is just 1 and thus it runs in O(n).
Disclaimer: I'm the author of iteration_utilities.

Other answers would not work if you're operating on nested dictionaries such as deserialized JSON objects. For this case you could use:
import json
set_of_jsons = {json.dumps(d, sort_keys=True) for d in X}
X = [json.loads(t) for t in set_of_jsons]

If you are using Pandas in your workflow, one option is to feed a list of dictionaries directly to the pd.DataFrame constructor. Then use drop_duplicates and to_dict methods for the required result.
import pandas as pd
d = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
d_unique = pd.DataFrame(d).drop_duplicates().to_dict('records')
print(d_unique)
[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Sometimes old-style loops are still useful. This code is little longer than jcollado's, but very easy to read:
a = [{'a': 123}, {'b': 123}, {'a': 123}]
b = []
for i in range(len(a)):
if a[i] not in a[i+1:]:
b.append(a[i])

If you want to preserve the Order, then you can do
from collections import OrderedDict
print OrderedDict((frozenset(item.items()),item) for item in data).values()
# [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]
If the order doesn't matter, then you can do
print {frozenset(item.items()):item for item in data}.values()
# [{'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

Not a universal answer, but if your list happens to be sorted by some key, like this:
l=[{'a': {'b': 31}, 't': 1},
{'a': {'b': 31}, 't': 1},
{'a': {'b': 145}, 't': 2},
{'a': {'b': 25231}, 't': 2},
{'a': {'b': 25231}, 't': 2},
{'a': {'b': 25231}, 't': 2},
{'a': {'b': 112}, 't': 3}]
then the solution is as simple as:
import itertools
result = [a[0] for a in itertools.groupby(l)]
Result:
[{'a': {'b': 31}, 't': 1},
{'a': {'b': 145}, 't': 2},
{'a': {'b': 25231}, 't': 2},
{'a': {'b': 112}, 't': 3}]
Works with nested dictionaries and (obviously) preserves order.

You can use a set, but you need to turn the dicts into a hashable type.
seq = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
unique = set()
for d in seq:
t = tuple(d.iteritems())
unique.add(t)
Unique now equals
set([(('a', 3222), ('b', 1234)), (('a', 123), ('b', 1234))])
To get dicts back:
[dict(x) for x in unique]

Easiest way, convert each item in the list to string, since dictionary is not hashable. Then you can use set to remove the duplicates.
list_org = [{'a': 123}, {'b': 123}, {'a': 123}]
list_org_updated = [ str(item) for item in list_org]
print(list_org_updated)
["{'a': 123}", "{'b': 123}", "{'a': 123}"]
unique_set = set(list_org_updated)
print(unique_set)
{"{'b': 123}", "{'a': 123}"}
You can use the set, but if you do want a list, then add the following:
import ast
unique_list = [ast.literal_eval(item) for item in unique_set]
print(unique_list)
[{'b': 123}, {'a': 123}]

input_list =[{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
#output required => [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]
#code
list = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
empty_list = []
for item in list:
if item not in empty_list:
empty_list.append(item)
print("previous list =",list)
print("Updated list =",empty_list)
#output
previous list = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]
Updated list = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Here's a quick one-line solution with a doubly-nested list comprehension (based on #Emmanuel 's solution).
This uses a single key (for example, a) in each dict as the primary key, rather than checking if the entire dict matches
[i for n, i in enumerate(list_of_dicts) if i.get(primary_key) not in [y.get(primary_key) for y in list_of_dicts[n + 1:]]]
It's not what OP asked for, but it's what brought me to this thread, so I figured I'd post the solution I ended up with

Not so short but easy to read:
list_of_data = [{'a': 123}, {'b': 123}, {'a': 123}]
list_of_data_uniq = []
for data in list_of_data:
if data not in list_of_data_uniq:
list_of_data_uniq.append(data)
Now, list list_of_data_uniq will have unique dicts.

Remove duplications by custom key:
def remove_duplications(arr, key):
return list({key(x): x for x in arr}.values())

A lot of good examples searching for duplicate values and keys, below is the way we filter out whole dictionary duplicate data in lists. Use dupKeys = [] if your source data is comprised of EXACT formatted dictionaries and looking for duplicates. Otherwise set dupKeys = to the key names of the data you want to not have duplicate entries of, can be 1 to n keys. It aint elegant, but works and is very flexible
import binascii
collected_sensor_data = [{"sensor_id":"nw-180","data":"XXXXXXX"},
{"sensor_id":"nw-163","data":"ZYZYZYY"},
{"sensor_id":"nw-180","data":"XXXXXXX"},
{"sensor_id":"nw-97", "data":"QQQQQZZ"}]
dupKeys = ["sensor_id", "data"]
def RemoveDuplicateDictData(collected_sensor_data, dupKeys):
checkCRCs = []
final_sensor_data = []
if dupKeys == []:
for sensor_read in collected_sensor_data:
ck1 = binascii.crc32(str(sensor_read).encode('utf8'))
if not ck1 in checkCRCs:
final_sensor_data.append(sensor_read)
checkCRCs.append(ck1)
else:
for sensor_read in collected_sensor_data:
tmp = ""
for k in dupKeys:
tmp += str(sensor_read[k])
ck1 = binascii.crc32(tmp.encode('utf8'))
if not ck1 in checkCRCs:
final_sensor_data.append(sensor_read)
checkCRCs.append(ck1)
return final_sensor_data
final_sensor_data = [{"sensor_id":"nw-180","data":"XXXXXXX"},
{"sensor_id":"nw-163","data":"ZYZYZYY"},
{"sensor_id":"nw-97", "data":"QQQQQZZ"}]

If you don't care about scale and crazy performance, simple func:
# Filters dicts with the same value in unique_key
# in: [{'k1': 1}, {'k1': 33}, {'k1': 1}]
# out: [{'k1': 1}, {'k1': 33}]
def remove_dup_dicts(list_of_dicts: list, unique_key) -> list:
unique_values = list()
unique_dicts = list()
for obj in list_of_dicts:
val = obj.get(unique_key)
if val not in unique_values:
unique_values.append(val)
unique_dicts.append(obj)
return unique_dicts

Related

Avoid creation of extra variable while updating a dictionary

I have a dictionary with one key-value pair,
dct = {'a': 1}
I want to add more key-value pairs to this dictionary, so, I do,
{dct.update(**i) for i in [{'b': 2}, {'c': 3}, {'d': None}] if any(i.values())}
but the IDE starts suggesting to convert this into a variable, and marks the above line with a yellowish background
var = {dct.update(**i) for i in [{'b': 2}, {'c': 3}, {'d': None}] if any(i.values())}
then I add this variable, but it would go unused, and the IDE starts saying unused variable var.
How do I update the dictionary, without IDE having any issues?

Do it in the normal way without using the set-comprehension
for i in [{'b': 2}, {'c': 3}, {'d': None}]:
if any(i.values()):
dct.update(**i)
Since you are not using the result set in your code. It's better to keep simple without using any unnecessary comprehensions.
Edit
As mark suggestion, if you have any value 0, you can do like this
for i in [{'b': 2}, {'c': 3}, {'d': None}]:
if any([v for v in i.values() if v not None])
dct.update(**i)

If you are thinking about this in terms of key/value pairs, you could turn your dicts into key/value pairs and pass them into update as a flattened list:
dct = {'a': 1}
l = [{'b': 2}, {'c': 3}, {'d': None}]
dct.update((k, v) for d in l for k, v in d.items() if v is not None)
print(dct)
# {'a': 1, 'b': 2, 'c': 3}
This is subtly different from your code of using any(i.values()) in the case where any of these dicts might have more than on value like: {'e':100, 'd': None}. Using the above code, this would add e and not d, but using the any approach you would end up adding the d: None key value pair.
Also, be careful with the construct if any(i.values()) if it possible that any of the values could be 0 to make sure it has the behavior you expect.

have found one way to achieve the same
dct = {i: j for i, j in zip(['a', 'b', 'c', 'd'], [1, 2, 3, None]) if j}
edit
or something like this,
dct = {'a': 1}
dct.update({i: j for i, j in zip(['b', 'c', 'd'], [2, 3, None]) if j})

Filter key/values in a list of dictionaries

Assuming a list of dicts:
[{'a':3434,'b':23424,'c':3231,'d':24334243},
{'a':344,'b':234,'c':321,'d':24334}
{'a':34,'b':2424,'c':31,'d':2434243},...]
Is there a one-liner way to filter the list getting the dictionaries only with certain keys ['a','b']?
for instance:
Result = [{'a':3434,'b':23424},
{'a':344,'b':234}
{'a':34,'b':2424},...]
Note: my current solution is with for loops, totally un-elegant

This would be my homemade approach.
newLst = [{k:v for k,v in dicts.items() if k in ['a','b']}for dicts in last]

a = [{'a':3434,'b':23424,'c':3231,'d':24334243},{'a':344,'b':234,'c':321,'d':24334},{'a':34,'b':2424,'c':31,'d':2434243}]
r = []
for i in a:
if ('a' in i) and ('b' in i):
r.append({'a':i['a'], 'b':i['b']})
print(r)
output
[{'a': 3434, 'b': 23424}, {'a': 344, 'b': 234}, {'a': 34, 'b': 2424}]

Why is this dictionary turning into a tuple?

I have a complex dictionary:
l = {10: [{'a':1, 'T':'y'}, {'a':2, 'T':'n'}], 20: [{'a':3,'T':'n'}]}
When I'm trying to iterate over the dictionary I'm not getting a dictionary with a list for values that are a dictionary I'm getting a tuple like so:
for m in l.items():
print(m)
(10, [{'a': 1, 'T': 'y'}, {'a': 2, 'T': 'n'}])
(20, [{'a': 3, 'T': 'n'}])
But when I just print l I get my original dictionary:
In [7]: l
Out[7]: {10: [{'a': 1, 'T': 'y'}, {'a': 2, 'T': 'n'}], 20: [{'a': 3, 'T': 'n'}]}
How do I iterate over the dictionary? I still need the keys and to process each dictionary in the value list.

There are two questions here. First, you ask why this is turned into a "tuple" - the answer to that question is because that is what the .items() method on dictionaries returns - a tuple of each key/value pair.
Knowing this, you can then decide how to use this information. You can choose to expand the tuple into the two parts during iteration
for k, v in l.items():
# Now k has the value of the key and v is the value
# So you can either use the value directly
print(v[0]);
# or access using the key
value = l[k];
print(value[0]);
# Both yield the same value

With a dictionary you can add another variable while iterating over it.
for key, value in l.items():
print(key,value)

I often rely on pprint when processing a nested object to know at a glance what structure that I am dealing with.
from pprint import pprint
l = {10: [{'a':1, 'T':'y'}, {'a':2, 'T':'n'}], 20: [{'a':3,'T':'n'}]}
pprint(l, indent=4, width=40)
Output:
{ 10: [ {'T': 'y', 'a': 1},
{'T': 'n', 'a': 2}],
20: [{'T': 'n', 'a': 3}]}
Others have already answered with implementations.

Thanks for all the help. I did discuss figure out how to process this. Here is the implementation I came up with:
for m in l.items():
k,v = m
print(f"key: {k}, val: {v}")
for n in v:
print(f"key: {n['a']}, val: {n['T']}")
Thanks for everyones help!

python list of tuples to list of dicts for use by csv.dictwriter

i have this scenario
x=['a','b','c'] #Header
y=[(1,2,3),(4,5,6)] #data
I need to create below structure
[{'a':1, 'b':2, 'c':3}, {'a':4, 'b':5, 'c':6}]
Any better way of doing this(like a python expert)
rows=[]
for row in range(0,len(y)):
rec={}
for col in range(0, len(x)):
rec[x[col]]=y[row][col]
rows.append(rec)
print(rows)
above code will give the desired result, but i am looking for a one liner solution some thing like below
rows=list( ( {x[col]:y[row][col]} for row in range(0,len(y)) for col in range(0, len(x)) ) )
output:
[{'a': 1}, {'b': 2}, {'c': 3}, {'a': 4}, {'b': 5}, {'c': 6}]
but this gives list as individual dict's rather than a combined dict. Any ideas???

You could write a generator that iterates over data. Then for each item in data use zip to generate iterable of (header, value) tuples that you pass to dict:
>>> x = ['a','b','c']
>>> y = [(1,2,3),(4,5,6)]
>>> gen = (dict(zip(x, z)) for z in y)
>>> list(gen)
[{'a': 1, 'c': 3, 'b': 2}, {'a': 4, 'c': 6, 'b': 5}]
Update The example above uses generator expression instead of list since the code writing CSV would only need one row at a time. Generating the full list would require much more memory with no benefit.

Create N new dictionaries based on N keys from an original dictionary

I am new at python and here. I need to create N new dictionaries based on N keys from an original dictionary.
Lets say I have an OriginalDict {a:1, b:2, c:3, d:4, e:5, f:6} I need to create 6 new dictionaries having as keys (new keys are same) the value of the original. Something like that:
Dict1 {Name:a,...}, Dict2 {Name:b,...}, Dict3 {Name:c,...}.....
Dict6 {Name:f...}
This is my code:
d = {}
for key in OriginalDict:
d['Name'] = key
I got a new dictionary but only for the last key.
print d
{Name:f}
I guess cos' last value in a dictionary overrides the previous one if the keys are the same
please advise... :)

If, for example, we put all those dicts in a list, we can use this comprehensions
dicts = [{'Name': k} for k in OriginalDict]
Let's try it out
>>> OriginalDict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6}
>>> dicts = [{'Name': k} for k in OriginalDict]
>>> dicts
[{'Name': 'd'}, {'Name': 'c'}, {'Name': 'a'}, {'Name': 'b'}, {'Name': 'e'}, {'Name': 'f'}]
The statement 6 new dictionaries having as keys (new keys are same) the value of the original seems to contradict your example, at least to me.
In such case we can do
dicts = [{v: k} for k, v in OriginalDict.items()]
Let's try it out:
>>> dicts = [{v: k} for k, v in OriginalDict.items()]
>>> dicts
[{4: 'd'}, {3: 'c'}, {1: 'a'}, {2: 'b'}, {5: 'e'}, {6: 'f'}]

In python 3.x:
for key in OriginalDict.keys():
d = {}
d ['Name'] = key
This will give you a new Dictionary for every key of the Original one. Now, you could save them inside a list or inside another dictionary like:
New_Dicts = []
for key in OriginalDict.keys():
d = {}
d ['Name'] = key
New_Dicts.append(d)
Or,
New_Dicts = {}
for i,key in enumerate(OriginalDict.keys()):
d = {}
d ['Name'] = key
New_Dicts[i] = d

I think you want to create a function which is a generator, then call that function passing in the dictionary and then yielding your new ones:
orig = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':6}
def make_sub_dictionaries(big_dictionary):
for key, val in big_dictionary.iteritems():
yield {'name': key, 'val': val }
# if you want them one by one, call next
first = make_sub_dictionaries(orig).next()
print first
# if you want all of them
for d in make_sub_dictionaries(orig):
print str(d)
# do whatever stuff you need to do

One method is as follows:
OriginalDict = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6}
newDicts = [{'Name':v} for k,v in enumerate(OriginalDict.values())]
which will give you
>>> newDicts
[{'Name': 1}, {'Name': 3}, {'Name': 2}, {'Name': 5}, {'Name': 4}, {'Name': 6}]

First of all, this won't keep the order. You need to use OriginalDict to preserve the order.
my_dict={'a':1, 'b':2, 'c':3, 'd':4, 'e':5, 'f':6}
j=0
for key, value in my_dict.iteritems():
j +=1
exec("Dict%s = %s" % (j,{key:value}))
Now when you type.
>>> print Dict1
{'a':1}
>>> print Dict2
{'c':3}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python3: Remove duplicates from the dictionary list [duplicate] - python

Other answers would not work if you're operating on nested dictionaries such as deserialized JSON objects. For this case you could use: import json set_of_jsons = {json.dumps(d, sort_keys=True) for d in X} X = [json.loads(t) for t in set_of_jsons]

Sometimes old-style loops are still useful. This code is little longer than jcollado's, but very easy to read: a = [{'a': 123}, {'b': 123}, {'a': 123}] b = [] for i in range(len(a)): if a[i] not in a[i+1:]: b.append(a[i])

Not so short but easy to read: list_of_data = [{'a': 123}, {'b': 123}, {'a': 123}] list_of_data_uniq = [] for data in list_of_data: if data not in list_of_data_uniq: list_of_data_uniq.append(data) Now, list list_of_data_uniq will have unique dicts.

Remove duplications by custom key: def remove_duplications(arr, key): return list({key(x): x for x in arr}.values())

Related

Avoid creation of extra variable while updating a dictionary

Filter key/values in a list of dictionaries

Why is this dictionary turning into a tuple?

python list of tuples to list of dicts for use by csv.dictwriter

Create N new dictionaries based on N keys from an original dictionary

Categories

Resources