python list of dictionaries find duplicates based on value

python list of dictionaries find duplicates based on value - python

I have a list of dicts:
a =[{'id': 1,'desc': 'smth'},
{'id': 2,'desc': 'smthelse'},
{'id': 1,'desc': 'smthelse2'},
{'id': 1,'desc': 'smthelse3'}]
I would like to go trough the list and find those dicts that have the same id value (e.g. id=1) and create a new dict:
b = [{'id':1, 'desc' : [smth, smthelse2,smthelse3]},
{'id': 2, 'desc': 'smthelse'}]

You can try:
import operator, itertools
key = operator.itemgetter('id')
b = [{'id': x, 'desc': [d['desc'] for d in y]}
for x, y in itertools.groupby(sorted(a, key=key), key=key)]

It is better to keep the "desc" values as lists everywhere even if they contain a single element only. This way you can do
for d in b:
print d['id']
for desc in d['desc']:
print desc
This would work for strings too, just returning individual characters, which is not what you want.
And now the solution giving you a list of dicts of lists:
a =[{'id': 1,'desc': 'smth'},{'id': 2,'desc': 'smthelse'},{'id': 1,'desc': 'smthelse2'},{'id': 1,'desc': 'smthelse3'}]
c = {}
for d in a:
c.setdefault(d['id'], []).append(d['desc'])
b = [{'id': k, 'desc': v} for k,v in c.iteritems()]
b is now:
[{'desc': ['smth', 'smthelse2', 'smthelse3'], 'id': 1},
{'desc': ['smthelse'], 'id': 2}]

from collections import defaultdict
d = defaultdict(list)
for x in a:
d[x['id']].append(x['desc']) # group description by id
b = [dict(id=id, desc=desc if len(desc) > 1 else desc[0])
for id, desc in d.items()]
To preserve order:
b = []
for id in (x['id'] for x in a):
desc = d[id]
if desc:
b.append(dict(id=id, desc=desc if len(desc) > 1 else desc[0]))
del d[id]

Related

Value duplicated in dictionary

The following is my code:
test = [{'name' : 'one'}, {'name' : 'two'}]
a = {}
b = []
c = {}
for i in test:
c['name'] = i['name']
b.append(c)
a['items'] = b
print(a)
This produces the following content of dictionary a, which is wrong:
{'items': [{'name': 'two'}, {'name': 'two'}]}
Why does the output dictionary, a, contains the value 'two' twice and not 1 time the value 'one' and 1 time the value 'two'?

You only created one dict named c, so it's name key changes each time through the loop. You want a new dict to append to b each time through the loop: move c = {} into the loop's body.
for i in test:
c = {}
c['name'] = i['name']
b.append(c)
or
for i in test:
c = {'name': i['name']}
b.append(c)
or
b = [{'name': i['name']} for i in test]

Uniqify list of dicts based on specific keys - Keep specific occurrences in cases of duplicates

Let's suppose that I have a list of dicts like this:
list = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},
{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},
{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]
I want to uniqify the list of dicts based on key and timestamp.
Specifically, I want to keep dicts with unique key and keep the most recent timestamp when there are duplicate keys based on key.
Therefore, I want to have the following:
list = [{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}`
{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252}]
How can I efficiently do this?

my_list = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},
{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},
{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]
r = {}
for d in my_list:
k = d['key']
if k not in r or r[k]['timestamp'] < d['timestamp']:
r[k] = d
list(r.values())
output:
[{'key': 1,
'timestamp': 3456789012,
'action': 'like',
'type': 'photo',
'id': 212},
{'key': 2,
'timestamp': 2345678901,
'action': 'like',
'type': 'photo',
'id': 252}]
here is a simple benchmark between most of the proposed solutions:
from itertools import groupby
import itertools
from operator import itemgetter
from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()
#b.add_function()
def kederrac(lst):
r = {}
for d in lst:
k = d['key']
if k not in r or r[k]['timestamp'] < d['timestamp']:
r[k] = d
return list(r.values())
#b.add_function()
def Daweo(lst):
s = sorted(lst, key=lambda x:(x['key'],x['timestamp']), reverse=True)
return [next(g) for k, g in itertools.groupby(s, lambda x:x['key'])]
#b.add_function()
def Jan(lst):
result = []
sorted_lst = sorted(lst, key=lambda x: x['key'])
for k,v in groupby(sorted_lst, key = lambda x: x['key']):
result.append(max(v, key=lambda x: x['timestamp']))
return result
#b.add_function()
def Jan_one_line(lst):
keyfunc = lambda x: x['key']
return [max(v, key = lambda x: x['timestamp'])
for k, v in groupby(sorted(lst, key=keyfunc), key=keyfunc)]
#b.add_function()
def gold_cy(lst):
key = itemgetter('key')
ts = itemgetter('timestamp')
def custom_sort(item):
return (key(item), -ts(item))
results = []
for k, v in groupby(sorted(lst, key=custom_sort), key=key):
results.append(next(v))
return results
#b.add_arguments('Number of dictionaries in list')
def argument_provider():
for exp in range(2, 18):
size = 2**exp
yield size, [{'key':choice(range((size // 10) or 2)),
'timestamp': randint(1_000_000_000, 10_000_000_000),
'action':'like','type':'photo','id':randint(100, 10000)}
for _ in range(size)]
r = b.run()
r.plot()
it shows that a simple for loop solution is more efficient, the result is expected since the sorted built-in function will come with a O(NlogN) time complexity

Another solution with itertools.groupby:
from itertools import groupby
lst = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},
{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},
{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]
result = []
sorted_lst = sorted(lst, key=lambda x: x['key'])
for k,v in groupby(sorted_lst, key = lambda x: x['key']):
result.append(max(v, key=lambda x: x['timestamp']))
print(result)
Or - if you are into one-liners:
keyfunc = lambda x: x['key']
result = [max(v, key = lambda x: x['timestamp'])
for k, v in groupby(sorted(lst, key=keyfunc), key=keyfunc)]
Additionally, do not name your variables like builtin-functions, e.g. list or id. id(...) returns the identity of an object (random, but unique within the same program).

Easiest way would be to insert it into a dict and then read back all the values as list. Also you should not use list as name of a variable.
d = {}
for item in lst:
key = item['key']
if key not in d or item['timestamp'] > d[key]['timestamp']:
d[key] = item
list(s.values())

You could do that using itertools.groupby following way:
import itertools
lst = [{'key':1,'timestamp':1234567890,'action':'like','type':'photo','id':245},{'key':2,'timestamp':2345678901,'action':'like','type':'photo','id':252},{'key':1,'timestamp':3456789012,'action':'like','type':'photo','id':212}]
s = sorted(lst, key=lambda x:(x['key'],x['timestamp']), reverse=True)
uniq_lst = [next(g) for k, g in itertools.groupby(s, lambda x:x['key'])]
Output:
[{'key': 2, 'timestamp': 2345678901, 'action': 'like', 'type': 'photo', 'id': 252}, {'key': 1, 'timestamp': 3456789012, 'action': 'like', 'type': 'photo', 'id': 212}]
Firstly I sort by key, timestamp so elements with same key will be adjacent and also reverse so highest timestamp will be first. Then I group elements with same key and get first record from every group.

We can use a combination of itertools.groupby and itemgetter. One caveat is that data must be presorted for itertools.groupby to work properly.
from itertools import groupby
from operator import itemgetter
key = itemgetter('key')
ts = itemgetter('timestamp')
def custom_sort(item):
return (key(item), -ts(item))
results = []
for k, v in groupby(sorted(data, key=custom_sort), key=key):
results.append(next(v))
[{'id': 212,
'action': 'like',
'key': 1,
'timestamp': 3456789012,
'type': 'photo'},
{'id': 252,
'action': 'like',
'key': 2,
'timestamp': 2345678901,
'type': 'photo'}]
As a side note, don't name variable using built-in names like list or id.

An array into two arrays in fast way. python

I want to split an array into two array if object has 'confirmation' param. Are there any ways faster way than I used simple for loop. The array has a lot of elements. I have concern about performance.
Before
[
{
'id':'1'
},
{
'id':'2'
},
{
'id':'3',
'confirmation':'20',
},
{
'id':'4',
'confirmation':'10',
}
]
After
[{'id': 3, 'confirmation': 20}, {'id': 4, 'confirmation': 10}]
[{'id': 1}, {'id': 2}]
Implementation using for loop
$ python3
Python 3.4.3 (default, Nov 17 2016, 01:08:31)
dict1 = {"id":1}
dict2 = {"id":2}
dict3 = {"id":3, "confirmation":20}
dict4 = {"id":4, "confirmation":10}
list = [dict1, dict2, dict3, dict4]
list_with_confirmation = []
list_without_confirmation = []
for d in list:
if 'confirmation' in d:
list_with_confirmation.append(d)
else:
list_without_confirmation.append(d)
print(list_with_confirmation)
print(list_without_confirmation)
Update 1
This is the result on our real data. (3) is the fastest.
(1) 0.148394346
(2) 0.105772018
(3) 0.0339076519
_list = search()
logger.warning(time.time()) //1504691716.5748231
list_with_confirmation = []
list_without_confirmation = []
for d in _list:
if 'confirmation' in d:
list_with_confirmation.append(d)
else:
list_without_confirmation.append(d)
logger.warning(len(list_with_confirmation)) // 69427
logger.warning(time.time()) // 1504691716.7232175 (0.148394346) --- (1)
list_with_confirmation = [d for d in _list if 'confirmation' in d]
list_without_confirmation = [d for d in _list if not 'confirmation' in d]
logger.warning(len(list_with_confirmation)) // 69427
logger.warning(time.time()) // 1504691716.8289895 (0.105772018) --- (2)
lists = ([], [])
[lists['confirmation' in d].append(d) for d in _list]
logger.warning(len(lists[1])) // 69427
logger.warning(time.time()) // 1504691716.8628972 (0.0339076519) --- (3)
I could not know how to use timeit on my environment. sorry it is poor bench check..

List comprehension might be slightly faster:
list_with_confirmation = [d for d in list if "confirmation" in d]
list_without_confirmation = [d for d in list if "confirmation" not in d]
Refer to Why is list comprehension so faster?

Probably it is the fastest way, but you could try another:
lists = ([], [])
for d in source_list:
lists['confirmation' in d].append(d)
or even:
lists = ([], [])
[lists['confirmation' in d].append(d) for d in source_list]
This way lists[0] will be "without confirmation" and lists[1] will be "with confirmation". Do your own benchmarks.
Side note: don't use list for list name, as it overwrites list constructor function.

If you execute below code:
dict1 = {"id":1}
dict2 = {"id":2}
dict3 = {"id":3, "confirmation":20}
dict4 = {"id":4, "confirmation":10}
_list = [dict1, dict2, dict3, dict4]
import timeit
def fun(_list):
list_with_confirmation = []
list_without_confirmation = []
for d in _list:
if 'confirmation' in d:
list_with_confirmation.append(d)
else:
list_without_confirmation.append(d)
print(list_with_confirmation)
print(list_without_confirmation)
def my_fun(_list):
list_with_confirmation = [d for d in _list if 'confirmation' in d]
list_without_confirmation = [d for d in _list if not 'confirmation' in d]
print(list_with_confirmation)
print(list_without_confirmation)
if __name__ == '__main__':
print(timeit.timeit("fun(_list)", setup="from __main__ import fun, _list",number=1))
print(timeit.timeit("my_fun(_list)", setup="from __main__ import my_fun, _list",number=1))
You can get following statistics:
[{'confirmation': 20, 'id': 3}, {'confirmation': 10, 'id': 4}]
[{'id': 1}, {'id': 2}]
5.41210174561e-05
[{'confirmation': 20, 'id': 3}, {'confirmation': 10, 'id': 4}]
[{'id': 1}, {'id': 2}]
2.40802764893e-05
Which mean List comprehention is most optimize way for more reference you can see:blog

Python - Get dictionary element in a list of dictionaries after an if statement

How can I get a dictionary value in a list of dictionaries, based on the dictionary satisfying some condition? For instance, if one of the dictionaries in the list has the id=5, I want to print the value corresponding to the name key of that dictionary:
list = [{'name': 'Mike', 'id': 1}, {'name': 'Ellen', 'id': 5}]
id = 5
if any(m['id'] == id for m in list):
print m['name']
This won't work because m is not defined outside the if statement.

You have a list of dictionaries, so you can use a list comprehension:
[d for d in lst if d['id'] == 5]
# [{'id': 5, 'name': 'Ellen'}]

new_list = [m['name'] for m in list if m['id']==5]
print '\n'.join(new_list)

This will be easy to accomplish with a single for-loop:
for d in list:
if 'id' in d and d['in'] == 5:
print(d['name'])
There are two key concepts to learn here. The first is that we used a for loop to "go through each element of the list". The second, is that we used the in word to check if a dictionary had a certain key.

How about the following?
for entry in list:
if entry['id']==5:
print entry['name']

It doesn't exist in Python2, but a simple solution in Python3 would be to use a ChainMap instead of a list.
import collections
d = collections.ChainMap(*[{'name':'Mike', 'id': 1}, {'name':'Ellen', 'id': 5}])
if 'id' in d:
print(d['id'])

You can do it by using the filter function:
lis = [ {'name': 'Mike', 'id': 1}, {'name':'Ellen', 'id': 5}]
result = filter(lambda dic:dic['id']==5,lis)[0]['name']
print(result)

remove a dict entry in a list of dict python

How can I remove a key in a dic in a list
for exemple
My_list= [{'ID':0,'Name':'Paul','phone':'1234'},{'ID':1,'Name':'John','phone':'5678'}]
I want to remove in ID 1 the phone key
My_list= [{'ID':0,'Name':'Paul','phone':'1234'},{'ID':1,'Name':'John'}]
thanks in advance for your help

Just iterate through the list, check whether if 'ID' equals to 1 and if so then delete the 'phone' key. This should work:
for d in My_list:
if d["ID"] == 1:
del d["phone"]
And finally print the list:
print My_list

When the id you are looking for matches 1, then reconstruct the dictionary excluding the key phone, otherwise use the dictionary as it is, like this
l = [{'ID': 0, 'Name': 'Paul', 'phone': '1234'},
{'ID': 1, 'Name': 'John', 'phone': '5678'}]
k, f = 1, {"phone"}
print([{k: i[k] for k in i.keys() - f} if i["ID"] == k else i for i in l])
# [{'phone': '1234', 'ID': 0, 'Name': 'Paul'}, {'ID': 1, 'Name': 'John'}]
Here, k is the value of ID you are looking for and f is a set of keys which need to be excluded in the resulting dictionary, if the id matches.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python list of dictionaries find duplicates based on value - python

You can try: import operator, itertools key = operator.itemgetter('id') b = [{'id': x, 'desc': [d['desc'] for d in y]} for x, y in itertools.groupby(sorted(a, key=key), key=key)]

Related

Value duplicated in dictionary

Uniqify list of dicts based on specific keys - Keep specific occurrences in cases of duplicates

An array into two arrays in fast way. python

Python - Get dictionary element in a list of dictionaries after an if statement

remove a dict entry in a list of dict python

Categories

Resources