Removing duplicates from 1million array of dictionariies - python

I have a HUGE list of objects.
That looks like this
output = [
{
'name': 'some name',
'id': '1'
}
]
clean_list = []
for i in range(len(output)):
if output[i] not in output[i + 1:]:
clean_list.append(output[i])
It's 1 million, and this is the method I'm currently using.. however it takes hours to complete this operation when my array is massive.
Is there a optimal way to remove duplicates?

There's two issues with the code you propose.
1. speed of "in" test
You wrote
clean_list = []
but you really want
clean_list = set()
That way in completes in O(1) constant
time, rather than O(N) linear time.
The linear probe inside the for
gives the loop O(N^2) quadratic cost.
2. equality test of hashable items
Your source items look like e.g. dict(name='some name', id=1).
You want to turn them into hashable tuples,
e.g. ('some name', 1),
so you can take advantage of set's O(1) fast lookups.
from pprint import pp
clean = set()
for d in output:
t = tuple(d.values())
if t not in clean:
clean.add(t)
pp(sorted(clean))
But wait! No need for checks, since a set will
reject attempts to add same thing twice.
So this suffices:
clean = set()
for d in output:
clean.add(tuple(d.values()))
And now it's simple enough that a set comprehension makes sense.
clean = {tuple(d.values())
for d in output}
Consider uniquifying on just name,
if the id value doesn't matter to you.
tl;dr: Doing ~ 10^6 operations is way better
than doing ~ 10^12 of them.

for such a large dataset it makes sence to use pandas, something like this:
import pandas as pd
output = [
{'name': 'some name 1','id': '1'},
{'name': 'some name 2','id': '2'},
{'name': 'some name 3','id': '3'},
{'name': 'some name 2','id': '2'}]
clean_list = pd.DataFrame(output).drop_duplicates().to_dict('records')
>>> clean_list
'''
[{'name': 'some name 1', 'id': '1'},
{'name': 'some name 2', 'id': '2'},
{'name': 'some name 3', 'id': '3'}]

Related

Comportement of list comprehension with self reference

I'm retrieving a list of (name, id) pairs and I need to make sure there's no duplicate of name, regardless of the id.
# Sample data
filesID = [{'name': 'file1', 'id': '353'},{'name': 'file2', 'id': '154'},{'name': 'file3', 'id': '1874'},{'name': 'file1', 'id': '14'}]
I managed to get the desired output with nested loops :
uniqueFilesIDLoops = []
for pair in filesID:
found = False
for d in uniqueFilesIDLoops:
if d['name'] == pair['name']:
found = True
if not found:
uniqueFilesIDLoops.append(pair)
But I can't get it to work with list comprehension ... Here's what I've tried so far :
uniqueFilesIDComprehension = []
uniqueFilesIDComprehension = [pair for pair in filesID if pair['name'] not in [d['name'] for d in uniqueFilesIDComprehension]]
Outputs :
# Original data
[{'name': 'file1', 'id': '353'}, {'name': 'file2', 'id': '154'}, {'name': 'file3', 'id': '1874'}, {'name': 'file1', 'id': '14'}]
# Data obtained with list comprehension
[{'name': 'file1', 'id': '353'}, {'name': 'file2', 'id': '154'}, {'name': 'file3', 'id': '1874'}, {'name': 'file1', 'id': '14'}]
# Data obtained with loops (and desired output)
[{'name': 'file1', 'id': '353'}, {'name': 'file2', 'id': '154'}, {'name': 'file3', 'id': '1874'}]
I was thinking that maybe the call to uniqueFilesIDComprehension inside the list comprehension was not updated at each iteration, thus using [] and not finding corresponding values...
You cannot access contents of list comprehension during its creation, because it will be assigned to anything only after its value is completely evaluated.
Simpliest way to remove duplicates would be:
list({el['name'] : el for el in filesID}.values()) - this will create a dictionary based on the names of each element, so every time you encounter duplicate name it will overwrite it with a new element. After the dict is created all you need to do is get the values and cast it to list.
If you want to keep the first element with each name, not the last you can instead do it by creating the dictionary in a for loop:
out = {}
for el in filesID:
if el['name'] not in out:
out[el['name']] = el
And finally, one thing to consider when implementing any of those solutions - since you do not care about id part, do you really need to extract it?
I'd ask myself if this is not a valid choice as well.
out = {el['name'] for el in filesID}
print(out)
Output: {'file1', 'file3', 'file2'}
I would stick with your original loop, although note that it can be made a little cleaner. Namely, you don't need a flag named found.
uniqueFilesIDLoops = []
for pair in filesID:
for d in uniqueFilesIDLoops:
if d['name'] == pair['name']:
break
else:
uniqueFilesIDLoops.append(pair)
You can also use an auxiliary set to simplify detecting duplicate names (since they are str values and therefore hashable).
seen = set()
uniqueFilesIDLoops = []
for pair in filesID:
if (name := pair['name']) not in seen:
seen.add(name)
uniqueFilesIDLoops.append(pair)
Because we've now decoupled the result from the data structure we perform lookups in, the above could be turned into a list comprehension by writing an expression that both returns True when the name is not in the set and adds the name to the set. Something iffy like
seen = set()
uniqueFilesIDLoops = [pair
for pair in filesID
if (pair['name'] not in seen
and (seen.add(pair['name']) or True))]
(seen.add always returns None, which is a falsey value, so seen.add(...) or True is always True.)
List comprehensions are used to create new lists, so the original list is never updated; the assignment causes the variable to refer to the newly created list.

Can I loop an array inside a dictionary definition in python?

I'm trying to push data to Firebase and I was able to loop through an array and push the information on each loop.
But I need to add some pictures in this (So it'd be like looping an array inside a dictionary definition). I have all the links in an array.
This is my code so far.
def image_upload():
for i in range(len(excel.name)):
doc_ref = db.collection('plans').document('doc-name').collection('collection-name').document()
doc_id = doc_ref.id
data = {
'bedroomLocation': excel.bedroomLocation[i],
'bedrooms': excel.bedrooms[i],
'brand': excel.brand[i],
'catalog': excel.catalog[i],
'category': excel.category[i],
'code': excel.code[i],
'depth': excel.depth[i],
'description': excel.description[i],
'fullBaths': excel.fullBaths[i],
'garage': excel.garage[i],
'garageBays': excel.garageBays[i],
'garageLocation': excel.garageLocation[i],
'garageType': excel.garageType[i],
'date': firestore.SERVER_TIMESTAMP,
'id': doc_id,
'halfBaths': excel.halfBaths[i],
'laundryLocation': excel.laundryLocation[i],
'name': excel.name[i],
'onCreated': excel.onCreated[i],
'productType': excel.productType[i],
'region': excel.region[i],
'sqf': excel.sqf[i],
'state': excel.state[i],
'stories': excel.stories[i],
'tags': [excel.tags[i]],
'width': excel.width[i],
}
doc_ref.set(data)
That works fine, but I don't really know how to loop through the array of links.
This is what I tried below the block I copied above.
for j in range(len(excel.gallery)):
if len(excel.gallery[j]) != 0:
for k in range(len(excel.gallery[j])):
data['gallery'] = firestore.ArrayUnion(
[{'name': excel.gallery[j][k][0], 'type': excel.gallery[j][k][1],
'url': excel.gallery[j][k][2]}])
print(data)
doc_ref.set(data)
len(excel.gallery) has the same length as len(excel.name)
each j position has different amount of links though.
If I declare the gallery inside the data definition and I use ArrayUnion and pre define more than one piece of information it works fine, but I need to use that array to push information to Firebase.
excel.gallery is a matrix actually, is not a dictionary. And this is one of the example outputs for this [[['Images/CFK_0004-Craftmark_Oakmont_Elevation_1.jpeg', 'Elevation', 'url example from firebase'], .... and it goes on for each file. I'm testing with 8 images and 2 plans. So my matrix is 2x4 in this case. But it can happen that in a position there won't be any files if none match. What I'm looking for is add to the data before it is pushed (or after it doesn't matter the order) all the urls for that plan.
This works:
'gallery': firestore.ArrayUnion(
[{'name': 'Example Name', 'type': 'Elevation',
'url': 'Example url'},
{'name': 'Example Name2', 'type': 'First Floor',
'url': 'Example url2'}])
But I need to populate that field looping through excel.gallery
I'm a little confused by when you say, "I have all the links in an array". Could we see that array and what kind of output you are looking for?
Also assuming that excel.gallery is a dictionary you could clean up your code substantially using the items() function.
for j, k in excel.gallery.items():
if k:
data['gallery'] = firestore.ArrayUnion([{'name': k[0], 'type': k[1], 'url': k[2]}])
print(data)
doc_ref.set(data)

How to extract the dictionary from list of dictionary with condition

How to extract from the json with condition
I have a list of dictionary. I need extract some of the dictionary with some conditions
If for cross field I need "AND" condition
for same field array I need to OR condition
I need to search subject which is Physics or Accounting this is array of fields(OR) statement
AND
I need to search type is Permanent or GUEST condition this is array of fields(OR) statement
AND
I need to search Location is NY(&) condition
test = [{'id':1,'name': 'A','subject': ['Maths','Accounting'],'type':'Contract', 'Location':'NY'},
{ 'id':2,'name': 'AB','subject': ['Physics','Engineering'],'type':'Permanent','Location':'NY'},
{'id':3,'name': 'ABC','subject': ['Maths','Engineering'],'type':'Permanent','Location':'NY'},
{'id':4,'name':'ABCD','subject': ['Physics','Engineering'],'type':['Contract','Guest'],'Location':'NY'}]
Expected out is id [{ 'id':2,'name': 'AB','subject': ['Physics','Engineering'],'type':'Permanent','Location':'NY'},{'id':4,'name':'ABCD','subject': ['Physics','Engineering'],'type':['Contract','Guest'],'Location':'NY'}]
The problem here is mostly that your data is not uniform, sometimes it's strings, sometimes it's list. Let's try:
# turns the values into set for easy comparison
def get_set(d,field):
return {d[field]} if isinstance(d[field], str) else set(d[field])
# we use this to filter
def validate(d):
# the three lines below corresponds to the three conditions listed
return get_set(d,'subject').intersection({'Physics','Accounting'}) and \
get_set(d,'type').intersection({'Permanent', 'Guest'}) and \
get_set(d,'Location')=={'NY'}
result = [d for d in test if validate(d)]
Output:
[{'id': 2,
'name': 'AB',
'subject': ['Physics', 'Engineering'],
'type': 'Permanent',
'Location': 'NY'},
{'id': 4,
'name': 'ABCD',
'subject': ['Physics', 'Engineering'],
'type': ['Contract', 'Guest'],
'Location': 'NY'}]
The following simple approach with a nested if clause solves the issue. The and condition is done via the nested if and the or conditions is simply done via or.
The in operator works for string values and list values, therefore it can be used interchangeably and results in the expected out. BUT this approach expects that there are no specific subjects like XYZ Accounting.
result = []
for elem in test:
# Check Location
if elem['Location'] == 'NY':
# Check subject
subject = elem['subject']
if ('Accounting' in subject) or ('Physics' in subject):
# Check type
elem_type = elem['type']
if ('Permanent' in elem_type) or ('Guest' in elem_type):
# Add element to result, because all conditions are true
result.append(elem)

How do I turn list values into an array with an index that matches the other dic values?

Hoping someone can help me out. I've spent the past couple hours trying to solve this, and fair warning, I'm still fairly new to python.
This is a repost of a question I recently deleted. I've misinterpreted my code in the last example.The correct example is:
I have a dictionary, with a list that looks similar to:
dic = [
{
'name': 'john',
'items': ['pants_1', 'shirt_2','socks_3']
},
{
'name': 'bob',
items: ['jacket_1', 'hat_1']
}
]
I'm using .append for both 'name', and 'items', which adds the dic values into two new lists:
for x in dic:
dic_name.append(dic['name'])
dic_items.append(dic['items'])
I need to split the item value using '_' as the delimiter, so I've also split the values by doing:
name, items = [i if i is None else i.split('_')[0] for i in dic_name],
[if i is None else i.split('_')[0] for i in chain(*dic_items)])
None is used in case there is no value. This provides me with a new list for name, items, with the delimiter used. Disregard the fact that I used '_' split for names in this example.
When I use this, the index for name, and item no longer match. Do i need to create the listed items in an array to match the name index, and if so, how?
Ideally, I want name[0] (which is john), to also match items[0] (as an array of the items in the list, so pants, shirt, socks). This way when I refer to index 0 for name, it also grabs all the values for items as index 0. The same thing regarding the index used for bob [1], which should match his items with the same index.
#avinash-raj, thanks for your patience, as I've had to update my question to reflect more closely to the code I'm working with.
I'm reading a little bit between the lines but are you trying to just collapse the list and get rid of the field names, e.g.:
>>> dic = [{'name': 'john', 'items':['pants_1','shirt_2','socks_3']},
{'name': 'bob', 'items':['jacket_1','hat_1']}]
>>> data = {d['name']: dict(i.split('_') for i in d['items']) for d in dic}
>>> data
{'bob': {'hat': '1', 'jacket': '1'},
'john': {'pants': '1', 'shirt': '2', 'socks': '3'}}
Now the data is directly related vs. indirectly related via a common index into 2 lists. If you want the dictionary split out you can always
>>> dic_name, dic_items = zip(*data.items())
>>> dic_name
('bob', 'john')
>>> dic_items
({'hat': '1', 'jacket': '1'}, {'pants': '1', 'shirt': '2', 'socks': '3'})
You need a list of dictionaries because the duplicate keys name and items are overwritten:
items = [[i.split('_')[0] for i in d['items']] for d in your_list]
names = [d['name'] for d in your_list] # then grab names from list
Alternatively, you can do this in one line with the built-in zip method and generators, like so:
names, items = zip(*((i['name'], [j.split('_')[0] for j in i['items']]) for i in dic))
From Looping Techniques in the Tutorial.
for name, items in div.items():
names.append(name)
items.append(item)
That will work if your dict is structured
{'name':[item1]}
In the loop body of
for x in dic:
dic_name.append(dic['name'])
dic_items.append(dic['items'])
you'll probably want to access x (to which the items in dic will be assigned in turn) rather than dic.

What is the easiest way to search through a list of dicts in Python?

My database currently returns a list of dicts:
id_list = ({'id': '0c871320cf5111df87da000c29196d3d'},
{'id': '2eeeb9f4cf5111df87da000c29196d3d'},
{'id': '3b982384cf5111df87da000c29196d3d'},
{'id': '3f6f3fcecf5111df87da000c29196d3d'},
{'id': '44762370cf5111df87da000c29196d3d'},
{'id': '4ba0d294cf5111df87da000c29196d3d'})
How can I easily check if a given id is in this list or not?
Thanks.
Here's a one-liner:
if some_id in [d.get('id') for d in id_list]:
pass
Not very efficient though.
edit -- A better approach might be:
if some_id in (d.get('id') for d in id_list):
pass
This way, the list isn't generated in full length beforehand.
How can I easily check if a given id is in this list or not?
Make a set
keys = set( d['id'] for d in id_list )
if some_value in keys
Don't ask if this is "efficient" or "best". It involves the standard tradeoff.
Building the set takes time. But the lookup is then instant.
If you do a lot of lookups, the cost of building the set is amortized over each lookup.
If you do few lookups, the cost of building the set may be higher than something ilike
{'id':some_value} in id_list.
if you make a dictionary of your search id,
search_dic = {'id': '0c871320cf5111df87da000c29196d3d'}
id_list = ({'id': '0c871320cf5111df87da000c29196d3d'},
{'id': '2eeeb9f4cf5111df87da000c29196d3d'},
{'id': '3b982384cf5111df87da000c29196d3d'},
{'id': '3f6f3fcecf5111df87da000c29196d3d'},
{'id': '44762370cf5111df87da000c29196d3d'},
{'id': '4ba0d294cf5111df87da000c29196d3d'})
if search_dic in id_list:
print 'yes'
any(x.get('id')==given_id for x in id_list)
. . . returns boolean. Efficiency? See S.Lott's answer
You can flatten it with a list comprehension and use in:
id in [d['id'] for d in id_list]
You can also use generator expressions, which have different performance characteristics (and will use less memory if your list is huge):
id in (d['id'] for d in id_list)

Categories

Resources