Comportement of list comprehension with self reference - python

I'm retrieving a list of (name, id) pairs and I need to make sure there's no duplicate of name, regardless of the id.
# Sample data
filesID = [{'name': 'file1', 'id': '353'},{'name': 'file2', 'id': '154'},{'name': 'file3', 'id': '1874'},{'name': 'file1', 'id': '14'}]
I managed to get the desired output with nested loops :
uniqueFilesIDLoops = []
for pair in filesID:
found = False
for d in uniqueFilesIDLoops:
if d['name'] == pair['name']:
found = True
if not found:
uniqueFilesIDLoops.append(pair)
But I can't get it to work with list comprehension ... Here's what I've tried so far :
uniqueFilesIDComprehension = []
uniqueFilesIDComprehension = [pair for pair in filesID if pair['name'] not in [d['name'] for d in uniqueFilesIDComprehension]]
Outputs :
# Original data
[{'name': 'file1', 'id': '353'}, {'name': 'file2', 'id': '154'}, {'name': 'file3', 'id': '1874'}, {'name': 'file1', 'id': '14'}]
# Data obtained with list comprehension
[{'name': 'file1', 'id': '353'}, {'name': 'file2', 'id': '154'}, {'name': 'file3', 'id': '1874'}, {'name': 'file1', 'id': '14'}]
# Data obtained with loops (and desired output)
[{'name': 'file1', 'id': '353'}, {'name': 'file2', 'id': '154'}, {'name': 'file3', 'id': '1874'}]
I was thinking that maybe the call to uniqueFilesIDComprehension inside the list comprehension was not updated at each iteration, thus using [] and not finding corresponding values...

You cannot access contents of list comprehension during its creation, because it will be assigned to anything only after its value is completely evaluated.
Simpliest way to remove duplicates would be:
list({el['name'] : el for el in filesID}.values()) - this will create a dictionary based on the names of each element, so every time you encounter duplicate name it will overwrite it with a new element. After the dict is created all you need to do is get the values and cast it to list.
If you want to keep the first element with each name, not the last you can instead do it by creating the dictionary in a for loop:
out = {}
for el in filesID:
if el['name'] not in out:
out[el['name']] = el
And finally, one thing to consider when implementing any of those solutions - since you do not care about id part, do you really need to extract it?
I'd ask myself if this is not a valid choice as well.
out = {el['name'] for el in filesID}
print(out)
Output: {'file1', 'file3', 'file2'}

I would stick with your original loop, although note that it can be made a little cleaner. Namely, you don't need a flag named found.
uniqueFilesIDLoops = []
for pair in filesID:
for d in uniqueFilesIDLoops:
if d['name'] == pair['name']:
break
else:
uniqueFilesIDLoops.append(pair)
You can also use an auxiliary set to simplify detecting duplicate names (since they are str values and therefore hashable).
seen = set()
uniqueFilesIDLoops = []
for pair in filesID:
if (name := pair['name']) not in seen:
seen.add(name)
uniqueFilesIDLoops.append(pair)
Because we've now decoupled the result from the data structure we perform lookups in, the above could be turned into a list comprehension by writing an expression that both returns True when the name is not in the set and adds the name to the set. Something iffy like
seen = set()
uniqueFilesIDLoops = [pair
for pair in filesID
if (pair['name'] not in seen
and (seen.add(pair['name']) or True))]
(seen.add always returns None, which is a falsey value, so seen.add(...) or True is always True.)

List comprehensions are used to create new lists, so the original list is never updated; the assignment causes the variable to refer to the newly created list.

Related

Removing duplicates from 1million array of dictionariies

I have a HUGE list of objects.
That looks like this
output = [
{
'name': 'some name',
'id': '1'
}
]
clean_list = []
for i in range(len(output)):
if output[i] not in output[i + 1:]:
clean_list.append(output[i])
It's 1 million, and this is the method I'm currently using.. however it takes hours to complete this operation when my array is massive.
Is there a optimal way to remove duplicates?
There's two issues with the code you propose.
1. speed of "in" test
You wrote
clean_list = []
but you really want
clean_list = set()
That way in completes in O(1) constant
time, rather than O(N) linear time.
The linear probe inside the for
gives the loop O(N^2) quadratic cost.
2. equality test of hashable items
Your source items look like e.g. dict(name='some name', id=1).
You want to turn them into hashable tuples,
e.g. ('some name', 1),
so you can take advantage of set's O(1) fast lookups.
from pprint import pp
clean = set()
for d in output:
t = tuple(d.values())
if t not in clean:
clean.add(t)
pp(sorted(clean))
But wait! No need for checks, since a set will
reject attempts to add same thing twice.
So this suffices:
clean = set()
for d in output:
clean.add(tuple(d.values()))
And now it's simple enough that a set comprehension makes sense.
clean = {tuple(d.values())
for d in output}
Consider uniquifying on just name,
if the id value doesn't matter to you.
tl;dr: Doing ~ 10^6 operations is way better
than doing ~ 10^12 of them.
for such a large dataset it makes sence to use pandas, something like this:
import pandas as pd
output = [
{'name': 'some name 1','id': '1'},
{'name': 'some name 2','id': '2'},
{'name': 'some name 3','id': '3'},
{'name': 'some name 2','id': '2'}]
clean_list = pd.DataFrame(output).drop_duplicates().to_dict('records')
>>> clean_list
'''
[{'name': 'some name 1', 'id': '1'},
{'name': 'some name 2', 'id': '2'},
{'name': 'some name 3', 'id': '3'}]

How to extract the dictionary from list of dictionary with condition

How to extract from the json with condition
I have a list of dictionary. I need extract some of the dictionary with some conditions
If for cross field I need "AND" condition
for same field array I need to OR condition
I need to search subject which is Physics or Accounting this is array of fields(OR) statement
AND
I need to search type is Permanent or GUEST condition this is array of fields(OR) statement
AND
I need to search Location is NY(&) condition
test = [{'id':1,'name': 'A','subject': ['Maths','Accounting'],'type':'Contract', 'Location':'NY'},
{ 'id':2,'name': 'AB','subject': ['Physics','Engineering'],'type':'Permanent','Location':'NY'},
{'id':3,'name': 'ABC','subject': ['Maths','Engineering'],'type':'Permanent','Location':'NY'},
{'id':4,'name':'ABCD','subject': ['Physics','Engineering'],'type':['Contract','Guest'],'Location':'NY'}]
Expected out is id [{ 'id':2,'name': 'AB','subject': ['Physics','Engineering'],'type':'Permanent','Location':'NY'},{'id':4,'name':'ABCD','subject': ['Physics','Engineering'],'type':['Contract','Guest'],'Location':'NY'}]
The problem here is mostly that your data is not uniform, sometimes it's strings, sometimes it's list. Let's try:
# turns the values into set for easy comparison
def get_set(d,field):
return {d[field]} if isinstance(d[field], str) else set(d[field])
# we use this to filter
def validate(d):
# the three lines below corresponds to the three conditions listed
return get_set(d,'subject').intersection({'Physics','Accounting'}) and \
get_set(d,'type').intersection({'Permanent', 'Guest'}) and \
get_set(d,'Location')=={'NY'}
result = [d for d in test if validate(d)]
Output:
[{'id': 2,
'name': 'AB',
'subject': ['Physics', 'Engineering'],
'type': 'Permanent',
'Location': 'NY'},
{'id': 4,
'name': 'ABCD',
'subject': ['Physics', 'Engineering'],
'type': ['Contract', 'Guest'],
'Location': 'NY'}]
The following simple approach with a nested if clause solves the issue. The and condition is done via the nested if and the or conditions is simply done via or.
The in operator works for string values and list values, therefore it can be used interchangeably and results in the expected out. BUT this approach expects that there are no specific subjects like XYZ Accounting.
result = []
for elem in test:
# Check Location
if elem['Location'] == 'NY':
# Check subject
subject = elem['subject']
if ('Accounting' in subject) or ('Physics' in subject):
# Check type
elem_type = elem['type']
if ('Permanent' in elem_type) or ('Guest' in elem_type):
# Add element to result, because all conditions are true
result.append(elem)

Returning a list of dictionaries by iterating over given data

So let's say I've a list of students and there is a dictionary containing some data for each student as given below :
students= [{'student_name': 'name1',
'regNO': '12',
},{'student_name': 'name2',
'regNO': '13',
},{'student_name': 'name3',
'regNO': '14',
}
]
So based on the above data I want to return another list of dictionaries containing data for each student,
I wrote the following code :
res_dict = {}
res_list = []
for student in students:
res_dict['name']=student['student_name']
res_list.append(res_dict)
print(res_list)
I was hoping that in the output, for each student , there would be a dictionary with key being 'name' and value being the student name taken from 'students' list. I expected it to be as follows :
[{'name': 'name1'}, {'name': 'name2'}, {'name': 'name3'}]
But the output turned out to be this :
[{'name': 'name3'}, {'name': 'name3'}, {'name': 'name3'}]
Can anyone help me identify the issue in my code ?
The better way to get the desired result is via using list comprehension expression as:
[{'name': student['student_name']} for student in students]
The issue with your code is you are updating the values in same reference of the dict object and appending the same object again to the list. Change your code to:
for student in students:
res_dict = {} # Create new `dict` object
res_dict['name'] = student['student_name']
res_list.append(res_dict)
OR, you may just do:
for student in students:
res_list.append({'name': student['student_name']})
The subtle concept of list is that it does not copy the item that you append to it.Instead of that it just stores the pointer to the newly added object via append,similar to pointer arrays in c.So when you try print(res_dict),it will give you the the result like this [{'name': 'name3'}, {'name': 'name3'}, {'name': 'name3'}].But when you append this to the list,all the items in the list point to the same object.You can verify this by this small fragment of code
for i in res_list:
print(id(i))
You will find the same memory address for all the list elements.
But when you take a copy of the dictionary with the help of d.copy() and append that to the res_list,you can see that all the list objects are pointing to different objects by the same technique using id(i) and for loop as shown above.
So finally the corrected code would be
students= [{'student_name': 'name1',
'regNO': '12',
},{'student_name': 'name2',
'regNO': '13',
},{'student_name': 'name3',
'regNO': '14',
}
]
res_dict = {}
res_list = []
for student in students:
res_dict['name']=student['student_name']
res_list.append(res_dict.copy())
print(res_list)
Using list comprehension would always expose the contents to be modified.
This is a classic reference issue. In other words you are appending the reference to the same dict object. Thus, when you change the name value on it, for all times it shows up in the list it will reflect its new value. Instead, create a new dict for each iteration and append that :)
Here is a simple way to do it.
n_list = [{'name':i['student_name']} for i in students]
You should reset res_dict = {} to an empty dict within your loop for each new entry.

How do I turn list values into an array with an index that matches the other dic values?

Hoping someone can help me out. I've spent the past couple hours trying to solve this, and fair warning, I'm still fairly new to python.
This is a repost of a question I recently deleted. I've misinterpreted my code in the last example.The correct example is:
I have a dictionary, with a list that looks similar to:
dic = [
{
'name': 'john',
'items': ['pants_1', 'shirt_2','socks_3']
},
{
'name': 'bob',
items: ['jacket_1', 'hat_1']
}
]
I'm using .append for both 'name', and 'items', which adds the dic values into two new lists:
for x in dic:
dic_name.append(dic['name'])
dic_items.append(dic['items'])
I need to split the item value using '_' as the delimiter, so I've also split the values by doing:
name, items = [i if i is None else i.split('_')[0] for i in dic_name],
[if i is None else i.split('_')[0] for i in chain(*dic_items)])
None is used in case there is no value. This provides me with a new list for name, items, with the delimiter used. Disregard the fact that I used '_' split for names in this example.
When I use this, the index for name, and item no longer match. Do i need to create the listed items in an array to match the name index, and if so, how?
Ideally, I want name[0] (which is john), to also match items[0] (as an array of the items in the list, so pants, shirt, socks). This way when I refer to index 0 for name, it also grabs all the values for items as index 0. The same thing regarding the index used for bob [1], which should match his items with the same index.
#avinash-raj, thanks for your patience, as I've had to update my question to reflect more closely to the code I'm working with.
I'm reading a little bit between the lines but are you trying to just collapse the list and get rid of the field names, e.g.:
>>> dic = [{'name': 'john', 'items':['pants_1','shirt_2','socks_3']},
{'name': 'bob', 'items':['jacket_1','hat_1']}]
>>> data = {d['name']: dict(i.split('_') for i in d['items']) for d in dic}
>>> data
{'bob': {'hat': '1', 'jacket': '1'},
'john': {'pants': '1', 'shirt': '2', 'socks': '3'}}
Now the data is directly related vs. indirectly related via a common index into 2 lists. If you want the dictionary split out you can always
>>> dic_name, dic_items = zip(*data.items())
>>> dic_name
('bob', 'john')
>>> dic_items
({'hat': '1', 'jacket': '1'}, {'pants': '1', 'shirt': '2', 'socks': '3'})
You need a list of dictionaries because the duplicate keys name and items are overwritten:
items = [[i.split('_')[0] for i in d['items']] for d in your_list]
names = [d['name'] for d in your_list] # then grab names from list
Alternatively, you can do this in one line with the built-in zip method and generators, like so:
names, items = zip(*((i['name'], [j.split('_')[0] for j in i['items']]) for i in dic))
From Looping Techniques in the Tutorial.
for name, items in div.items():
names.append(name)
items.append(item)
That will work if your dict is structured
{'name':[item1]}
In the loop body of
for x in dic:
dic_name.append(dic['name'])
dic_items.append(dic['items'])
you'll probably want to access x (to which the items in dic will be assigned in turn) rather than dic.

Python Dictionary - List Nesting

I have a dictionary with a key and a pair of values, the values are stored in a List. But i'm keeping the list empty so i can .append its values ,i cant seem to be able to do this
>>>myDict = {'Numbers':[]}
>>>myDict['Numbers'[1].append(user_inputs)
doesn't seem to work, returns an error . How do i refer to the list in myDict so i can append its values.
Also is it possible to have a dictionary inside a list and also have another list inside? if so? what is its syntax or can you recommend anyother way i can do this
>>>myDict2 = {'Names': [{'first name':[],'Second name':[]}]}
do i change the second nested list to a tuple?? Please lets keep it to PYTHON 2.7
You get an error because your syntax is wrong. The following appends to the list value for the 'Numbers' key:
myDict['Numbers'].append(user_inputs)
You can nest Python objects arbitrarily; your myDict2 syntax is entirely correct. Only the keys need to be immutable (so a tuple vs. a list), but your keys are all strings:
>>> myDict2 = {'Names': [{'first name':[],'Second name':[]}]}
>>> myDict2['Names']
[{'first name': [], 'Second name': []}]
>>> myDict2['Names'][0]
{'first name': [], 'Second name': []}
>>> myDict2['Names'][0]['first name']
[]
You should access the list with myDict['Numbers']:
>>>myDict['Numbers'].append(user_inputs)
You can have dicts inside of a list.
The only catch is that dictionary keys have to be immutable, so you can't have dicts or lists as keys.
You may want to look into the json library, which supports a mix of nested dictionaries and lists.
In addition, you may also be interested in the setdefault method of the dictionary class.
Format is something like:
new_dict = dict()
some_list = ['1', '2', '3', ...]
for idx, val in enumerate(some_list):
something = get_something(idx)
new_dict.setdefault(val, []).append(something)

Categories

Resources