Removing duplicates from a list in Python [duplicate] - python

This question already has answers here:
Removing duplicates in lists
(56 answers)
Closed 4 years ago.
How would I use python to check a list and delete all duplicates? I don't want to have to specify what the duplicate item is - I want the code to figure out if there are any and remove them if so, keeping only one instance of each. It also must work if there are multiple duplicates in a list.
For example, in my code below, the list lseparatedOrbList has 12 items - one is repeated six times, one is repeated five times, and there is only one instance of one. I want it to change the list so there are only three items - one of each, and in the same order they appeared before. I tried this:
for i in lseparatedOrbList:
for j in lseparatedOrblist:
if lseparatedOrbList[i] == lseparatedOrbList[j]:
lseparatedOrbList.remove(lseparatedOrbList[j])
But I get the error:
Traceback (most recent call last):
File "qchemOutputSearch.py", line 123, in <module>
for j in lseparatedOrblist:
NameError: name 'lseparatedOrblist' is not defined
I'm guessing because it's because I'm trying to loop through lseparatedOrbList while I loop through it, but I can't think of another way to do it.

Use set():
woduplicates = set(lseparatedOrblist)
Returns a set without duplicates. If you, for some reason, need a list back:
woduplicates = list(set(lseperatedOrblist))
This will, however, have a different order than your original list.

Just make a new list to populate, if the item for your list is not yet in the new list input it, else just move on to the next item in your original list.
for i in mylist:
if i not in newlist:
newlist.append(i)

This should be faster and will preserve the original order:
seen = {}
new_list = [seen.setdefault(x, x) for x in my_list if x not in seen]
If you don't care about order, you can just:
new_list = list(set(my_list))

You can do this like that:
x = list(set(x))
Example: if you do something like that:
x = [1,2,3,4,5,6,7,8,9,10,2,1,6,31,20]
x = list(set(x))
x
you will see the following result:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 31]
There is only one thing you should think of: the resulting list will not be ordered as the original one (will lose the order in the process).

The modern way to do it that maintains the order is:
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(lseparatedOrbList))
as discussed by Raymond Hettinger in this answer. In python 3.5 and above this is also the fastest way - see the linked answer for details. However the keys must be hashable (as is the case in your list I think)
As of python 3.7 ordered dicts are a language feature so the above call becomes
>>> list(dict.fromkeys(lseparatedOrbList))
Performance:
"""Dedup list."""
import sys
import timeit
repeat = 3
numbers = 1000
setup = """"""
def timer(statement, msg='', _setup=None):
print(msg, min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers)))
print(sys.version)
s = """import random; n=%d; li = [random.randint(0, 100) for _ in range(n)]"""
for siz, m in ((150, "\nFew duplicates"), (15000, "\nMany duplicates")):
print(m)
setup = s % siz
timer('s = set(); [i for i in li if i not in s if not s.add(i)]', "s.add(i):")
timer('list(dict.fromkeys(li))', "dict:")
timer('list(set(li))', 'Not order preserving: list(set(li)):')
gives:
3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)]
Few duplicates
s.add(i): 0.008242200000040611
dict: 0.0037373999998635554
Not order preserving: list(set(li)): 0.0029409000001123786
Many duplicates
s.add(i): 0.2839437000000089
dict: 0.21970469999996567
Not order preserving: list(set(li)): 0.102068700000018
So dict seems consistently faster although approaching list comprehension with set.add for many duplicates - not sure if further varying the numbers would give different results. list(set) is of course faster but does not preserve original list order, a requirement here

This should do it for you:
new_list = list(set(old_list))
set will automatically remove duplicates. list will cast it back to a list.

No, it's simply a typo, the "list" at the end must be capitalized. You can nest loops over the same variable just fine (although there's rarely a good reason to).
However, there are other problems with the code. For starters, you're iterating through lists, so i and j will be items not indices. Furthermore, you can't change a collection while iterating over it (well, you "can" in that it runs, but madness lies that way - for instance, you'll propably skip over items). And then there's the complexity problem, your code is O(n^2). Either convert the list into a set and back into a list (simple, but shuffles the remaining list items) or do something like this:
seen = set()
new_x = []
for x in xs:
if x in seen:
continue
seen.add(x)
new_xs.append(x)
Both solutions require the items to be hashable. If that's not possible, you'll probably have to stick with your current approach sans the mentioned problems.

It's because you are missing a capital letter, actually.
Purposely dedented:
for i in lseparatedOrbList: # capital 'L'
for j in lseparatedOrblist: # lowercase 'l'
Though the more efficient way to do it would be to insert the contents into a set.
If maintaining the list order matters (ie, it must be "stable"), check out the answers on this question

for unhashable lists. It is faster as it does not iterate about already checked entries.
def purge_dublicates(X):
unique_X = []
for i, row in enumerate(X):
if row not in X[i + 1:]:
unique_X.append(row)
return unique_X

There is a faster way to fix this:
list = [1, 1.0, 1.41, 1.73, 2, 2, 2.0, 2.24, 3, 3, 4, 4, 4, 5, 6, 6, 8, 8, 9, 10]
list2=[]
for value in list:
try:
list2.index(value)
except:
list2.append(value)
list.clear()
for value in list2:
list.append(value)
list2.clear()
print(list)
print(list2)

In this way one can delete a particular item which is present multiple times in a list : Try deleting all 5
list1=[1,2,3,4,5,6,5,3,5,7,11,5,9,8,121,98,67,34,5,21]
print list1
n=input("item to be deleted : " )
for i in list1:
if n in list1:
list1.remove(n)
print list1

Related

"group" items in list

I have
[string, int,int],[string, int,int]... kind of list that I want to group with a different list
[string1, int1,int1] + [string2, int2,int2] = ["string+string2", int1+int1+int2+int2]
the History goes like I have already made import function that gets me compounds:
ex[Ch3, 15.3107,15.284] kinda like this...
I have a function that gives me:
dictionary{0:"CH3"}
and another that gives me:
List ["CH3",30.594700000000003]
def group_selectec_compounds(numCompound,values)
values can be list of list that have everything
I also have dic made that is something like this {0:["CH4"],...}
numCoumpound should be various variables (I think) or tuple of keys? So I can do the math for the user.
In the end I want something like: ["CH3+CH4",61.573]
it can also be: ["CH3+CH4+H2SO4",138.773]
I would solve this using '+'.join, sum and comprehensions:
>>> data = [['string1', 2, 3], ['string2', 4, 5], ['string3', 6, 7]]
>>> ['+'.join(s for s, _, _ in data), sum(x + y for _, x, y in data)]
['string1+string2+string3', 27]
First, create a dictionary that stores the location of the type:
helper = {}
for elem in lst1:
elemType = str(type(elem))
if elemType not in helper:
helper[elemType] = lst1.index[elem]
Now you have a dictionary that indexes your original list by type, you just have to run the second list and append accordingly:
for elem in lst2:
elemType = str(type(elem))
if elemType not in helper:
#in case list 2 contains a type that list 1 doesn't have
lst1.append(elem)
helper[elemType] = lst1.index[elem]
else:
lst1[helper[elemType]] += elem
Hope this makes sense! I have not vetted this for correctness but the idea is there.
Edit: This also does not solve the issue of list 1 having more than 1 string or more than 1 int, etc., but to solve that should be trivial depending on how you wish to resolve that issue.
2nd Edit: This answer is generic, so it doesn't matter how you order the strings and ints in the list, in fact, lst1 can be [string, int, double] and lst2 can be [int, double, string] and this would still work, so it is robust in case the order of your list changes

Organize/Formatting the python code to one line

Is there any way to rewrite the below python code in one line
for i in range(len(main_list)):
if main_list[i] != []:
for j in range(len(main_list[i])):
main_list[i][j][6]=main_list[i][j][6].strftime('%Y-%m-%d')
something like below,
[main_list[i][j][6]=main_list[i][j][6].strftime('%Y-%m-%d') for i in range(len(main_list)) if main_list[i] != [] for j in range(len(main_list[i]))]
I got SyntaxError for this.
Actually, i'm trying to storing all the values fetched from table into one list. Since the table contains date method/datatype, my requirement needs to convert it to string as i faced with malformed string error.
So my approach is to convert that element of list from datetime.date() to str. And i got it working. Just wanted it to work with one line
Use the explicit for loop. There's no better option.
A list comprehension is used to create a new list, not to modify certain elements of an existing list.
You may be able to update values via a list comprehension, e.g. [L.__setitem__(i, 'some_value') for i in range(len(L))], but this is not recommended as you are using a side-effect and in the process creating a list of None values which you then discard.
You could also write a convoluted list comprehension with a ternary statement indicating when you meet the 6th element in a 3rd nested sublist. But this will make your code difficult to maintain.
In short, use the for loop.
You're getting a syntax error because you're not allowed to perform assignments within a list comprehension. Python forbids assignments because it is discouraging over complex list comprehensions in favour of for loops.
Obviously you shouldn't do this on one line, but this is how to do it:
import datetime
# Example from your comment:
type1 = "some type"
main_list = [[], [],
[[1, 2, 3, datetime.date(2016, 8, 18), type1],
[3, 4, 5, datetime.date(2016, 8, 18), type1]], [], []]
def fmt_times(lst):
"""Format the fourth value of each element of each non-empty sublist"""
for i in range(len(lst)):
if lst[i] != []:
for j in range(len(lst[i])):
lst[i][j][3] = lst[i][j][3].strftime('%Y-%m-%d')
return lst
def fmt_times_one_line(main_list):
"""Format the fourth value of each element of each non-empty sublist"""
return [[] if main_list[i] == [] else [[main_list[i][j][k] if k != 3 else main_list[i][j][k].strftime('%Y-%m-%d') for k in range(len(main_list[i][j]))] for j in range(len(main_list[i])) ] for i in range(len(main_list))]
import copy
# Deep copy needed because fmt_times modifies the sublists.
assert fmt_times(copy.deepcopy(main_list)) == fmt_times_one_line(main_list)
The list comprehension is a functional thing. If you know how map() works in python or javascript then it's the same thing. In a map() or comprehension we generally don't mutate the data we're mapping over (and python discourages attempting it) so instead we recreate the entire object, substituting only the values we wanted to modify.
One line?
main_list = convert_list(main_list)
You will have to put a few more lines somewhere else though:
def convert_list(main_list):
for i, ml in enumerate(main_list):
if isinstance(ml, list) and len(ml) > 0:
main_list[i] = convert_list(ml)
elif isinstance(ml, datetime.date):
main_list[i] = ml.strftime('%Y-%m-%d')
return main_list
You might be able to whack this together with a list comprehension but it's a terrible idea (for reasons better explained in the other answer).

How to count number of unique lists within list?

I've tried using Counter and itertools, but since a list is unhasable, they don't work.
My data looks like this: [ [1,2,3], [2,3,4], [1,2,3] ]
I would like to know that the list [1,2,3] appears twice, but I cant figure out how to do this. I was thinking of just converting each list to a tuple, then hashing with that. Is there a better way?
>>> from collections import Counter
>>> li=[ [1,2,3], [2,3,4], [1,2,3] ]
>>> Counter(str(e) for e in li)
Counter({'[1, 2, 3]': 2, '[2, 3, 4]': 1})
The method that you state also works as long as there are not nested mutables in each sublist (such as [ [1,2,3], [2,3,4,[11,12]], [1,2,3] ]:
>>> Counter(tuple(e) for e in li)
Counter({(1, 2, 3): 2, (2, 3, 4): 1})
If you do have other unhasable types nested in the sub lists lists, use the str or repr method since that deals with all sub lists as well. Or recursively convert all to tuples (more work).
ll = [ [1,2,3], [2,3,4], [1,2,3] ]
print(len(set(map(tuple, ll))))
Also, if you wanted to count the occurences of a unique* list:
print(ll.count([1,2,3]))
*value unique, not reference unique)
I think, using the Counter class on tuples like
Counter(tuple(item) for item in li)
Will be optimal in terms of elegance and "pythoniticity": It's probably the shortest solution, it's perfectly clear what you want to achieve and how it's done, and it uses resp. combines standard methods (and thus avoids reinventing the wheel).
The only performance drawback I can see is, that every element has to be converted to a tuple (in order to be hashable), which more or less means that all elements of all sublists have to be copied once. Also the internal hash function on tuples may be suboptimal if you know that list elements will e.g. always be integers.
In order to improve on performance, you would have to
Implement some kind of hash algorithm working directly on lists (more or less reimplementing the hashing of tuples but for lists)
Somehow reimplement the Counter class in order to use this hash algorithm and provide some suitable output (this class would probably use a dictionary using the hash values as key and a combination of the "original" list and the count as value)
At least the first step would need to be done in C/C++ in order to match the speed of the internal hash function. If you know the type of the list elements you could probably even improve the performance.
As for the Counter class I do not know if it's standard implementation is in Python or in C, if the latter is the case you'll probably also have to reimplement it in C in order to achieve the same (or better) performance.
So the question "Is there a better solution" cannot be answered (as always) without knowing your specific requirements.
list = [ [1,2,3], [2,3,4], [1,2,3] ]
repeats = []
unique = 0
for i in list:
count = 0;
if i not in repeats:
for i2 in list:
if i == i2:
count += 1
if count > 1:
repeats.append(i)
elif count == 1:
unique += 1
print "Repeated Items"
for r in repeats:
print r,
print "\nUnique items:", unique
loops through the list to find repeated sequences, while skipping items if they have already been detected as repeats, and adds them into the repeats list, while counting the number of unique lists.

How to iterate a list while deleting items from list using range() function? [duplicate]

This question already has answers here:
Strange result when removing item from a list while iterating over it
(8 answers)
Closed 7 years ago.
This is the most common problem I face while trying to learn programming in python. The problem is, when I try to iterate a list using "range()" function to check if given item in list meets given condition and if yes then to delete it, it will always give "IndexError". So, is there a particular way to do this without using any other intermediate list or "while" statement? Below is an example:
l = range(20)
for i in range(0,len(l)):
if l[i] == something:
l.pop(i)
First of all, you never want to iterate over things like that in Python. Iterate over the actual objects, not the indices:
l = range(20)
for i in l:
...
The reason for your error was that you were removing an item, so the later indices cease to exist.
Now, you can't modify a list while you are looping over it, but that isn't a problem. The better solution is to use a list comprehension here, to filter out the extra items.
l = range(20)
new_l = [i for i in l if not i == something]
You can also use the filter() builtin, although that tends to be unclear in most situations (and slower where you need lambda).
Also note that in Python 3.x, range() produces a generator, not a list.
It would also be a good idea to use more descriptive variable names - I'll presume here it's for example, but names like i and l are hard to read and make it easier to introduce bugs.
Edit:
If you wish to update the existing list in place, as pointed out in the comments, you can use the slicing syntax to replace each item of the list in turn (l[:] = new_l). That said, I would argue that that case is pretty bad design. You don't want one segment of code to rely on data being updated from another bit of code in that way.
Edit 2:
If, for any reason, you need the indices as you loop over the items, that's what the enumerate() builtin is for.
You can always do this sort of thing with a list comprehension:
newlist=[i for i in oldlist if not condition ]
As others have said, iterate over the list and create a new list with just the items you want to keep.
Use a slice assignment to update the original list in-place.
l[:] = [item for item in l if item != something]
You should look the problem from the other side: add an element to a list when it is equal with "something". with list comprehension:
l = [i for i in xrange(20) if i != something]
you should not use for i in range(0,len(l)):, use for i, item in enumerate(l): instead if you need the index, for item in l: if not
you should not manipulate a structure you are iterating over. when faced to do so, iterate over a copy instead
don't name a variable l (may be mistaken as 1 or I)
if you want to filter a list, do so explicitly. use filter() or list comprehensions
BTW, in your case, you could also do:
while something in list_: list_.remove(something)
That's not very efficient, though. But depending on context, it might be more readable.
The reason you're getting an IndexError is because you're changing the length of the list as you iterate in the for-loop. Basically, here's the logic...
#-- Build the original list: [0, 1, 2, ..., 19]
l = range(20)
#-- Here, the range function builds ANOTHER list, in this case also [0, 1, 2, ..., 19]
#-- the variable "i" will be bound to each element of this list, so i = 0 (loop), then i = 1 (loop), i = 2, etc.
for i in range(0,len(l)):
if i == something:
#-- So, when i is equivalent to something, you "pop" the list, l.
#-- the length of l is now *19* elements, NOT 20 (you just removed one)
l.pop(i)
#-- So...when the list has been shortened to 19 elements...
#-- we're still iterating, i = 17 (loop), i = 18 (loop), i = 19 *CRASH*
#-- There is no 19th element of l, as l (after you popped out an element) only
#-- has indices 0, ..., 18, now.
NOTE also, that you're making the "pop" decision based on the index of the list, not what's in the indexed cell of the list. This is unusual -- was that your intention? Or did you
mean something more like...
if l[i] == something:
l.pop(i)
Now, in your specific example, (l[i] == i) but this is not a typical pattern.
Rather than iterating over the list, try the filter function. It's a built-in (like a lot of other list processing functions: e.g. map, sort, reverse, zip, etc.)
Try this...
#-- Create a function for testing the elements of the list.
def f(x):
if (x == SOMETHING):
return False
else:
return True
#-- Create the original list.
l = range(20)
#-- Apply the function f to each element of l.
#-- Where f(l[i]) is True, the element l[i] is kept and will be in the new list, m.
#-- Where f(l[i]) is False, the element l[i] is passed over and will NOT appear in m.
m = filter(f, l)
List processing functions go hand-in-hand with "lambda" functions - which, in Python, are brief, anonymous functions. so, we can re-write the above code as...
#-- Create the original list.
l = range(20)
#-- Apply the function f to each element of l.
#-- Where lambda is True, the element l[i] is kept and will be in the new list, m.
#-- Where lambda is False, the element l[i] is passed over and will NOT appear in m.
m = filter(lambda x: (x != SOMETHING), l)
Give it a go and see it how it works!

How to remove all duplicate items from a list [duplicate]

This question already has answers here:
Removing duplicates in lists
(56 answers)
Closed 4 years ago.
How would I use python to check a list and delete all duplicates? I don't want to have to specify what the duplicate item is - I want the code to figure out if there are any and remove them if so, keeping only one instance of each. It also must work if there are multiple duplicates in a list.
For example, in my code below, the list lseparatedOrbList has 12 items - one is repeated six times, one is repeated five times, and there is only one instance of one. I want it to change the list so there are only three items - one of each, and in the same order they appeared before. I tried this:
for i in lseparatedOrbList:
for j in lseparatedOrblist:
if lseparatedOrbList[i] == lseparatedOrbList[j]:
lseparatedOrbList.remove(lseparatedOrbList[j])
But I get the error:
Traceback (most recent call last):
File "qchemOutputSearch.py", line 123, in <module>
for j in lseparatedOrblist:
NameError: name 'lseparatedOrblist' is not defined
I'm guessing because it's because I'm trying to loop through lseparatedOrbList while I loop through it, but I can't think of another way to do it.
Use set():
woduplicates = set(lseparatedOrblist)
Returns a set without duplicates. If you, for some reason, need a list back:
woduplicates = list(set(lseperatedOrblist))
This will, however, have a different order than your original list.
Just make a new list to populate, if the item for your list is not yet in the new list input it, else just move on to the next item in your original list.
for i in mylist:
if i not in newlist:
newlist.append(i)
This should be faster and will preserve the original order:
seen = {}
new_list = [seen.setdefault(x, x) for x in my_list if x not in seen]
If you don't care about order, you can just:
new_list = list(set(my_list))
You can do this like that:
x = list(set(x))
Example: if you do something like that:
x = [1,2,3,4,5,6,7,8,9,10,2,1,6,31,20]
x = list(set(x))
x
you will see the following result:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 31]
There is only one thing you should think of: the resulting list will not be ordered as the original one (will lose the order in the process).
The modern way to do it that maintains the order is:
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys(lseparatedOrbList))
as discussed by Raymond Hettinger in this answer. In python 3.5 and above this is also the fastest way - see the linked answer for details. However the keys must be hashable (as is the case in your list I think)
As of python 3.7 ordered dicts are a language feature so the above call becomes
>>> list(dict.fromkeys(lseparatedOrbList))
Performance:
"""Dedup list."""
import sys
import timeit
repeat = 3
numbers = 1000
setup = """"""
def timer(statement, msg='', _setup=None):
print(msg, min(
timeit.Timer(statement, setup=_setup or setup).repeat(
repeat, numbers)))
print(sys.version)
s = """import random; n=%d; li = [random.randint(0, 100) for _ in range(n)]"""
for siz, m in ((150, "\nFew duplicates"), (15000, "\nMany duplicates")):
print(m)
setup = s % siz
timer('s = set(); [i for i in li if i not in s if not s.add(i)]', "s.add(i):")
timer('list(dict.fromkeys(li))', "dict:")
timer('list(set(li))', 'Not order preserving: list(set(li)):')
gives:
3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)]
Few duplicates
s.add(i): 0.008242200000040611
dict: 0.0037373999998635554
Not order preserving: list(set(li)): 0.0029409000001123786
Many duplicates
s.add(i): 0.2839437000000089
dict: 0.21970469999996567
Not order preserving: list(set(li)): 0.102068700000018
So dict seems consistently faster although approaching list comprehension with set.add for many duplicates - not sure if further varying the numbers would give different results. list(set) is of course faster but does not preserve original list order, a requirement here
This should do it for you:
new_list = list(set(old_list))
set will automatically remove duplicates. list will cast it back to a list.
No, it's simply a typo, the "list" at the end must be capitalized. You can nest loops over the same variable just fine (although there's rarely a good reason to).
However, there are other problems with the code. For starters, you're iterating through lists, so i and j will be items not indices. Furthermore, you can't change a collection while iterating over it (well, you "can" in that it runs, but madness lies that way - for instance, you'll propably skip over items). And then there's the complexity problem, your code is O(n^2). Either convert the list into a set and back into a list (simple, but shuffles the remaining list items) or do something like this:
seen = set()
new_x = []
for x in xs:
if x in seen:
continue
seen.add(x)
new_xs.append(x)
Both solutions require the items to be hashable. If that's not possible, you'll probably have to stick with your current approach sans the mentioned problems.
It's because you are missing a capital letter, actually.
Purposely dedented:
for i in lseparatedOrbList: # capital 'L'
for j in lseparatedOrblist: # lowercase 'l'
Though the more efficient way to do it would be to insert the contents into a set.
If maintaining the list order matters (ie, it must be "stable"), check out the answers on this question
for unhashable lists. It is faster as it does not iterate about already checked entries.
def purge_dublicates(X):
unique_X = []
for i, row in enumerate(X):
if row not in X[i + 1:]:
unique_X.append(row)
return unique_X
There is a faster way to fix this:
list = [1, 1.0, 1.41, 1.73, 2, 2, 2.0, 2.24, 3, 3, 4, 4, 4, 5, 6, 6, 8, 8, 9, 10]
list2=[]
for value in list:
try:
list2.index(value)
except:
list2.append(value)
list.clear()
for value in list2:
list.append(value)
list2.clear()
print(list)
print(list2)
In this way one can delete a particular item which is present multiple times in a list : Try deleting all 5
list1=[1,2,3,4,5,6,5,3,5,7,11,5,9,8,121,98,67,34,5,21]
print list1
n=input("item to be deleted : " )
for i in list1:
if n in list1:
list1.remove(n)
print list1

Categories

Resources