I have a tuple called values which contains the following:
('275', '54000', '0.0', '5000.0', '0.0')
I want to change the first value (i.e., 275) in this tuple but I understand that tuples are immutable so values[0] = 200 will not work. How can I achieve this?
It's possible via:
t = ('275', '54000', '0.0', '5000.0', '0.0')
lst = list(t)
lst[0] = '300'
t = tuple(lst)
But if you're going to need to change things, you probably are better off keeping it as a list
Depending on your problem slicing can be a really neat solution:
>>> b = (1, 2, 3, 4, 5)
>>> b[:2] + (8,9) + b[3:]
(1, 2, 8, 9, 4, 5)
>>> b[:2] + (8,) + b[3:]
(1, 2, 8, 4, 5)
This allows you to add multiple elements or also to replace a few elements (especially if they are "neighbours". In the above case casting to a list is probably more appropriate and readable (even though the slicing notation is much shorter).
Well, as Trufa has already shown, there are basically two ways of replacing a tuple's element at a given index. Either convert the tuple to a list, replace the element and convert back, or construct a new tuple by concatenation.
In [1]: def replace_at_index1(tup, ix, val):
...: lst = list(tup)
...: lst[ix] = val
...: return tuple(lst)
...:
In [2]: def replace_at_index2(tup, ix, val):
...: return tup[:ix] + (val,) + tup[ix+1:]
...:
So, which method is better, that is, faster?
It turns out that for short tuples (on Python 3.3), concatenation is actually faster!
In [3]: d = tuple(range(10))
In [4]: %timeit replace_at_index1(d, 5, 99)
1000000 loops, best of 3: 872 ns per loop
In [5]: %timeit replace_at_index2(d, 5, 99)
1000000 loops, best of 3: 642 ns per loop
Yet if we look at longer tuples, list conversion is the way to go:
In [6]: k = tuple(range(1000))
In [7]: %timeit replace_at_index1(k, 500, 99)
100000 loops, best of 3: 9.08 µs per loop
In [8]: %timeit replace_at_index2(k, 500, 99)
100000 loops, best of 3: 10.1 µs per loop
For very long tuples, list conversion is substantially better!
In [9]: m = tuple(range(1000000))
In [10]: %timeit replace_at_index1(m, 500000, 99)
10 loops, best of 3: 26.6 ms per loop
In [11]: %timeit replace_at_index2(m, 500000, 99)
10 loops, best of 3: 35.9 ms per loop
Also, performance of the concatenation method depends on the index at which we replace the element. For the list method, the index is irrelevant.
In [12]: %timeit replace_at_index1(m, 900000, 99)
10 loops, best of 3: 26.6 ms per loop
In [13]: %timeit replace_at_index2(m, 900000, 99)
10 loops, best of 3: 49.2 ms per loop
So: If your tuple is short, slice and concatenate.
If it's long, do the list conversion!
It is possible with a one liner:
values = ('275', '54000', '0.0', '5000.0', '0.0')
values = ('300', *values[1:])
I believe this technically answers the question, but don't do this at home. At the moment, all answers involve creating a new tuple, but you can use ctypes to modify a tuple in-memory. Relying on various implementation details of CPython on a 64-bit system, one way to do this is as follows:
def modify_tuple(t, idx, new_value):
# `id` happens to give the memory address in CPython; you may
# want to use `ctypes.addressof` instead.
element_ptr = (ctypes.c_longlong).from_address(id(t) + (3 + idx)*8)
element_ptr.value = id(new_value)
# Manually increment the reference count to `new_value` to pretend that
# this is not a terrible idea.
ref_count = (ctypes.c_longlong).from_address(id(new_value))
ref_count.value += 1
t = (10, 20, 30)
modify_tuple(t, 1, 50) # t is now (10, 50, 30)
modify_tuple(t, -1, 50) # Will probably crash your Python runtime
As Hunter McMillen mentioned, tuples are immutable, you need to create a new tuple in order to achieve this. For instance:
>>> tpl = ('275', '54000', '0.0', '5000.0', '0.0')
>>> change_value = 200
>>> tpl = (change_value,) + tpl[1:]
>>> tpl
(200, '54000', '0.0', '5000.0', '0.0')
Not that this is superior, but if anyone is curious it can be done on one line with:
tuple = tuple([200 if i == 0 else _ for i, _ in enumerate(tuple)])
You can't modify items in tuple, but you can modify properties of mutable objects in tuples (for example if those objects are lists or actual class objects)
For example
my_list = [1,2]
tuple_of_lists = (my_list,'hello')
print(tuple_of_lists) # ([1, 2], 'hello')
my_list[0] = 0
print(tuple_of_lists) # ([0, 2], 'hello')
EDIT: This doesn't work on tuples with duplicate entries yet!!
Based on Pooya's idea:
If you are planning on doing this often (which you shouldn't since tuples are inmutable for a reason) you should do something like this:
def modTupByIndex(tup, index, ins):
return tuple(tup[0:index]) + (ins,) + tuple(tup[index+1:])
print modTupByIndex((1,2,3),2,"a")
Or based on Jon's idea:
def modTupByIndex(tup, index, ins):
lst = list(tup)
lst[index] = ins
return tuple(lst)
print modTupByIndex((1,2,3),1,"a")
based on Jon's Idea and dear Trufa
def modifyTuple(tup, oldval, newval):
lst=list(tup)
for i in range(tup.count(oldval)):
index = lst.index(oldval)
lst[index]=newval
return tuple(lst)
print modTupByIndex((1, 1, 3), 1, "a")
it changes all of your old values occurrences
You can't. If you want to change it, you need to use a list instead of a tuple.
Note that you could instead make a new tuple that has the new value as its first element.
Frist, ask yourself why you want to mutate your tuple. There is a reason why strings and tuple are immutable in Ptyhon, if you want to mutate your tuple then it should probably be a list instead.
Second, if you still wish to mutate your tuple then you can convert your tuple to a list then convert it back, and reassign the new tuple to the same variable. This is great if you are only going to mutate your tuple once. Otherwise, I personally think that is counterintuitive. Because It is essentially creating a new tuple and every time if you wish to mutate the tuple you would have to perform the conversion. Also If you read the code it would be confusing to think why not just create a list? But it is nice because it doesn't require any library.
I suggest using mutabletuple(typename, field_names, default=MtNoDefault) from mutabletuple 0.2. I personally think this way is a more intuitive and readable. The personal reading the code would know that writer intends to mutate this tuple in the future. The downside compares to the list conversion method above is that this requires you to import additional py file.
from mutabletuple import mutabletuple
myTuple = mutabletuple('myTuple', 'v w x y z')
p = myTuple('275', '54000', '0.0', '5000.0', '0.0')
print(p.v) #print 275
p.v = '200' #mutate myTuple
print(p.v) #print 200
TL;DR: Don't try to mutate tuple. if you do and it is a one-time operation convert tuple to list, mutate it, turn list into a new tuple, and reassign back to the variable holding old tuple. If desires tuple and somehow want to avoid listand want to mutate more than once then create mutabletuple.
I've found the best way to edit tuples is to recreate the tuple using the previous version as the base.
Here's an example I used for making a lighter version of a colour (I had it open already at the time):
colour = tuple([c+50 for c in colour])
What it does, is it goes through the tuple 'colour' and reads each item, does something to it, and finally adds it to the new tuple.
So what you'd want would be something like:
values = ('275', '54000', '0.0', '5000.0', '0.0')
values = (tuple(for i in values: if i = 0: i = 200 else i = values[i])
That specific one doesn't work, but the concept is what you need.
tuple = (0, 1, 2)
tuple = iterate through tuple, alter each item as needed
that's the concept.
I´m late to the game but I think the simplest, resource-friendliest and fastest way (depending on the situation),
is to overwrite the tuple itself. Since this would remove the need for the list & variable creation and is archived in one line.
new = 24
t = (1, 2, 3)
t = (t[0],t[1],new)
>>> (1, 2, 24)
But: This is only handy for rather small tuples and also limits you to a fixed tuple value, nevertheless, this is the case for tuples most of the time anyway.
So in this particular case it would look like this:
new = '200'
t = ('275', '54000', '0.0', '5000.0', '0.0')
t = (new, t[1], t[2], t[3], t[4])
>>> ('200', '54000', '0.0', '5000.0', '0.0')
If you want to do this, you probably don't want to toss a bunch of weird functions all over the place and call attention to you wanting to change values in things specific unable to do that. Also, we can go ahead and assume you're not being efficient.
t = tuple([new_value if p == old_value else p for p in t])
i did this:
list = [1,2,3,4,5]
tuple = (list)
and to change, just do
list[0]=6
and u can change a tuple :D
here is it copied exactly from IDLE
>>> list=[1,2,3,4,5,6,7,8,9]
>>> tuple=(list)
>>> print(tuple)
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list[0]=6
>>> print(tuple)
[6, 2, 3, 4, 5, 6, 7, 8, 9]
You can change the value of tuple using copy by reference
>>> tuple1=[20,30,40]
>>> tuple2=tuple1
>>> tuple2
[20, 30, 40]
>>> tuple2[1]=10
>>> print(tuple2)
[20, 10, 40]
>>> print(tuple1)
[20, 10, 40]
Related
I have a tuple called values which contains the following:
('275', '54000', '0.0', '5000.0', '0.0')
I want to change the first value (i.e., 275) in this tuple but I understand that tuples are immutable so values[0] = 200 will not work. How can I achieve this?
It's possible via:
t = ('275', '54000', '0.0', '5000.0', '0.0')
lst = list(t)
lst[0] = '300'
t = tuple(lst)
But if you're going to need to change things, you probably are better off keeping it as a list
Depending on your problem slicing can be a really neat solution:
>>> b = (1, 2, 3, 4, 5)
>>> b[:2] + (8,9) + b[3:]
(1, 2, 8, 9, 4, 5)
>>> b[:2] + (8,) + b[3:]
(1, 2, 8, 4, 5)
This allows you to add multiple elements or also to replace a few elements (especially if they are "neighbours". In the above case casting to a list is probably more appropriate and readable (even though the slicing notation is much shorter).
Well, as Trufa has already shown, there are basically two ways of replacing a tuple's element at a given index. Either convert the tuple to a list, replace the element and convert back, or construct a new tuple by concatenation.
In [1]: def replace_at_index1(tup, ix, val):
...: lst = list(tup)
...: lst[ix] = val
...: return tuple(lst)
...:
In [2]: def replace_at_index2(tup, ix, val):
...: return tup[:ix] + (val,) + tup[ix+1:]
...:
So, which method is better, that is, faster?
It turns out that for short tuples (on Python 3.3), concatenation is actually faster!
In [3]: d = tuple(range(10))
In [4]: %timeit replace_at_index1(d, 5, 99)
1000000 loops, best of 3: 872 ns per loop
In [5]: %timeit replace_at_index2(d, 5, 99)
1000000 loops, best of 3: 642 ns per loop
Yet if we look at longer tuples, list conversion is the way to go:
In [6]: k = tuple(range(1000))
In [7]: %timeit replace_at_index1(k, 500, 99)
100000 loops, best of 3: 9.08 µs per loop
In [8]: %timeit replace_at_index2(k, 500, 99)
100000 loops, best of 3: 10.1 µs per loop
For very long tuples, list conversion is substantially better!
In [9]: m = tuple(range(1000000))
In [10]: %timeit replace_at_index1(m, 500000, 99)
10 loops, best of 3: 26.6 ms per loop
In [11]: %timeit replace_at_index2(m, 500000, 99)
10 loops, best of 3: 35.9 ms per loop
Also, performance of the concatenation method depends on the index at which we replace the element. For the list method, the index is irrelevant.
In [12]: %timeit replace_at_index1(m, 900000, 99)
10 loops, best of 3: 26.6 ms per loop
In [13]: %timeit replace_at_index2(m, 900000, 99)
10 loops, best of 3: 49.2 ms per loop
So: If your tuple is short, slice and concatenate.
If it's long, do the list conversion!
It is possible with a one liner:
values = ('275', '54000', '0.0', '5000.0', '0.0')
values = ('300', *values[1:])
I believe this technically answers the question, but don't do this at home. At the moment, all answers involve creating a new tuple, but you can use ctypes to modify a tuple in-memory. Relying on various implementation details of CPython on a 64-bit system, one way to do this is as follows:
def modify_tuple(t, idx, new_value):
# `id` happens to give the memory address in CPython; you may
# want to use `ctypes.addressof` instead.
element_ptr = (ctypes.c_longlong).from_address(id(t) + (3 + idx)*8)
element_ptr.value = id(new_value)
# Manually increment the reference count to `new_value` to pretend that
# this is not a terrible idea.
ref_count = (ctypes.c_longlong).from_address(id(new_value))
ref_count.value += 1
t = (10, 20, 30)
modify_tuple(t, 1, 50) # t is now (10, 50, 30)
modify_tuple(t, -1, 50) # Will probably crash your Python runtime
As Hunter McMillen mentioned, tuples are immutable, you need to create a new tuple in order to achieve this. For instance:
>>> tpl = ('275', '54000', '0.0', '5000.0', '0.0')
>>> change_value = 200
>>> tpl = (change_value,) + tpl[1:]
>>> tpl
(200, '54000', '0.0', '5000.0', '0.0')
Not that this is superior, but if anyone is curious it can be done on one line with:
tuple = tuple([200 if i == 0 else _ for i, _ in enumerate(tuple)])
You can't modify items in tuple, but you can modify properties of mutable objects in tuples (for example if those objects are lists or actual class objects)
For example
my_list = [1,2]
tuple_of_lists = (my_list,'hello')
print(tuple_of_lists) # ([1, 2], 'hello')
my_list[0] = 0
print(tuple_of_lists) # ([0, 2], 'hello')
EDIT: This doesn't work on tuples with duplicate entries yet!!
Based on Pooya's idea:
If you are planning on doing this often (which you shouldn't since tuples are inmutable for a reason) you should do something like this:
def modTupByIndex(tup, index, ins):
return tuple(tup[0:index]) + (ins,) + tuple(tup[index+1:])
print modTupByIndex((1,2,3),2,"a")
Or based on Jon's idea:
def modTupByIndex(tup, index, ins):
lst = list(tup)
lst[index] = ins
return tuple(lst)
print modTupByIndex((1,2,3),1,"a")
based on Jon's Idea and dear Trufa
def modifyTuple(tup, oldval, newval):
lst=list(tup)
for i in range(tup.count(oldval)):
index = lst.index(oldval)
lst[index]=newval
return tuple(lst)
print modTupByIndex((1, 1, 3), 1, "a")
it changes all of your old values occurrences
You can't. If you want to change it, you need to use a list instead of a tuple.
Note that you could instead make a new tuple that has the new value as its first element.
Frist, ask yourself why you want to mutate your tuple. There is a reason why strings and tuple are immutable in Ptyhon, if you want to mutate your tuple then it should probably be a list instead.
Second, if you still wish to mutate your tuple then you can convert your tuple to a list then convert it back, and reassign the new tuple to the same variable. This is great if you are only going to mutate your tuple once. Otherwise, I personally think that is counterintuitive. Because It is essentially creating a new tuple and every time if you wish to mutate the tuple you would have to perform the conversion. Also If you read the code it would be confusing to think why not just create a list? But it is nice because it doesn't require any library.
I suggest using mutabletuple(typename, field_names, default=MtNoDefault) from mutabletuple 0.2. I personally think this way is a more intuitive and readable. The personal reading the code would know that writer intends to mutate this tuple in the future. The downside compares to the list conversion method above is that this requires you to import additional py file.
from mutabletuple import mutabletuple
myTuple = mutabletuple('myTuple', 'v w x y z')
p = myTuple('275', '54000', '0.0', '5000.0', '0.0')
print(p.v) #print 275
p.v = '200' #mutate myTuple
print(p.v) #print 200
TL;DR: Don't try to mutate tuple. if you do and it is a one-time operation convert tuple to list, mutate it, turn list into a new tuple, and reassign back to the variable holding old tuple. If desires tuple and somehow want to avoid listand want to mutate more than once then create mutabletuple.
I've found the best way to edit tuples is to recreate the tuple using the previous version as the base.
Here's an example I used for making a lighter version of a colour (I had it open already at the time):
colour = tuple([c+50 for c in colour])
What it does, is it goes through the tuple 'colour' and reads each item, does something to it, and finally adds it to the new tuple.
So what you'd want would be something like:
values = ('275', '54000', '0.0', '5000.0', '0.0')
values = (tuple(for i in values: if i = 0: i = 200 else i = values[i])
That specific one doesn't work, but the concept is what you need.
tuple = (0, 1, 2)
tuple = iterate through tuple, alter each item as needed
that's the concept.
I´m late to the game but I think the simplest, resource-friendliest and fastest way (depending on the situation),
is to overwrite the tuple itself. Since this would remove the need for the list & variable creation and is archived in one line.
new = 24
t = (1, 2, 3)
t = (t[0],t[1],new)
>>> (1, 2, 24)
But: This is only handy for rather small tuples and also limits you to a fixed tuple value, nevertheless, this is the case for tuples most of the time anyway.
So in this particular case it would look like this:
new = '200'
t = ('275', '54000', '0.0', '5000.0', '0.0')
t = (new, t[1], t[2], t[3], t[4])
>>> ('200', '54000', '0.0', '5000.0', '0.0')
If you want to do this, you probably don't want to toss a bunch of weird functions all over the place and call attention to you wanting to change values in things specific unable to do that. Also, we can go ahead and assume you're not being efficient.
t = tuple([new_value if p == old_value else p for p in t])
i did this:
list = [1,2,3,4,5]
tuple = (list)
and to change, just do
list[0]=6
and u can change a tuple :D
here is it copied exactly from IDLE
>>> list=[1,2,3,4,5,6,7,8,9]
>>> tuple=(list)
>>> print(tuple)
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list[0]=6
>>> print(tuple)
[6, 2, 3, 4, 5, 6, 7, 8, 9]
You can change the value of tuple using copy by reference
>>> tuple1=[20,30,40]
>>> tuple2=tuple1
>>> tuple2
[20, 30, 40]
>>> tuple2[1]=10
>>> print(tuple2)
[20, 10, 40]
>>> print(tuple1)
[20, 10, 40]
How do I remove duplicates from a list, while preserving order? Using a set to remove duplicates destroys the original order.
Is there a built-in or a Pythonic idiom?
Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark
Fastest one:
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.
If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/
O(1) insertion, deletion and member-check per operation.
(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)
The best solution varies by Python version and environment constraints:
Python 3.7+ (and most interpreters supporting 3.6, as an implementation detail):
First introduced in PyPy 2.5.0, and adopted in CPython 3.6 as an implementation detail, before being made a language guarantee in Python 3.7, plain dict is insertion-ordered, and even more efficient than the (also C implemented as of CPython 3.5) collections.OrderedDict. So the fastest solution, by far, is also the simplest:
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(dict.fromkeys(items)) # Or [*dict.fromkeys(items)] if you prefer
[1, 2, 0, 3]
Like list(set(items)) this pushes all the work to the C layer (on CPython), but since dicts are insertion ordered, dict.fromkeys doesn't lose ordering. It's slower than list(set(items)) (takes 50-100% longer typically), but much faster than any other order-preserving solution (takes about half the time of hacks involving use of sets in a listcomp).
Important note: The unique_everseen solution from more_itertools (see below) has some unique advantages in terms of laziness and support for non-hashable input items; if you need these features, it's the only solution that will work.
Python 3.5 (and all older versions if performance isn't critical)
As Raymond pointed out, in CPython 3.5 where OrderedDict is implemented in C, ugly list comprehension hacks are slower than OrderedDict.fromkeys (unless you actually need the list at the end - and even then, only if the input is very short). So on both performance and readability the best solution for CPython 3.5 is the OrderedDict equivalent of the 3.6+ use of plain dict:
>>> from collections import OrderedDict
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(OrderedDict.fromkeys(items))
[1, 2, 0, 3]
On CPython 3.4 and earlier, this will be slower than some other solutions, so if profiling shows you need a better solution, keep reading.
Python 3.4 and earlier, if performance is critical and third-party modules are acceptable
As #abarnert notes, the more_itertools library (pip install more_itertools) contains a unique_everseen function that is built to solve this problem without any unreadable (not seen.add) mutations in list comprehensions. This is the fastest solution too:
>>> from more_itertools import unique_everseen
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(unique_everseen(items))
[1, 2, 0, 3]
Just one simple library import and no hacks.
The module is adapting the itertools recipe unique_everseen which looks like:
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
but unlike the itertools recipe, it supports non-hashable items (at a performance cost; if all elements in iterable are non-hashable, the algorithm becomes O(n²), vs. O(n) if they're all hashable).
Important note: Unlike all the other solutions here, unique_everseen can be used lazily; the peak memory usage will be the same (eventually, the underlying set grows to the same size), but if you don't listify the result, you just iterate it, you'll be able to process unique items as they're found, rather than waiting until the entire input has been deduplicated before processing the first unique item.
Python 3.4 and earlier, if performance is critical and third party modules are unavailable
You have two options:
Copy and paste in the unique_everseen recipe to your code and use it per the more_itertools example above
Use ugly hacks to allow a single listcomp to both check and update a set to track what's been seen:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
at the expense of relying on the ugly hack:
not seen.add(x)
which relies on the fact that set.add is an in-place method that always returns None so not None evaluates to True.
Note that all of the solutions above are O(n) (save calling unique_everseen on an iterable of non-hashable items, which is O(n²), while the others would fail immediately with a TypeError), so all solutions are performant enough when they're not the hottest code path. Which one to use depends on which versions of the language spec/interpreter/third-party modules you can rely on, whether or not performance is critical (don't assume it is; it usually isn't), and most importantly, readability (because if the person who maintains this code later ends up in a murderous mood, your clever micro-optimization probably wasn't worth it).
In CPython 3.6+ (and all other Python implementations starting with Python 3.7+), dictionaries are ordered, so the way to remove duplicates from an iterable while keeping it in the original order is:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.5 and below (including Python 2.7), use the OrderedDict. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5 (when it gained a C implementation; prior to 3.5 it's still the clearest solution, though not the fastest).
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
Not to kick a dead horse (this question is very old and already has lots of good answers), but here is a solution using pandas that is quite fast in many circumstances and is dead simple to use.
import pandas as pd
my_list = [0, 1, 2, 3, 4, 1, 2, 3, 5]
>>> pd.Series(my_list).drop_duplicates().tolist()
# Output:
# [0, 1, 2, 3, 4, 5]
In Python 3.7 and above, dictionaries are guaranteed to remember their key insertion order. The answer to this question summarizes the current state of affairs.
The OrderedDict solution thus becomes obsolete and without any import statements we can simply issue:
>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> list(dict.fromkeys(lst))
[1, 2, 3, 4]
sequence = ['1', '2', '3', '3', '6', '4', '5', '6']
unique = []
[unique.append(item) for item in sequence if item not in unique]
unique → ['1', '2', '3', '6', '4', '5']
from itertools import groupby
[ key for key,_ in groupby(sortedList)]
The list doesn't even have to be sorted, the sufficient condition is that equal values are grouped together.
Edit: I assumed that "preserving order" implies that the list is actually ordered. If this is not the case, then the solution from MizardX is the right one.
Community edit: This is however the most elegant way to "compress duplicate consecutive elements into a single element".
I think if you wanna maintain the order,
you can try this:
list1 = ['b','c','d','b','c','a','a']
list2 = list(set(list1))
list2.sort(key=list1.index)
print list2
OR similarly you can do this:
list1 = ['b','c','d','b','c','a','a']
list2 = sorted(set(list1),key=list1.index)
print list2
You can also do this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
for i in list1:
if not i in list2:
list2.append(i)`
print list2
It can also be written as this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
[list2.append(i) for i in list1 if not i in list2]
print list2
Just to add another (very performant) implementation of such a functionality from an external module1: iteration_utilities.unique_everseen:
>>> from iteration_utilities import unique_everseen
>>> lst = [1,1,1,2,3,2,2,2,1,3,4]
>>> list(unique_everseen(lst))
[1, 2, 3, 4]
Timings
I did some timings (Python 3.6) and these show that it's faster than all other alternatives I tested, including OrderedDict.fromkeys, f7 and more_itertools.unique_everseen:
%matplotlib notebook
from iteration_utilities import unique_everseen
from collections import OrderedDict
from more_itertools import unique_everseen as mi_unique_everseen
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def iteration_utilities_unique_everseen(seq):
return list(unique_everseen(seq))
def more_itertools_unique_everseen(seq):
return list(mi_unique_everseen(seq))
def odict(seq):
return list(OrderedDict.fromkeys(seq))
from simple_benchmark import benchmark
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: list(range(2**i)) for i in range(1, 20)},
'list size (no duplicates)')
b.plot()
And just to make sure I also did a test with more duplicates just to check if it makes a difference:
import random
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},
'list size (lots of duplicates)')
b.plot()
And one containing only one value:
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [1]*(2**i) for i in range(1, 20)},
'list size (only duplicates)')
b.plot()
In all of these cases the iteration_utilities.unique_everseen function is the fastest (on my computer).
This iteration_utilities.unique_everseen function can also handle unhashable values in the input (however with an O(n*n) performance instead of the O(n) performance when the values are hashable).
>>> lst = [{1}, {1}, {2}, {1}, {3}]
>>> list(unique_everseen(lst))
[{1}, {2}, {3}]
1 Disclaimer: I'm the author of that package.
For another very late answer to another very old question:
The itertools recipes have a function that does this, using the seen set technique, but:
Handles a standard key function.
Uses no unseemly hacks.
Optimizes the loop by pre-binding seen.add instead of looking it up N times. (f7 also does this, but some versions don't.)
Optimizes the loop by using ifilterfalse, so you only have to loop over the unique elements in Python, instead of all of them. (You still iterate over all of them inside ifilterfalse, of course, but that's in C, and much faster.)
Is it actually faster than f7? It depends on your data, so you'll have to test it and see. If you want a list in the end, f7 uses a listcomp, and there's no way to do that here. (You can directly append instead of yielding, or you can feed the generator into the list function, but neither one can be as fast as the LIST_APPEND inside a listcomp.) At any rate, usually, squeezing out a few microseconds is not going to be as important as having an easily-understandable, reusable, already-written function that doesn't require DSU when you want to decorate.
As with all of the recipes, it's also available in more-iterools.
If you just want the no-key case, you can simplify it as:
def unique(iterable):
seen = set()
seen_add = seen.add
for element in itertools.ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
For no hashable types (e.g. list of lists), based on MizardX's:
def f7_noHash(seq)
seen = set()
return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]
pandas users should check out pandas.unique.
>>> import pandas as pd
>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> pd.unique(lst)
array([1, 2, 3, 4])
The function returns a NumPy array. If needed, you can convert it to a list with the tolist method.
5 x faster reduce variant but more sophisticated
>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]
Explanation:
default = (list(), set())
# use list to keep order
# use set to make lookup faster
def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result
>>> reduce(reducer, l, default)[0]
[5, 6, 1, 2, 3, 4]
here is a simple way to do it:
list1 = ["hello", " ", "w", "o", "r", "l", "d"]
sorted(set(list1 ), key=list1.index)
that gives the output:
["hello", " ", "w", "o", "r", "l", "d"]
Borrowing the recursive idea used in definining Haskell's nub function for lists, this would be a recursive approach:
def unique(lst):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:]))
e.g.:
In [118]: unique([1,5,1,1,4,3,4])
Out[118]: [1, 5, 4, 3]
I tried it for growing data sizes and saw sub-linear time-complexity (not definitive, but suggests this should be fine for normal data).
In [122]: %timeit unique(np.random.randint(5, size=(1)))
10000 loops, best of 3: 25.3 us per loop
In [123]: %timeit unique(np.random.randint(5, size=(10)))
10000 loops, best of 3: 42.9 us per loop
In [124]: %timeit unique(np.random.randint(5, size=(100)))
10000 loops, best of 3: 132 us per loop
In [125]: %timeit unique(np.random.randint(5, size=(1000)))
1000 loops, best of 3: 1.05 ms per loop
In [126]: %timeit unique(np.random.randint(5, size=(10000)))
100 loops, best of 3: 11 ms per loop
I also think it's interesting that this could be readily generalized to uniqueness by other operations. Like this:
import operator
def unique(lst, cmp_op=operator.ne):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)
For example, you could pass in a function that uses the notion of rounding to the same integer as if it was "equality" for uniqueness purposes, like this:
def test_round(x,y):
return round(x) != round(y)
then unique(some_list, test_round) would provide the unique elements of the list where uniqueness no longer meant traditional equality (which is implied by using any sort of set-based or dict-key-based approach to this problem) but instead meant to take only the first element that rounds to K for each possible integer K that the elements might round to, e.g.:
In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)
Out[6]: [1.2, 5, 1.9, 4.2, 3]
You can reference a list comprehension as it is being built by the symbol '_[1]'. For example, the following function unique-ifies a list of elements without changing their order by referencing its list comprehension.
def unique(my_list):
return [x for x in my_list if x not in locals()['_[1]']]
Demo:
l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]
l2 = [x for x in l1 if x not in locals()['_[1]']]
print l2
Output:
[1, 2, 3, 4, 5]
Eliminating the duplicate values in a sequence, but preserve the order of the remaining items. Use of general purpose generator function.
# for hashable sequence
def remove_duplicates(items):
seen = set()
for item in items:
if item not in seen:
yield item
seen.add(item)
a = [1, 5, 2, 1, 9, 1, 5, 10]
list(remove_duplicates(a))
# [1, 5, 2, 9, 10]
# for unhashable sequence
def remove_duplicates(items, key=None):
seen = set()
for item in items:
val = item if key is None else key(item)
if val not in seen:
yield item
seen.add(val)
a = [ {'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 1, 'y': 2}, {'x': 2, 'y': 4}]
list(remove_duplicates(a, key=lambda d: (d['x'],d['y'])))
# [{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}]
1. These solutions are fine…
For removing duplicates while preserving order, the excellent solution(s) proposed elsewhere on this page:
seen = set()
[x for x in seq if not (x in seen or seen.add(x))]
and variation(s), e.g.:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
are indeed popular because they are simple, minimalistic, and deploy the correct hashing for optimal efficency. The main complaint about these seems to be that using the invariant None "returned" by method seen.add(x) as a constant (and therefore excess/unnecessary) value in a logical expression—just for its side-effect—is hacky and/or confusing.
2. …but they waste one hash lookup per iteration.
Surprisingly, given the amount of discussion and debate on this topic, there is actually a significant improvement to the code that seems to have been overlooked. As shown, each "test-and-set" iteration requires two hash lookups: the first to test membership x not in seen and then again to actually add the value seen.add(x). Since the first operation guarantees that the second will always be successful, there is a wasteful duplication of effort here. And because the overall technique here is so efficient, the excess hash lookups will likely end up being the most expensive proportion of what little work remains.
3. Instead, let the set do its job!
Notice that the examples above only call set.add with the foreknowledge that doing so will always result in an increase in set membership. The set itself never gets an chance to reject a duplicate; our code snippet has essentially usurped that role for itself. The use of explicit two-step test-and-set code is robbing set of its core ability to exclude those duplicates itself.
4. The single-hash-lookup code:
The following version cuts the number of hash lookups per iteration in half—from two down to just one.
seen = set()
[x for x in seq if len(seen) < len(seen.add(x) or seen)]
I've compared all relevant answers with perfplot and found that,
list(dict.fromkeys(data))
is fastest. This also holds true for small numpy arrays. For larger numpy arrays, pandas.unique is actually fastest.
Code to reproduce the plot:
from collections import OrderedDict
from functools import reduce
from itertools import groupby
import numpy as np
import pandas as pd
import perfplot
from more_itertools import unique_everseen as ue
def dict_fromkeys(data):
return list(dict.fromkeys(data))
def unique_everseen(data):
return list(ue(data))
def seen_add(data):
seen = set()
seen_add = seen.add
return [x for x in data if not (x in seen or seen_add(x))]
def ordereddict_fromkeys(data):
return list(OrderedDict.fromkeys(data))
def pandas_drop_duplicates(data):
return pd.Series(data).drop_duplicates().tolist()
def pandas_unique(data):
return pd.unique(data)
def itertools_groupby(data):
return [key for key, _ in groupby(data)]
def reduce_tricks(data):
return reduce(
lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r,
data,
([], set()),
)[0]
b = perfplot.bench(
setup=lambda n: np.random.randint(100, size=n).tolist(),
kernels=[
dict_fromkeys,
unique_everseen,
seen_add,
ordereddict_fromkeys,
pandas_drop_duplicates,
pandas_unique,
reduce_tricks,
],
n_range=[2**k for k in range(20)],
)
b.save("out.png")
b.show()
If you need one liner then maybe this would help:
reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))
... should work but correct me if i'm wrong
MizardX's answer gives a good collection of multiple approaches.
This is what I came up with while thinking aloud:
mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]
You could do a sort of ugly list comprehension hack.
[l[i] for i in range(len(l)) if l.index(l[i]) == i]
Relatively effective approach with _sorted_ a numpy arrays:
b = np.array([1,3,3, 8, 12, 12,12])
numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]])
Outputs:
array([ 1, 3, 8, 12])
l = [1,2,2,3,3,...]
n = []
n.extend(ele for ele in l if ele not in set(n))
A generator expression that uses the O(1) look up of a set to determine whether or not to include an element in the new list.
A simple recursive solution:
def uniquefy_list(a):
return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]]
this will preserve order and run in O(n) time. basically the idea is to create a hole wherever there is a duplicate found and sink it down to the bottom. makes use of a read and write pointer. whenever a duplicate is found only the read pointer advances and write pointer stays on the duplicate entry to overwrite it.
def deduplicate(l):
count = {}
(read,write) = (0,0)
while read < len(l):
if l[read] in count:
read += 1
continue
count[l[read]] = True
l[write] = l[read]
read += 1
write += 1
return l[0:write]
x = [1, 2, 1, 3, 1, 4]
# brute force method
arr = []
for i in x:
if not i in arr:
arr.insert(x[i],i)
# recursive method
tmp = []
def remove_duplicates(j=0):
if j < len(x):
if not x[j] in tmp:
tmp.append(x[j])
i = j+1
remove_duplicates(i)
remove_duplicates()
One liner list comprehension:
values_non_duplicated = [value for index, value in enumerate(values) if value not in values[ : index]]
If you routinely use pandas, and aesthetics is preferred over performance, then consider the built-in function pandas.Series.drop_duplicates:
import pandas as pd
import numpy as np
uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()
# from the chosen answer
def f7(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
alist = np.random.randint(low=0, high=1000, size=10000).tolist()
print uniquifier(alist) == f7(alist) # True
Timing:
In [104]: %timeit f7(alist)
1000 loops, best of 3: 1.3 ms per loop
In [110]: %timeit uniquifier(alist)
100 loops, best of 3: 4.39 ms per loop
A solution without using imported modules or sets:
text = "ask not what your country can do for you ask what you can do for your country"
sentence = text.split(" ")
noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]
print(noduplicates)
Gives output:
['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']
I want to assign a single value to a part of a list. Is there a better solution to this than one of the following?
Maybe most performant but somehow ugly:
>>> l=[0,1,2,3,4,5]
>>> for i in range(2,len(l)): l[i] = None
>>> l
[0, 1, None, None, None, None]
Concise (but I don't know if Python recognizes that no rearrangement of the list elements is necesssary):
>>> l=[0,1,2,3,4,5]
>>> l[2:] = [None]*(len(l)-2)
>>> l
[0, 1, None, None, None, None]
Same caveat like above:
>>> l=[0,1,2,3,4,5]
>>> l[2:] = [None for _ in range(len(l)-2)]
>>> l
[0, 1, None, None, None, None]
Not sure if using a library for such a trivial task is wise:
>>> import itertools
>>> l=[0,1,2,3,4,5]
>>> l[2:] = itertools.repeat(None,len(l)-2)
>>> l
[0, 1, None, None, None, None]
The problem that I see with the assignment to the slice (vs. the for loop) is that Python maybe tries to prepare for a change in the length of "l". After all, changing the list by inserting a shorter/longer slice involves copying all elements (that is, all references) of the list AFAIK. If Python does this in my case too (although it is unnecessary), the operation becomes O(n) instead of O(1) (assuming that I only always change a handful of elements).
Timing it:
python -mtimeit "l=[0,1,2,3,4,5]" "for i in range(2,len(l)):" " l[i] = None"
1000000 loops, best of 3: 0.669 usec per loop
python -mtimeit "l=[0,1,2,3,4,5]" "l[2:] = [None]*(len(l)-2)"
1000000 loops, best of 3: 0.419 usec per loop
python -mtimeit "l=[0,1,2,3,4,5]" "l[2:] = [None for _ in range(len(l)-2)]"
1000000 loops, best of 3: 0.655 usec per loop
python -mtimeit "l=[0,1,2,3,4,5]" "l[2:] = itertools.repeat(None,len(l)-2)"
1000000 loops, best of 3: 0.997 usec per loop
Looks like l[2:] = [None]*(len(l)-2) is the best of the options you provided (for the scope you are dealing with).
Note:
Keep in mind that results will vary based on Python version, operation system, other currently running programs, and most of all - the size of the list and of the slice to be replaced. For larger scopes probably the last option (using itertools.repeat) will be the most effective, being both easily readable (pythonic) and efficient (performance).
All of your solutions are Pythonic and about equally readable. If you really care about performance and think that it matters in this case, use the timeit module to benchmark them.
Having said that, I would expect that the first solution is almost certainly not the most performant one because it iterates over the list elements in Python. Also, Python doesn't optimize away list the list created on the right-hand-side of assignment, but the list creation is extremely fast, and in most cases a small temporary list doesn't affect execution at all. Personally, for a short list I'd go with your second solution, and for a longer list I'd go with itertools.repeat().
Note that itertools doesn't really count as a "library", it comes with Python and is so often used that it is essentially part of the language.
I think there's no straight out of the box feature in Python to do this. I like your second approach, but keep in mind that there's a tradeoff between space and time. This is a very good reading recommended by #user4815162342: Python Patterns - An Optimization Anecdote.
Anyhow, if this is an operation you'll be performing eventually in your code, I think your best option is to wrap it inside a helper function:
def setvalues(lst, index=0, value=None):
for i in range(index, len(lst)):
lst[i] = value
>>>l=[1,2,3,4,5]
>>>setvalues(l,index=2)
>>>l
>>>[1, 2, None, None, None]
This has some advantages:
The code is refactored inside a function, so easy to modify if you change your mind about how to perform the action.
You can have several functions that accomplish the same target and therefore can measure their performance.
You can write tests for them.
Every other advantage you can get by refactoring :)
Since IMHO there's no straight Python future for this action, this is the best workaround I can imagine.
Hope this helps!
Being more Pythonic and being more performant are goals that can sometimes collide. So you're basically asking two questions. If you really need the performance: measure it and take the fastest. In all other cases just go with what is most readable, in other words what is most Pythonic (what is most readable/familiar to other Python programmers).
Personally I think your second solution is quite readable:
>>> l=[0,1,2,3,4,5]
>>> l[2:] = [None]*(len(l)-2)
>>> l
[0, 1, None, None, None, None]
The start of the second line immediately tells me that you're replacing a specific part of the values of the list.
I'd suggest something like this:
>>> l = [0, 1, 2, 3, 4, 5]
>>> n = [None] * len(l)
>>> l[2:] = n[2:]
>>> l
[0, 1, None, None, None, None]
It looks pretty: no explicit loops, no 'if', no comparisons! At the cost of a 2N complexity! (or not?)
EDIT - Editing the original list now.
You forgot this one, (IMO, this one is more readable)
>>> l = [i if i < 2 else None for i in range(6)]
>>> l
[0, 1, None, None, None, None]
If preserving is necessary,
>>> l = range(6)
>>> l
[0, 1, 2, 3, 4, 5]
>>> l[:] = [l[i] if i < 2 else None
... for i in range(len(l))]
>>> l
[0, 1, None, None, None, None]
Timed it, performance is roughly 2.5 times slower than what Inbar got as the fastest method.
I'm sure there's a nice way to do this in Python, but I'm pretty new to the language, so forgive me if this is an easy one!
I have a list, and I'd like to pick out certain values from that list. The values I want to pick out are the ones whose indexes in the list are specified in another list.
For example:
indexes = [2, 4, 5]
main_list = [0, 1, 9, 3, 2, 6, 1, 9, 8]
the output would be:
[9, 2, 6]
(i.e., the elements with indexes 2, 4 and 5 from main_list).
I have a feeling this should be doable using something like list comprehensions, but I can't figure it out (in particular, I can't figure out how to access the index of an item when using a list comprehension).
[main_list[x] for x in indexes]
This will return a list of the objects, using a list comprehension.
t = []
for i in indexes:
t.append(main_list[i])
return t
map(lambda x:main_list[x],indexes)
If you're good with numpy:
import numpy as np
main_array = np.array(main_list) # converting to numpy array
out_array = main_array.take([2, 4, 5])
out_list = out_array.tolist() # if you want a list specifically
I think Yuval A's solution is a pretty clear and simple. But if you actually want a one line list comprehension:
[e for i, e in enumerate(main_list) if i in indexes]
As an alternative to a list comprehension, you can use map with list.__getitem__. For large lists you should see better performance:
import random
n = 10**7
L = list(range(n))
idx = random.sample(range(n), int(n/10))
x = [L[x] for x in idx]
y = list(map(L.__getitem__, idx))
assert all(i==j for i, j in zip(x, y))
%timeit [L[x] for x in idx] # 474 ms per loop
%timeit list(map(L.__getitem__, idx)) # 417 ms per loop
For a lazy iterator, you can just use map(L.__getitem__, idx). Note in Python 2.7, map returns a list, so there is no need to pass to list.
I have noticed that there are two optional ways to do this job, either by loop or by turning to np.array. Then I test the time needed by these two methods, the result shows that when dataset is large
【[main_list[x] for x in indexes]】is about 3~5 times faster than
【np.array.take()】
if your code is sensitive to the computation time, the highest voted answer is a good choice.
How do I remove duplicates from a list, while preserving order? Using a set to remove duplicates destroys the original order.
Is there a built-in or a Pythonic idiom?
Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark
Fastest one:
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.
If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/
O(1) insertion, deletion and member-check per operation.
(Small additional note: seen.add() always returns None, so the or above is there only as a way to attempt a set update, and not as an integral part of the logical test.)
The best solution varies by Python version and environment constraints:
Python 3.7+ (and most interpreters supporting 3.6, as an implementation detail):
First introduced in PyPy 2.5.0, and adopted in CPython 3.6 as an implementation detail, before being made a language guarantee in Python 3.7, plain dict is insertion-ordered, and even more efficient than the (also C implemented as of CPython 3.5) collections.OrderedDict. So the fastest solution, by far, is also the simplest:
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(dict.fromkeys(items)) # Or [*dict.fromkeys(items)] if you prefer
[1, 2, 0, 3]
Like list(set(items)) this pushes all the work to the C layer (on CPython), but since dicts are insertion ordered, dict.fromkeys doesn't lose ordering. It's slower than list(set(items)) (takes 50-100% longer typically), but much faster than any other order-preserving solution (takes about half the time of hacks involving use of sets in a listcomp).
Important note: The unique_everseen solution from more_itertools (see below) has some unique advantages in terms of laziness and support for non-hashable input items; if you need these features, it's the only solution that will work.
Python 3.5 (and all older versions if performance isn't critical)
As Raymond pointed out, in CPython 3.5 where OrderedDict is implemented in C, ugly list comprehension hacks are slower than OrderedDict.fromkeys (unless you actually need the list at the end - and even then, only if the input is very short). So on both performance and readability the best solution for CPython 3.5 is the OrderedDict equivalent of the 3.6+ use of plain dict:
>>> from collections import OrderedDict
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(OrderedDict.fromkeys(items))
[1, 2, 0, 3]
On CPython 3.4 and earlier, this will be slower than some other solutions, so if profiling shows you need a better solution, keep reading.
Python 3.4 and earlier, if performance is critical and third-party modules are acceptable
As #abarnert notes, the more_itertools library (pip install more_itertools) contains a unique_everseen function that is built to solve this problem without any unreadable (not seen.add) mutations in list comprehensions. This is the fastest solution too:
>>> from more_itertools import unique_everseen
>>> items = [1, 2, 0, 1, 3, 2]
>>> list(unique_everseen(items))
[1, 2, 0, 3]
Just one simple library import and no hacks.
The module is adapting the itertools recipe unique_everseen which looks like:
def unique_everseen(iterable, key=None):
"List unique elements, preserving order. Remember all elements ever seen."
# unique_everseen('AAAABBBCCDAABBB') --> A B C D
# unique_everseen('ABBCcAD', str.lower) --> A B C D
seen = set()
seen_add = seen.add
if key is None:
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
else:
for element in iterable:
k = key(element)
if k not in seen:
seen_add(k)
yield element
but unlike the itertools recipe, it supports non-hashable items (at a performance cost; if all elements in iterable are non-hashable, the algorithm becomes O(n²), vs. O(n) if they're all hashable).
Important note: Unlike all the other solutions here, unique_everseen can be used lazily; the peak memory usage will be the same (eventually, the underlying set grows to the same size), but if you don't listify the result, you just iterate it, you'll be able to process unique items as they're found, rather than waiting until the entire input has been deduplicated before processing the first unique item.
Python 3.4 and earlier, if performance is critical and third party modules are unavailable
You have two options:
Copy and paste in the unique_everseen recipe to your code and use it per the more_itertools example above
Use ugly hacks to allow a single listcomp to both check and update a set to track what's been seen:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
at the expense of relying on the ugly hack:
not seen.add(x)
which relies on the fact that set.add is an in-place method that always returns None so not None evaluates to True.
Note that all of the solutions above are O(n) (save calling unique_everseen on an iterable of non-hashable items, which is O(n²), while the others would fail immediately with a TypeError), so all solutions are performant enough when they're not the hottest code path. Which one to use depends on which versions of the language spec/interpreter/third-party modules you can rely on, whether or not performance is critical (don't assume it is; it usually isn't), and most importantly, readability (because if the person who maintains this code later ends up in a murderous mood, your clever micro-optimization probably wasn't worth it).
In CPython 3.6+ (and all other Python implementations starting with Python 3.7+), dictionaries are ordered, so the way to remove duplicates from an iterable while keeping it in the original order is:
>>> list(dict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
In Python 3.5 and below (including Python 2.7), use the OrderedDict. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5 (when it gained a C implementation; prior to 3.5 it's still the clearest solution, though not the fastest).
>>> from collections import OrderedDict
>>> list(OrderedDict.fromkeys('abracadabra'))
['a', 'b', 'r', 'c', 'd']
Not to kick a dead horse (this question is very old and already has lots of good answers), but here is a solution using pandas that is quite fast in many circumstances and is dead simple to use.
import pandas as pd
my_list = [0, 1, 2, 3, 4, 1, 2, 3, 5]
>>> pd.Series(my_list).drop_duplicates().tolist()
# Output:
# [0, 1, 2, 3, 4, 5]
In Python 3.7 and above, dictionaries are guaranteed to remember their key insertion order. The answer to this question summarizes the current state of affairs.
The OrderedDict solution thus becomes obsolete and without any import statements we can simply issue:
>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> list(dict.fromkeys(lst))
[1, 2, 3, 4]
sequence = ['1', '2', '3', '3', '6', '4', '5', '6']
unique = []
[unique.append(item) for item in sequence if item not in unique]
unique → ['1', '2', '3', '6', '4', '5']
from itertools import groupby
[ key for key,_ in groupby(sortedList)]
The list doesn't even have to be sorted, the sufficient condition is that equal values are grouped together.
Edit: I assumed that "preserving order" implies that the list is actually ordered. If this is not the case, then the solution from MizardX is the right one.
Community edit: This is however the most elegant way to "compress duplicate consecutive elements into a single element".
I think if you wanna maintain the order,
you can try this:
list1 = ['b','c','d','b','c','a','a']
list2 = list(set(list1))
list2.sort(key=list1.index)
print list2
OR similarly you can do this:
list1 = ['b','c','d','b','c','a','a']
list2 = sorted(set(list1),key=list1.index)
print list2
You can also do this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
for i in list1:
if not i in list2:
list2.append(i)`
print list2
It can also be written as this:
list1 = ['b','c','d','b','c','a','a']
list2 = []
[list2.append(i) for i in list1 if not i in list2]
print list2
Just to add another (very performant) implementation of such a functionality from an external module1: iteration_utilities.unique_everseen:
>>> from iteration_utilities import unique_everseen
>>> lst = [1,1,1,2,3,2,2,2,1,3,4]
>>> list(unique_everseen(lst))
[1, 2, 3, 4]
Timings
I did some timings (Python 3.6) and these show that it's faster than all other alternatives I tested, including OrderedDict.fromkeys, f7 and more_itertools.unique_everseen:
%matplotlib notebook
from iteration_utilities import unique_everseen
from collections import OrderedDict
from more_itertools import unique_everseen as mi_unique_everseen
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def iteration_utilities_unique_everseen(seq):
return list(unique_everseen(seq))
def more_itertools_unique_everseen(seq):
return list(mi_unique_everseen(seq))
def odict(seq):
return list(OrderedDict.fromkeys(seq))
from simple_benchmark import benchmark
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: list(range(2**i)) for i in range(1, 20)},
'list size (no duplicates)')
b.plot()
And just to make sure I also did a test with more duplicates just to check if it makes a difference:
import random
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},
'list size (lots of duplicates)')
b.plot()
And one containing only one value:
b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],
{2**i: [1]*(2**i) for i in range(1, 20)},
'list size (only duplicates)')
b.plot()
In all of these cases the iteration_utilities.unique_everseen function is the fastest (on my computer).
This iteration_utilities.unique_everseen function can also handle unhashable values in the input (however with an O(n*n) performance instead of the O(n) performance when the values are hashable).
>>> lst = [{1}, {1}, {2}, {1}, {3}]
>>> list(unique_everseen(lst))
[{1}, {2}, {3}]
1 Disclaimer: I'm the author of that package.
For another very late answer to another very old question:
The itertools recipes have a function that does this, using the seen set technique, but:
Handles a standard key function.
Uses no unseemly hacks.
Optimizes the loop by pre-binding seen.add instead of looking it up N times. (f7 also does this, but some versions don't.)
Optimizes the loop by using ifilterfalse, so you only have to loop over the unique elements in Python, instead of all of them. (You still iterate over all of them inside ifilterfalse, of course, but that's in C, and much faster.)
Is it actually faster than f7? It depends on your data, so you'll have to test it and see. If you want a list in the end, f7 uses a listcomp, and there's no way to do that here. (You can directly append instead of yielding, or you can feed the generator into the list function, but neither one can be as fast as the LIST_APPEND inside a listcomp.) At any rate, usually, squeezing out a few microseconds is not going to be as important as having an easily-understandable, reusable, already-written function that doesn't require DSU when you want to decorate.
As with all of the recipes, it's also available in more-iterools.
If you just want the no-key case, you can simplify it as:
def unique(iterable):
seen = set()
seen_add = seen.add
for element in itertools.ifilterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
For no hashable types (e.g. list of lists), based on MizardX's:
def f7_noHash(seq)
seen = set()
return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]
pandas users should check out pandas.unique.
>>> import pandas as pd
>>> lst = [1, 2, 1, 3, 3, 2, 4]
>>> pd.unique(lst)
array([1, 2, 3, 4])
The function returns a NumPy array. If needed, you can convert it to a list with the tolist method.
5 x faster reduce variant but more sophisticated
>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]
>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0]
[5, 6, 1, 2, 3, 4]
Explanation:
default = (list(), set())
# use list to keep order
# use set to make lookup faster
def reducer(result, item):
if item not in result[1]:
result[0].append(item)
result[1].add(item)
return result
>>> reduce(reducer, l, default)[0]
[5, 6, 1, 2, 3, 4]
here is a simple way to do it:
list1 = ["hello", " ", "w", "o", "r", "l", "d"]
sorted(set(list1 ), key=list1.index)
that gives the output:
["hello", " ", "w", "o", "r", "l", "d"]
Borrowing the recursive idea used in definining Haskell's nub function for lists, this would be a recursive approach:
def unique(lst):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:]))
e.g.:
In [118]: unique([1,5,1,1,4,3,4])
Out[118]: [1, 5, 4, 3]
I tried it for growing data sizes and saw sub-linear time-complexity (not definitive, but suggests this should be fine for normal data).
In [122]: %timeit unique(np.random.randint(5, size=(1)))
10000 loops, best of 3: 25.3 us per loop
In [123]: %timeit unique(np.random.randint(5, size=(10)))
10000 loops, best of 3: 42.9 us per loop
In [124]: %timeit unique(np.random.randint(5, size=(100)))
10000 loops, best of 3: 132 us per loop
In [125]: %timeit unique(np.random.randint(5, size=(1000)))
1000 loops, best of 3: 1.05 ms per loop
In [126]: %timeit unique(np.random.randint(5, size=(10000)))
100 loops, best of 3: 11 ms per loop
I also think it's interesting that this could be readily generalized to uniqueness by other operations. Like this:
import operator
def unique(lst, cmp_op=operator.ne):
return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)
For example, you could pass in a function that uses the notion of rounding to the same integer as if it was "equality" for uniqueness purposes, like this:
def test_round(x,y):
return round(x) != round(y)
then unique(some_list, test_round) would provide the unique elements of the list where uniqueness no longer meant traditional equality (which is implied by using any sort of set-based or dict-key-based approach to this problem) but instead meant to take only the first element that rounds to K for each possible integer K that the elements might round to, e.g.:
In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)
Out[6]: [1.2, 5, 1.9, 4.2, 3]
You can reference a list comprehension as it is being built by the symbol '_[1]'. For example, the following function unique-ifies a list of elements without changing their order by referencing its list comprehension.
def unique(my_list):
return [x for x in my_list if x not in locals()['_[1]']]
Demo:
l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]
l2 = [x for x in l1 if x not in locals()['_[1]']]
print l2
Output:
[1, 2, 3, 4, 5]
Eliminating the duplicate values in a sequence, but preserve the order of the remaining items. Use of general purpose generator function.
# for hashable sequence
def remove_duplicates(items):
seen = set()
for item in items:
if item not in seen:
yield item
seen.add(item)
a = [1, 5, 2, 1, 9, 1, 5, 10]
list(remove_duplicates(a))
# [1, 5, 2, 9, 10]
# for unhashable sequence
def remove_duplicates(items, key=None):
seen = set()
for item in items:
val = item if key is None else key(item)
if val not in seen:
yield item
seen.add(val)
a = [ {'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 1, 'y': 2}, {'x': 2, 'y': 4}]
list(remove_duplicates(a, key=lambda d: (d['x'],d['y'])))
# [{'x': 1, 'y': 2}, {'x': 1, 'y': 3}, {'x': 2, 'y': 4}]
1. These solutions are fine…
For removing duplicates while preserving order, the excellent solution(s) proposed elsewhere on this page:
seen = set()
[x for x in seq if not (x in seen or seen.add(x))]
and variation(s), e.g.:
seen = set()
[x for x in seq if x not in seen and not seen.add(x)]
are indeed popular because they are simple, minimalistic, and deploy the correct hashing for optimal efficency. The main complaint about these seems to be that using the invariant None "returned" by method seen.add(x) as a constant (and therefore excess/unnecessary) value in a logical expression—just for its side-effect—is hacky and/or confusing.
2. …but they waste one hash lookup per iteration.
Surprisingly, given the amount of discussion and debate on this topic, there is actually a significant improvement to the code that seems to have been overlooked. As shown, each "test-and-set" iteration requires two hash lookups: the first to test membership x not in seen and then again to actually add the value seen.add(x). Since the first operation guarantees that the second will always be successful, there is a wasteful duplication of effort here. And because the overall technique here is so efficient, the excess hash lookups will likely end up being the most expensive proportion of what little work remains.
3. Instead, let the set do its job!
Notice that the examples above only call set.add with the foreknowledge that doing so will always result in an increase in set membership. The set itself never gets an chance to reject a duplicate; our code snippet has essentially usurped that role for itself. The use of explicit two-step test-and-set code is robbing set of its core ability to exclude those duplicates itself.
4. The single-hash-lookup code:
The following version cuts the number of hash lookups per iteration in half—from two down to just one.
seen = set()
[x for x in seq if len(seen) < len(seen.add(x) or seen)]
I've compared all relevant answers with perfplot and found that,
list(dict.fromkeys(data))
is fastest. This also holds true for small numpy arrays. For larger numpy arrays, pandas.unique is actually fastest.
Code to reproduce the plot:
from collections import OrderedDict
from functools import reduce
from itertools import groupby
import numpy as np
import pandas as pd
import perfplot
from more_itertools import unique_everseen as ue
def dict_fromkeys(data):
return list(dict.fromkeys(data))
def unique_everseen(data):
return list(ue(data))
def seen_add(data):
seen = set()
seen_add = seen.add
return [x for x in data if not (x in seen or seen_add(x))]
def ordereddict_fromkeys(data):
return list(OrderedDict.fromkeys(data))
def pandas_drop_duplicates(data):
return pd.Series(data).drop_duplicates().tolist()
def pandas_unique(data):
return pd.unique(data)
def itertools_groupby(data):
return [key for key, _ in groupby(data)]
def reduce_tricks(data):
return reduce(
lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r,
data,
([], set()),
)[0]
b = perfplot.bench(
setup=lambda n: np.random.randint(100, size=n).tolist(),
kernels=[
dict_fromkeys,
unique_everseen,
seen_add,
ordereddict_fromkeys,
pandas_drop_duplicates,
pandas_unique,
reduce_tricks,
],
n_range=[2**k for k in range(20)],
)
b.save("out.png")
b.show()
If you need one liner then maybe this would help:
reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))
... should work but correct me if i'm wrong
MizardX's answer gives a good collection of multiple approaches.
This is what I came up with while thinking aloud:
mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]
You could do a sort of ugly list comprehension hack.
[l[i] for i in range(len(l)) if l.index(l[i]) == i]
Relatively effective approach with _sorted_ a numpy arrays:
b = np.array([1,3,3, 8, 12, 12,12])
numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]])
Outputs:
array([ 1, 3, 8, 12])
l = [1,2,2,3,3,...]
n = []
n.extend(ele for ele in l if ele not in set(n))
A generator expression that uses the O(1) look up of a set to determine whether or not to include an element in the new list.
A simple recursive solution:
def uniquefy_list(a):
return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]]
this will preserve order and run in O(n) time. basically the idea is to create a hole wherever there is a duplicate found and sink it down to the bottom. makes use of a read and write pointer. whenever a duplicate is found only the read pointer advances and write pointer stays on the duplicate entry to overwrite it.
def deduplicate(l):
count = {}
(read,write) = (0,0)
while read < len(l):
if l[read] in count:
read += 1
continue
count[l[read]] = True
l[write] = l[read]
read += 1
write += 1
return l[0:write]
x = [1, 2, 1, 3, 1, 4]
# brute force method
arr = []
for i in x:
if not i in arr:
arr.insert(x[i],i)
# recursive method
tmp = []
def remove_duplicates(j=0):
if j < len(x):
if not x[j] in tmp:
tmp.append(x[j])
i = j+1
remove_duplicates(i)
remove_duplicates()
One liner list comprehension:
values_non_duplicated = [value for index, value in enumerate(values) if value not in values[ : index]]
If you routinely use pandas, and aesthetics is preferred over performance, then consider the built-in function pandas.Series.drop_duplicates:
import pandas as pd
import numpy as np
uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()
# from the chosen answer
def f7(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if not (x in seen or seen_add(x))]
alist = np.random.randint(low=0, high=1000, size=10000).tolist()
print uniquifier(alist) == f7(alist) # True
Timing:
In [104]: %timeit f7(alist)
1000 loops, best of 3: 1.3 ms per loop
In [110]: %timeit uniquifier(alist)
100 loops, best of 3: 4.39 ms per loop
A solution without using imported modules or sets:
text = "ask not what your country can do for you ask what you can do for your country"
sentence = text.split(" ")
noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]
print(noduplicates)
Gives output:
['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']