I need to initialize a list of defaultdicts. If they were, say, strings, this would be tidy:
list_of_dds = [string] * n
…but for mutables, you get right into a mess with that approach:
>>> x=[defaultdict(list)] * 3
>>> x[0]['foo'] = 'bar'
>>> x
[defaultdict(<type 'list'>, {'foo': 'bar'}), defaultdict(<type 'list'>, {'foo': 'bar'}), defaultdict(<type 'list'>, {'foo': 'bar'})]
What I do want is an iterable of freshly-minted distinct instances of defaultdicts. I can do this:
list_of_dds = [defaultdict(list) for i in xrange(n)]
but I feel a little dirty using a list comprehension here. I think there's a better approach. Is there? Please tell me what it is.
Edit:
This is why I feel the list comprehension is suboptimal. I'm not usually the pre-optimization type, but I can't bring myself to ignore the speed difference here:
>>> timeit('x=[string.letters]*100', setup='import string')
0.9318461418151855
>>> timeit('x=[string.letters for i in xrange(100)]', setup='import string')
12.606678009033203
>>> timeit('x=[[]]*100')
0.890861988067627
>>> timeit('x=[[] for i in xrange(100)]')
9.716886043548584
Your approach using the list comprehension is correct. Why do you think it's dirty? What you want is a list of things whose length is defined by some base set. List comprehensions create lists based on some base set. What's wrong with using a list comprehension here?
Edit: The speed difference is a direct consequence of what you are trying to do. [[]]*100 is faster, because it only has to create one list. Creating a new list each time is slower, yeah, but you have to expect it to be slower if you actually want 100 different lists.
(It doesn't create a new string each time on your string examples, but it's still slower, because the list comprehension can't "know" ahead of time that all the elements are going to be the same, so it still has to reevaluate the expression every time. I don't know the internal details of the list comp, but it's possible there's also some list-resizing overhead because it doesn't necessarily know the size of the index iterable to start with, so it can't preallocate the list. In addition, note that some of the slowdown in your string example is due to looking up string.letters on every iteration. On my system using timeit.timeit('x=[letters for i in xrange(100)]', setup='from string import letters') instead --- looking up string.letters only once --- cuts the time by about 30%.)
The list comprehension is exactly what you should use.
The problem with the list multiplication is that the list containing a single mutable object is created and then you try to duplicate it. But by trying to duplicate the object from the object itself, the code used to create it is no longer relevant. Nothing you do with the object is going to do what you want, which is run the code used to create it N times, because the object has no idea what code was used to create it.
You could use copy.copy or copy.deepcopy to duplicate it, but that puts you right back in the same boat because then the call to copy/deepcopy just becomes the code you need to run N times.
A list comprehension is a very good fit here. What's wrong with it?
Related
I quite often use set() to remove duplicates from lists. After doing so, I always directly change it back to a list.
a = [0,0,0,1,2,3,4,5]
b = list(set(a))
Why does set() return a set item, instead of simply a list?
type(set(a)) == set # is true
Is there a use for set items that I have failed to understand?
Yes, sets have many uses. They have lots of nice operations documented here which lists don't have. One very useful difference is that membership testing (x in a) can be much faster than for a list.
Okay, by doubles you mean duplicate? and set() will always return a set because it is a data structure in python like lists. when you are calling set you are creating an object of set().
rest of the information about sets you can find here
https://docs.python.org/2/library/sets.html
As already mentioned, I won't go into why set does not return a list but like you stated:
I quite often use set() to remove doubles from lists. After doing so, I always directly change it back to a list.
You could use OrderedDict if you really hate going back to changing it to a list:
source_list = [0,0,0,1,2,3,4,5]
from collections import OrderedDict
print(OrderedDict((x, True) for x in source_list).keys())
OUTPUT:
odict_keys([0, 1, 2, 3, 4, 5])
As said before, for certain operations if you use set instead of list, it is faster. Python wiki has query TimeComplexity in which speed of operations of various data types are given. Note that if you have few elements in your list or set, you will most probably do not notice difference, but with more elements it become more important.
Notice that for example if you want to make in-place removal, for list it is O(n) meaning that for 10 times longer list it will need 10 times more time, while for set and s.difference_update(t) where s is set, t is set with one element to be removed from s, time is O(1) i.e. independent from number of elements of s.
Is there an operator to remove elements from a List based on the content of a Set?
What I want to do is already possible by doing this:
words = ["hello", "you", "how", "are", "you", "today", "hello"]
my_set = {"you", "are"}
new_list = [w for w in words if w not in my_set]
# ["hello", "how", "today", "hello"]
What bothers me with this list comprehension is that for huge collections, it looks less effective to me than the - operator that can be used between two sets. Because in the list comprehension, the iteration happens in Python, whereas with the operator, the iteration happens in C and is more low-level, hence faster.
So is there some way of computing a difference between a List and a Set in a shorter/cleaner/more efficient way than using a list comprehension, like for example:
# I know this is not possible, but does something approaching exist?
new_list = words - my_set
TL;DR
I'm looking for a way to remove all element presents in a Set from a List, that is either:
cleaner (with a built-in perhaps)
and/or more efficient
than what I know can be done with list comprehensions.
Unfortunately, the only answer for this is: No, there is no built-in way, implemented in native code, for this kind of operation.
What bothers me with this list comprehension is that for huge collections, it looks less effective to me than the - operator that can be used between two sets.
I think what’s important here is the “looks” part. Yes, list comprehensions run more within Python than a set difference, but I assume that most of your application actually runs within Python (otherwise you should probably be programming in C instead). So you should consider whether it really matters much. Iterating over a list is fast in Python, and a membership test on a set is also super fast (constant time, and implemented in native code). And if you look at list comprehensions, they are also very fast. So it probably won’t matter much.
Because in the list comprehension, the iteration happens in Python, whereas with the operator, the iteration happens in C and is more low-level, hence faster.
It is true that native operations are faster, but they are also more specialized, limited and allow for less flexibility. For sets, a difference is pretty easy. The set difference is a mathematical concept and is very clearly defined.
But when talking about a “list difference” or a “list and set difference” (or more generalized “list and iterable difference”?) it becomes a lot more unclear. There are a lot open questions:
How are duplicates handled? If there are two X in the original list and only one X in the subtrahend, should both X disappear from the list? Should only one disappear? If so, which one, and why?
How is order handled? Should the order be kept as in the original list? Does the order of the elements in the subtrahend have any impact?
What if we want to subtract members based on some other condition than equality? For sets, it’s clear that they always work on the equality (and hash value) of the members. Lists don’t, so lists are by design a lot more flexible. With list comprehensions, we can easily have any kind of condition to remove elements from a list; with a “list difference” we would be restricted to equality, and that might actually be a rare situation if you think about it.
It’s maybe more likely to use a set if you need to calculate differences (or even some ordered set). And for filtering lists, it might also be a rare case that you want to end up with a filtered list, so it might be more common to use a generator expression (or the Python 3 filter() function) and work with that later without having to create that filtered list in memory.
What I’m trying to say is that the use case for a list difference is not as clear as a set difference. And if there was a use case, it might be a very rare use case. And in general, I don’t think it’s worth to add complexity to the Python implementation for this. Especially when the in-Python alternative, e.g. a list comprehension, is as fast as it already is.
First things first, are you prematurely worrying about an optimisation problem that isn't really an issue? I have to to have lists with at least 10,000,000 elements before I even get into the range of this operation taking 1/10ths of a second.
If you're working with large data sets then you may find it advantageous to move to using numpy.
import random
import timeit
r = range(10000000)
setup = """
import numpy as np
l = list({!r})
s = set(l)
to_remove = {!r}
n = np.array(l)
n_remove = np.array(list(to_remove))
""".format(r, set(random.sample(r, 3)))
list_filter = "[x for x in l if x not in to_remove]"
set_filter = "s - to_remove"
np_filter = "n[np.in1d(n, n_remove, invert=True)]"
n = 1
l_time = timeit.timeit(list_filter, setup, number=n)
print("lists:", l_time)
s_time = timeit.timeit(set_filter, setup, number=n)
print("sets:", s_time)
n_time = timeit.timeit(np_filter, setup, number=n)
print("numpy:", n_time)
returns the following results -- with numpy an order of magnitude faster than using sets.
lists: 0.8743789765043315
sets: 0.20703006886620656
numpy: 0.06197169088128707
I agree with poke. Here is my reasoning:
The easiest way to do it would be using a filter:
words = ["hello", "you", "how", "are", "you", "today", "hello"]
my_set = {"you", "are"}
new_list = filter(lambda w: w not in my_set, words)
And using Dunes solution, I get these times:
lists: 0.87401028
sets: 0.55103887
numpy: 0.16134396
filter: 0.00000886 WOW beats numpy by various orders of magnitude !!!
But wait, we are making a flawed comparison because we are comparing the time of making a list strictly (comprehension and set difference) vs. lazily (numpy and filter).
If I run Dunes solution but producing the actual lists, I get:
lists: 0.86804159
sets: 0.56945663
numpy: 1.19315723
filter: 1.68792561
Now numpy is slightly more efficient than using a simple filter, but both are not better than the list comprehension, which was the first and more intuitive solution.
I would definitely use a filter over the comprehension, except if I need to use the filtered list more than once (although I could tee it).
I'm attempting to add elements from a list of lists into a set. For example if I had
new_list=[['blue','purple'],['black','orange','red'],['green']]
How would I receive
new_set=(['blue','purple'],['black','orange','red'],['green'])
I'm trying to do this so I can use intersection to find out what elements appear in 2 sets. I thought this would work...
results=set()
results2=set()
for element in new_list:
results.add(element)
for element in new_list2:
results2.add(element)
results3=results.intersection(results2)
but I keep receiving:
TypeError: unhashable type: 'list'
for some reason.
Convert the inner lists to tuples, as sets allow you to store only hashable(immutable) objects:
In [72]: new_list=[['blue','purple'],['black','orange','red'],['green']]
In [73]: set(tuple(x) for x in new_list)
Out[73]: set([('blue', 'purple'), ('black', 'orange', 'red'), ('green',)])
How would I receive
new_set=(['blue','purple'],['black','orange','red'],['green'])
Well, despite the misleading name, that's not a set of anything, that's a tuple of lists. To convert a list of lists into a tuple of lists:
new_set = tuple(new_list)
Maybe you wanted to receive this?
new_set=set([['blue','purple'],['black','orange','red'],['green']])
If so… you can't. A set cannot contain unhashable values like lists. That's what the TypeError is telling you.
If this weren't a problem, all you'd have to do is write:
new_set = set(new_list)
And anything more complicated you write will have exactly the same problem as just calling set, so there's no tricky way around it.
Of course you can have a set of tuples, since they're hashable. So, maybe you wanted this:
new_set=set([('blue','purple'),('black','orange','red'),('green')])
That's easy too. Assuming your inner lists are guaranteed to contain nothing but strings (or other hashable values), as in your example it's just:
new_set = set(map(tuple, new_list))
Or, if you use a sort-based set class, you don't need hashable values, just fully-ordered values. For example:
new_set = sortedset(new_list)
Python doesn't come with such a thing in the standard library, but there are some great third-party implementations you can install, like blist.sortedset or bintrees.FastRBTree.
Of course sorted-set operations aren't quite as fast as hash operations in general, but often they're more than good enough. (For a concrete example, if you have 1 million items in the list, hashing will make each lookup 1 million times faster; sorting will only make it 50,000 times faster.)
Basically, any output you can describe or give an example of, we can tell you how to get that, or that it isn't a valid object you can get… but first you have to tell us what you actually want.
By the way, if you're wondering why lists aren't hashable, it's just because they're mutable. If you're wondering why most mutable types aren't hashable, the FAQ explains that.
Make the element a tuple before adding it to the set:
new_list=[['blue','purple'],['black','orange','red'],['green']]
new_list2=[['blue','purple'],['black','green','red'],['orange']]
results=set()
results2=set()
for element in new_list:
results.add(tuple(element))
for element in new_list2:
results2.add(tuple(element))
results3=results.intersection(results2)
print results3
results in:
set([('blue', 'purple')])
Set elements have to be hashable.
for adding lists to a set, instead use tuple
for adding sets to a set, instead use frozenset
When you do something like "test" in a where a is a list does python do a sequential search on the list or does it create a hash table representation to optimize the lookup? In the application I need this for I'll be doing a lot of lookups on the list so would it be best to do something like b = set(a) and then "test" in b? Also note that the list of values I'll have won't have duplicate data and I don't actually care about the order it's in; I just need to be able to check for the existence of a value.
Also note that the list of values I'll have won't have duplicate data and I don't actually care about the order it's in; I just need to be able to check for the existence of a value.
Don't use a list, use a set() instead. It has exactly the properties you want, including a blazing fast in test.
I've seen speedups of 20x and higher in places (mostly heavy number crunching) where one list was changed for a set.
"test" in a with a list a will do a linear search. Setting up a hash table on the fly would be much more expensive than a linear search. "test" in b on the other hand will do an amoirtised O(1) hash look-up.
In the case you describe, there doesn't seem to be a reason to use a list over a set.
I think it would be better to go with the set implementation. I know for a fact that sets have O(1) lookup time. I think lists take O(n) lookup time. But even if lists are also O(1) lookup, you lose nothing with switching to sets.
Further, sets don't allow duplicate values. This will make your program slightly more memory efficient as well
List and tuples seems to have the same time, and using "in" is slow for large data:
>>> t = list(range(0, 1000000))
>>> a=time.time();x = [b in t for b in range(100234,101234)];print(time.time()-a)
1.66235494614
>>> t = tuple(range(0, 1000000))
>>> a=time.time();x = [b in t for b in range(100234,101234)];print(time.time()-a)
1.6594209671
Here is much better solution: Most efficient way for a lookup/search in a huge list (python)
It's super fast:
>>> from bisect import bisect_left
>>> t = list(range(0, 1000000))
>>> a=time.time();x = [t[bisect_left(t,b)]==b for b in range(100234,101234)];print(time.time()-a)
0.0054759979248
I don't remember whether I was dreaming or not but I seem to recall there being a function which allowed something like,
foo in iter_attr(array of python objects, attribute name)
I've looked over the docs but this kind of thing doesn't fall under any obvious listed headers
Using a list comprehension would build a temporary list, which could eat all your memory if the sequence being searched is large. Even if the sequence is not large, building the list means iterating over the whole of the sequence before in could start its search.
The temporary list can be avoiding by using a generator expression:
foo = 12
foo in (obj.id for obj in bar)
Now, as long as obj.id == 12 near the start of bar, the search will be fast, even if bar is infinitely long.
As #Matt suggested, it's a good idea to use hasattr if any of the objects in bar can be missing an id attribute:
foo = 12
foo in (obj.id for obj in bar if hasattr(obj, 'id'))
Are you looking to get a list of objects that have a certain attribute? If so, a list comprehension is the right way to do this.
result = [obj for obj in listOfObjs if hasattr(obj, 'attributeName')]
you could always write one yourself:
def iterattr(iterator, attributename):
for obj in iterator:
yield getattr(obj, attributename)
will work with anything that iterates, be it a tuple, list, or whatever.
I love python, it makes stuff like this very simple and no more of a hassle than neccessary, and in use stuff like this is hugely elegant.
No, you were not dreaming. Python has a pretty excellent list comprehension system that lets you manipulate lists pretty elegantly, and depending on exactly what you want to accomplish, this can be done a couple of ways. In essence, what you're doing is saying "For item in list if criteria.matches", and from that you can just iterate through the results or dump the results into a new list.
I'm going to crib an example from Dive Into Python here, because it's pretty elegant and they're smarter than I am. Here they're getting a list of files in a directory, then filtering the list for all files that match a regular expression criteria.
files = os.listdir(path)
test = re.compile("test\.py$", re.IGNORECASE)
files = [f for f in files if test.search(f)]
You could do this without regular expressions, for your example, for anything where your expression at the end returns true for a match. There are other options like using the filter() function, but if I were going to choose, I'd go with this.
Eric Sipple
The function you are thinking of is probably operator.attrgettter. For example, to get a list that contains the value of each object's "id" attribute:
import operator
ids = map(operator.attrgetter("id"), bar)
If you want to check whether the list contains an object with an id == 12, then a neat and efficient (i.e. doesn't iterate the whole list unnecessarily) way to do it is:
any(obj.id == 12 for obj in bar)
If you want to use 'in' with attrgetter, while still retaining lazy iteration of the list:
import operator,itertools
foo = 12
foo in itertools.imap(operator.attrgetter("id"), bar)
What I was thinking of can be achieved using list comprehensions, but I thought that there was a function that did this in a slightly neater way.
i.e. 'bar' is a list of objects, all of which have the attribute 'id'
The mythical functional way:
foo = 12
foo in iter_attr(bar, 'id')
The list comprehension way:
foo = 12
foo in [obj.id for obj in bar]
In retrospect the list comprehension way is pretty neat anyway.
If you plan on searching anything of remotely decent size, your best bet is going to be to use a dictionary or a set. Otherwise, you basically have to iterate through every element of the iterator until you get to the one you want.
If this isn't necessarily performance sensitive code, then the list comprehension way should work. But note that it is fairly inefficient because it goes over every element of the iterator and then goes BACK over it again until it finds what it wants.
Remember, python has one of the most efficient hashing algorithms around. Use it to your advantage.
I think:
#!/bin/python
bar in dict(Foo)
Is what you are thinking of. When trying to see if a certain key exists within a dictionary in python (python's version of a hash table) there are two ways to check. First is the has_key() method attached to the dictionary and second is the example given above. It will return a boolean value.
That should answer your question.
And now a little off topic to tie this in to the list comprehension answer previously given (for a bit more clarity). List Comprehensions construct a list from a basic for loop with modifiers. As an example (to clarify slightly), a way to use the in dict language construct in a list comprehension:
Say you have a two dimensional dictionary foo and you only want the second dimension dictionaries which contain the key bar. A relatively straightforward way to do so would be to use a list comprehension with a conditional as follows:
#!/bin/python
baz = dict([(key, value) for key, value in foo if bar in value])
Note the if bar in value at the end of the statement**, this is a modifying clause which tells the list comprehension to only keep those key-value pairs which meet the conditional.** In this case baz is a new dictionary which contains only the dictionaries from foo which contain bar (Hopefully I didn't miss anything in that code example... you may have to take a look at the list comprehension documentation found in docs.python.org tutorials and at secnetix.de, both sites are good references if you have questions in the future.).