How clever is the python interpreter?

How clever is the python interpreter? - python

Probably a stupid question, but I am wondering in general, and if anyone knows, how much foresight the Python interpreter has, specifically in the field of regular expressions and text parsing.
Suppose my code at some point looks like this:
mylist = ['a', 'b', 'c', ... ]
if 'g' in list: print(mylist.index('g'))
is there any safer way to do this with a while loop or similar. I mean, will the index be looked up with a second parsing from the beginning or are the two g's (in the above line) the same thing in Python's mind?

It'll do the lookup both times. If it's worth it (for, say, a very big list), use try:
try:
print(mylist.index('g'))
except ValueError:
pass

The result of the containment check is not cached, and so the index will need to be discovered anew. And the dynamic nature of Python makes implicit caching of such a thing unreliable since the __contains__() method may mutate the object (although it would be a violation of several programming principles to do so).

Your code will result in two lookups, first to determine if 'g' is in the list and second to find the index. Python won't try to consolidate them into a single lookup. If you're worried about efficiency you can use a dictionary instead of a list which will make both lookups O(1) instead of O(n).

You can easily make a dict to look up. Something like this:
mydict = {k:v for v,k in enumerate(mylist)}
The overhead of creating the dict won't be worthwhile unless you are doing a few such lookups on the same list

Try is better option to find the index of element in list.
try:
print(mylist.index('g'))
except ValueError:
print "value not in list"
pass

yeah it will be looked up twice, the python interpreter doesn't cache instructions, though I've being wondering if its possible (for certain things), if this is an issue, then you can use sets or dicts both of which have constant look up time.
Either way it seems you are LBYL, in python we tend to EAFP so its quite common to wrap such things in try ... except blocks

Related

Find if any element of list exists in another list in Python

The question is about a quicker, ie. more pythonic, way to test if any of the elements in an iterable exists inside another iterable.
What I am trying to do is something like:
if "foo" in terms or "bar" in terms or "baz" in terms:
pass
But apparently this way repeats the 'in terms' clause and bloats the code, especially when
we are dealing with many more elements. So I wondered whether is a better way to do this in python.

You could also consider in your special case if it is possible to use sets instead of iterables. If both (foobarbaz and terms) are sets, then you can just write
if foobarbaz & terms:
pass
This isn't particularly faster than your way, but it is smaller code by far and thus probably better for reading.
And of course, not all iterables can be converted to sets, so it depends on your usecase.

Figured out a way, posting here for easy reference. Try this:
if any(term in terms for term in ("foo", "bar", "baz")):
pass

Faster than Alfe's answer, since it only tests for, rather than calculates the, intersection
if not set(terms).isdisjoint({'foo', 'bar', 'baz'}):
pass

More elegant / pythonic way to append to an array, or create it

For some reason the following code felt a bit cumbersome to me, given all the syntactic sugar I keep finding in Python, so I thought enquire if there's a better way:
pictures = list_of_random_pictures()
invalid_pictures = {}
for picture in pictures:
if picture_invalid(picture):
if not invalid_pictures.get(picture.album.id):
invalid_pictures[picture.album.id] = []
invalid_pictures[picture.album.id].append(picture)
So just to clarify, I'm wondering if there's a more readable way to take care of the last 3 lines above. As I'm repeating invalid_pictures[picture.album.id] 3 times, it just seems unnecessary if it's at all avoidable.
EDIT: Just realized my code above will KeyError, so I've altered it to handle that.

There is, in fact, an object in the standard library to deal exactly with that case: collections.defaultdict
It takes as its argument either a type or function which produces the default value for a missing key.
from collections import defaultdict
invalid_pictures = defaultdict(list)
for picture in pictures:
if picture_invalid(picture):
invalid_pictures[picture.album.id].append(picture)

Although I do find Joel's answer to be the usual solution, there's an often overlooked feature that is sometimes preferrable. when a default value isn't particularly desired is to use dict.setdefault(), when inserting the default is something of an exception (and collections.defaultdict is suboptimal. your code, adapted to use setdefault() would be
pictures = list_of_random_pictures()
invalid_pictures = {}
for picture in pictures:
if picture_invalid(picture):
invalid_pictures.setdefault(picture.album.id, []).append(picture)
although this has the downside of creating a lot of (probably unused) lists, so in the particular case of setting the default to a list, it's still often better to use a defaultdict, and then copy it to a regular dict to strip off the default value. This doesn't particularly apply to immutable types, such as int or str.
For example, pep-333 requires that the environ variable be a regular dict, not a subclass or instance of collections.Mapping, but a real, plain ol' dict. Another case is when you aren't so much as passing through an iterator and applying to a flat mapping; but each key needs special handling, such as when applying defaults to a configuration.

Difference between two "contains" operations for python lists

I'm fairly new to python and have found that I need to query a list about whether it contains a certain item.
The majority of the postings I have seen on various websites (including this similar stackoverflow question) have all suggested something along the lines of
for i in list
if i == thingIAmLookingFor
return True
However, I have also found from one lone forum that
if thingIAmLookingFor in list
# do work
works.
I am wondering if the if thing in list method is shorthand for the for i in list method, or if it is implemented differently.
I would also like to which, if either, is more preferred.

In your simple example it is of course better to use in.
However... in the question you link to, in doesn't work (at least not directly) because the OP does not want to find an object that is equal to something, but an object whose attribute n is equal to something.
One answer does mention using in on a list comprehension, though I'm not sure why a generator expression wasn't used instead:
if 5 in (data.n for data in myList):
print "Found it"
But this is hardly much of an improvement over the other approaches, such as this one using any:
if any(data.n == 5 for data in myList):
print "Found it"

the "if x in thing:" format is strongly preferred, not just because it takes less code, but it also works on other data types and is (to me) easier to read.
I'm not sure how it's implemented, but I'd expect it to be quite a lot more efficient on datatypes that are stored in a more searchable form. eg. sets or dictionary keys.

The if thing in somelist is the preferred and fastest way.
Under-the-hood that use of the in-operator translates to somelist.__contains__(thing) whose implementation is equivalent to: any((x is thing or x == thing) for x in somelist).
Note the condition tests identity and then equality.

for i in list
if i == thingIAmLookingFor
return True
The above is a terrible way to test whether an item exists in a collection. It returns True from the function, so if you need the test as part of some code you'd need to move this into a separate utility function, or add thingWasFound = False before the loop and set it to True in the if statement (and then break), either of which is several lines of boilerplate for what could be a simple expression.
Plus, if you just use thingIAmLookingFor in list, this might execute more efficiently by doing fewer Python level operations (it'll need to do the same operations, but maybe in C, as list is a builtin type). But even more importantly, if list is actually bound to some other collection like a set or a dictionary thingIAmLookingFor in list will use the hash lookup mechanism such types support and be much more efficient, while using a for loop will force Python to go through every item in turn.
Obligatory post-script: list is a terrible name for a variable that contains a list as it shadows the list builtin, which can confuse you or anyone who reads your code. You're much better off naming it something that tells you something about what it means.

LIst Comprehensions: References to the Components

In sum: I need to write a List Comprehension in which i refer to list that is being created by the List Comprehension.
This might not be something you need to do every day, but i don't think it's unusual either.
Maybe there's no answer here--still, please don't tell me i ought to use a for loop. That might be correct, but it's not helpful. The reason is the problem domain: this line of code is part of an ETL module, so performance is relevant, and so is the need to avoid creating a temporary container--hence my wish to code this step in a L/C. If a for loop would work for me here, i would just code one.
In any event, i am unable to write this particular list comprehension. The reason: the expression i need to write has this form:
[ some_function(s) for s in raw_data if s not in this_list ]
In that pseudo-code, "this_list" refers to the list created by evaluating that list comprehension. And that's why i'm stuck--because this_list isn't built until my list comprehension is evaluated, and because this list isn't yet built by the time i need to refer to it, i don't know how to refer to it.
What i have considered so far (and which might be based on one or more false assumptions, though i don't know exactly where):
doesn't the python interpreter have
to give this list-under-construction
a name? i think so
that temporary name is probably taken
from some bound method used to build
my list ('sum'?)
but even if i went to the trouble to
find that bound method and assuming
that it is indeed the temporary name
used by the python interpreter to
refer to the list while it is under
construction, i am pretty sure you
can't refer to bound methods
directly; i'm not aware of such an
explicit rule, but those methods (at
least the few that i've actually
looked at) are not valid python
syntax. I'm guessing one reason why
is so that we do not write them into
our code.
so that's the chain of my so-called reasoning, and which has led me to conclude, or at least guess, that i have coded myself into a corner. Still i thought i ought to verify this with the Community before turning around and going a different direction.

There used to be a way to do this using the undocumented fact that while the list was being built its value was stored in a local variable named _[1].__self__. However that quit working in Python 2.7 (maybe earlier, I wasn't paying close attention).
You can do what you want in a single list comprehension if you set up an external data structure first. Since all your pseudo code seemed to be doing with this_list was checking it to see if each s was already in it -- i.e. a membership test -- I've changed it into a set named seen as an optimization (checking for membership in a list can be very slow if the list is large). Here's what I mean:
raw_data = [c for c in 'abcdaebfc']
seen = set()
def some_function(s):
seen.add(s)
return s
print [ some_function(s) for s in raw_data if s not in seen ]
# ['a', 'b', 'c', 'd', 'e', 'f']
If you don't have access to some_function, you could put a call to it in your own wrapper function that added its return value to the seen set before returning it.
Even though it wouldn't be a list comprehension, I'd encapsulate the whole thing in a function to make reuse easier:
def some_function(s):
# do something with or to 's'...
return s
def add_unique(function, data):
result = []
seen = set(result) # init to empty set
for s in data:
if s not in seen:
t = function(s)
result.append(t)
seen.add(t)
return result
print add_unique(some_function, raw_data)
# ['a', 'b', 'c', 'd', 'e', 'f']
In either case, I find it odd that the list being built in your pseudo code that you want to reference isn't comprised of a subset of raw_data values, but rather the result of calling some_function on each of them -- i.e. transformed data -- which naturally makes one wonder what some_function does such that its return value might match an existing raw_data item's value.

I don't see why you need to do this in one go. Either iterate through the initial data first to eliminate duplicates - or, even better, convert it to a set as KennyTM suggests - then do your list comprehension.
Note that even if you could reference the "list under construction", your approach would still fail because s is not in the list anyway - the result of some_function(s) is.

As far as I know, there is no way to access a list comprehension as it's being built.
As KennyTM mentioned (and if the order of the entries is not relevant), then you can use a set instead. If you're on Python 2.7/3.1 and above, you even get set comprehensions:
{ some_function(s) for s in raw_data }
Otherwise, a for loop isn't that bad either (although it will scale terribly)
l = []
for s in raw_data:
item = somefunction(s)
if item not in l:
l.append(item)

Why don't you simply do:[ some_function(s) for s in set(raw_data) ]
That should do what you are asking for. Except when you need to preserve the order of the previous list.

A better way to assign list into a var

Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...

done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']

What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]

I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).

Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)

Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.