python koan - simplifying this two liner into one - python

I wan to turn this two lines of code into one.
for n in exceptions:
my_dict[n] += 1
It's bothering me that a small statement to increment a dictionary takes two lines. I'm sure this problem has bothered someone too.

for n in exceptions: my_dict[n] += 1
...but there is nothing wrong in the fact that the statement takes two lines. There are some python coding guidelines (PEP8) and they strongly encourage core readability. By putting those statements into a single line you lower the readability.
From PEP8:
While sometimes it's okay to put an if/for/while with a small body on
the same line, never do this for multi-clause statements. Also avoid
folding such long lines!

my_dict.update({n: my_dict[n] + 1 for n in exceptions})
This essentially uses dict.update to update the dictionary with values from another dictionary. And that other dictionary is built using a dictionary comprehension. Unfortunately, to get the my_dict[n] += 1 effect, we have to read the value explicitely again in the expression.
But I would argue a lot if it would be better than using a simple and clear for-loop for this.
On the other hand, as you seem to be just counting exceptions, you might want to consider using a Counter for your dictionary. It’s essentially an improved dictionary (meaning that you can use it just like a normal dictionary) but it comes with a few features that makes it perfect when counting things. For example, you could simplify above statement to just this:
my_dict.update({n: 1 for n in exceptions})
Or as you are just adding 1 for each, you can just pass the iterable exceptions directly:
my_dict.update(exceptions)
This would also have the benefit that exceptions which do not exist in my_dict yet, are automatically initialized. To create a Counter from your existing dictionary, you can just pass the original dictionary to the Counter constructor:
from collections import Counter
my_dict = Counter(my_dict)

Use a collections.Counter:
import collections
my_dict = collections.Counter(exceptions)
or, if my_dict is already defined to be a dict,
my_dict = collections.Counter(my_dict)
my_dict.update(exceptions)

Related

Choose one key arbitrarily in a dictionary without iteration [duplicate]

This question already has answers here:
dict.keys()[0] on Python 3 [duplicate]
(3 answers)
Closed 6 years ago.
I just wanna make sure that in Python dictionaries there's no way to get just a key (with no specific quality or relation to a certain value) but doing iteration. As much as I found out you have to make a list of them by going through the whole dictionary in a loop. Something like this:
list_keys=[k for k in dic.keys()]
The thing is I just need an arbitrary key if the dictionary is not empty and don't care about the rest. I guess iterating over a long dictionary in addition to creation of a long list for just randomly getting a key is a whole lot overhead, isn't it?
Is there a better trick somebody can point out?
Thanks
A lot of the answers here produce a random key but the original question asked for an arbitrary key. There's quite a difference between those two. Randomness has a handful of mathematical/statistical guarantees.
Python dictionaries are not ordered in any meaningful way. So, yes, accessing an arbitrary key requires iteration. But for a single arbitrary key, we do not need to iterate the entire dictionary. The built-in functions next and iter are useful here:
key = next(iter(mapping))
The iter built-in creates an iterator over the keys in the mapping. The iteration order will be arbitrary. The next built-in returns the first item from the iterator. Iterating the whole mapping is not necessary for an arbitrary key.
If you're going to end up deleting the key from the mapping, you may instead use dict.popitem. Here's the docstring:
D.popitem() -> (k, v), remove and return some (key, value) pair as a 2-tuple;
but raise KeyError if D is empty.
You can use random.choice
rand_key = random.choice(dict.keys())
And this will only work in python 2.x, in python 3.x dict.keys returns an iterator, so you'll have to do cast it into a list -
rand_key = random.choice(list(dict.keys()))
So, for example -
import random
d = {'rand1':'hey there', 'rand2':'you love python, I know!', 'rand3' : 'python has a method for everything!'}
random.choice(list(d.keys()))
Output -
rand1
You are correct: there is not a way to get a random key from an ordinary dict without using iteration. Even solutions like random.choice must iterate through the dictionary in the background.
However you could use a sorted dict:
from sortedcontainers import SortedDict as sd
d = sd(dic)
i = random.randrange(len(d))
ran_key = d.iloc[i]
More here:.
http://www.grantjenks.com/docs/sortedcontainers/sorteddict.html
Note that whether or not using something like SortedDict will result in any efficiency gains is going to be entirely dependent upon the actual implementation. If you are creating a lot of SD objects, or adding new keys very often (which have to be sorted), and are only getting a random key occasionally in relation to those other two tasks, you are unlikely to see much of a performance gain.
How about something like this:
import random
arbitrary_key = random.choice( dic.keys() )
BTW, your use of a list comprehension there really makes no sense:
dic.keys() == [k for k in dic.keys()]
check the length of dictionary like this, this should do !!
import random
if len(yourdict) > 0:
randomKey = random.sample(yourdict,1)
print randomKey[0]
else:
do something
randomKey will return a list, as we have passed 1 so it will return list with 1 key and then get the key by using randomKey[0]

Filter massive linux log files

I have the user input into a list the items it would like to filter out. From there it filters using:
while knownIssuesCounter != len(newLogFile):
for line in knownIssues:
if line in newLogFile[knownIssuesCounter]:
if line not in issuesFound:
issuesFoundCounter[line]=1
issuesFound.append(line)
issuesFound.append(knownIssues[line])
else:
issuesFoundCounter[line]=issuesFoundCounter[line] + 1
knownIssuesCounter +=1
I'm running into hundred meg log files and it is taking FOREVER.....
Is there a better way to be doing this with Python?
Try to change issuesFound from a list to set:
issuesFound = set()
and use add instead of append:
issuesFound.add(line)
A big part of the reason your code is so slow is that if line not in issuesFound:. This requires a linear search through a huge list.
You can fix that by adding a set of seen issues (which is effectively free to search). That reduces your time from O(NM) to O(N).
But really, you can make this even simpler by removing the if entirely.
First, you can generate the issuesFound list after the fact from the keys of issuesFoundCounter. For each line in issuesFoundCounter, you want that line, and then its knownIssues[line]. So:
issuesFound = list(flatten((line, knownIssues[line]) for line in issuesFoundCounter))
(I'm using the flatten recipe from the itertools docs. You can copy that into your code, or you can just write this with itertools.chain.from_iterable instead of flatten.)
And that means you can just search if line not in issuesFoundCounter: instead of in issuesFound:, which is already a dict (and therefore effectively free to search). But if you just use setdefault—or, even simpler, use a defaultdict or a Counter instead of a dict—you can make that automatic.
So, if issuesFoundCounter is a Counter, the whole thing reduces to this:
for newLogLine in newLogFile:
for line in knownIssues:
if line in newLogLine:
issuesFoundCounter[line] += 1
And you can turn that into a generator expression, eliminating the slow-ish explicit looping in Python with faster looping inside the interpreter guts. This is only going to be, say, a fixed 5:1 speedup, as opposed to the linear-to-constant speedup from the first half, but it's still worth considering:
issuesFoundCounter = collections.Counter(line
for newLogLine in newLogFile
for line in knownIssues
if line in newLogLine)
The only problem with this is that the issuesFound list is now in arbitrary order, instead of in the order that the issues are found. If that's important, just use an OrderedCounter instead of a Counter. There's a simple recipe in the collections docs, but for your case, it can be as simple as:
class OrderedCounter(Counter, OrderedDict):
pass

How clever is the python interpreter?

Probably a stupid question, but I am wondering in general, and if anyone knows, how much foresight the Python interpreter has, specifically in the field of regular expressions and text parsing.
Suppose my code at some point looks like this:
mylist = ['a', 'b', 'c', ... ]
if 'g' in list: print(mylist.index('g'))
is there any safer way to do this with a while loop or similar. I mean, will the index be looked up with a second parsing from the beginning or are the two g's (in the above line) the same thing in Python's mind?
It'll do the lookup both times. If it's worth it (for, say, a very big list), use try:
try:
print(mylist.index('g'))
except ValueError:
pass
The result of the containment check is not cached, and so the index will need to be discovered anew. And the dynamic nature of Python makes implicit caching of such a thing unreliable since the __contains__() method may mutate the object (although it would be a violation of several programming principles to do so).
Your code will result in two lookups, first to determine if 'g' is in the list and second to find the index. Python won't try to consolidate them into a single lookup. If you're worried about efficiency you can use a dictionary instead of a list which will make both lookups O(1) instead of O(n).
You can easily make a dict to look up. Something like this:
mydict = {k:v for v,k in enumerate(mylist)}
The overhead of creating the dict won't be worthwhile unless you are doing a few such lookups on the same list
Try is better option to find the index of element in list.
try:
print(mylist.index('g'))
except ValueError:
print "value not in list"
pass
yeah it will be looked up twice, the python interpreter doesn't cache instructions, though I've being wondering if its possible (for certain things), if this is an issue, then you can use sets or dicts both of which have constant look up time.
Either way it seems you are LBYL, in python we tend to EAFP so its quite common to wrap such things in try ... except blocks

Difference between two "contains" operations for python lists

I'm fairly new to python and have found that I need to query a list about whether it contains a certain item.
The majority of the postings I have seen on various websites (including this similar stackoverflow question) have all suggested something along the lines of
for i in list
if i == thingIAmLookingFor
return True
However, I have also found from one lone forum that
if thingIAmLookingFor in list
# do work
works.
I am wondering if the if thing in list method is shorthand for the for i in list method, or if it is implemented differently.
I would also like to which, if either, is more preferred.
In your simple example it is of course better to use in.
However... in the question you link to, in doesn't work (at least not directly) because the OP does not want to find an object that is equal to something, but an object whose attribute n is equal to something.
One answer does mention using in on a list comprehension, though I'm not sure why a generator expression wasn't used instead:
if 5 in (data.n for data in myList):
print "Found it"
But this is hardly much of an improvement over the other approaches, such as this one using any:
if any(data.n == 5 for data in myList):
print "Found it"
the "if x in thing:" format is strongly preferred, not just because it takes less code, but it also works on other data types and is (to me) easier to read.
I'm not sure how it's implemented, but I'd expect it to be quite a lot more efficient on datatypes that are stored in a more searchable form. eg. sets or dictionary keys.
The if thing in somelist is the preferred and fastest way.
Under-the-hood that use of the in-operator translates to somelist.__contains__(thing) whose implementation is equivalent to: any((x is thing or x == thing) for x in somelist).
Note the condition tests identity and then equality.
for i in list
if i == thingIAmLookingFor
return True
The above is a terrible way to test whether an item exists in a collection. It returns True from the function, so if you need the test as part of some code you'd need to move this into a separate utility function, or add thingWasFound = False before the loop and set it to True in the if statement (and then break), either of which is several lines of boilerplate for what could be a simple expression.
Plus, if you just use thingIAmLookingFor in list, this might execute more efficiently by doing fewer Python level operations (it'll need to do the same operations, but maybe in C, as list is a builtin type). But even more importantly, if list is actually bound to some other collection like a set or a dictionary thingIAmLookingFor in list will use the hash lookup mechanism such types support and be much more efficient, while using a for loop will force Python to go through every item in turn.
Obligatory post-script: list is a terrible name for a variable that contains a list as it shadows the list builtin, which can confuse you or anyone who reads your code. You're much better off naming it something that tells you something about what it means.

A better way to assign list into a var

Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...
done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']
What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]
I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).
Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)
Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...

Categories

Resources