Is overwriting variables names for lengthy operations bad style? - python

I quite often find myself in a situation where I undertake several steps to get from my start data input to the output I want to have, e.g. in functions/loops. To avoid making my lines very long, I sometimes overwrite the variable name I am using in these operations.
One example would be:
df_2 = df_1.loc[(df1['id'] == val)]
df_2 = df_2[['c1','c2']]
df_2 = df_2.merge(df3, left_on='c1', right_on='c1'))
The only alternative I can come up with is:
df_2 = df_1.loc[(df1['id'] == val)][['c1','c2']]\
.merge(df3, left_on='c1', right_on='c1'))
But none of these options feels really clean. how should these situations be handled?

You can refer to this article which discusses exactly your question.
The pandas core team now encourages the use of "method chaining".
This is a style of programming in which you chain together multiple
method calls into a single statement. This allows you to pass
intermediate results from one method to the next rather than storing
the intermediate results using variables.
In addition to prettifying the chained codes by using brackets and indentation like #perl's answer, you might also find using functions like .query() and .assign() useful for coding in a "method chaining" style.
Of course, there are some drawbacks for method chaining, especially when excessive:
"One drawback to excessively long chains is that debugging can be
harder. If something looks wrong at the end, you don't have
intermediate values to inspect."

Just as another option, you can put everything in brackets and then break the lines, like this:
df_2 = (df_1
.loc[(df1['id'] == val)][['c1','c2']]
.merge(df3, left_on='c1', right_on='c1')))
It's generally pretty readable even if you have a lot of lines, and if you want to change the name of the output variable, you only need to change it in one place. So, a bit less verbose, and a bit easier to make changes to as compared to overwriting the variables

Related

optimizing code that contains a several foor lops and if-else statements in Python

I red that
lis2= map(str.strip, lis1)
is faster and better-written than
lis2= []
for z in lis1:
lis2.append(z.strip())
now, I have the following code:
for item in sel:
name = item.text
songs = []
for song in item.find_next_siblings('div', class_="listalbum-item"):
if song.find_previous_sibling('div', class_='album') == item:
if 'www.somesite.com/lyrics' in song.find('a')['href']:
songs.append([song.text, song.find('a')['href']])
else:
songs.append([song.text, 'https://www.somesite.com/' + song.find('a')['href'][3:]])
album[name] = songs
how can apply the concept above to that piece of code? To be honest the first question should be is it necessary? really is it possible to optimize that? , but anyway, some advices?
thanks in advance!
You should consider the real difference between the two fragments:
lis2= map(str.strip, lis1)
And:
lis2= []
for z in lis1:
lis2.append(z.strip())
It's fair to say that, to many programmers, the first is more clearly written. After all, it literally says what is being done, not how it's done. Whenever you can do that, it's the better choice, as long as you're not sacrificing anything that's important (like, possibly, speed).
Another way of improving the readability would be to use better names and follow PEP8 coding conventions:
list2 = map(str.strip, list1)
But if it's 'better' or 'faster', really depends. The real difference is that the map example leaves it up to the implementation of map to decide how a list is constructed. On the inside of map, you could find code that just uses a for loop and list.append() as well, but it could also be some really complicated, but very efficient code that does it faster or using fewer resources somehow.
The reason people say using something like map is better (and why it's often faster) is that you're leaving the work up to a library function that has been specifically written to keep the implementation details away from you and others that read your code, and to allow you to benefit from improvements to that function over time.
It's quite possible though that, for your specific case, with your specific data, you know how to write something that performs even better. So, if speed is the objective, you may be able to improve on map for your case.
What the 'best' code is for you, depends on how you rank your code. Does readability (for others and your future self) get trumped by speed?
If your case, since you call some external library to search through an HTML document, it's very likely that will cost orders of magnitude more time than a simple for loop in Python, that's actually pretty clear as well.
Have you looked at ways of doing that work more optimally? For example, this bit:
for song in item.find_next_siblings('div', class_="listalbum-item"):
if song.find_previous_sibling('div', class_='album') == item:
find_next_siblings finds all document element siblings of the item that are a listalbum-item (apparently, a song) and for each of those, you check if the first sibling immediately before it that is an album is the item.
In other words, it appears that you're trying to loop over all the songs of the album, but do you expect songs in sel that are not on the album? If not, you should be able to optimise this code quite a bit - but it's hard to help improve the code without knowing what the content looks like.
This can be done as a list comprehension, and will be faster as a list comprehension ... but is certainly not necessarily clearer as a list comprehension:
[[song.text,
(song.find('a')['href']
if 'www.somesite.com/lyrics' in song.find('a')['href']
else 'https://www.somesite.com/' + song.find('a')['href'][3:])]
for song in item.find_next_siblings('div', class_="listalbum-item")]

How can I make this code more DRY

From web2py example 33 we see:
db.purchase.insert(buyer_id=form.vars.buyer_id,
product_id=form.vars.product_id,
quantity=form.vars.quantity)
but I think there should be some way to make this less repetitive. perhaps this?
db.purchase.insert(**dict( [k = getattr(form.vars, k) for k in "buyer_id product_id quantity".split()]))
For me, DRY means 1) not repeating actual code, and 2) (and more important) not repeating information; i.e. there should be one place that each item of information exists.
In this instance, you're really just repeating a pattern, and I think that's fine. The second example is much harder to read; why complicate it just to save a few characters?
You could avoid repeating form.vars:
vars = form.vars
db.purchase.insert(
buyer_id=vars.buyer_id,
product_id=vars.product_id,
quantity=vars.quantity)
There is still some repetition, but I think it's better to leave it with some repetition rather than making your code hard to read.
if those 3 things are all there is you could do
db.purchase.insert(**form.vars)
otherwise I think that the original code is plenty dry
but i guess you could do
to_insert = {"product_id":form.vars.product_id,"quantity":form.vars.quantity,"buyer_id":form.vars.buyer_id}
db.purchase.insert(**to_insert)
which is similar to your second example but is more readable and simple (some of the tenets of python)

Difference between two "contains" operations for python lists

I'm fairly new to python and have found that I need to query a list about whether it contains a certain item.
The majority of the postings I have seen on various websites (including this similar stackoverflow question) have all suggested something along the lines of
for i in list
if i == thingIAmLookingFor
return True
However, I have also found from one lone forum that
if thingIAmLookingFor in list
# do work
works.
I am wondering if the if thing in list method is shorthand for the for i in list method, or if it is implemented differently.
I would also like to which, if either, is more preferred.
In your simple example it is of course better to use in.
However... in the question you link to, in doesn't work (at least not directly) because the OP does not want to find an object that is equal to something, but an object whose attribute n is equal to something.
One answer does mention using in on a list comprehension, though I'm not sure why a generator expression wasn't used instead:
if 5 in (data.n for data in myList):
print "Found it"
But this is hardly much of an improvement over the other approaches, such as this one using any:
if any(data.n == 5 for data in myList):
print "Found it"
the "if x in thing:" format is strongly preferred, not just because it takes less code, but it also works on other data types and is (to me) easier to read.
I'm not sure how it's implemented, but I'd expect it to be quite a lot more efficient on datatypes that are stored in a more searchable form. eg. sets or dictionary keys.
The if thing in somelist is the preferred and fastest way.
Under-the-hood that use of the in-operator translates to somelist.__contains__(thing) whose implementation is equivalent to: any((x is thing or x == thing) for x in somelist).
Note the condition tests identity and then equality.
for i in list
if i == thingIAmLookingFor
return True
The above is a terrible way to test whether an item exists in a collection. It returns True from the function, so if you need the test as part of some code you'd need to move this into a separate utility function, or add thingWasFound = False before the loop and set it to True in the if statement (and then break), either of which is several lines of boilerplate for what could be a simple expression.
Plus, if you just use thingIAmLookingFor in list, this might execute more efficiently by doing fewer Python level operations (it'll need to do the same operations, but maybe in C, as list is a builtin type). But even more importantly, if list is actually bound to some other collection like a set or a dictionary thingIAmLookingFor in list will use the hash lookup mechanism such types support and be much more efficient, while using a for loop will force Python to go through every item in turn.
Obligatory post-script: list is a terrible name for a variable that contains a list as it shadows the list builtin, which can confuse you or anyone who reads your code. You're much better off naming it something that tells you something about what it means.

A better way to assign list into a var

Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...
done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']
What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]
I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).
Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)
Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...

How should I organise my functions with pyparsing?

I am parsing a file with python and pyparsing (it's the report file for PSAT in Matlab but that isn't important). here is what I have so far. I think it's a mess and would like some advice on how to improve it. Specifically, how should I organise my grammar definitions with pyparsing?
Should I have all my grammar definitions in one function? If so, it's going to be one huge function. If not, then how do I break it up. At the moment I have split it at the sections of the file. Is it worth making loads of functions that only ever get called once from one place. Neither really feels right to me.
Should I place all my input and output code in a separate file to the other class functions? It would make the purpose of the class much clearer.
I'm also interested to know if there is an easier way to parse a file, do some sanity checks and store the data in a class. I seem to spend a lot of my time doing this.
(I will accept answers of it's good enough or use X rather than pyparsing if people agree)
I could go either way on using a single big method to create your parser vs. taking it in steps the way you have it now.
I can see that you have defined some useful helper utilities, such as slit ("suppress Literal", I presume), stringtolits, and decimaltable. This looks good to me.
I like that you are using results names, they really improve the robustness of your post-parsing code. I would recommend using the shortcut form that was added in pyparsing 1.4.7, in which you can replace
busname.setResultsName("bus1")
with
busname("bus1")
This can declutter your code quite a bit.
I would look back through your parse actions to see where you are using numeric indexes to access individual tokens, and go back and assign results names instead. Here is one case, where GetStats returns (ngroup + sgroup).setParseAction(self.process_stats). process_stats has references like:
self.num_load = tokens[0]["loads"]
self.num_generator = tokens[0]["generators"]
self.num_transformer = tokens[0]["transformers"]
self.num_line = tokens[0]["lines"]
self.num_bus = tokens[0]["buses"]
self.power_rate = tokens[1]["rate"]
I like that you have Group'ed the values and the stats, but go ahead and give them names, like "network" and "soln". Then you could write this parse action code as (I've also converted to the - to me - easier-to-read object-attribute notation instead of dict element notation):
self.num_load = tokens.network.loads
self.num_generator = tokens.network.generators
self.num_transformer = tokens.network.transformers
self.num_line = tokens.network.lines
self.num_bus = tokens.network.buses
self.power_rate = tokens.soln.rate
Also, a style question: why do you sometimes use the explicit And constructor, instead of using the '+' operator?
busdef = And([busname.setResultsName("bus1"),
busname.setResultsName("bus2"),
integer.setResultsName("linenum"),
decimaltable("pf qf pl ql".split())])
This is just as easily written:
busdef = (busname("bus1") + busname("bus2") +
integer("linenum") +
decimaltable("pf qf pl ql".split()))
Overall, I think this is about par for a file of this complexity. I have a similar format (proprietary, unfortunately, so cannot be shared) in which I built the code in pieces similar to the way you have, but in one large method, something like this:
def parser():
header = Group(...)
inputsummary = Group(...)
jobstats = Group(...)
measurements = Group(...)
return header("hdr") + inputsummary("inputs") + jobstats("stats") + measurements("meas")
The Group constructs are especially helpful in a large parser like this, to establish a sort of namespace for results names within each section of the parsed data.

Categories

Resources