Why should I convert all strings to constants in Python?

Why should I convert all strings to constants in Python? - python

I don't know if this will be useful to the community or not, as it might be unique to my situation. I'm working with a senior programmer who, in his code, has this peculiar habit of turning all strings into constants before using them. And I just don't get why. It doesn't make any sense to me. 99% of the time, we are gaining no abstraction or expressive power from the conversion, as it's done like this:
URL_CONVERTER = "url_converter"
URL_TYPE_LONG = "url_type_long"
URL_TYPE_SHORT = "url_type_short"
URL_TYPE_ARRAY = [URL_TYPE_LONG, URL_TYPE_SHORT]
for urltype in URL_TYPE_ARRAY:
outside_class.validate(urltype)
Just like that. As a rule, the constant name is almost always simply the string's content, capitalized, and these constants are seldom referenced more than once anyway. Perhaps less than 5% of the constants thus created are referenced twice or more during runtime.
Is this some programming technique that I just don't understand? Or is it just a bad habit? The other programmers are beginning to mimic this (possibly bad) form, and I want to know if there is a reason for me to as well before blindly following.
Thanks!
Edit: Updated the example. In addition, I understand everyone's points, and would add that this is a fairly small shop, at most two other people will ever see anyone's code, never mind work on it, and these are pretty simple one-offs we're building, not complicated workers or anything. I understand why this would be good practice in a large project, but in the context of our work, it comes across as too much overhead for a very simple task.

It helps to keep your code congruent. E.g. if you use URL_TYPE_LONG in both your client and your server and for some reason you need to change its string value, you just change one line. And you don't run the risk of forgetting to change one instance in the code, or to change one string in your code which just hazardly has the same value.
Even if those constants are only referenced once now, who are we to foresee the future...
I think this also arises from a time (when dinosaurs roamed the earth) when you tried to (A) keep data and code seperated and (B) you were concerned about how many strings you had allocated.

Below are a few scenarios where this would be a good practice:
You have a long string that will be used in a lot of places. Thus, you put it in a (presumably shorter) variable name so that you can use it easily. This keeps the lines of your code from becoming overly long/repetitive.
(somewhat similar to #1) You have a long sting that can't fit on a certain line without sending it way off the screen. So, you put the string in a variable to keep the lines concise.
You want to save a string and alter (add/remove characters from) it latter on. Only a variable will give you this functionality.
You want to have the ability to change multiple lines that use the same string by just altering one variable's value. This is a lot easier than having to go through numerous lines looking for occurrences of the string. It also keeps you from possibly missing some and thereby introducing bugs to your code.
Basically, the answer is to be smart about it. Ask yourself: will making the string into a variable save typing? Will it improve efficiency? Will it make your code easier to work with and/or maintain? If you answered "yes" to any of these, then you should do it.

Doing this is a really good idea.. I work on a fairly large python codebase with 100+ other engineers, and vouch for the fact that this makes collaboration much easier.
If you directly used the underlying strings everywhere, it would make it easier for you to make a typo when referencing it in one particular module and could lead to hard-to-catch bugs.
Its easier for modern IDE's to provide autocomplete and refactoring support when you are using a variable like this. You can easily change the underlying identifier to a different string or even a number later; This makes it easier for you to track down all modules referencing a particular identifier.

Related

Abbreviating several functions - guidelines? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
This is probably a really dumb question but I will ask anyway.
There are two ways to present this code:
file = "picture.jpg"
pic = makePicture(file)
show(pic)
or
show(makePicture("picture.jpg"))
This is just an example of how it can be abbreviated and I have seen it with other functions. But it always confuses me when I need to do it. Is there any rule of thumb when combining functions like this? It seems to me you work backwards picking out the functions as you go and ending with either the file or the function that chooses the file (i.e pickAFile()). Does this sound right?
Please keep explanations simple enough that a smart dog could understand.

Chiming in, because I think that style does matter. I would definitely pick show(makePicture("picture.jpg")) if you don't ever reuse "picture.jpg" and makePicture(…). The reason are that:
This is perfectly legible.
This makes the code faster to read (no need to spend more time than needed on it).
If you use variables, you are sending a signal to people reading the code (including you, after some time) that the variables are reused somewhere in the code and that they should better be put in their working (short-term) memory. Our short-term memory is limited (in the 1960s, experiments have shown that one remembers about 7 pieces of information at a time, and some modern experiments came up with lower numbers). So, if the variables are not reused anywhere, they often should be removed so as to not pollute the reader's short-term memory.
I think that your question is very valid and that you should definitely not use intermediate variables here unless they are necessary (because they are reused, or because they help break a complex expression in directly intelligible parts). This practice will make your code more legible and will give you good habits.
PS: As noted by Blender, having many nested function calls can make the code hard to read. If this is the case, I do recommend considering using intermediate variables to hold pieces of information that make sense, so that the function calls do not contain too many levels of nesting.
PPS: As noted by pcurry, nested function calls can also be easily broken down into many lines, if they become too long, which can make the code about as legible as if using intermediate variables, with the benefit of not using any:
print_summary(
energy=solar_panel.energy_produced(time_of_the_day),
losses=solar_panel.loss_ratio(),
output_path="/tmp/out.txt"
)

When you write:
pic = makePicture(file)
You call makePicture with file as its argument and put the output of that function into the variable pic. If all you do with pic is use it as an argument to show, you don't really need to use pic at all. It's just a temporary variable. Your second example does just that and passes the output of makePicture(file) directly as the first argument to show, without using a temporary variable like pic.
Now, if you're using pic somewhere else, there's really no way to get around using it. If you don't reuse the temporary variables, pick whatever way you like. Just make sure it's readable.

It's all at the discretion of the programmer, if you're planning on making a larger program you might want to keep the statements separate so you can refer back to the file.
Readability is always important if you're working with a team of programmers but if this is just something you're doing by yourself, do whatever's most comfortable.

show(makePicture("picture.jpg")) is more readable than the longer version for reasons discussed in other answers. I have also found that trying to eliminate intermediate variables very often results in a better solution. However, there are cases where a descriptive naming of a complex intermediate result will make code more readable.

Reverse Engineer Auto-Generated C?

How easy is it to reverse engineer an auto-generated C code? I am working on a Python project and as part of my work, am using Cython to compile the code for speedup purposes.
This does help in terms of speed, yet, I am concerned that where I work, some people would try to "peek" into the code and figure out what it does.
Cython code is basically an auto-generated C. Is it very hard to reverse engineer it?
Are there any recommendations that would make the code safer and reverse-engineering harder to do? (I assume that with enough effort, everything can be reversed engineered).

Okay -- to attempt to answer your question more directly: most auto-generated C code is fairly ugly, so somebody would need to be fairly motivated to reverse engineer it. At the same time, I don't believe I've never looked at what Cython generates, so I'm not sure how it looks.
In addition, a lot of auto-generated code is done in the form of things like state machine tables, that most programmers find fairly difficult to follow even at best. The tendency (in many cases) is to have a generic framework, with tables of data that the framework more or less "interprets" at run-time. This isn't necessarily impossible to follow, but it's enough different from most typical code that most people will give up on it fairly quickly (and if they do much, they'll typically waste a lot of time looking at the framework instead of the data, which is what really matters in cases like this).
I'll repeat, however, that I'm pretty sure I haven't looked at what Cython produces, so I can't say much about it with any real certainty.
There are (or at least used to be) commercial obfuscators intended to make C source code difficult to understand. I suspect the availability of Perl has taken a lot of the market share from them, but if you look you may still be able to find one and use it.
Absent that, it's not terribly difficult to write an obfuscator of your own, but the degree of effectiveness will probably vary with the amount of effort you're willing to put into it. Just systematically renaming any meaningful variable names into things like _ and __ can do quite a bit (e.g., profit = sales - costs; is a lot more meaningful than _ = _I_ - _i_;). Depending on the machine generated code in question, however, this may not really accomplish much -- obfuscating a generic framework may not make much difference in understanding what your code does -- and if they figure out the procedure you're following, they may be able to simply replicate the correct framework code and transplant the pieces specific to your program into the un-obfuscated framework.

You should really take a look at the code that Cython produces. To help with debugging, for example, it copies the complete Python source code into the generated file, marking each source line before generating C code for it. That makes it very easy to find the code section that you are interested in.
A very nice feature is that you can compile your code with the "-a" (annotate) option, and it will spit out an HTML file next to the C file that contains the annotated Python code. When you click on a line, you'll see the C code for that line. As a bonus, it marks lines that do a lot of Python processing in dark yellow, so that you get a simple indicator where to look for potential optimisations.
There's also special gdb support in Cython now, so you can do Cython source level debugging etc.

Ah, I guess I missed the bit that you were talking about the compiled module, whereas I was only referring to the source code that Cython generates. I agree with Jerry that it will be fairly tricky to extract something useful from the compiled module, as long as you keep the gdb support disabled (the default) and strip the debugging symbols. That is because the C compiler will do lots of inlining of helper functions all over the place and apply various low-level code optimisations, thus making it harder to extract the original macro level code patterns. However, you will see named C-API calls to CPython, and you will also see function names from your own code. Cython isn't specifically designed for code obfuscation, quite the opposite. But readable assembly has certainly never been a design goal.

Check if something is a list

What is the easiest way to check if something is a list?
A method doSomething has the parameters a and b. In the method, it will loop through the list a and do something. I'd like a way to make sure a is a list, before looping through - thus avoiding an error or the unfortunate circumstance of passing in a string then getting back a letter from each loop.
This question must have been asked before - however my googles failed me. Cheers.

To enable more usecases, but still treat strings as scalars, don't check for a being a list, check that it isn't a string:
if not isinstance(a, basestring):
...

Typechecking hurts the generality, simplicity, and maintainability of your code. It is seldom used in good, idiomatic Python programs.
There are two main reasons people want to typecheck:
To issue errors if the caller provides the wrong type.
This is not worth your time. If the user provides an incompatible type for the operation you are performing, an error will already be raised when the compatibility is hit. It is worrisome that this might not happen immediately, but it typically doesn't take long at all and results in code that is more robust, simple, efficient, and easier to write.
Oftentimes people insist on this with the hope they can catch all the dumb things a user can do. If a user is willing to do arbitrarily dumb things, there is nothing you can do to stop him. Typechecking mainly has the potential of keeping a user who comes in with his own types that are drop-in replacements for the ones replaced or when the user recognizes that your function should actually be polymorphic and provides something different that can accept the same operation.
If I had a big system where lots of things made by lots of people should fit together right, I would use a system like zope.interface to make testing that everything fits together right.
To do different things based on the types of the arguments received.
This makes your code worse because your API is inconsistent. A function or method should do one thing, not fundamentally different things. This ends up being a feature not usually worth supporting.
One common scenario is to have an argument that can either be a foo or a list of foos. A cleaner solution is simply to accept a list of foos. Your code is simpler and more consistent. If it's an important, common use case only to have one foo, you can consider having another convenience method/function that calls the one that accepts a list of foos and lose nothing. Providing the first API would not only have been more complicated and less consistent, but it would break when the types were not the exact values expected; in Python we distinguish between objects based on their capabilities, not their actual types. It's almost always better to accept an arbitrary iterable or a sequence instead of a list and anything that works like a foo instead of requiring a foo in particular.
As you can tell, I do not think either reason is compelling enough to typecheck under normal circumstances.

I'd like a way to make sure a is a list, before looping through
Document the function.

Usually it's considered not a good style to perform type-check in Python, but try
if isinstance(a, list):
...
(I think you may also check if a.__iter__ exists.)

Python - Things I shouldn't be doing?

I've got a few questions about best practices in Python. Not too long ago I would do something like this with my code:
...
junk_block = "".join(open("foo.txt","rb").read().split())
...
I don't do this anymore because I can see that it makes code harder to read, but would the code run slower if I split the statements up like so:
f_obj = open("foo.txt", "rb")
f_data = f_obj.read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
I also noticed that there's nothing keeping you from doing an 'import' within a function block, is there any reason why I should do that?

As long as you're inside a function (not at module top level), assigning intermediate results to local barenames has an essentially-negligible cost (at module top level, assigning to the "local" barenames implies churning on a dict -- the module's __dict__ -- and is measurably costlier than it would be within a function; the remedy is never to have "substantial" code at module top level... always stash substantial code within a function!-).
Python's general philosophy includes "flat is better than nested" -- and that includes highly "nested" expressions. Looking at your original example...:
junk_block = "".join(open("foo.txt","rb").read().split())
presents another important issues: when is that file getting closed? In CPython today, you need not worry -- reference counting in practice does ensure timely closure. But most other Python implementations (Jython on the JVM, IronPython on .NET, PyPy on all sorts of backends, pynie on Parrot, Unladen Swallow on LLVM if and when it matures per its published roadmap, ...) do not guarantee the use of reference counting -- many garbage collection strategies may be involved, with all sort of other advantages.
Without any guarantee of reference counting (and even in CPython it's always been deemed an implementation artifact, not part of the language semantics!), you might be exhausting resources, by executing such "open but no close" code in a tight loop -- garbage collection is triggered by scarcity of memory, and does not consider other limited resources such as file descriptors. Since 2.6 (and 2.5, with an "import from the future"), Python has a great solution via the RAII ("resource acquisition is initialization") approach supported by the with statement:
with open("foo.txt","rb") as f:
junk_block = "".join(f.read().split())
is the least-"unnested" way that will ensure timely closure of the file across all compliant versions of Python. The stronger semantics make it preferable.
Beyond ensuring the correct, and prudent;-), semantics, there's not that much to choose between nested and flattened versions of an expression such as this. Given the task "remove all runs of whitespace from the file's contents", I would be tempted to benchmark alternative approaches based on re and on the .translate method of strings (the latter, esp. in Python 2.*, is often the fastest way to delete all characters from a certain set!), before settling on the "split and rejoin" approach if it proves to be faster -- but that's really a rather different issue;-).

First of all, there's not really a reason you shouldn't use the first example - it'd quite readable in that it's concise about what it does. No reason to break it up since it's just a linear combination of calls.
Second, import within a function block is useful if there's a particular library function that you only need within that function - since the scope of an imported symbol is only the block within which it is imported, if you only ever use something once, you can just import it where you need it and not have to worry about name conflicts in other functions. This is especially handy with from X import Y statements, since Y won't be qualified by its containing module name and thus might conflict with a similarly named function in a different module being used elsewhere.

from PEP 8 (which is worth reading anyway)
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants

That line has the same result as this:
junk_block = open("foo.txt","rb").read().replace(' ', '')
In your example you are splitting the words of the text into a list of words, and then you are joining them back together with no spaces. The above example instead uses the str.replace() method.
The differences:
Yours builds a file object into memory, builds a string into memory by reading it, builds a list into memory by splitting the string, builds a new string by joining the list.
Mine builds a file object into memory, builds a string into memory by reading it, builds a new string into memory by replacing spaces.
You can see a bit less RAM is used in the new variation but more processor is used. RAM is more valuable in some cases and so memory waste is frowned upon when it can be avoided.
Most of the memory will be garbage collected immediately but multiple users at the same time will hog RAM.

If you want to know if your second code fragment is slower, the quick way to find out would be to just use timeit. I wouldn't expect there to be that much difference though, since they seem pretty equivalent.
You should also ask if a performance difference actually matters in the code in question. Often readability is of more value than performance.
I can't think of any good reasons for importing a module in a function, but sometimes you just don't know you'll need to do something until you see the problem. I'll have to leave it to others to point out a constructive example of that, if it exists.

I think the two codes are readable. I (and that's just a question of personal style) will probably use the first, adding a coment line, something like: "Open the file and convert the data inside into a list"
Also, there are times when I use the second, maybe not so separated, but something like
f_data = open("foo.txt", "rb").read()
f_data_list = f_data.split()
junk_block = "".join(f_data_list)
But then I'm giving more entity to each operation, which could be important in the flow of the code. I think it's important you are confortable and don't think that the code is difficult to understand in the future.
Definitly, the code will not be (at least, much) slower, as the only "overload" you're making is to asing the results to values.

Encapsulation severely hurts performance?

I know this question is kind of stupid, maybe it's a just a part of writing code but it seems defining simple functions can really hurt performance severely... I've tried this simple test:
def make_legal_foo_string(x):
return "This is a foo string: " + str(x)
def sum_up_to(x):
return x*(x+1)/2
def foo(x):
return [make_legal_foo_string(x),sum_up_to(x),x+1]
def bar(x):
return ''.join([str(foo(x))," -- bar !! "])
it's very good style and makes code clear but it can be three times as slow as just writing it literally. It's inescapable for functions that can have side effects but it's actually almost trivial to define some functions that just should literally be replaced with lines of code every time they appear, translate the source code into that and only then compile. Same I think for magic numbers, it doesn't take a lot of time to read from memory but if they're not supposed to be changed then why not just replace every instance of 'magic' with a literal before the code compiles?

Function call overheads are not big; you won't normally notice them. You only see them in this case because your actual code (x*x) is itself so completely trivial. In any real program that does real work, the amount of time spent in function-calling overhead will be negligably small.
(Not that I'd really recommend using foo, identity and square in the example, in any case; they're so trivial it's just as readable to have them inline and they don't really encapsulate or abstract anything.)
if they're not supposed to be changed then why not just replace every instance of 'magic' with a literal before the code compiles?
Because programs are written for to be easy for you to read and maintain. You could replace constants with their literal values, but it'd make the program harder to work with, for a benefit that is so tiny you'll probably never even be able to measure it: the height of premature optimisation.

Encapsulation is about one thing and one thing only: Readability. If you're really so worried about performance that you're willing to start stripping out encapsulated logic, you may as well just start coding in assembly.
Encapsulation also assists in debugging and feature adding. Consider the following: lets say you have a simple game and need to add code that depletes the players health under some circumstances. Easy, yes?
def DamagePlayer(dmg):
player.health -= dmg;
This IS very trivial code, so it's very tempting to simply scatter "player.health -=" everywhere. But what if later you want to add a powerup that halves damage done to the player while active? If the logic is still encapsulated, it's easy:
def DamagePlayer(dmg):
if player.hasCoolPowerUp:
player.health -= dmg / 2
else
player.health -= dmg
Now, consider if you had neglected to encapulate that bit of logic because of it's simplicity. Now you're looking at coding the same logic into 50 different places, at least one of which you are almost certain to forget, which leads to weird bugs like: "When player has powerup all damage is halved except when hit by AlienSheep enemies..."
Do you want to have problems with Alien Sheep? I don't think so. :)
In all seriousness, the point I'm trying to make is that encapsulation is a very good thing in the right circumstances. Of course, it's also easy to over-encapsulate, which can be just as problematic. Also, there are situations where the speed really truly does matter (though they are rare), and that extra few clock cycles is worth it. About the only way to find the right balance is practice. Don't shun encapsulation because it's slower, though. The benefits usually far outweigh the costs.

I don't know how good python compilers are, but the answer to this question for many languages is that the compiler will optimize calls to small procedures / functions / methods by inlining them. In fact, in some language implementations you generally get better performance by NOT trying to "micro-optimize" the code yourself.

What you are talking about is the effect of inlining functions for gaining efficiency.
It is certainly true in your Python example, that encapsulation hurts performance. But there are some counter example to it:
In Java, defining getter&setter instead of defining public member variables does not result in performance degradation as the JIT inline the getters&setters.
sometimes calling a function repetedly may be better than performing inlining as the code executed may then fit in the cache. Inlining may cause code explosion...

Figuring out what to make into a function and what to just include inline is something of an art. Many factors (performance, readability, maintainability) feed into the equation.
I actually find your example kind of silly in many ways - a function that just returns it's argument? Unless it's an overload that's changing the rules, it's stupid. A function to square things? Again, why bother. Your function 'foo' should probably return a string, so that it can be used directly:
''.join(foo(x)," -- bar !! "])
That's probably a more correct level of encapsulation in this example.
As I say, it really depends on the circumstances. Unfortunately, this is the sort of thing that doesn't lend itself well to examples.

IMO, this is related to Function Call Costs. Which are negligible usually, but not zero. Splitting the code in a lot of very small functions may hurt. Especially in interpreted languages where full optimization is not available.
Inlining functions may improve performance but it may also deteriorate. See, for example, C++ FQA Lite for explanations (“Inlining can make code faster by eliminating function call overhead, or slower by generating too much code, causing instruction cache misses”). This is not C++ specific. Better leave optimizations for compiler/interpreter unless they are really necessary.
By the way, I don't see a huge difference between two versions:
$ python bench.py
fine-grained function decomposition: 5.46632194519
one-liner: 4.46827578545
$ python --version
Python 2.5.2
I think this result is acceptable. See bench.py in the pastebin.

There is a performance hit for using functions, since there is the overhead of jumping to a new address, pushing registers on the stack, and returning at the end. This overhead is very small, however, and even in peformance critial systems worrying about such overhead is most likely premature optimization.
Many languages avoid these issues in small frequenty called functions by using inlining, which is essentially what you do above.
Python does not do inlining. The closest thing you can do is use macros to replace the function calls.
This sort of performance issue is better served by another language, if you need the sort of speed gained by inlining (mostly marginal, and sometimes detremental) then you need to consider not using python for whatever you are working on.

There is a good technical reason why what you suggested is impossible. In Python, functions, constants, and everything else are accessible at runtime and can be changed at any time if necessary; they could also be changed by outside modules. This is an explicit promise of Python and one would need some extremely important reasons to break it.
For example, here's the common logging idiom:
# beginning of the file xxx.py
log = lambda *x: None
def something():
...
log(...)
...
(here log does nothing), and then at some other module or at the interactive prompt:
import xxx
xxx.log = print
xxx.something()
As you see, here log is modified by completely different module --- or by the user --- so that logging now works. That would be impossible if log was optimized away.
Similarly, if an exception was to happen in make_legal_foo_string (this is possible, e.g. if x.__str__() is broken and returns None) you'd be hit with a source quote from a wrong line and even perhaps from a wrong file in your scenario.
There are some tools that in fact apply some optimizations to Python code, but I don't think of the kind you suggested.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.