This may be a subjective question so I understand if it gets shut down, but this is something I've been wondering about ever since I started to learn python in a more serious way.
Is there a generally accepted 'best practice' about whether importing an additional module to accomplish a task more cleanly is better than avoiding the call and 'working around it'?
For example, I had some feedback on a script I worked on recently, and the suggestion was that I could have replaced the code below with a glob.glob() call. I avoided this at the time, because it meant adding another import that seemed unecessary to me (and the actual flow of filtering the lines just meshed with my thought process for the task).
headers = []
with open(hhresult_file) as result_fasta:
for line in result_fasta:
if line.startswith(">"):
line = line.split("_")[0]
headers.append(line.replace(">",""))
Similarly, I decided to use an os.rename() call later in the script for moving some files rather than import shutil.
Is there a right answer here? Are there any overheads associated with calling additional modules and creating more dependencies (lets say for instance, that the module wasn't a built-in python module) vs. writing a slightly 'messier' code using modules that are already in your script?
This is quite a broad question, but I'll try to answer it succinctly.
There is no real best practice, however, it is generally a good idea to recycle code that's already been written by others. If you find a bug in the imported code, it's more beneficial than finding one in your own code because you can submit a ticket to the author and have it fixed for potentially a large group of people.
There are certainly considerations to be made when making additional imports, mostly when they are not part of the Python standard library.
Sometimes adding in a package that is a little bit too 'magical' makes code harder to understand, because it's another library or file that somebody has to look up to understand what is going on, versus just a few lines that might not be as sophisticated as the third party library, but get the job done regardless.
If you can get away with not making additional imports, you probably should, but if it would save you substantial amounts of time and headache, it's probably worth importing something that has been pre-written to deal with the problem you're facing.
It's a continual consideration that has to be made.
Related
I know there are dozens of similar questions about Python imports, so I'll try to phrase it a bit differently. I'll spare you the details of the days of desperation that lie behind me, and instead approach the issue from a more general point of view: What does it take to make imports of the form from package.module.submodule import SomeClass work in Python 3.6 in 2018?
When I look into random modules of large Python projects like Django, Tensorflow or Twisted this seems to be the import pattern they all use, probably because it has the two benefits of a) making clear where an object comes from and b) keeping the actual invocation of that object short and clean.
I thought it can't be wrong to learn from these projects and tried to emulate this pattern in my own package – and as others before me I have run into the hell of circular imports.
Thankfully there are already a lot of posts on this topic, and their recommendations seem to fall into two broad categories:
1) Change the imports
A lot of suggestions go in the direction of using some other form of import syntax. Quite common seems to be the approach to avoid from X.Y import Z statements and instead rely only on import X.Y and then use Y.Z() in the code. Long story short, I have tried this and rewritten my entire package in that manner, and it didn't change anything. It also looks very unpythonic and ugly to me.
2) Fix the circular dependencies
That is the second family of recommendations, and I agree entirely that it would be preferable to dodge this issue entirely by employing good design from the start. The problem is that I don't really have a coherent idea how to do this.
My project is certainly not the pinnacle of excellent software design, but it's also not spaghetti code. It's ten modules in a single package (no submodules) of around 1000 lines of code in total. Each module has a single responsibility: One has the main function, one is a flask instance, another one an API client, another one has all the data models and so on. To me its seems pretty much unavoidable that the imports between these modules (not the classes themselves) sometimes end up being circular. I have of course tried to break these circles (they are usually four or five modules long), but that always leads to other circles, if not immediately, then five commits later.
Now, I'm just a second year CS student and I'm entirely willing to accept that this might just be a point where I still have to learn much, much more. But to start with that it would be really helpful if someone could explain how this problem is dealt with in some of these larger projects mentioned above. Shouldn't they run into circular imports all the time? How do they prevent contributors from introducing a circle somewhere accidentally? Do they constantly refactor there codebase when they find out that an import isn't possible because it would lead to a circle? What's the general approach here?
I often include this, or something close to it, in Python scripts and IPython notebooks.
import cPickle
def unpickle(filename):
with open(filename) as f:
obj = cPickle.load(f)
return obj
This seems like a common enough use case that the standard library should provide a function that does the same thing. Is there such a function? If there isn't, how come?
Most of the serialization libraries in the stdlib and on PyPI have a similar API. I'm pretty sure it was marshal that set the standard,* and pickle, json, PyYAML, etc. have just followed in its footsteps.
So, the question is, why was marshal designed that way?
Well, you obviously need loads/dumps; you couldn't build those on top of a filename-based function, and to build them on top of a file-object-based function you'd need StringIO, which didn't come until later.
You don't necessarily need load/dump, because those could be built on top of loads/dumps—but doing so could have major performance implications: you can't save anything to the file until you've built the whole thing in memory, and vice-versa, which could be a problem for huge objects.
You definitely don't need a loadf/dumpf function based on filenames, because those can be built trivially on top of load/dump, with no performance implications, and no tricky considerations that a user is likely to get wrong.
On the one hand, it would be convenient to have them anyway—and there are some libraries, like ElementTree, that do have analogous functions. It may only save a few seconds and a few lines per project, but multiply that by thousands of projects…
On the other hand, it would make Python larger. Not so much the extra 1K to download and install it if you added these two functions to every module (although that did mean a lot more back in the 1.x days than nowadays…), but more to document, more to learn, more to remember. And of course more code to maintain—every time you need to fix a bug in marshal.dumpf you have to remember to go check pickle.dumpf and json.dumpf to make sure they don't need the change, and sometimes you won't remember.
Balancing those two considerations is really a judgment call. One someone made decades ago and probably nobody has really discussed since. If you think there's a good case for changing it today, you can always post a feature request on the issue tracker or start a thread on python-ideas.
* Not in the original 1991 version of marshal.c; that just had load and dump. Guido added loads and dumps in 1993 as part of a change whose main description was "Add separate main program for the Mac: macmain.c". Presumably because something inside the Python interpreter needed to dump and load to strings.**
** marshal is used as the underpinnings for things like importing .pyc files. This also means (at least in CPython) it's not just implemented in C, but statically built into the core of the interpreter itself. While I think it actually could be turned into a regular module since the 3.4 import changes, but it definitely couldn't have back in the early days. So, that's extra motivation to keep it small and simple.
How easy is it to reverse engineer an auto-generated C code? I am working on a Python project and as part of my work, am using Cython to compile the code for speedup purposes.
This does help in terms of speed, yet, I am concerned that where I work, some people would try to "peek" into the code and figure out what it does.
Cython code is basically an auto-generated C. Is it very hard to reverse engineer it?
Are there any recommendations that would make the code safer and reverse-engineering harder to do? (I assume that with enough effort, everything can be reversed engineered).
Okay -- to attempt to answer your question more directly: most auto-generated C code is fairly ugly, so somebody would need to be fairly motivated to reverse engineer it. At the same time, I don't believe I've never looked at what Cython generates, so I'm not sure how it looks.
In addition, a lot of auto-generated code is done in the form of things like state machine tables, that most programmers find fairly difficult to follow even at best. The tendency (in many cases) is to have a generic framework, with tables of data that the framework more or less "interprets" at run-time. This isn't necessarily impossible to follow, but it's enough different from most typical code that most people will give up on it fairly quickly (and if they do much, they'll typically waste a lot of time looking at the framework instead of the data, which is what really matters in cases like this).
I'll repeat, however, that I'm pretty sure I haven't looked at what Cython produces, so I can't say much about it with any real certainty.
There are (or at least used to be) commercial obfuscators intended to make C source code difficult to understand. I suspect the availability of Perl has taken a lot of the market share from them, but if you look you may still be able to find one and use it.
Absent that, it's not terribly difficult to write an obfuscator of your own, but the degree of effectiveness will probably vary with the amount of effort you're willing to put into it. Just systematically renaming any meaningful variable names into things like _ and __ can do quite a bit (e.g., profit = sales - costs; is a lot more meaningful than _ = _I_ - _i_;). Depending on the machine generated code in question, however, this may not really accomplish much -- obfuscating a generic framework may not make much difference in understanding what your code does -- and if they figure out the procedure you're following, they may be able to simply replicate the correct framework code and transplant the pieces specific to your program into the un-obfuscated framework.
You should really take a look at the code that Cython produces. To help with debugging, for example, it copies the complete Python source code into the generated file, marking each source line before generating C code for it. That makes it very easy to find the code section that you are interested in.
A very nice feature is that you can compile your code with the "-a" (annotate) option, and it will spit out an HTML file next to the C file that contains the annotated Python code. When you click on a line, you'll see the C code for that line. As a bonus, it marks lines that do a lot of Python processing in dark yellow, so that you get a simple indicator where to look for potential optimisations.
There's also special gdb support in Cython now, so you can do Cython source level debugging etc.
Ah, I guess I missed the bit that you were talking about the compiled module, whereas I was only referring to the source code that Cython generates. I agree with Jerry that it will be fairly tricky to extract something useful from the compiled module, as long as you keep the gdb support disabled (the default) and strip the debugging symbols. That is because the C compiler will do lots of inlining of helper functions all over the place and apply various low-level code optimisations, thus making it harder to extract the original macro level code patterns. However, you will see named C-API calls to CPython, and you will also see function names from your own code. Cython isn't specifically designed for code obfuscation, quite the opposite. But readable assembly has certainly never been a design goal.
I have around 80 lines of a function in a file. I need the same functionality in another file so I am currently importing the other file for the function.
My question is that in terms of running time on a machine which technique would be better :- importing the complete file and running the function or copying the function as it is and run it from same package.
I know it won't matter in a large sense but I want to learn it in the sense that if we are making a large project is it better to import a complete file in Python or just add the function in the current namespace.....
Importing is how you're supposed to do it. That's why it's possible. Performance is a complicated question, but in general it really doesn't matter. People who really, really need performance, and can't be satisfied by just fixing the basic algorithm, are not using Python in the first place. :) (At least not for the tiny part of the project where the performance really matters. ;) )
Importing is good cause it helps you manage stuff easily. What if you needed the same function again? Instead of making changes at multiple places, there is just one centralized location - your module.
In case the function is small and you won't need it anywhere else, put it in the file itself.
If it is complex and would require to be used again, separate it and put it inside a module.
Performance should not be your concern here. It should hardly matter. And even if it does, ask yourself - does it matter to you?
Copy/Paste cannot be better. Importing affects load-time performance, not run-time (if you import it at the top-level).
The whole point of importing is to allow code reuse and organization.
Remember too that you can do either
import MyModule
to get the whole file or
from MyModule import MyFunction
for when you only need to reference that one part of the module.
If the two modules are unrelated except for that common function, you may wish to consider extracting that function (and maybe other things that are related to that function) into a third module.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
After asking organising my Python project and then calling from a parent file in Python it's occurring to me that it'll be so much easier to put all my code in one file (data will be read in externally).
I've always thought that this was bad project organisation but it seems to be the easiest way to deal with the problems I'm thinking I will face. Have I simply gotten the wrong end of the stick with file count or have I not seen some great guide on large (for me) projects?
If you are planning to use any kind of SCM then you are going to be screwed. Having one file is a guaranteed way to have lots of collisions and merges that will be painstaking to deal with over time.
Stick to conventions and break apart your files. If nothing more than to save the guy who will one day have to maintain your code...
If your code is going to work together all the time anyway, and isn't useful separately, there's nothing wrong with keeping everything in one file. I can think of at least popular package (BeautifulSoup) that does this. Sure makes installation easier.
Of course, if it seems, down the road, that you could use part of your code with another project, or if maintainance starts to be an issue, then worry about organizing your project differently.
It seems to me from the questions you've been asking lately that you're worrying about all of this a bit prematurely. Often, for me, these sorts of issues are better tackled a little later on in the solution. Especially for smaller projects, my goal is to get a solution that is correct, and then optimal.
It's always a now verses then argument. If you're under the gun to get it done, do it. Source control will be a problem later, as with many things there's no black and white answer. You need to be responsible to both your deadline and the long term maintenance of the code.
If that's the best way to organise it, you're probably doing something wrong.
If it's more than just a toy program or a simple script, then you should break it up into separate files, etc. It's the only sane way of doing it. When your project gets big enough that you need someone else helping on it, then it will make the SCM a whole bunch easier.
Additionally, sooner or later you are going to need to add a separate utility to your project, that is going to need some common code/structures. It's far easier to do this if you have separate source files than if you have just one big one.
Looking at your earlier questions I would say all code in one file would be a good intermediate state on the way to a complete refactoring of your project. To do this you'll need a regression test suite to make sure you don't break the project while refactoring it.
Once all your code is in one file, I suggest iterating on the following:
Identify a small group of interdependent classes.
Pull those classes into a separate file.
Add unit tests for the new separate file.
Retest the entire project.
Depending on the size of your project, it shouldn't take too many iterations for you to reach something reasonable.
Since Calling from a parent file in Python indicates serious design problems, I'd say that you have two choices.
Don't have a library module try to call back to main. You'll have to rewrite things to fix this.
[An imported component calling the main program is an improper dependency. And Python doesn't support it because it's a poor design.]
Put it all in one file until you figure out a better design with proper one-way dependencies. Then you'll have to rewrite it to fix the dependency problems.
A module (a single file) should be a logical piece of related code. Not everything. Not a single class definition. There's a middle ground of modularity.
Additionally, there should be a proper one-way dependency graph from main program to components (which do NOT depend on the main program) to utility libraries and what-not (that do not know about the components OR the main program.
Circular (or mutual) dependencies often indicate a design problem. Callbacks are one way out of the problem. Another way is to decompose the circular elements to get a proper one-way graph.