I'm trying to write a python script that can combine two python files into a hybrid that overwrites methods from the first file using methods from a second file (those methods can be standalone, or exist inside a class).
For example, say I have original.py:
from someclass import Class0
def fun1(a, b):
print "fun1", a, b
def fun2(a):
print "fun2", a
class Class1(Class0):
def __init__(self, a):
print "init", a
def fun3(self, a, b):
print "fun3", a, b
and new.py:
def fun2(a, b):
print "new fun2", a, b
class Class1(Class0):
def fun3(self, a, b):
print "new fun3"
and after running combine('original.py', 'new.py') I want it to generate a new file that looks as follows:
import someclass
def fun1(a, b):
print "fun1", a, b
def fun2(a, b):
print "new fun2", a, b
class Class1:
def __init__(self, a):
print "init", a
def fun3(self, a, b):
print "new fun3"
I'm trying to figure out what's the cleanest way to approach this. I was thinking of using regex at first, but keeping track of indentations and current level I'm in (a method inside a class would be one level deep, but what if a file has classes inside classes or methods inside other methods as happens when using decorators) sounds like a disaster waiting to happen. I was also thinking of using tokenize module to generate a syntax tree, and then trying to navigate to the same element in both trees, replacing portions of one tree with parts of another (again this seems like it could get complicated since all uses of tokenize I see are for modifying individual python tokens rather than entire methods). What do you guys suggest? It seems like a simple task, but I don't see a clean way of accomplishing it.
Also, I don't care if the second file is complete or valid (note the missing import despite inheritance from Class0), although if making the second file valid would allow me to use some internal python trick (like modifying python import logic to overwrite imported methods using ones from different file and then dumping the imported version from memory into a new file), then I'd be fine with that as well.
EDIT (more explanation about why I'm trying to do this rather than using inheritance):
I guess my question needs to better explain why I'm trying to do this. I have several groups/customers I'm providing my source code to. The functionality needs to differ based on customer requirements (in some cases it's something simple like tweaking the parameters file, like sarnold mentioned; in others the users are asking for features/hooks that are irrelevant to other groups, and would only add more confusion to the UI). Additionally, some of the source code is customer-specific and/or proprietary (so while I can share it with some groups, I'm not allowed to share it with others), this is what makes me try to avoid inheritance. I guess I can still rely on normal inheritance as long as it's the subclass version that's proprietary and not the original but for most features currently there is only one group that doesn't have the priveledges (and it's not always the same group for each feature) and the other groups do. So if I was to use inheritance, I'd probably need a bunch of directories/files like "SecretFromA", "SecretFromB", ..., "SecretFromZ" and then for each of the other groups I'd need "from SecretFromZ import *" for every module, whereas with the replacement technique I described I can just add stubs for functionality I want to filter. This is also something I'd be using in the future a lot, so as painful as it would be to write this script now, I feel like I will save a lot more in the future by not having to maintain the plethora of "from * import *" type files that inheritance would force on me (not to mention having to diff each version later in the event that customer decides to move features around).
Also, in response to comment from sarnold, I guess I was too vague in my description. This is not something that would happen on the fly. I'd generate the new *.py files once (per version/delivery) and give the generated source to the customer. Additionally the patching would only take place if the second file (new.py in my example) actually exists, otherwise the original would be copied over. So unlike with inheritance, my group-specific directories would be relatively small, with the final version being deployed to a separate directory that I wouldn't need to maintain at all aside from delivering it to the customer.
With your new improved description I can propose a solution that I've used in the past, as well as other teams, to great effect: use your source control to manage the differences via branches or new repositories.
First things first, please use a source control system of some sort. The more advanced ones, such as git, hg, Subversion, Bazaar, ClearCase, BitKeeper, Darcs, Perforce, etc., all support some form of merge tracking that can make this specific problem significantly easier to solve. (Whether or not you decide to use branches this way, please do use a source control system.)
You would maintain a main branch (call it trunk or master if you wish) with almost all your development work, bug fixing, and so on. Branch from this all the code specific to given customers. You might prefer to branch based on features rather than on customer names, as you might have ten customers who need SCTP support, and the other twenty customers want TCP support instead. (And you want to charge more for the more exotic protocol -- so just merging it all into a configuration option would be Bad For Business.)
When it comes time to deliver new products, you change branches (or repositories) to the specialized branches, git pull (different repositories) or git merge (different branches), or svn merge or similar commands to pull updates from your main development trunk into your specialized branches. Depending upon if you worked on code that conflicts or not, you might have some merge conflicts to repair, or it might go very cleanly. Fix up any merge conflicts, commit to your branch, and move on to the next branch.
You're liable to wind up with a "tree" that looks something like this:
I--J--K supremely magic feature
/ /
D--E--F--L--M--N magic feature
/ / /
A--B--C--G--H--O--P--Q main trunk
Getting too many branches is one quick path to madness; be sure to commit bug fixes and shared features into main trunk if you can, and if you can't, as close to it as possible. That will make merging it back into all descendant branches much easier. It would probably be best to do all the merging as close to when you write the new feature of bug fix, as it would be freshest in your mind at that time.
Maintaining multiple versions of code can drive you insane, though -- when you can make your code more modular, and simply deliver different 'modules' of code to your users, you can drastically reduce the complexity you need to manage. Programming to a modularized interface takes discipline, and the amount of up-front work can be prohibitive, but it is often easier to manage one tool with five plugins than 5! == 120 different combinations of those five features for your customers who would pick and choose among the five features.
You'll come to recognize the sorts of modifications that make most sense in a per-customer branch and the modifications that should be maintained as modules in your trunk. The case isn't always clear-cut: take the approach that leads to lowest amount of overall complexity, as best you can judge.
Related
Django==2.2.5
In the examples below two custom filters and two auxiliary functions.
It is a fake example, not a real code.
Two problems with this code:
When a project becomes big I forget what aux functions I have already written. Not to mention team programming. What is the solution here? To organize a separate module for functions that can be imported? And sort them alphabetically?
Some functions from here may be reused outside this package, and some may not. Say, the combine function seems to be reusable, while get_salted_str is definitely for this module only. I think that it is better to distinguish between functions that may be imported and those that may not. Is it better to use underline symbol to mark unimported functions? Like this: _get_salted_str. This may ease the first problem a bit.
Does Django style guide or any other pythonic style guide mention solutions to the two above mentioned problems?
My code example:
def combine(str1, str2):
return "{}_{}".format(str1, str2)
def get_salted_str(str):
SALT = "slkdjghslkdjfghsldfghaaasd"
return combine(str, SALT)
#register.filter
def get_salted_string(str):
return combine(str, get_salted_str(str))
#register.filter
def get_salted_peppered_string(str):
salted_str = get_salted_str(str)
PEPPER = "1234128712908369735619346"
return "{}_{}".format(PEPPER, salted_str)
When a project becomes big I forget what aux functions I have already written. Not to mention team programming. What is the solution here?
Good documentation and proper modularization.
To organize a separate module for functions that can be imported?
Technically, all functions (except of course nested ones) can be imported. Now I assume you meant: "for functions that are meant to be imported from other modules", but even then, it doesn't mean much - it often happens that a function primarily intended for "internal use" (helper function used within the same module) later becomes useful for other modules.
Also, the proper way to regroup function is not based on whether those are for internal or public use (this is handled by prefixing 'internal use only' functions with a single leading underscore), but on how those functions are related.
NB: I use the term "function" because that's how you phrased your question, but this applies to all other names (classes etc).
And sort them alphabetically?
Bad idea IMHO - it doesn't make any sense from a function POV, and can cause issue when merging diverging branches.
Some functions from here may be reused outside this package, and some may not. Say, the combine function seems to be reusable, while "get_salted_str" is definitely for this module only. I think that it is better to distinguish between functions that may be imported and those that may not. Is it better to use underline symbol to mark unimported functions? Like this: _get_salted_str. This may ease the first problem a bit.
Why would you prevent get_salted_str from being imported by another module actually ?
'protected' (single leading underscore) names are for implementation parts that the module's client code should not mess with nor even be aware of - this is called "encapsulation" -, the goal being to allow for implementation changes that won't break the client code.
In your example, get_salted_str() is a template filter, so it's obviously part of your package's public API.
OTHO, combine really looks like an implementation detail - the fact that some unrelated code in another package may need to combine two strings with the same separator seems mostly accidental, and if you expose combine as part of the module's API you cannot change it's implementation anyway. This is typically an implementation function as far as I can tell from your example (and also it's so trivial that it really doesn't warrant being exposed as ar as I'm concerned).
As a more general level: while avoiding duplication is a very audable goal, you must be careful of overdoing it. Some duplication is actually "accidental" - at some point in time, two totally unrelated parts of the code have a few lines in common, but for totally different reasons, and the forces that may lead to a change in one point of the code are totally unrelated to the other part. So before factoring out seemingly duplicated code, ask yourself if this code is doing the same thing for the same reasons and whether changing this code in one part should affect the other part too.
Does Django style guide or any other pythonic style guide mention solutions to the two above mentioned problems?
This is nothing specific to Django, nor even to Python. Writing well organized code relies on the same heuristics whatever the language: you want high cohesions (all functions / classes etc in a same module should be related and provide solutions to the same problems) and low coupling (a module should depend on as few other modules as possible).
NB: I'm talking about "modules" here but the same rules hold for packages (a package is kind of a super-module) or classes (a class is a kind of mini-module too - except that you can have multiple instances of it).
Now it must be said that proper modularisation - like proper naming etc - IS hard. It takes time and experience (and a lot of reflexion) to develop (no pun intended but...) a "feel" for it, and even then you often find yourself reorganizing things quite a bit during your project's lifetime. And, well, there almost always be some messy area somewhere, because sometimes finding out where a given feature really belongs is a bit of wild guess (hint: look for modules or packages named "util" or "utils" or "helpers" - those are usually where the dev regrouped stuff that didn't clearly belong anywhere else).
There are a lot of ways to go about this, so here is the way I always handle this:
1. Reusable functions in a project
First and foremost: Documentation. When working in a big team you definitely need to document reusable function.
Second, packages. When creating a lot of auxiliary/helper functions, that might have a use outside the current module or app, it can be useful to bundle them all together. I often create a 'base' or 'utils' package in my Django project where I bundle all sorts of functions.
The django.contrib package is a pretty good example of all sorts of helper packages bundled into one.
My rule of thumb is, if I find that I reuse some function/piece of code, I move it to my utils package, and if it's related to something else in that package, I bundle them together. That makes it pretty easy to keep track of all the functions there are.
2. Private functions
Python doesn't really have private members, but the generally accepted way to 'mark' a member as private is to add an underscore, like _get_salted_str
3. Style guide
With regards to auxiliary functions, I'm not aware of any styleguide.
'Private' members : https://docs.python.org/3/tutorial/classes.html#private-variables
After recently starting a role which essentially incorporates professional software development in python, I have noticed in the code that I am working with that people are tending to read in files as objects instead of variables, and I cannot understand why.
For example, I work with a limited amount of raster files. In the parts of the code that deal with these raster files, there might a class defined explicitly for reading in the rasters. The class takes a file location as in input and will then utilise the python package rasterio to open, read and access other characteristics of the file, but then stores these as attributes. For further example, we may have something like the following:
class test(object):
def __init__(self, fileLocation):
self.fileRead = open(fileLocation)
self.fileSplit = self.fileRead.split()
My instinct would have been to just read the file straight in as a variable and access its properties when I needed them, avoiding the expenditure of extra effort.
I know the idea of classes is to allow for streamlined data handling when dealing with quantities of similar types of data (i.e. student information for students in a school), but this class might initiate one instance in each run of the parent program. So to me, it seems to be a bit overkill to go to the effort of creating a class just to hold the information obtained through rasterio, when it would probably be much cleaner just to access the file data as you want it through explicit calls with rasterio.
Now, this sort of way of structuring code appears to be quite prevalent, and I just can't seem to understand why this would be a good way of doing things. As such, I wondered if there is some hidden benefit that I am missing which someone could explain to me? Or else to disregard it, and carry on in the manner outlined.
I have very simple structure of classes. Class B and C inheriting from A. The functionality is similar in the case of some functions and properties. Class B and C has different processing of the data but the classes have the same output (the function creating the output is inside the A class). Example of the structure:
Now I need to extend the functionality. I need to add the option for a little differences in processing and the outputting of the data (class X). This option is managed by configuration file so I want keep the old way of the downloading, processing and outputting the data, e.g.:
option 1 - download the data without threads using the old way processing and output
option 2 - download the data without threads using the new way processing and output
option 3 - download the data with threads using the old way processing and output
option 4 - download the data without threads using the new way processing and output
I am not sure how to implement combination of the new processing and outputting the data. I need combination of Class B and X (BX) and Class C and X (CX). I think about these options:
The easiest way but the worst - duplicating some functions in the class B and class A.
Keep the classes A and B and add the combination of BX and AX classes.
Rewrite the classes A and B only for downloading the data. Then add classes for processing the data and then add classes for outputting the data. All classes will share the objects. Looks like the best option but with the most work to do.
Is there any better option (e.g. design pattern or something like this) how to extend the classes with the cleanest way?
It seems to me that you have a Dataflow.
What you have is Get (Download) Data -> Process Data -> Output Data. This basically is a pipeline with three stages.
If you are using object to models this you, have two types of objects: the ones who perform the operations and one or more that manage the data flow and the ordering of operations.
Lets use three interfaces (you may use abstract classes to define them it doesn't matter) to define the steps of the pipeline: IDataDownloader, IDataProcessor and IDataOutputer.
Lets add another object that will represent and manage the pipeline:
(Note: I don't have experience with python, so I will write it in C# sorry..)
public class Pipeline {
private IDataDownloader mDataDownloader;
private IDataProcessor mDataProcessor;
private IDataOutputer mDataOutputer;
public void Setup() {
// use a config file to select between the two ways
if (Config.UseOldWayDownloader) {
mDataDownloader = new OldWayDownloader();
}
else {
mDataDownloader = new NewWayDownloader();
}
// ....
// or you can use service locator or dependecy injection that will get
// a specific downloader based on configuration/setup
mDataDownloader = Services.Get<IDataDownloader>();
// ....
}
public void Execute() {
var data = mDownloader.Download();
var processedData = mDataProcessor.Process(data);
mDataOutputer.Output(processedData);
}
}
This way you get nice separation of concerns and you get a modular system that you can extend. This sytem is also composable. You can compose it's parts in different ways.
It may seem like more work, but this may not be true. Clean designs are often easier to write and extend.
It's true that you may write little more code, but this code can be writen faster because of the clean design and you can save time from debugging. Debugging is the thing that comsumes most of the time we write software but it's often overloked by programmers that thinks that they can measure only by writing and code lines.
One example for this is the case with loose typed languages. Quite often you write less code so it's fater, but it's prone to errors due to mistakes. In case of error you get more (and sometimes harder) debugging so you waste time.
If you start with a simple proof of concept app they will significantly speed up developement. Once you get to the point of acqually having a robust software that is tested then a different forces kick in.
In order to ensure that your software is robust you have to write more tests and more checks/assertions to make sure that everything runs smoothly. So in the end you will probably have the same amount of code as strong typing does some of the checks for you and you can write less tests.
So what's better? Well... it depends.
There are several factors that will influence the speed of developement. The
taste and experice of the programmers with a language. The application will also affect this. Some applications are easier to write on loose typed languages, other on strong typed languages.
Speed can be different, but it's not always like this, sometimes it's just the same.
I think that in your case you will waste more time trying to get the hierarchy right and debugging it even if in the end you have few lines less code. In the long run, having a design that is harder to understand will slow you down significantly if you need to extend it.
In the pipeline approach adding new ways of downloading, processing or outputting the data is a matter of adding a single class.
Well, to make it an UML answer I add the class diagram:
The Pipeline has those three processing classes referred. They are all abstract, so you will need to implement them differently. And the configuration will assign them according to what the configuration (file) says).
Im teaching myself python (3.x) and I'm trying to understand the use case for classes. I'm starting to understand what they actually do, but I'm struggling to understand why you would use a class as opposed to creating a module with functions.
For example, how does:
class cls1:
def func1(arguments...):
#do some stuff
obj1 = cls1()
obj2 = cls1()
obj1.func1(arg1,arg2...)
obj2.func1(arg1,arg2...)
Differ from:
#module.py contents
def func1(arguments...):
#do some stuff
import module
x = module.func1(arg1,arg2...)
y = module.func1(arg1,arg2...)
This is probably very simple but I just can't get my head around it.
So far, I've had quite a bit of success writing python programs, but they have all been pretty procedural, and only importing basic module functions. Classes are my next biggest hurdle.
You use class if you need multiple instance of it, and you want that instances don't interfere each other.
Module behaves like a singleton class, so you can have only one instance of them.
EDIT: for example if you have a module called example.py:
x = 0
def incr():
global x
x = x + 1
def getX():
return x
if you try to import these module twice:
import example as ex1
import example as ex2
ex1.incr()
ex1.getX()
1
ex2.getX()
1
This is why the module is imported only one time, so ex1 and ex2 points to the same object.
As long as you're only using pure functions (functions that only works on their arguments, always return the same result for the same arguments set, don't depend on any global/shared state and don't change anything - neither their arguments nor any global/shared state - IOW functions that don't have any side effects), then classes are indeed of a rather limited use. But that's functional programming, and while Python can technically be used in a functional style, it's possibly not the best choice here.
As soon has you have to share state between functions, and specially if some of these functions are supposed to change this shared state, you do have a use for OO concepts. There are mainly two ways to share state between functions: passing the state from function to function, or using globals.
The second solution - global state - is known to be troublesome, first because it makes understanding of the program flow (hence debugging) harder, but also because it prevents your code from being reentrant, which is a definitive no-no for quite a lot of now common use cases (multithreaded execution, most server-side web application code etc). Actually it makes your code practically unusable or near-unusable for anything except short simple one-shot scripts...
The second solution most often implies using half-informal complex datastructures (dicts with a given set of keys, often holding other dicts, lists, lists of dicts, sets etc), correctly initialising them and passing them from function to function - and of course have a set of functions that works on a given datastructure. IOW you are actually defining new complex datatypes (a data structure and a set of operations on that data structure), only using the lowest level tools the language provide.
Classes are actually a way to define such a data type at a higher level, grouping together the data and operations. They also offer a lot more, specially polymorphism, which makes for more generic, extensible code, and also easier unit testing.
Consider you have a file or a database with products, and each product has product id, price, availability, discount, published at web status, and more values. And you have second file with thousands of products that contain new prices and availability and discount. You want to update the values and keep control on how many products will be change and other stats. You can do it with Procedural programming and Functional programming but you will find yourself trying to discover tricks to make it work and most likely you will be lost in many different lists and sets.
On the other hand with Object-oriented programming you can create a class Product with instance variables the product-id, the old price, the old availability, the old discount, the old published status and some instance variables for the new values (new price, new availability, new discount, new published status). Than all you have to do is to read the first file/database and for every product to create a new instance of the class Product. Than you can read the second file and find the new values for your product objects. In the end every product of the first file/database will be an object and will be labeled and carry the old values and the new values. It is easier this way to track the changes, make statistics and update your database.
One more example. If you use tkinter, you can create a class for a top level window and every time you want to appear an information window or an about window (with custom color background and dimensions) you can simply create a new instance of this class.
For simple things classes are not needed. But for more complex things classes sometimes can make the solution easier.
I think the best answer is that it depends on what your indented object is supposed to be/do. But in general, there are some differences between a class and an imported module which will give each of them different features in the current module. Which the most important thing is that class has been defined to be objects, this means that they have a lot of options to act like an object which modules don't have. For example some special attributes like __getattr__, __setattr__, __iter__, etc. And the ability to create a lot of instances and even controlling the way that they are created. But for modules, the documentation describes their use-case perfectly:
If you quit from the Python interpreter and enter it again, the
definitions you have made (functions and variables) are lost.
Therefore, if you want to write a somewhat longer program, you are
better off using a text editor to prepare the input for the
interpreter and running it with that file as input instead. This is
known as creating a script. As your program gets longer, you may want
to split it into several files for easier maintenance. You may also
want to use a handy function that you’ve written in several programs
without copying its definition into each program.
To support this, Python has a way to put definitions in a file and use
them in a script or in an interactive instance of the interpreter.
Such a file is called a module; definitions from a module can be
imported into other modules or into the main module (the collection of
variables that you have access to in a script executed at the top
level and in calculator mode).
I'm trying to develop a system that will allow users to update local, offline databases on their laptops and, upon reconnection to the network, synchronize their dbs with the main, master db.
I looked at MySQL replication, but that documentation focuses on unidirectional syncing. So I think I'm going to build a custom app in python for doing this (bilateral syncing), and I have a couple of questions.
I've read a couple of posts regarding this issue, and one of the items which has been passively mentioned is serialization (which I would be implementing through the pickle and cPickle modules in python). Could someone please tell me whether this is necessary, and the advantages of serializing data in the context of syncing databases?
One of the uses in wikipedia's entry on serialization states it can be used as "a method for detecting changes in time-varying data." This sounds really important, because my application will be looking at timestamps to determine which records have precedence when updating the master database. So, I guess the thing I don't really get is how pickling data in python can be used to "detect changes in time-varying data", and whether or not this would supplement using timestamps in the database to determine precedence or replace this method entirely.
Anyways, high level explanations or code examples are both welcome. I'm just trying to figure this out.
Thanks
how pickling data in python can be used to "detect changes in time-varying data."
Bundling data in an opaque format tells you absolutely nothing about time-varying data, except that it might have possibly changed (but you'd need to check that manually by unwrapping it). What the article is actually saying is...
To quote the actual relevant section (link to article at this moment in time):
Since both serializing and deserializing can be driven from common code, (for example, the Serialize function in Microsoft Foundation Classes) it is possible for the common code to do both at the same time, and thus 1) detect differences between the objects being serialized and their prior copies, and 2) provide the input for the next such detection. It is not necessary to actually build the prior copy, since differences can be detected "on the fly". This is a way to understand the technique called differential execution[a link which does not exist]. It is useful in the programming of user interfaces whose contents are time-varying — graphical objects can be created, removed, altered, or made to handle input events without necessarily having to write separate code to do those things.
The term "differential execution" seems to be a neologism coined by this person, where he described it in another StackOverflow answer: How does differential execution work?. Reading over that answer, I think I understand what he's trying to say. He seems to be using "differential execution" as a MVC-style concept, in the context where you have lots of view widgets (think a webpage) and you want to allow incremental changes to update just those elements, without forcing a global redraw of the screen. I would not call this "serialization" in the classic sense of the word (not by any stretch, in my humble opinion), but rather "keeping track of the past" or something like that. Because this basically has nothing to do with serialization, the rest of this answer (my interpretation of what he is describing) is probably not worth your time unless you are interested in the topic.
In general, avoiding a global redraw is impossible. Global redraws must sometimes happen: for example in HTML, if you increase the size of an element, you need to reflow lower elements, triggering a repaint. In 3D, you need to redraw everything behind what you update. However if you follow this technique, you can reduce (though not minimize) the number of redraws. This technique he claims will avoid the use of most events, avoid OOP, and use only imperative procedures and macros. My interpretation goes as follows:
Your drawing functions must know, somehow, how to "erase" themselves and anything they do which may affect the display of unrelated functions.
Write a sideffect-free paintEverything() script that imperatively displays everything (e.g. using functions like paintButton() and paintLabel()), using nothing but IF macros/functions. The IF macro works just like an if-statement, except...
Whenever you encounter an IF branch, keep track of both which IF statement this was, and the branch you took. "Which IF statement this was" is sort of a vague concept. For example you might decide to implement a FOR loop by combining IFs with recursion, in which case I think you'd need to keep track of the IF statement as a tree (whose nodes are either function calls or IF statements). You ensure the structure of that tree corresponds to the precedence rule "child layout choices depend on this layout choice".
Every time a user input event happens, rerun your paintEverything() script. However because we have kept track of which part of the code depends on which other parts, we can automatically skip anything which did not depend on what was updated. For example if paintLabel() did not depend on the state of the button, we can avoid rerunning that part of the paintEverything() script.
The "serialization" (not really serialization, more like naturally-serialized data structure) comes from the execution history of the if-branches. Except, serialization here is not necessary at all; all you needed was to keep track of which part of the display code depends on which others. It just so happens that if you use this technique with serially-executed "smart-if"-statements, it makes sense to use a lazily-evaluated diff of execution history to determine what you need to update.
However this technique does have useful takeaways. I'd say the main takeaway is: it is also a reasonable thing to keep track of dependencies not just in an OOP-style (e.g. not just widget A depends on widget B), but dependencies of the basic combinators in whatever DSL you are programming in. Also dependencies can be inferred from the structure of your program (e.g. like HTML does).