The way to design structure of classes

The way to design structure of classes - python

I have very simple structure of classes. Class B and C inheriting from A. The functionality is similar in the case of some functions and properties. Class B and C has different processing of the data but the classes have the same output (the function creating the output is inside the A class). Example of the structure:
Now I need to extend the functionality. I need to add the option for a little differences in processing and the outputting of the data (class X). This option is managed by configuration file so I want keep the old way of the downloading, processing and outputting the data, e.g.:
option 1 - download the data without threads using the old way processing and output
option 2 - download the data without threads using the new way processing and output
option 3 - download the data with threads using the old way processing and output
option 4 - download the data without threads using the new way processing and output
I am not sure how to implement combination of the new processing and outputting the data. I need combination of Class B and X (BX) and Class C and X (CX). I think about these options:
The easiest way but the worst - duplicating some functions in the class B and class A.
Keep the classes A and B and add the combination of BX and AX classes.
Rewrite the classes A and B only for downloading the data. Then add classes for processing the data and then add classes for outputting the data. All classes will share the objects. Looks like the best option but with the most work to do.
Is there any better option (e.g. design pattern or something like this) how to extend the classes with the cleanest way?

It seems to me that you have a Dataflow.
What you have is Get (Download) Data -> Process Data -> Output Data. This basically is a pipeline with three stages.
If you are using object to models this you, have two types of objects: the ones who perform the operations and one or more that manage the data flow and the ordering of operations.
Lets use three interfaces (you may use abstract classes to define them it doesn't matter) to define the steps of the pipeline: IDataDownloader, IDataProcessor and IDataOutputer.
Lets add another object that will represent and manage the pipeline:
(Note: I don't have experience with python, so I will write it in C# sorry..)
public class Pipeline {
private IDataDownloader mDataDownloader;
private IDataProcessor mDataProcessor;
private IDataOutputer mDataOutputer;
public void Setup() {
// use a config file to select between the two ways
if (Config.UseOldWayDownloader) {
mDataDownloader = new OldWayDownloader();
}
else {
mDataDownloader = new NewWayDownloader();
}
// ....
// or you can use service locator or dependecy injection that will get
// a specific downloader based on configuration/setup
mDataDownloader = Services.Get<IDataDownloader>();
// ....
}
public void Execute() {
var data = mDownloader.Download();
var processedData = mDataProcessor.Process(data);
mDataOutputer.Output(processedData);
}
}
This way you get nice separation of concerns and you get a modular system that you can extend. This sytem is also composable. You can compose it's parts in different ways.
It may seem like more work, but this may not be true. Clean designs are often easier to write and extend.
It's true that you may write little more code, but this code can be writen faster because of the clean design and you can save time from debugging. Debugging is the thing that comsumes most of the time we write software but it's often overloked by programmers that thinks that they can measure only by writing and code lines.
One example for this is the case with loose typed languages. Quite often you write less code so it's fater, but it's prone to errors due to mistakes. In case of error you get more (and sometimes harder) debugging so you waste time.
If you start with a simple proof of concept app they will significantly speed up developement. Once you get to the point of acqually having a robust software that is tested then a different forces kick in.
In order to ensure that your software is robust you have to write more tests and more checks/assertions to make sure that everything runs smoothly. So in the end you will probably have the same amount of code as strong typing does some of the checks for you and you can write less tests.
So what's better? Well... it depends.
There are several factors that will influence the speed of developement. The
taste and experice of the programmers with a language. The application will also affect this. Some applications are easier to write on loose typed languages, other on strong typed languages.
Speed can be different, but it's not always like this, sometimes it's just the same.
I think that in your case you will waste more time trying to get the hierarchy right and debugging it even if in the end you have few lines less code. In the long run, having a design that is harder to understand will slow you down significantly if you need to extend it.
In the pipeline approach adding new ways of downloading, processing or outputting the data is a matter of adding a single class.

Well, to make it an UML answer I add the class diagram:
The Pipeline has those three processing classes referred. They are all abstract, so you will need to implement them differently. And the configuration will assign them according to what the configuration (file) says).

Related

Reading in files as objects instead of variables in python

After recently starting a role which essentially incorporates professional software development in python, I have noticed in the code that I am working with that people are tending to read in files as objects instead of variables, and I cannot understand why.
For example, I work with a limited amount of raster files. In the parts of the code that deal with these raster files, there might a class defined explicitly for reading in the rasters. The class takes a file location as in input and will then utilise the python package rasterio to open, read and access other characteristics of the file, but then stores these as attributes. For further example, we may have something like the following:
class test(object):
def __init__(self, fileLocation):
self.fileRead = open(fileLocation)
self.fileSplit = self.fileRead.split()
My instinct would have been to just read the file straight in as a variable and access its properties when I needed them, avoiding the expenditure of extra effort.
I know the idea of classes is to allow for streamlined data handling when dealing with quantities of similar types of data (i.e. student information for students in a school), but this class might initiate one instance in each run of the parent program. So to me, it seems to be a bit overkill to go to the effort of creating a class just to hold the information obtained through rasterio, when it would probably be much cleaner just to access the file data as you want it through explicit calls with rasterio.
Now, this sort of way of structuring code appears to be quite prevalent, and I just can't seem to understand why this would be a good way of doing things. As such, I wondered if there is some hidden benefit that I am missing which someone could explain to me? Or else to disregard it, and carry on in the manner outlined.

Why use python classes over modules with functions?

Im teaching myself python (3.x) and I'm trying to understand the use case for classes. I'm starting to understand what they actually do, but I'm struggling to understand why you would use a class as opposed to creating a module with functions.
For example, how does:
class cls1:
def func1(arguments...):
#do some stuff
obj1 = cls1()
obj2 = cls1()
obj1.func1(arg1,arg2...)
obj2.func1(arg1,arg2...)
Differ from:
#module.py contents
def func1(arguments...):
#do some stuff
import module
x = module.func1(arg1,arg2...)
y = module.func1(arg1,arg2...)
This is probably very simple but I just can't get my head around it.
So far, I've had quite a bit of success writing python programs, but they have all been pretty procedural, and only importing basic module functions. Classes are my next biggest hurdle.

You use class if you need multiple instance of it, and you want that instances don't interfere each other.
Module behaves like a singleton class, so you can have only one instance of them.
EDIT: for example if you have a module called example.py:
x = 0
def incr():
global x
x = x + 1
def getX():
return x
if you try to import these module twice:
import example as ex1
import example as ex2
ex1.incr()
ex1.getX()
1
ex2.getX()
1
This is why the module is imported only one time, so ex1 and ex2 points to the same object.

As long as you're only using pure functions (functions that only works on their arguments, always return the same result for the same arguments set, don't depend on any global/shared state and don't change anything - neither their arguments nor any global/shared state - IOW functions that don't have any side effects), then classes are indeed of a rather limited use. But that's functional programming, and while Python can technically be used in a functional style, it's possibly not the best choice here.
As soon has you have to share state between functions, and specially if some of these functions are supposed to change this shared state, you do have a use for OO concepts. There are mainly two ways to share state between functions: passing the state from function to function, or using globals.
The second solution - global state - is known to be troublesome, first because it makes understanding of the program flow (hence debugging) harder, but also because it prevents your code from being reentrant, which is a definitive no-no for quite a lot of now common use cases (multithreaded execution, most server-side web application code etc). Actually it makes your code practically unusable or near-unusable for anything except short simple one-shot scripts...
The second solution most often implies using half-informal complex datastructures (dicts with a given set of keys, often holding other dicts, lists, lists of dicts, sets etc), correctly initialising them and passing them from function to function - and of course have a set of functions that works on a given datastructure. IOW you are actually defining new complex datatypes (a data structure and a set of operations on that data structure), only using the lowest level tools the language provide.
Classes are actually a way to define such a data type at a higher level, grouping together the data and operations. They also offer a lot more, specially polymorphism, which makes for more generic, extensible code, and also easier unit testing.

Consider you have a file or a database with products, and each product has product id, price, availability, discount, published at web status, and more values. And you have second file with thousands of products that contain new prices and availability and discount. You want to update the values and keep control on how many products will be change and other stats. You can do it with Procedural programming and Functional programming but you will find yourself trying to discover tricks to make it work and most likely you will be lost in many different lists and sets.
On the other hand with Object-oriented programming you can create a class Product with instance variables the product-id, the old price, the old availability, the old discount, the old published status and some instance variables for the new values (new price, new availability, new discount, new published status). Than all you have to do is to read the first file/database and for every product to create a new instance of the class Product. Than you can read the second file and find the new values for your product objects. In the end every product of the first file/database will be an object and will be labeled and carry the old values and the new values. It is easier this way to track the changes, make statistics and update your database.
One more example. If you use tkinter, you can create a class for a top level window and every time you want to appear an information window or an about window (with custom color background and dimensions) you can simply create a new instance of this class.
For simple things classes are not needed. But for more complex things classes sometimes can make the solution easier.

I think the best answer is that it depends on what your indented object is supposed to be/do. But in general, there are some differences between a class and an imported module which will give each of them different features in the current module. Which the most important thing is that class has been defined to be objects, this means that they have a lot of options to act like an object which modules don't have. For example some special attributes like __getattr__, __setattr__, __iter__, etc. And the ability to create a lot of instances and even controlling the way that they are created. But for modules, the documentation describes their use-case perfectly:
If you quit from the Python interpreter and enter it again, the
definitions you have made (functions and variables) are lost.
Therefore, if you want to write a somewhat longer program, you are
better off using a text editor to prepare the input for the
interpreter and running it with that file as input instead. This is
known as creating a script. As your program gets longer, you may want
to split it into several files for easier maintenance. You may also
want to use a handy function that you’ve written in several programs
without copying its definition into each program.
To support this, Python has a way to put definitions in a file and use
them in a script or in an interactive instance of the interpreter.
Such a file is called a module; definitions from a module can be
imported into other modules or into the main module (the collection of
variables that you have access to in a script executed at the top
level and in calculator mode).

advantages of serializing data during db synchronization

I'm trying to develop a system that will allow users to update local, offline databases on their laptops and, upon reconnection to the network, synchronize their dbs with the main, master db.
I looked at MySQL replication, but that documentation focuses on unidirectional syncing. So I think I'm going to build a custom app in python for doing this (bilateral syncing), and I have a couple of questions.
I've read a couple of posts regarding this issue, and one of the items which has been passively mentioned is serialization (which I would be implementing through the pickle and cPickle modules in python). Could someone please tell me whether this is necessary, and the advantages of serializing data in the context of syncing databases?
One of the uses in wikipedia's entry on serialization states it can be used as "a method for detecting changes in time-varying data." This sounds really important, because my application will be looking at timestamps to determine which records have precedence when updating the master database. So, I guess the thing I don't really get is how pickling data in python can be used to "detect changes in time-varying data", and whether or not this would supplement using timestamps in the database to determine precedence or replace this method entirely.
Anyways, high level explanations or code examples are both welcome. I'm just trying to figure this out.
Thanks

how pickling data in python can be used to "detect changes in time-varying data."
Bundling data in an opaque format tells you absolutely nothing about time-varying data, except that it might have possibly changed (but you'd need to check that manually by unwrapping it). What the article is actually saying is...
To quote the actual relevant section (link to article at this moment in time):
Since both serializing and deserializing can be driven from common code, (for example, the Serialize function in Microsoft Foundation Classes) it is possible for the common code to do both at the same time, and thus 1) detect differences between the objects being serialized and their prior copies, and 2) provide the input for the next such detection. It is not necessary to actually build the prior copy, since differences can be detected "on the fly". This is a way to understand the technique called differential execution[a link which does not exist]. It is useful in the programming of user interfaces whose contents are time-varying — graphical objects can be created, removed, altered, or made to handle input events without necessarily having to write separate code to do those things.
The term "differential execution" seems to be a neologism coined by this person, where he described it in another StackOverflow answer: How does differential execution work?. Reading over that answer, I think I understand what he's trying to say. He seems to be using "differential execution" as a MVC-style concept, in the context where you have lots of view widgets (think a webpage) and you want to allow incremental changes to update just those elements, without forcing a global redraw of the screen. I would not call this "serialization" in the classic sense of the word (not by any stretch, in my humble opinion), but rather "keeping track of the past" or something like that. Because this basically has nothing to do with serialization, the rest of this answer (my interpretation of what he is describing) is probably not worth your time unless you are interested in the topic.
In general, avoiding a global redraw is impossible. Global redraws must sometimes happen: for example in HTML, if you increase the size of an element, you need to reflow lower elements, triggering a repaint. In 3D, you need to redraw everything behind what you update. However if you follow this technique, you can reduce (though not minimize) the number of redraws. This technique he claims will avoid the use of most events, avoid OOP, and use only imperative procedures and macros. My interpretation goes as follows:
Your drawing functions must know, somehow, how to "erase" themselves and anything they do which may affect the display of unrelated functions.
Write a sideffect-free paintEverything() script that imperatively displays everything (e.g. using functions like paintButton() and paintLabel()), using nothing but IF macros/functions. The IF macro works just like an if-statement, except...
Whenever you encounter an IF branch, keep track of both which IF statement this was, and the branch you took. "Which IF statement this was" is sort of a vague concept. For example you might decide to implement a FOR loop by combining IFs with recursion, in which case I think you'd need to keep track of the IF statement as a tree (whose nodes are either function calls or IF statements). You ensure the structure of that tree corresponds to the precedence rule "child layout choices depend on this layout choice".
Every time a user input event happens, rerun your paintEverything() script. However because we have kept track of which part of the code depends on which other parts, we can automatically skip anything which did not depend on what was updated. For example if paintLabel() did not depend on the state of the button, we can avoid rerunning that part of the paintEverything() script.
The "serialization" (not really serialization, more like naturally-serialized data structure) comes from the execution history of the if-branches. Except, serialization here is not necessary at all; all you needed was to keep track of which part of the display code depends on which others. It just so happens that if you use this technique with serially-executed "smart-if"-statements, it makes sense to use a lazily-evaluated diff of execution history to determine what you need to update.
However this technique does have useful takeaways. I'd say the main takeaway is: it is also a reasonable thing to keep track of dependencies not just in an OOP-style (e.g. not just widget A depends on widget B), but dependencies of the basic combinators in whatever DSL you are programming in. Also dependencies can be inferred from the structure of your program (e.g. like HTML does).

generating hybrid *.py file from 2 other files

I'm trying to write a python script that can combine two python files into a hybrid that overwrites methods from the first file using methods from a second file (those methods can be standalone, or exist inside a class).
For example, say I have original.py:
from someclass import Class0
def fun1(a, b):
print "fun1", a, b
def fun2(a):
print "fun2", a
class Class1(Class0):
def __init__(self, a):
print "init", a
def fun3(self, a, b):
print "fun3", a, b
and new.py:
def fun2(a, b):
print "new fun2", a, b
class Class1(Class0):
def fun3(self, a, b):
print "new fun3"
and after running combine('original.py', 'new.py') I want it to generate a new file that looks as follows:
import someclass
def fun1(a, b):
print "fun1", a, b
def fun2(a, b):
print "new fun2", a, b
class Class1:
def __init__(self, a):
print "init", a
def fun3(self, a, b):
print "new fun3"
I'm trying to figure out what's the cleanest way to approach this. I was thinking of using regex at first, but keeping track of indentations and current level I'm in (a method inside a class would be one level deep, but what if a file has classes inside classes or methods inside other methods as happens when using decorators) sounds like a disaster waiting to happen. I was also thinking of using tokenize module to generate a syntax tree, and then trying to navigate to the same element in both trees, replacing portions of one tree with parts of another (again this seems like it could get complicated since all uses of tokenize I see are for modifying individual python tokens rather than entire methods). What do you guys suggest? It seems like a simple task, but I don't see a clean way of accomplishing it.
Also, I don't care if the second file is complete or valid (note the missing import despite inheritance from Class0), although if making the second file valid would allow me to use some internal python trick (like modifying python import logic to overwrite imported methods using ones from different file and then dumping the imported version from memory into a new file), then I'd be fine with that as well.
EDIT (more explanation about why I'm trying to do this rather than using inheritance):
I guess my question needs to better explain why I'm trying to do this. I have several groups/customers I'm providing my source code to. The functionality needs to differ based on customer requirements (in some cases it's something simple like tweaking the parameters file, like sarnold mentioned; in others the users are asking for features/hooks that are irrelevant to other groups, and would only add more confusion to the UI). Additionally, some of the source code is customer-specific and/or proprietary (so while I can share it with some groups, I'm not allowed to share it with others), this is what makes me try to avoid inheritance. I guess I can still rely on normal inheritance as long as it's the subclass version that's proprietary and not the original but for most features currently there is only one group that doesn't have the priveledges (and it's not always the same group for each feature) and the other groups do. So if I was to use inheritance, I'd probably need a bunch of directories/files like "SecretFromA", "SecretFromB", ..., "SecretFromZ" and then for each of the other groups I'd need "from SecretFromZ import *" for every module, whereas with the replacement technique I described I can just add stubs for functionality I want to filter. This is also something I'd be using in the future a lot, so as painful as it would be to write this script now, I feel like I will save a lot more in the future by not having to maintain the plethora of "from * import *" type files that inheritance would force on me (not to mention having to diff each version later in the event that customer decides to move features around).
Also, in response to comment from sarnold, I guess I was too vague in my description. This is not something that would happen on the fly. I'd generate the new *.py files once (per version/delivery) and give the generated source to the customer. Additionally the patching would only take place if the second file (new.py in my example) actually exists, otherwise the original would be copied over. So unlike with inheritance, my group-specific directories would be relatively small, with the final version being deployed to a separate directory that I wouldn't need to maintain at all aside from delivering it to the customer.

With your new improved description I can propose a solution that I've used in the past, as well as other teams, to great effect: use your source control to manage the differences via branches or new repositories.
First things first, please use a source control system of some sort. The more advanced ones, such as git, hg, Subversion, Bazaar, ClearCase, BitKeeper, Darcs, Perforce, etc., all support some form of merge tracking that can make this specific problem significantly easier to solve. (Whether or not you decide to use branches this way, please do use a source control system.)
You would maintain a main branch (call it trunk or master if you wish) with almost all your development work, bug fixing, and so on. Branch from this all the code specific to given customers. You might prefer to branch based on features rather than on customer names, as you might have ten customers who need SCTP support, and the other twenty customers want TCP support instead. (And you want to charge more for the more exotic protocol -- so just merging it all into a configuration option would be Bad For Business.)
When it comes time to deliver new products, you change branches (or repositories) to the specialized branches, git pull (different repositories) or git merge (different branches), or svn merge or similar commands to pull updates from your main development trunk into your specialized branches. Depending upon if you worked on code that conflicts or not, you might have some merge conflicts to repair, or it might go very cleanly. Fix up any merge conflicts, commit to your branch, and move on to the next branch.
You're liable to wind up with a "tree" that looks something like this:
I--J--K supremely magic feature
/ /
D--E--F--L--M--N magic feature
/ / /
A--B--C--G--H--O--P--Q main trunk
Getting too many branches is one quick path to madness; be sure to commit bug fixes and shared features into main trunk if you can, and if you can't, as close to it as possible. That will make merging it back into all descendant branches much easier. It would probably be best to do all the merging as close to when you write the new feature of bug fix, as it would be freshest in your mind at that time.
Maintaining multiple versions of code can drive you insane, though -- when you can make your code more modular, and simply deliver different 'modules' of code to your users, you can drastically reduce the complexity you need to manage. Programming to a modularized interface takes discipline, and the amount of up-front work can be prohibitive, but it is often easier to manage one tool with five plugins than 5! == 120 different combinations of those five features for your customers who would pick and choose among the five features.
You'll come to recognize the sorts of modifications that make most sense in a per-customer branch and the modifications that should be maintained as modules in your trunk. The case isn't always clear-cut: take the approach that leads to lowest amount of overall complexity, as best you can judge.

Python design question

I'm a C programmer and I'm getting quite good with Python. But I still have some problems getting my mind around the OO awesomeness of Python.
Here is my current design problem:
The end "product" is a JSON data structure created in Python (and passed to Javascript code) containing different types of data like:
{ type:url, {urlpayloaddict) }
{ type:text, {textpayloaddict}
...
My Javascript knows how to parse and display each type of JSON response.
I'm happy with this design. My question comes from handling this data in the Python code.
I obtain my data from a variety of sources: MySQL, a table lookup, an API call to a web service...
Basically, should I make a super class responseElement and specialise it for each type of response, then pass around a list of these objects in the Python code OR should I simply pass around a list of dictionaries that contain the response data in key value pairs. The answer seems to result in significantly different implementations.
I'm a bit unsure if I'm getting too object happy ??

In my mind, it basically goes like this: you should try to keep things the same where they are the same, and separate them where they're different.
If you're performing the exact same operations on and with the data, and it can all be represented in a common format, then there's no reason to have separate objects for it - translate it into a common format ASAP and Don't Repeat Yourself when it comes to implementing things that don't distinguish.
If each type/source of data requires specialized operations specific to it, and there isn't much in the way of overlap between such at the layer your Python code is dealing with, then keep things in separate objects so that you maintain a tight association between the specialized code and the specific data on which it is able to operate.

Do the different response sources represent fundamentally different categories or classes of objects? They don't appear to, the way you've described it.
Thus, various encode/decode functions and passing around only one type seems the best solution for you.
That type can be a dict or your own class, if you have special methods to use on the data (but those methods would then not care what input and output encodings were), or you could put the encode/decode pairs into the class. (Decode would be a classmethod, returning a new instance.)

Your receiver objects (which can perfectly well be instances of different classes, perhaps generated by a Factory pattern depending on the source of incoming data) should all have a common method that returns the appropriate dict (or other directly-JSON'able structure, such as a list that will turn into a JSON array).
Differently from what one answer states, this approach clearly doesn't require higher level code to know what exact kind of receiver it's dealing with (polymorphism will handle that for you in any OO language!) -- nor does the higher level code need to know "names of keys" (as, again, that other answer peculiarly states), as it can perfectly well treat the "JSON'able data" as a pretty opaque data token (as long as it's suitable to be the argument for a json.dumps later call!).
Building up and passing around a container of "plain old data" objects (produced and added to the container in various ways) for eventual serialization (or other such uniform treatment, but you can see JSON translation as a specific form of serialization) is a common OO pattern. No need to carry around anything richer or heavier than such POD data, after all, and in Python using dicts as the PODs is often a perfectly natural implementation choice.

I've had success with the OOP approach. Consider a base class with a "ToJson" method and have each subclass implement it appropriately. Then your higher level code doesn't need to know any detail about how the data was obtained...it just knows it has to call "ToJson" on every object in the list you mentioned.
A dictionary would work too, but it requires your calling code to know names of keys, etc and won't scale as well.
OOP I say!

Personally, I opt for the latter (passing around a list of data) wherever and whenever possible. I think OO is often misused/abused for certain things. I specifically avoid things like wrapping data in an object just for the sake of wrapping it in an object. So this, {'type':'url', 'data':{some_other_dict}} is better to me than:
class DataObject:
def __init__(self):
self.type = 'url'
self.data = {some_other_dict}
But, if you need to add specific functionality to this data, like the ability for it to sort its data.keys() and return them as a set, then creating an object makes more sense.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.