Here is the context for the question. There is a top-level module (see designs below) containing code, that as the name suggests, manages an ETL pipeline. A directory is created called etl_helpers that organizes several modules responsible for making up the pipeline. The discussion concerns the Python language, which supports OOP as well as Structural/Functional approaches to programming. I hope it is obvious that I left out the directory init files for convenience.
I am interested in opinions on which design adheres best to standard architectural practices and SOLID principles. I understand that this is one of those topics where people may have strong opinions one way or the other. I am interested in those opinions.
Allow me to give my thoughts. First, I don't think there would be much difference if I was using OOP for the functionality, or using a structural programming paradigm. A structural paradigm in my opinion, along the lines of Rich Hickey's comments on simple versus complex, would be a simpler implementation. In this case there is no reason to create a construct with state. So let's assume the code is structural and not OOP.
I would go with Design I. Succinctly stated, Design I supports readability and maintainability at least as well, if not better than the other designs.
The directory structure preserves the level of abstraction in the code (readability). Design III puts the 3 files in one module and sits within the same abstract level of the manage_the_etl_pipeline.py module. Design II pus the files one level of abstraction lower where they belong, but collects them into the same module. In this case, if using OOP, with classes encapsulating the object, my objections would not be as strong.
The goal of the SOLID principles are the creation of mid-level software structures that (Software Architecture: SA Martin). I think Design I best adheres to these principles of:
Tolerate change,
Are easy to understand, and
Are the basis of components that can be used in many software systems.
I could point to the Single Responsibility Principle which is defined as (SA Martin): a module should be responsible to one, and only one, actor. It should satisfy the Liskov Substitution Principle as well. Further, each module in the etl_helpers directory is at the same level of abstraction.
I could also mention that as Dijkstra stressed, at every level, from the smallest function to the largest component, software is like a science and, therefore, is driven by falsifiability. Software architects strive to define modules, components, and services that are easily falsifiable (testable). To do so, they employ restrictive disciplines similar to structured programming, albeit at a much higher level (SA Martin).
One last point. "Everything" in Python is a namespace: Built-in, global, function, enclosing, and user namespaces, e.g. dictionaries, SimpleNamespace, dataclasses. Namespaces group similar "things" together. Creating a Global namespace like Design II and III, confounds it.
I have expressed some of the reasons why Design I might be preferred, but what are the compelling reasons, if there are any, that would suggest another design was superior.
Finally, let me reference an interesting research paper I read recently that seems to support the other designs as anti-patterns: Architecture_Anti-patterns_Automatically.pdf. It can be found on the web.
SEVERAL DESIGNS FOR COMPARISON
DESIGN I:
manage_the_etl_pipeline.py
-- etl_helpers
extract.py
transform.py
load.py
Of course one could also
DESIGN II:
manage_the_etl_pipeline.py
-- etl_helpers
extract_transform_load.py
or probably even:
DESIGN III:
manage_the_etl_pipeline.py
extract_transform_load.py
Referred to online literature. Looked in GoF Design patters, Software Architecture (Martin), Clean Code (Martin), Clean Cod in Python.
Related
I am new to Python with some experience in C++. (Unfortunately, with just two sample points, any pair of features are either uncorrelated or perfectly correlated.) In Python, the elements in the same list can have any types. In C++, the STL containers hold homogeneous types. (I suppose it is possible to mimic the flexibility in Python lists with a vector of void pointers.) The C++ STL facilitates generic programming, but Python lists has far more genericty. What causes this contrast? Is it difficult to design a language with a static type system and have something like a Python list?
More generally, I often have to resist the urge to think "Python has feature A and C++ has feature B and therefore Python does X and C++ does Y." Are there good choices of languages that provide good comparison and contrast with these two so that I can understand what features of programming languages are correlated and what features are orthogonal issues? Does a formal education in computer science teach the analogy of linguistics to programming languages? (If so how can I learn that?)
In Python, the elements in the same list can have any types. In C++, the STL containers hold homogeneous types. [...] The C++ STL facilitates generic programming, but Python lists has far more genericty. What causes this contrast? Is it difficult to design a language with a static type system and have something like a Python list?
The contrast is that Python uses duck typing, so it doesn't really care what you put into the list. On the other hand, C++ templates will figure out what types are being passed, instantiate the "skeleton" code with those types, then type-check the instantiated.
It is possible to have a heterogenous vector/list in C++. std::vector<void*> is one possibility you mentioned. With C++17 and above, a more type-safe alternative is std::vector<std::any> (more type-safe because std::any stores type information).
However, generally there are approaches which are more type-safe and less error-prone. If all you care about is your data, collecting different types of data under a single container, then you can consider the functional programming approach which is to use sum types or tagged unions (e.g. std::vector<std::variant<x, y, z>>). If you care about the element's data members and member functions, then you can consider the OOP approach which uses class-based/inheritance-based polymorphism.
Are there good choices of languages that provide good comparison and contrast with these two so that I can understand what features of programming languages are correlated and what features are orthogonal issues?
I will try to provide an objective answer to this.
Rarely will you see a PL stuff itself with every single feature. It's just unfeasible to maintain and apply in industry. PL maintainers and committees will make decisions and some feature proposals will be kicked out. Different PLs are built differently (e.g. some use the actor model for better concurrency, some use garbage collectors, some are designed more functionally). So just learn and expose yourself to different PLs. Choose from different paradigms (object-oriented, functional, concurrent).
C++ and Python are a good first step. Learning C++, you get an understanding of pointers, OOP, and types. Learning Python, you get a feel for how convenient and quick programming can be. Both languages are not without their weaknesses. While coding in Python, you might've realised: "Oh, I don't need to worry about pointers and memory leaks since it's all done under the hood!". But a few moments later: "Aiya, I returned the wrong type in this branch, why didn't Python warn me earlier?"
According to tutorialspoint.com, Python is a functional programming language.
"Some of the popular functional programming languages include: Lisp, Python, Erlang, Haskell, Clojure, etc."
https://www.tutorialspoint.com/functional_programming/functional_programming_introduction.htm
But other sources say Python is an object-oriented programming language (you can create objects in Python).
So is Python both?
If so, if you're trying to program something that requires lots of mathematical computations, would Python still be a good choice (Since functional languages have concurrency, better syntax for math, and higher-level functions)?
Python, like many others, is a multi-paradigm language. You can use it as a fairly strictly imperative language, you can use it in a more object-oriented way, and you can use it in a more functional way. One important thing to note though is that functional is generally contrasted with imperative, object-oriented tends to exist at a different level and can be "layered over" a more imperative or a more functional core.
However Python is largely an imperative and object oriented language: much of the builtins and standard library are really built around classes and objects, and it doesn't encourage the sort of thinking which functional languages generally drive the user to.
In fact going through the (fairly terrible) list the article you link to provides, Python lands on the OOP side of more or less all of them:
it doesn't use immutable data much (it's not really possible to define immutable types in pure python, most of the collections are mutable, and the ones which are not are not designed for functional updates)
its execution model is very imperative
it has limited support for parallel programming
its functions very much do have side-effects
flow control is absolutely not done using function calls
it's not a language which encourages recursion
execution order is very relevant and quite strictly defined
Then again, much of the article is nonsense. If that is typical of that site, I'd recommend using something else.
If so, if you're trying to program something very mathematical and computational, would Python still be a good choice
Well Python is a pretty slow language in and of itself, but at the same time it has a very large and strong ecosystem of scientific libraries. It's probably not the premier language for abstract mathematics (it's rather bad at symbolic manipulation) but it tends to be a relatively good glue or prototyping tool.
As functional languages are more suitable for mathematical stuff
Not necessarily. But not knowing what you actually mean by "mathematical stuff" it's hard to judge. Do you mean symbolic manipulations? Statistics? Hard computations? Something else entirely?
I am researching to implement a DSL in Python and i am looking for a small DSL language that is friendly for someone who had no experience with designing and implementing languages. So far, i reviewed two implementations which are Hy and Mochi. Hy is actually a dialect of lisp and Mochi seems to be very similar to Elixir. Both are complex for me, right now as my aim is to prototype the language and play around with in in order to find if it really helps in solving the problem and fits to the style that problem requires or not. I am aware that Python has good support via language tools provided in Standard library. So far i implemented a dialect of lisp which is very simple indeed, i did not used python AST whatsoever and it is purely implemented via string processing which is absolutely not flexible for what i am looking for.
Are there any implementations rather than two languages mentioned above, small enough to be studied ?
What are some good books ( practical in a sense that does not only sticks to theoritical and academic aspect ) on this subject ?
What would be a good way into studying Python AST and using it ?
Are there any significant performance issues related to languages built upon Python ( like Hy ) in terms of being overhead on the actual produced bytecode ?
Thanks
You can split the task of creating a (yet another!) new language in at least two big steps:
Syntax
Semantics & Interpretation
Syntax
You need to define a grammar for your language, with production rules that specify how to create complex expressions from simple ones.
Example: syntax for LISP:
expression ::= atom | list
atom ::= number | symbol
number ::= [+-]?['0'-'9']+
symbol ::= ['A'-'Z''a'-'z'].*
list ::= '(' expression* ')'
How to read it: an expression is either an atom or a list; an atom is a number or a symbol; a number is... and so on.
Often you will define also some tokenization rules, because most grammars work at token level, and not at characters level.
Once you defined your grammar, you want a parser that, given a sentence (a program) is able to build the derivation tree, or the abstract syntax tree.
For example, for the expression x=f(y+1)+2, you want to obtain the tree:
There are several parsers (LL, LR, recursive descent, ...). You don't necessarily need to write your language parser by yourself, as there are tools that generate the parser from the grammar specification (LEX & YACC, Flex & Bison, JavaCC, ANTLR; also check this list of parsers available for Python).
If you want to skip the step of designing a new grammar, you may want to start from a simple one, like the grammar of LISP. There is even a LISP parser written in Python in the Pyperplan project. They use it for parsing PDDL, which is a domain specific language for planning that is based on LISP.
Useful readings:
Book: Compilers: Principles, Techniques, and Tools by by Alfred Aho, Jeffrey Ullman, Monica S. Lam, and Ravi Sethi, also known as The Dragon Book (because of the dragon pictured in the cover)
https://en.wikipedia.org/?title=Syntax_(programming_languages)
https://en.wikibooks.org/wiki/Introduction_to_Programming_Languages/Grammars
How to define a grammar for a programming language
https://en.wikipedia.org/wiki/LL_parser
https://en.wikipedia.org/wiki/Recursive_descent_parser
https://en.wikipedia.org/wiki/LALR_parser
https://en.wikipedia.org/wiki/LR_parser
Semantics & Interpretation
Once you have the abstract syntax tree of your program, you want to execute your program. There are several formalisms for specifying the "rules" to execute (pieces of) programs:
Operational semantics: a very popular one. It is classified in two categories:
Small Step Semantics: describe individual steps of computation
Big Step Semantics: describe the overall results of computation
Reduction semantics: a formalism based on lambda calculus
Transition semantics: if you look at your interpreter like a transition system, you can specify its semantics using transition semantics. This is especially useful for programs that do not terminate (i.e. run continuously), like controllers.
Useful readings:
Book: A Structural Approach to Operational Semantics [pdf link] by Gordon D. Plotkin
Book: Structure and Interpretation of Computer Programs by Gerald Jay Sussman and Hal Abelson
Book: Semantics Engineering with PLT Redex (SEwPR) by Matthias Felleisen, Robert Bruce Findler, and Matthew Flatt
https://en.wikipedia.org/wiki/Semantics_(computer_science)
https://en.wikipedia.org/wiki/Operational_semantics
https://en.wikipedia.org/wiki/Transition_system
https://en.wikipedia.org/wiki/Kripke_structure_(model_checking)
https://en.wikipedia.org/wiki/Hoare_logic
https://en.wikipedia.org/wiki/Lambda_calculus
You don't really need to know a lot about parsing to write your own language.
I wrote a library that lets you do just that very easily: https://github.com/erezsh/lark
Here's a blog post by me explaining how to use it to write your own language: http://blog.erezsh.com/how-to-write-a-dsl-in-python-with-lark/
I hope you don't mind my shameless plug, but it seems very relevant to your question.
Say I want to write a large application in groovy, and take advantage of closures, categories and other concepts (that I regularly use to separate concerns). Is there a way to diagram or otherwise communicate in a simple way the architecture of some of this stuff? How do you detail (without verbose documentation) the things that a map of closures might do, for example? I understand that dynamic language features aren't usually recommended on a larger scale because they are seen as complex but does that have to be the case?
UML isn't too well equipped to handle such things, but you can still use it to communicate your design if you are willing to do some mental mapping. You can find an isomorphism between most dynamic concepts and UMLs static object-model.
For example you can think of a closure as an object implementing a one method interface. It's probably useful to model such interfaces as something a bit more specific than interface Callable { call(args[0..*]: Object) : Object }.
Duck typing can similarly though of as an interface. If you have a method that takes something that can quack, model it as taking an object that is a specialization of the interface _interface Quackable { quack() }.
You can use your imagination for other concepts. Keep in mind that the purpose of design diagrams is to communicate ideas. So don't get overly pedantic about modeling everything 100%, think what do you want your diagrams to say, make sure that they say that and eliminate any extraneous detail that would dilute the message. And if you use some concepts that aren't obvious to your target audience, explain them.
Also, if UML really can't handle what you want to say, try other ways to visualize your message. UML is only a good choice because it gives you a common vocabulary so you don't have to explain every concept on your diagram.
If you don't want to generate verbose documentation, a picture is worth a thousand words. I've found tools like FreeMind useful, both for clarifying my ideas and for communicating them to others. And if you are willing to invest in a medium (or at least higher) level of documentation, I would recommend Sphinx. It is pretty easy to use, and although it's oriented towards documentation of Python modules, it can generate completely generic documentation which looks professional and easy on the eye. Your documentation can contain diagrams such as are created using Graphviz.
When using a multi-paradigm language such as Python, C++, D, or Ruby, how much do you mix paradigms within a single application? Within a single module? Do you believe that mixing the functional, procedural and OO paradigms at a fine granularity leads to clearer, more concise code because you're using the right tool for every subproblem, or an inconsistent mess because you're doing similar things 3 different ways?
Different paradigms mix in different ways. For example, Using OOP doesn't eliminate the use of subroutines and procedural code from an outside library. It merely moves the procedures around into a different place.
It is impossible to purely program with one paradigm. You may think you have a single one in mind when you program, but that's your illusion. Your resultant code will land along the borders and within the bounds of many paradigms.
I am not sure that I ever think about it like this.
Once you start "thinking in Ruby" the multi-paradigms just merge into ... well, Ruby.
Ruby is object-oriented, but I find that other things such as the functional aspect tend to mean that some of the "traditional" design patters present in OO languages are just simply not relevant. The iterator is a classic example ... iteration is something that is handled elegantly in Ruby and the heavy-weight OO iteration patterns no longer really apply. This seems to be true throughout the language.
Mixing paradigms has an advantage of letting you express solutions in most natural and esy way. Which is very good thing when it help keeping your program logic smaller. For example, filtering a list by some criteria is several times simpler to express with functional solution compared to traditional loop.
On the other hand, to get benefit from mixing two or more paradigms programmer should be reasonably fluent with all of them. So this is powerful tool that should be used with care.
Different problems require different solutions, but it helps if you solve things the same way in the same layer. And varying to wildly will just confuse you and everyone else in the project.
For C++, I've found that statically typed OOP (use zope.interface in Python) work well for higher-level parts (connecting, updating, signaling, etc) and functional stuff solves many lower-level problems (parsing, nuts 'n bolts data processing, etc) more nicely.
And usually, a dynamically typed scripting system is good for selecting and configuring the specific app, game level, whatnot. This may be the language itself (i.e. Python) or something else (an xml-script engine + necessary system for dynamic links in C++).