How do I refactor to avoid passing functions? - python

I currently have code that acquires and manipulates data from multiple sources using pandas DataFrames. The intent is for a user to create an instance of a class (call it dbase) which provides methods to do things like acquire and store data from API queries. I'm doing this by allowing the user to define their own functions to format values in dbase, but I've found that I tend to pass those user-defined functions through several other functions in ways that get confusing. I think this must be an obvious mistake to someone who knows what they're doing but I haven't come up with a better way to give the user control of the data.
The API queries are the worst example right now. Say I want to get a name from a server. Right now I do something like the following, in which the user-defined function for transforming the name gets passed across three other functions before it's called.
# file with code for api interaction
def submitter(this_query, dbase, name_mangler):
new_data = api.submit(this_query)
new_dbase_entry = name_mangler(new_data)
# in reality there is much more complicated data transformation here
dbase.update(new_dbase_entry)
def query_api(dbase, meta, name_mangler):
queries = make_query_strings(dbase, meta)
# using pandas.DataFrame.apply() here to avoid a for loop
queries.apply(lambda x: submitter(x, dbase))
# other file with class definition
from api_code import query_api
class dbase():
__init__():
self.df = pandas.DataFrame()
# data gets moved around between more than one data
# structure in this class, I'm just using a single
# dataframe as a minimal example
def get_remote_data(self, meta, name_mangler):
# in reality there is code here to handle multiple
# cases here rather than a trivial wrapper for another
# function
query_api(self, meta, name_mangler)
def update(self, new_data):
# do consistency checks
# possibly write new dbase entry
A user would then do something like this
import dbase
def custom_mangler(name):
# User determines how to store the name in dbase
# for instance this could take "Grace Hopper" to "hopper"
return(mangled_name)
my_dbase = dbase.dbase()
# meta defines what needs to be queried and how the remote
# data should get processed into dbase
meta = {stuff}
my_dbase.get_remote_data(meta, custom_mangler)
I find it very hard to follow my own code here because the definitions of functions can be widely separated from the first point at which they're called. How should I refactor to address this problem? (and does this approach violate accepted coding patterns for other reasons?)

It's a little hard to infer context from what you've posted, so take this with a grain of salt. The general concepts still apply. Also take a look at https://codereview.stackexchange.com/ as this question might be a better fit for that site.
Two things come to mind.
Try to give your functions/classes/variables better names
Think about orthogonality
Good Names
Consider how this looks from a users perspective. dbase is not a very descriptive name for either the module or the class. meta doesn't tell me at all what the dict should contain. mangler tells me that the string gets changed, but nothing about where the string comes from or how it should be changed.
Good names are hard, but it's worth spending time to make them thoughtful. It's always a trade off between being descriptive and overly verbose. If you can't think of a name that gives clear meaning without taking up too much space, then consider if your API is overly complex. Always consider names from the end users perspective as well as future programmers who will be reading/maintaining your code.
Orthogonality
Following the Unix mantra of "do one thing and do it well", sometimes an API is simpler and more flexible if we separate out different tasks to different functions rather than having one function that does it all.
When writing code, I think "what is the minimum this function needs to do to be useful".
In your example
my_dbase.get_remote_data(meta, custom_mangler)
get_remote_data not only fetches the data, but also processes it. That can be confusing as a user. There's a lot happening behind the scenes in this function that isn't obvious from the function name.
It might be more appropriate to have separate function calls for this. Let's assume that you're querying weather servers about temperature and rainfall.
london_weather_data = weatheraggrigator.WeatherAggrigator()
reports = london_weather_data.fetch_weather_reports(sources=[server_a, server_b])
london_weather_data.process_reports(reports, short_name_formatter)
Yes it's longer to type, but as a user it's a big improvement as I know what I'm getting.
Ultimately you need to decide where to split up tasks. The above may not make sense for your application.

Related

Defensive programming against client code if I myself implement the client code and the Interface?

I'm writing an application in Python. Somewhere, I write a model layer composed with classes, and, any of them, say:
class Orders:
has a bound method whose signature is :
def delete(self, id=None, name=None):
I need id or name not to be None. Then, here is my question:
Since I myself write the model layer and the client code, which would be something like:
model.Orders().delete(id=1234)
Would be good practice to check the parameters inside the delete method with code like:
if not any((id, name)):
raise ValueError('Order Id or Order name must be provided')
?
I think, since I write the two sides, I will always call the delete method providing one of the parameters avoiding checking and saving a lot of code. I expect the model layer will reach 2 or 3 thousand lines, interfacing a database server.
I put a specific example, but I'd appreciate an answer for the general case.
This was discussed here and here.
I would add, however, that since Python does not support mutually exclusive arguments by default, the clean way would be to provide two public interfaces: delete_by_id, delete_by_name; and these ones call a private method _delete. This way the facade to users of the class is clear and enforces the arguments nicely, while from within the class you reuse the code.
In general case, according to me, it is always a good practice to prevent error and misuse in your code. It is true that you can save a lot of coding but what is the price? Someday in the future you will be using this code not remembering to provide one of the two parameters and you can lose hours in debuging. Or maybe you could opensource your code and someone trying to use it may have some problem.

How to determine main program vs class [duplicate]

I have been programming in python for about two years; mostly data stuff (pandas, mpl, numpy), but also automation scripts and small web apps. I'm trying to become a better programmer and increase my python knowledge and one of the things that bothers me is that I have never used a class (outside of copying random flask code for small web apps). I generally understand what they are, but I can't seem to wrap my head around why I would need them over a simple function.
To add specificity to my question: I write tons of automated reports which always involve pulling data from multiple data sources (mongo, sql, postgres, apis), performing a lot or a little data munging and formatting, writing the data to csv/excel/html, send it out in an email. The scripts range from ~250 lines to ~600 lines. Would there be any reason for me to use classes to do this and why?
Classes are the pillar of Object Oriented Programming. OOP is highly concerned with code organization, reusability, and encapsulation.
First, a disclaimer: OOP is partially in contrast to Functional Programming, which is a different paradigm used a lot in Python. Not everyone who programs in Python (or surely most languages) uses OOP. You can do a lot in Java 8 that isn't very Object Oriented. If you don't want to use OOP, then don't. If you're just writing one-off scripts to process data that you'll never use again, then keep writing the way you are.
However, there are a lot of reasons to use OOP.
Some reasons:
Organization:
OOP defines well known and standard ways of describing and defining both data and procedure in code. Both data and procedure can be stored at varying levels of definition (in different classes), and there are standard ways about talking about these definitions. That is, if you use OOP in a standard way, it will help your later self and others understand, edit, and use your code. Also, instead of using a complex, arbitrary data storage mechanism (dicts of dicts or lists or dicts or lists of dicts of sets, or whatever), you can name pieces of data structures and conveniently refer to them.
State: OOP helps you define and keep track of state. For instance, in a classic example, if you're creating a program that processes students (for instance, a grade program), you can keep all the info you need about them in one spot (name, age, gender, grade level, courses, grades, teachers, peers, diet, special needs, etc.), and this data is persisted as long as the object is alive, and is easily accessible. In contrast, in pure functional programming, state is never mutated in place.
Encapsulation:
With encapsulation, procedure and data are stored together. Methods (an OOP term for functions) are defined right alongside the data that they operate on and produce. In a language like Java that allows for access control, or in Python, depending upon how you describe your public API, this means that methods and data can be hidden from the user. What this means is that if you need or want to change code, you can do whatever you want to the implementation of the code, but keep the public APIs the same.
Inheritance:
Inheritance allows you to define data and procedure in one place (in one class), and then override or extend that functionality later. For instance, in Python, I often see people creating subclasses of the dict class in order to add additional functionality. A common change is overriding the method that throws an exception when a key is requested from a dictionary that doesn't exist to give a default value based on an unknown key. This allows you to extend your own code now or later, allow others to extend your code, and allows you to extend other people's code.
Reusability: All of these reasons and others allow for greater reusability of code. Object oriented code allows you to write solid (tested) code once, and then reuse over and over. If you need to tweak something for your specific use case, you can inherit from an existing class and overwrite the existing behavior. If you need to change something, you can change it all while maintaining the existing public method signatures, and no one is the wiser (hopefully).
Again, there are several reasons not to use OOP, and you don't need to. But luckily with a language like Python, you can use just a little bit or a lot, it's up to you.
An example of the student use case (no guarantee on code quality, just an example):
Object Oriented
class Student(object):
def __init__(self, name, age, gender, level, grades=None):
self.name = name
self.age = age
self.gender = gender
self.level = level
self.grades = grades or {}
def setGrade(self, course, grade):
self.grades[course] = grade
def getGrade(self, course):
return self.grades[course]
def getGPA(self):
return sum(self.grades.values())/len(self.grades)
# Define some students
john = Student("John", 12, "male", 6, {"math":3.3})
jane = Student("Jane", 12, "female", 6, {"math":3.5})
# Now we can get to the grades easily
print(john.getGPA())
print(jane.getGPA())
Standard Dict
def calculateGPA(gradeDict):
return sum(gradeDict.values())/len(gradeDict)
students = {}
# We can set the keys to variables so we might minimize typos
name, age, gender, level, grades = "name", "age", "gender", "level", "grades"
john, jane = "john", "jane"
math = "math"
students[john] = {}
students[john][age] = 12
students[john][gender] = "male"
students[john][level] = 6
students[john][grades] = {math:3.3}
students[jane] = {}
students[jane][age] = 12
students[jane][gender] = "female"
students[jane][level] = 6
students[jane][grades] = {math:3.5}
# At this point, we need to remember who the students are and where the grades are stored. Not a huge deal, but avoided by OOP.
print(calculateGPA(students[john][grades]))
print(calculateGPA(students[jane][grades]))
Whenever you need to maintain a state of your functions and it cannot be accomplished with generators (functions which yield rather than return). Generators maintain their own state.
If you want to override any of the standard operators, you need a class.
Whenever you have a use for a Visitor pattern, you'll need classes. Every other design pattern can be accomplished more effectively and cleanly with generators, context managers (which are also better implemented as generators than as classes) and POD types (dictionaries, lists and tuples, etc.).
If you want to write "pythonic" code, you should prefer context managers and generators over classes. It will be cleaner.
If you want to extend functionality, you will almost always be able to accomplish it with containment rather than inheritance.
As every rule, this has an exception. If you want to encapsulate functionality quickly (ie, write test code rather than library-level reusable code), you can encapsulate the state in a class. It will be simple and won't need to be reusable.
If you need a C++ style destructor (RIIA), you definitely do NOT want to use classes. You want context managers.
I think you do it right. Classes are reasonable when you need to simulate some business logic or difficult real-life processes with difficult relations.
As example:
Several functions with share state
More than one copy of the same state variables
To extend the behavior of an existing functionality
I also suggest you to watch this classic video
dantiston gives a great answer on why OOP can be useful. However, it is worth noting that OOP is not necessary a better choice most cases it is used. OOP has the advantage of combining data and methods together. In terms of application, I would say that use OOP only if all the functions/methods are dealing and only dealing with a particular set of data and nothing else.
Consider a functional programming refactoring of dentiston's example:
def dictMean( nums ):
return sum(nums.values())/len(nums)
# It's good to include automatic tests for production code, to ensure that updates don't break old codes
assert( dictMean({'math':3.3,'science':3.5})==3.4 )
john = {'name':'John', 'age':12, 'gender':'male', 'level':6, 'grades':{'math':3.3}}
# setGrade
john['grades']['science']=3.5
# getGrade
print(john['grades']['math'])
# getGPA
print(dictMean(john['grades']))
At a first look, it seems like all the 3 methods exclusively deal with GPA, until you realize that Student.getGPA() can be generalized as a function to compute mean of a dict, and re-used on other problems, and the other 2 methods reinvent what dict can already do.
The functional implementation gains:
Simplicity. No boilerplate class or selfs.
Easily add automatic test code right after each
function for easy maintenance.
Easily split into several programs as your code scales.
Reusability for purposes other than computing GPA.
The functional implementation loses:
Typing in 'name', 'age', 'gender' in dict key each time is not very DRY (don't repeat yourself). It's possible to avoid that by changing dict to a list. Sure, a list is less clear than a dict, but this is a none issue if you include an automatic test code below anyway.
Issues this example doesn't cover:
OOP inheritance can be supplanted by function callback.
Calling an OOP class has to create an instance of it first. This can be boring when you don't have data in __init__(self).
A class defines a real world entity. If you are working on something that exists individually and has its own logic that is separate from others, you should create a class for it. For example, a class that encapsulates database connectivity.
If this not the case, no need to create class
It depends on your idea and design. If you are a good designer, then OOPs will come out naturally in the form of various design patterns.
For simple script-level processing, OOPs can be overhead.
Simply consider the basic benefits of OOPs like reusability and extendability and make sure if they are needed or not.
OOPs make complex things simpler and simpler things complex.
Simply keep the things simple in either way using OOPs or not using OOPs. Whichever is simpler, use that.

Why use python classes over modules with functions?

Im teaching myself python (3.x) and I'm trying to understand the use case for classes. I'm starting to understand what they actually do, but I'm struggling to understand why you would use a class as opposed to creating a module with functions.
For example, how does:
class cls1:
def func1(arguments...):
#do some stuff
obj1 = cls1()
obj2 = cls1()
obj1.func1(arg1,arg2...)
obj2.func1(arg1,arg2...)
Differ from:
#module.py contents
def func1(arguments...):
#do some stuff
import module
x = module.func1(arg1,arg2...)
y = module.func1(arg1,arg2...)
This is probably very simple but I just can't get my head around it.
So far, I've had quite a bit of success writing python programs, but they have all been pretty procedural, and only importing basic module functions. Classes are my next biggest hurdle.
You use class if you need multiple instance of it, and you want that instances don't interfere each other.
Module behaves like a singleton class, so you can have only one instance of them.
EDIT: for example if you have a module called example.py:
x = 0
def incr():
global x
x = x + 1
def getX():
return x
if you try to import these module twice:
import example as ex1
import example as ex2
ex1.incr()
ex1.getX()
1
ex2.getX()
1
This is why the module is imported only one time, so ex1 and ex2 points to the same object.
As long as you're only using pure functions (functions that only works on their arguments, always return the same result for the same arguments set, don't depend on any global/shared state and don't change anything - neither their arguments nor any global/shared state - IOW functions that don't have any side effects), then classes are indeed of a rather limited use. But that's functional programming, and while Python can technically be used in a functional style, it's possibly not the best choice here.
As soon has you have to share state between functions, and specially if some of these functions are supposed to change this shared state, you do have a use for OO concepts. There are mainly two ways to share state between functions: passing the state from function to function, or using globals.
The second solution - global state - is known to be troublesome, first because it makes understanding of the program flow (hence debugging) harder, but also because it prevents your code from being reentrant, which is a definitive no-no for quite a lot of now common use cases (multithreaded execution, most server-side web application code etc). Actually it makes your code practically unusable or near-unusable for anything except short simple one-shot scripts...
The second solution most often implies using half-informal complex datastructures (dicts with a given set of keys, often holding other dicts, lists, lists of dicts, sets etc), correctly initialising them and passing them from function to function - and of course have a set of functions that works on a given datastructure. IOW you are actually defining new complex datatypes (a data structure and a set of operations on that data structure), only using the lowest level tools the language provide.
Classes are actually a way to define such a data type at a higher level, grouping together the data and operations. They also offer a lot more, specially polymorphism, which makes for more generic, extensible code, and also easier unit testing.
Consider you have a file or a database with products, and each product has product id, price, availability, discount, published at web status, and more values. And you have second file with thousands of products that contain new prices and availability and discount. You want to update the values and keep control on how many products will be change and other stats. You can do it with Procedural programming and Functional programming but you will find yourself trying to discover tricks to make it work and most likely you will be lost in many different lists and sets.
On the other hand with Object-oriented programming you can create a class Product with instance variables the product-id, the old price, the old availability, the old discount, the old published status and some instance variables for the new values (new price, new availability, new discount, new published status). Than all you have to do is to read the first file/database and for every product to create a new instance of the class Product. Than you can read the second file and find the new values for your product objects. In the end every product of the first file/database will be an object and will be labeled and carry the old values and the new values. It is easier this way to track the changes, make statistics and update your database.
One more example. If you use tkinter, you can create a class for a top level window and every time you want to appear an information window or an about window (with custom color background and dimensions) you can simply create a new instance of this class.
For simple things classes are not needed. But for more complex things classes sometimes can make the solution easier.
I think the best answer is that it depends on what your indented object is supposed to be/do. But in general, there are some differences between a class and an imported module which will give each of them different features in the current module. Which the most important thing is that class has been defined to be objects, this means that they have a lot of options to act like an object which modules don't have. For example some special attributes like __getattr__, __setattr__, __iter__, etc. And the ability to create a lot of instances and even controlling the way that they are created. But for modules, the documentation describes their use-case perfectly:
If you quit from the Python interpreter and enter it again, the
definitions you have made (functions and variables) are lost.
Therefore, if you want to write a somewhat longer program, you are
better off using a text editor to prepare the input for the
interpreter and running it with that file as input instead. This is
known as creating a script. As your program gets longer, you may want
to split it into several files for easier maintenance. You may also
want to use a handy function that you’ve written in several programs
without copying its definition into each program.
To support this, Python has a way to put definitions in a file and use
them in a script or in an interactive instance of the interpreter.
Such a file is called a module; definitions from a module can be
imported into other modules or into the main module (the collection of
variables that you have access to in a script executed at the top
level and in calculator mode).

How to fetch Riak object, change its value and store it back with all indexes in Python

I am using Riak database to store my Python application objects that are used and processed in parallel by multiple scripts. Because of that, I need to lock them in various places, to avoid being processed by more than one script at once, like that:
riak_bucket = riak_connect('clusters')
cluster = riak_bucket.get(job_uuid).get_data()
cluster['status'] = 'locked'
riak_obj = riak_bucket.new(job_uuid, data=cluster)
riak_obj.add_index('username_bin', cluster['username'])
riak_obj.add_index('hostname_bin', cluster['hostname'])
riak_obj.store()
The thing is, this is quite a bit of code to do one simple, repeatable thing, and given the fact locking occurs quite often, I would like to find a simpler, cleaner way of doing that. I've tried to write a function to do locking/unlocking, like that (for a different object, called 'build'):
def build_job_locker(uuid, status='locked'):
riak_bucket = riak_connect('builds')
build = riak_bucket.get(uuid).get_data()
build['status'] = status
riak_obj = riak_bucket.new(build['uuid'], data=build)
riak_obj.add_index('cluster_uuid_bin', build['cluster_uuid'])
riak_obj.add_index('username_bin', build['username'])
riak_obj.store()
# when locking, return the locked db object to avoid fetching it again
if 'locked' in status:
return build
else:
return
but since the objects are obviously quite different one from another, they've different indexes and so on, I ended up writing a locking function per every object... which is almost as much messy as not having the functions at all and repeating the code.
The question is: is there a way to write a general function to do so, knowing that every object has a 'status' field, that'd lock them in db retaining all indexes and other attributes? Or, perhaps, there is another, easier way I havent thought about?
After some more research, and questions asked on various IRC channels it seems that this is not doable, as there's no way to fetch this kind of metadata about objects from Riak.

Python design question

I'm a C programmer and I'm getting quite good with Python. But I still have some problems getting my mind around the OO awesomeness of Python.
Here is my current design problem:
The end "product" is a JSON data structure created in Python (and passed to Javascript code) containing different types of data like:
{ type:url, {urlpayloaddict) }
{ type:text, {textpayloaddict}
...
My Javascript knows how to parse and display each type of JSON response.
I'm happy with this design. My question comes from handling this data in the Python code.
I obtain my data from a variety of sources: MySQL, a table lookup, an API call to a web service...
Basically, should I make a super class responseElement and specialise it for each type of response, then pass around a list of these objects in the Python code OR should I simply pass around a list of dictionaries that contain the response data in key value pairs. The answer seems to result in significantly different implementations.
I'm a bit unsure if I'm getting too object happy ??
In my mind, it basically goes like this: you should try to keep things the same where they are the same, and separate them where they're different.
If you're performing the exact same operations on and with the data, and it can all be represented in a common format, then there's no reason to have separate objects for it - translate it into a common format ASAP and Don't Repeat Yourself when it comes to implementing things that don't distinguish.
If each type/source of data requires specialized operations specific to it, and there isn't much in the way of overlap between such at the layer your Python code is dealing with, then keep things in separate objects so that you maintain a tight association between the specialized code and the specific data on which it is able to operate.
Do the different response sources represent fundamentally different categories or classes of objects? They don't appear to, the way you've described it.
Thus, various encode/decode functions and passing around only one type seems the best solution for you.
That type can be a dict or your own class, if you have special methods to use on the data (but those methods would then not care what input and output encodings were), or you could put the encode/decode pairs into the class. (Decode would be a classmethod, returning a new instance.)
Your receiver objects (which can perfectly well be instances of different classes, perhaps generated by a Factory pattern depending on the source of incoming data) should all have a common method that returns the appropriate dict (or other directly-JSON'able structure, such as a list that will turn into a JSON array).
Differently from what one answer states, this approach clearly doesn't require higher level code to know what exact kind of receiver it's dealing with (polymorphism will handle that for you in any OO language!) -- nor does the higher level code need to know "names of keys" (as, again, that other answer peculiarly states), as it can perfectly well treat the "JSON'able data" as a pretty opaque data token (as long as it's suitable to be the argument for a json.dumps later call!).
Building up and passing around a container of "plain old data" objects (produced and added to the container in various ways) for eventual serialization (or other such uniform treatment, but you can see JSON translation as a specific form of serialization) is a common OO pattern. No need to carry around anything richer or heavier than such POD data, after all, and in Python using dicts as the PODs is often a perfectly natural implementation choice.
I've had success with the OOP approach. Consider a base class with a "ToJson" method and have each subclass implement it appropriately. Then your higher level code doesn't need to know any detail about how the data was obtained...it just knows it has to call "ToJson" on every object in the list you mentioned.
A dictionary would work too, but it requires your calling code to know names of keys, etc and won't scale as well.
OOP I say!
Personally, I opt for the latter (passing around a list of data) wherever and whenever possible. I think OO is often misused/abused for certain things. I specifically avoid things like wrapping data in an object just for the sake of wrapping it in an object. So this, {'type':'url', 'data':{some_other_dict}} is better to me than:
class DataObject:
def __init__(self):
self.type = 'url'
self.data = {some_other_dict}
But, if you need to add specific functionality to this data, like the ability for it to sort its data.keys() and return them as a set, then creating an object makes more sense.

Categories

Resources