Reading in files as objects instead of variables in python

Reading in files as objects instead of variables in python - python

After recently starting a role which essentially incorporates professional software development in python, I have noticed in the code that I am working with that people are tending to read in files as objects instead of variables, and I cannot understand why.
For example, I work with a limited amount of raster files. In the parts of the code that deal with these raster files, there might a class defined explicitly for reading in the rasters. The class takes a file location as in input and will then utilise the python package rasterio to open, read and access other characteristics of the file, but then stores these as attributes. For further example, we may have something like the following:
class test(object):
def __init__(self, fileLocation):
self.fileRead = open(fileLocation)
self.fileSplit = self.fileRead.split()
My instinct would have been to just read the file straight in as a variable and access its properties when I needed them, avoiding the expenditure of extra effort.
I know the idea of classes is to allow for streamlined data handling when dealing with quantities of similar types of data (i.e. student information for students in a school), but this class might initiate one instance in each run of the parent program. So to me, it seems to be a bit overkill to go to the effort of creating a class just to hold the information obtained through rasterio, when it would probably be much cleaner just to access the file data as you want it through explicit calls with rasterio.
Now, this sort of way of structuring code appears to be quite prevalent, and I just can't seem to understand why this would be a good way of doing things. As such, I wondered if there is some hidden benefit that I am missing which someone could explain to me? Or else to disregard it, and carry on in the manner outlined.

Related

How do I refactor to avoid passing functions?

I currently have code that acquires and manipulates data from multiple sources using pandas DataFrames. The intent is for a user to create an instance of a class (call it dbase) which provides methods to do things like acquire and store data from API queries. I'm doing this by allowing the user to define their own functions to format values in dbase, but I've found that I tend to pass those user-defined functions through several other functions in ways that get confusing. I think this must be an obvious mistake to someone who knows what they're doing but I haven't come up with a better way to give the user control of the data.
The API queries are the worst example right now. Say I want to get a name from a server. Right now I do something like the following, in which the user-defined function for transforming the name gets passed across three other functions before it's called.
# file with code for api interaction
def submitter(this_query, dbase, name_mangler):
new_data = api.submit(this_query)
new_dbase_entry = name_mangler(new_data)
# in reality there is much more complicated data transformation here
dbase.update(new_dbase_entry)
def query_api(dbase, meta, name_mangler):
queries = make_query_strings(dbase, meta)
# using pandas.DataFrame.apply() here to avoid a for loop
queries.apply(lambda x: submitter(x, dbase))
# other file with class definition
from api_code import query_api
class dbase():
__init__():
self.df = pandas.DataFrame()
# data gets moved around between more than one data
# structure in this class, I'm just using a single
# dataframe as a minimal example
def get_remote_data(self, meta, name_mangler):
# in reality there is code here to handle multiple
# cases here rather than a trivial wrapper for another
# function
query_api(self, meta, name_mangler)
def update(self, new_data):
# do consistency checks
# possibly write new dbase entry
A user would then do something like this
import dbase
def custom_mangler(name):
# User determines how to store the name in dbase
# for instance this could take "Grace Hopper" to "hopper"
return(mangled_name)
my_dbase = dbase.dbase()
# meta defines what needs to be queried and how the remote
# data should get processed into dbase
meta = {stuff}
my_dbase.get_remote_data(meta, custom_mangler)
I find it very hard to follow my own code here because the definitions of functions can be widely separated from the first point at which they're called. How should I refactor to address this problem? (and does this approach violate accepted coding patterns for other reasons?)

It's a little hard to infer context from what you've posted, so take this with a grain of salt. The general concepts still apply. Also take a look at https://codereview.stackexchange.com/ as this question might be a better fit for that site.
Two things come to mind.
Try to give your functions/classes/variables better names
Think about orthogonality
Good Names
Consider how this looks from a users perspective. dbase is not a very descriptive name for either the module or the class. meta doesn't tell me at all what the dict should contain. mangler tells me that the string gets changed, but nothing about where the string comes from or how it should be changed.
Good names are hard, but it's worth spending time to make them thoughtful. It's always a trade off between being descriptive and overly verbose. If you can't think of a name that gives clear meaning without taking up too much space, then consider if your API is overly complex. Always consider names from the end users perspective as well as future programmers who will be reading/maintaining your code.
Orthogonality
Following the Unix mantra of "do one thing and do it well", sometimes an API is simpler and more flexible if we separate out different tasks to different functions rather than having one function that does it all.
When writing code, I think "what is the minimum this function needs to do to be useful".
In your example
my_dbase.get_remote_data(meta, custom_mangler)
get_remote_data not only fetches the data, but also processes it. That can be confusing as a user. There's a lot happening behind the scenes in this function that isn't obvious from the function name.
It might be more appropriate to have separate function calls for this. Let's assume that you're querying weather servers about temperature and rainfall.
london_weather_data = weatheraggrigator.WeatherAggrigator()
reports = london_weather_data.fetch_weather_reports(sources=[server_a, server_b])
london_weather_data.process_reports(reports, short_name_formatter)
Yes it's longer to type, but as a user it's a big improvement as I know what I'm getting.
Ultimately you need to decide where to split up tasks. The above may not make sense for your application.

Why use python classes over modules with functions?

Im teaching myself python (3.x) and I'm trying to understand the use case for classes. I'm starting to understand what they actually do, but I'm struggling to understand why you would use a class as opposed to creating a module with functions.
For example, how does:
class cls1:
def func1(arguments...):
#do some stuff
obj1 = cls1()
obj2 = cls1()
obj1.func1(arg1,arg2...)
obj2.func1(arg1,arg2...)
Differ from:
#module.py contents
def func1(arguments...):
#do some stuff
import module
x = module.func1(arg1,arg2...)
y = module.func1(arg1,arg2...)
This is probably very simple but I just can't get my head around it.
So far, I've had quite a bit of success writing python programs, but they have all been pretty procedural, and only importing basic module functions. Classes are my next biggest hurdle.

You use class if you need multiple instance of it, and you want that instances don't interfere each other.
Module behaves like a singleton class, so you can have only one instance of them.
EDIT: for example if you have a module called example.py:
x = 0
def incr():
global x
x = x + 1
def getX():
return x
if you try to import these module twice:
import example as ex1
import example as ex2
ex1.incr()
ex1.getX()
1
ex2.getX()
1
This is why the module is imported only one time, so ex1 and ex2 points to the same object.

As long as you're only using pure functions (functions that only works on their arguments, always return the same result for the same arguments set, don't depend on any global/shared state and don't change anything - neither their arguments nor any global/shared state - IOW functions that don't have any side effects), then classes are indeed of a rather limited use. But that's functional programming, and while Python can technically be used in a functional style, it's possibly not the best choice here.
As soon has you have to share state between functions, and specially if some of these functions are supposed to change this shared state, you do have a use for OO concepts. There are mainly two ways to share state between functions: passing the state from function to function, or using globals.
The second solution - global state - is known to be troublesome, first because it makes understanding of the program flow (hence debugging) harder, but also because it prevents your code from being reentrant, which is a definitive no-no for quite a lot of now common use cases (multithreaded execution, most server-side web application code etc). Actually it makes your code practically unusable or near-unusable for anything except short simple one-shot scripts...
The second solution most often implies using half-informal complex datastructures (dicts with a given set of keys, often holding other dicts, lists, lists of dicts, sets etc), correctly initialising them and passing them from function to function - and of course have a set of functions that works on a given datastructure. IOW you are actually defining new complex datatypes (a data structure and a set of operations on that data structure), only using the lowest level tools the language provide.
Classes are actually a way to define such a data type at a higher level, grouping together the data and operations. They also offer a lot more, specially polymorphism, which makes for more generic, extensible code, and also easier unit testing.

Consider you have a file or a database with products, and each product has product id, price, availability, discount, published at web status, and more values. And you have second file with thousands of products that contain new prices and availability and discount. You want to update the values and keep control on how many products will be change and other stats. You can do it with Procedural programming and Functional programming but you will find yourself trying to discover tricks to make it work and most likely you will be lost in many different lists and sets.
On the other hand with Object-oriented programming you can create a class Product with instance variables the product-id, the old price, the old availability, the old discount, the old published status and some instance variables for the new values (new price, new availability, new discount, new published status). Than all you have to do is to read the first file/database and for every product to create a new instance of the class Product. Than you can read the second file and find the new values for your product objects. In the end every product of the first file/database will be an object and will be labeled and carry the old values and the new values. It is easier this way to track the changes, make statistics and update your database.
One more example. If you use tkinter, you can create a class for a top level window and every time you want to appear an information window or an about window (with custom color background and dimensions) you can simply create a new instance of this class.
For simple things classes are not needed. But for more complex things classes sometimes can make the solution easier.

I think the best answer is that it depends on what your indented object is supposed to be/do. But in general, there are some differences between a class and an imported module which will give each of them different features in the current module. Which the most important thing is that class has been defined to be objects, this means that they have a lot of options to act like an object which modules don't have. For example some special attributes like __getattr__, __setattr__, __iter__, etc. And the ability to create a lot of instances and even controlling the way that they are created. But for modules, the documentation describes their use-case perfectly:
If you quit from the Python interpreter and enter it again, the
definitions you have made (functions and variables) are lost.
Therefore, if you want to write a somewhat longer program, you are
better off using a text editor to prepare the input for the
interpreter and running it with that file as input instead. This is
known as creating a script. As your program gets longer, you may want
to split it into several files for easier maintenance. You may also
want to use a handy function that you’ve written in several programs
without copying its definition into each program.
To support this, Python has a way to put definitions in a file and use
them in a script or in an interactive instance of the interpreter.
Such a file is called a module; definitions from a module can be
imported into other modules or into the main module (the collection of
variables that you have access to in a script executed at the top
level and in calculator mode).

Self-modifying Python code to keep track of high scores

I've considered storing the high scores for my game as variables in the code itself rather than as a text file as I've done so far because it means less additional files are required to run it and that attributing 999999 points becomes harder.
However, this would then require me to run self-modifying code to overwrite the global variables representing the scores permanently. I looked into that and considering that all I want to do is really just to change global variables, all the stuff I found was too advanced.
I'd appreciate if someone could give me an explanation on how to write self-modifying Python code to do just that, preferably with an example too as it aids understanding.

My first inclination is to say "don't do that". Self-modifying Python (really any language) makes it extremely difficult to maintain a versioned library.
You make a bug fix and need to redistribute - how do you merge data you stored via self-modification.
Very hard to authenticate packaging using a hash - once the local version is modified it's hard to tell which version it originated because SHAs won't match.
It's unsafe - You could just save and load a Python class that's not stored with your package, however, if it's user writable, a foreign process could add any arbitrary Python code to that file to evaluate. Kind of like SQL injection but Python style.
Python makes is so trivial to load and dump JSON files, that for simple things, I wouldn't think of anything else. Even CSV files are trivial and can be bound to maps but can be more easily manipulated as data using your favorite spreadsheet editor.
My suggestion - don't use self-modifiying Python unless you're just wanting to experiment; It's just not a practical solution in the real world, unless you're working in an embedded environment where disk and memory are a premium.

generating hybrid *.py file from 2 other files

I'm trying to write a python script that can combine two python files into a hybrid that overwrites methods from the first file using methods from a second file (those methods can be standalone, or exist inside a class).
For example, say I have original.py:
from someclass import Class0
def fun1(a, b):
print "fun1", a, b
def fun2(a):
print "fun2", a
class Class1(Class0):
def __init__(self, a):
print "init", a
def fun3(self, a, b):
print "fun3", a, b
and new.py:
def fun2(a, b):
print "new fun2", a, b
class Class1(Class0):
def fun3(self, a, b):
print "new fun3"
and after running combine('original.py', 'new.py') I want it to generate a new file that looks as follows:
import someclass
def fun1(a, b):
print "fun1", a, b
def fun2(a, b):
print "new fun2", a, b
class Class1:
def __init__(self, a):
print "init", a
def fun3(self, a, b):
print "new fun3"
I'm trying to figure out what's the cleanest way to approach this. I was thinking of using regex at first, but keeping track of indentations and current level I'm in (a method inside a class would be one level deep, but what if a file has classes inside classes or methods inside other methods as happens when using decorators) sounds like a disaster waiting to happen. I was also thinking of using tokenize module to generate a syntax tree, and then trying to navigate to the same element in both trees, replacing portions of one tree with parts of another (again this seems like it could get complicated since all uses of tokenize I see are for modifying individual python tokens rather than entire methods). What do you guys suggest? It seems like a simple task, but I don't see a clean way of accomplishing it.
Also, I don't care if the second file is complete or valid (note the missing import despite inheritance from Class0), although if making the second file valid would allow me to use some internal python trick (like modifying python import logic to overwrite imported methods using ones from different file and then dumping the imported version from memory into a new file), then I'd be fine with that as well.
EDIT (more explanation about why I'm trying to do this rather than using inheritance):
I guess my question needs to better explain why I'm trying to do this. I have several groups/customers I'm providing my source code to. The functionality needs to differ based on customer requirements (in some cases it's something simple like tweaking the parameters file, like sarnold mentioned; in others the users are asking for features/hooks that are irrelevant to other groups, and would only add more confusion to the UI). Additionally, some of the source code is customer-specific and/or proprietary (so while I can share it with some groups, I'm not allowed to share it with others), this is what makes me try to avoid inheritance. I guess I can still rely on normal inheritance as long as it's the subclass version that's proprietary and not the original but for most features currently there is only one group that doesn't have the priveledges (and it's not always the same group for each feature) and the other groups do. So if I was to use inheritance, I'd probably need a bunch of directories/files like "SecretFromA", "SecretFromB", ..., "SecretFromZ" and then for each of the other groups I'd need "from SecretFromZ import *" for every module, whereas with the replacement technique I described I can just add stubs for functionality I want to filter. This is also something I'd be using in the future a lot, so as painful as it would be to write this script now, I feel like I will save a lot more in the future by not having to maintain the plethora of "from * import *" type files that inheritance would force on me (not to mention having to diff each version later in the event that customer decides to move features around).
Also, in response to comment from sarnold, I guess I was too vague in my description. This is not something that would happen on the fly. I'd generate the new *.py files once (per version/delivery) and give the generated source to the customer. Additionally the patching would only take place if the second file (new.py in my example) actually exists, otherwise the original would be copied over. So unlike with inheritance, my group-specific directories would be relatively small, with the final version being deployed to a separate directory that I wouldn't need to maintain at all aside from delivering it to the customer.

With your new improved description I can propose a solution that I've used in the past, as well as other teams, to great effect: use your source control to manage the differences via branches or new repositories.
First things first, please use a source control system of some sort. The more advanced ones, such as git, hg, Subversion, Bazaar, ClearCase, BitKeeper, Darcs, Perforce, etc., all support some form of merge tracking that can make this specific problem significantly easier to solve. (Whether or not you decide to use branches this way, please do use a source control system.)
You would maintain a main branch (call it trunk or master if you wish) with almost all your development work, bug fixing, and so on. Branch from this all the code specific to given customers. You might prefer to branch based on features rather than on customer names, as you might have ten customers who need SCTP support, and the other twenty customers want TCP support instead. (And you want to charge more for the more exotic protocol -- so just merging it all into a configuration option would be Bad For Business.)
When it comes time to deliver new products, you change branches (or repositories) to the specialized branches, git pull (different repositories) or git merge (different branches), or svn merge or similar commands to pull updates from your main development trunk into your specialized branches. Depending upon if you worked on code that conflicts or not, you might have some merge conflicts to repair, or it might go very cleanly. Fix up any merge conflicts, commit to your branch, and move on to the next branch.
You're liable to wind up with a "tree" that looks something like this:
I--J--K supremely magic feature
/ /
D--E--F--L--M--N magic feature
/ / /
A--B--C--G--H--O--P--Q main trunk
Getting too many branches is one quick path to madness; be sure to commit bug fixes and shared features into main trunk if you can, and if you can't, as close to it as possible. That will make merging it back into all descendant branches much easier. It would probably be best to do all the merging as close to when you write the new feature of bug fix, as it would be freshest in your mind at that time.
Maintaining multiple versions of code can drive you insane, though -- when you can make your code more modular, and simply deliver different 'modules' of code to your users, you can drastically reduce the complexity you need to manage. Programming to a modularized interface takes discipline, and the amount of up-front work can be prohibitive, but it is often easier to manage one tool with five plugins than 5! == 120 different combinations of those five features for your customers who would pick and choose among the five features.
You'll come to recognize the sorts of modifications that make most sense in a per-customer branch and the modifications that should be maintained as modules in your trunk. The case isn't always clear-cut: take the approach that leads to lowest amount of overall complexity, as best you can judge.

How can I create global classes in Python (if possible)?

Let's suppose I have several functions for a RPG I'm working on...
def name_of_function():
action
and wanted to implement axe class (see below) into each function without having to rewrite each class. How would I create the class as a global class. I'm not sure if I'm using the correct terminology or not on that, but please help. This has always held me abck from creating Text based RPG games. An example of a global class would be awesome!
class axe:
attack = 5
weight = 6
description = "A lightweight battle axe."
level_required = 1
price = 10

You can't create anything that's truly global in Python - that is, something that's magically available to any module no matter what. But it hardly matters once you understand how modules and importing work.
Typically, you create classes and organize them into modules. Then you import them into whatever module needs them, which adds the class to the module's symbol table.
So for instance, you might create a module called weapons.py, and create a WeaponBase class in it, and then Axe and Broadsword classes derived from WeaponsBase. Then, in any module that needed to use weapons, you'd put
import weapons
at the top of the file. Once you do this, weapons.Axe returns the Axe class, weapons.Broadsword returns the Broadsword class, and so on. You could also use:
from weapons import Axe, Broadsword
which adds Axe and Broadsword to the module's symbol table, allowing code to do pretty much exactly what you are saying you want it to do.
You can also use
from weapons import *
but this generally is not a great idea for two reasons. First, it imports everything in the module whether you're going to use it or not - WeaponsBase, for instance. Second, you run into all kinds of confusing problems if there's a function in weapons that's got the same name as a function in the importing module.
There are a lot of subtleties in the import system. You have to be careful to make sure that modules don't try to import each other, for instance. And eventually your project gets large enough that you don't want to put all of its modules in the same directory, and you'll have to learn about things like __init__.py. But you can worry about that down the road.

i beg to differ with the view that you can't create something truly global in python. in fact, it is easy. in Python 3.1, it looks like this:
def get_builtins():
"""Due to the way Python works, ``__builtins__`` can strangely be either a module or a dictionary,
depending on whether the file is executed directly or as an import. I couldn’t care less about this
detail, so here is a method that simply returns the namespace as a dictionary."""
return getattr( __builtins__, '__dict__', __builtins__ )
like a bunch of other things, builtins are one point where Py3 differs in details from the way it used to work in Py2. read the "What's New in Python X.X" documents on python.org for details. i have no idea what the reason for the convention mentioned above might be; i just want to ignore that stuff. i think that above code should work in Py2 as well.
so the point here is there is a __builtins__ thingie which holds a lot of stuff that comes as, well, built-into Python. all the sum, max, range stuff you've come to love. well, almost everything. but you don't need the details, really. the simplest thing you could do is to say
G = get_builtins()
G[ 'G' ] = G
G[ 'axe' ] = axe
at a point in your code that is always guaranteed to execute. G stands in for the globally available namespace, and since i've registered G itself within G, G now magically transcends its existence into the background of every module. means you should use it with care. where naming collisions occur between whatever is held in G and in a module's namespace, the module's namespace should win (as it gets inspected first). also, be prepared for everybody to jump on you when you tell them you're POLLUTING THE GLOBAL NAMESPACE dammit. i'm relly surprised noone has copmplained about that as yet, here.
well, those people would be quite right, in a way. personally, however, this is one of my main application composition techniques: what you do is you take a step away from the all-purpose module (which shouldn't do such a thing) towards a fine-tuned application-specific namespace. your modules are bound not to work outside that namespace, but then they're not supposed to, either. i actually started this style of programming as an experimental rebellion against (1) established views, hah!, and (2) the desperation that befalls me whenever i want to accomplish something less than trivial using Python's import statement. these days, i only use import for standard library stuff and regularly-installed modules; for my own stuff, i use a homegrown system. what can i say, it works!
ah yes, two more points: do yourself a favor, if you like this solution, and write yourself a publish() method or the like that oversees you never publish a name that has already been taken. in most cases, you do not want that.
lastly, let me second the first commenter: i have been programming in exactly the style you show above, coz that's what you find in the textbook examples (most of the time using cars, not axes to be sure). for a rather substantial number of reasons, i've pretty much given up on that.
consider this: JSON defines but seven kinds of data: null, true, false, numbers, texts, lists, dictionaries, that's it. i claim you can model any other useful datatype with those.
there is still a lot of justification for things like sets, bags, ordered dictionaries and so on. the claim here is not that it is always convenient or appropriate to fall back on a pure, directly JSON-compatible form; the claim is only that it is possible to simulate. right now, i'm implementing a sparse list for use in a messaging system, and that data type i do implement in classical OOP. that's what it's good for.
but i never define classes that go beyond these generic datatypes. rather, i write libraries that take generic datatypes and that provide the functionality you need. all of my business data (in your case probably representations of players, scenes, implements and so on) go into generic data container (as a rule, mostly dicts). i know there are open questions with this way of architecturing things, but programming has become ever so much easier, so much more fluent since i broke thru the BS that part of OOP propaganda is (apart from the really useful and nice things that another part of OOP is).
oh yes, and did i mention that as long as you keep your business data in JSON-compatible objects you can always write them to and resurrect them from the disk? or send them over the wire so you can interact with remote players? and how incredibly twisted the serialization business can become in classical OOP if you want to do that (read this for the tip of the iceberg)? most of the technical detail you have to know in this field is completely meaningless for the rest of your life.

You can add (or change existing) Python built-in functions and classes by doing either of the following, at least in Py 2.x. Afterwards, whatever you add will available to all code by default, although not permanently.
Disclaimer: Doing this sort of thing can be dangerous due to possible name clashes and problematic due to the fact that it's extremely non-explicit. But, hey, as they say, we're all adults here, right?
class MyCLass: pass
# one way
setattr(__builtins__, 'MyCLass', MyCLass)
# another way
import __builtin__
__builtin__.MyCLass = MyCLass

Another way is to create a singleton:
class Singleton(type):
def __init__(cls, name, bases, dict):
super(Singleton, cls).__init__(name, bases, dict)
cls.instance = None
class GlobalClass(object):
__metaclass__ = Singleton
def __init__():
pinrt("I am global and whenever attributes are added in one instance, any other instance will be affected as well.")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.