Using Python 2.6, I'm trying to handle tables in a growing variety of formats (xls, csv, shp, json, xml, html table data), and feed the content into an ArcGIS database table (stay with me please, this is more about the python part of of the process than the GIS part). In the current design, my base class formats the target database table and populates it with the content of the source format. The subclasses are currently designed to feed the content into a dictionary so that the base class can handle the content no matter what the source format was.
The problem is that my users could be feeding a file or table of any one of these formats into the script, so the subclass would optimally be determined at runtime. I do not know how to do this other than by running a really involved if-elif-elif-... block. The structure kind of looks like this:
class Input:
def __init__(self, name): # name is the filename, including path
self.name = name
self.ext = name[3:]
d = {} # content goes here
... # dictionary content written to database table here
# each subclass writes to d
class xls(Input):
...
class xml(Input):
...
class csv(Input):
...
x = Input("c:\foo.xls")
y = Input("c:\bar.xml")
My understanding of duck-typing and polymorphism suggests this is not the way to go about it, but I'm having a tough time figuring out a better design. Help on that front would help, but what I'm really after is how to turn x.ext or y.ext into the fork at which the subclass (and thus the input-handling) is determined.
If it helps, let's say that foo.xls and bar.xml have the same data, and so x.d and y.d will eventually have the same items, such as {'name':'Somegrad', 'lat':52.91025, 'lon':47.88267}.
This problem is commonly solved with a factory function that knows about subclasses.
input_implementations = { 'xls':xls, 'xml':xml, 'csv':csv }
def input_factory(filename):
ext = os.path.splitext(filename)[1][1:].lower()
impl = input_implementations.get(ext, None)
if impl is None:
print 'rain fire from the skies'
else:
return impl(filename)
Its harder to do from the base class itself (Input('file.xyz')) because the subclasses aren't defined when Input is defined. You can get tricky, but a simple factory is easy.
How about if each derived class contained a list of possible file extensions that it could parse? Then you could try to match the input file's extension with one of these to decide which subclass to use.
You're on the right track. Use your subclasses:
x = xls("c:\foo.xls")
y = xml("c:\bar.xml")
Write methods in each subclass to parse the appropriate data type, and use the base class (Input) to write the data to a database.
Related
I am writing a REST API that will store several complex objects to an AWS DynamoDB and then when requested, retrieve them, perform computations on them, and return a result. Here is a big of extracted, simplified, renamed, pseudo code.
class Widget:
def __init__(self, height, weight):
self.height = height
self.weight = weight
class Machine:
def __init__ (self, widgets):
self.widgets = widgets
def useful_method ():
return "something great"
class WidgetSchema (Schema):
height = fields.Decimal()
weight = fields.Decimal()
#post_load
def make_widget (self, data):
return Widget(*data)
class MachineSchema (Schema):
widgets = fields.List(fields.Nested(WidgetSchema))
def make_machine (self, data):
return Machine(*data)
app = Flask(__name__)
dynamodb = boto3.resource("dynamodb", ...)
#app.route("/machine/<uuid:machine_id>", methods=['POST'])
def create_machine(machine_id):
input_json = request.get_json()
validated_input = MachineSchema().load(input_json)
# NOTE: validated_input should be a Python dict which
# contains Decimals instead of floats, for storage in DynamoDB.
validate_input['id'] = machine_id
dynamodb.Table('machine').put_item(Item=validate_input)
return jsonify({"status", "success", error_message = ""})
#app.route("/machine/<uuid:machine_id>/compute", methods=['GET'])
def get_machine(machine_id):
result = dynamodb.Table('machine').get_item(Key=machine_id)
return jsonify(result['Item'])
#app.route("/machine/<uuid:machine_id>/compute", methods=['GET'])
def compute_machine(machine_id):
result = dynamodb.Table('machine').get_item(Key=machine_id)
validated_input = MachineSchema().load(result['Item'])
# NOTE: validated_input should be a Machine object
# which has made use of the post_load
return jsonify(validated_input.useful_method())
The issue with this is that I need to have my Marshmallow schema pull double duty. For starters, in the create_machine function, I need the schema to ensure that the user calling my REST API has passed me a properly formed object with no extra fields and meeting all required fields, etc. I need to make sure I'm not storing invalid junk in the DB after all. It also needs to recursively crawl the input JSON and translate all of the JSON values to the right type. For example, floats are not supported in Dynamo, so they need to be Decimals as shown here. This is something Marshmallow make pretty easy. If there was no post_load, this is exactly what would be produced as validated_input.
The second job of the schema is that it needs it to take the Python object retrieved from the DynamoDB, which looks almost exactly like the user input JSON with the exception of floats are decimals, and translate it into my Python objects, Machine and Widget. This is where I'll need to read the object again but this time use the post load to create objects. In this case, however, I do not want my numbers to be decimals. I'd like them to be standard Python floats.
I could write two totally different Marshmallow schema for this and be done with it, clearly. One would have Decimals for the height and weight and one would have just floats. One would have post loads for every object and one would have none. But writing two identical schemas is a huge pain. My schema definitions are several hundred lines long. Inheriting a DB version with a post load didn't seem like the right direction because I would need to change any fields.Nested to point to the correct class. For example even if I inherited MachineSchemaDBVersion from MachineSchema, and added a post_load, MachineScehemaDBVersion would still reference WidgetScehema, not some DB version of the WidgetSchema, unless I overroad the widgets field as well.
I could potentially derive my own Schema object and pass a flag for are we in the DB mode or not.
How are people generally handling this issue of wanting to store REST API input more or less directly to a DynamoDB with some validation and then use that data later to construct Python objects for a computation?
On method I have tried is to have my schema always instantiate my Python objects and then dumb them to the database using dumps from a fully constructed object. The problem with this is that the computation library's objects, in my example Machine or Widget, do not have all the required fields that I need to store in the database, like the IDs, or names or descriptions. The objects are made specifically for doing the computations.
I ended up finding a solution to this. Effectively, what I've done is to generate the Marshmallow schema exclusively for translation from the DynamoDB into the Python objects. All Schema classes have #post_load methods that translate into the Python objects and all fields are labeled with the type they need to be in the Python world, not the database world.
When validating the input from the REST API and ensuring that no bad data is allowed to get into the database, I call MySchema().validate(input_json), check to see that there are no errors, and if not, dump the input_json into the database.
This leaves only one extra problem which is that the input_json needs to be cleaned up for entry into the Database, which I was previously doing with Marshmallow. However, this can also easily be done by adjusting my JSON decoder to read Decimals from floats.
So in summary, my JSON decoder is doing the work of recursively walking the data structure and converting Float to Decimal separately from Marshmallow. Marshmallow is running a validate on the fields of every object, but the results are only checked for errors. The original input is then dumped into the database.
I needed to add this line to do the conversion to Decimal.
app.json_decoder = partial(flask.json.JSONDecoder, parse_float=decimal.Decimal)
My create function now looks like this. Notice how the original input_json, parsed by my updated JSON decoder, is inserted directly into the database, rather than any data mundged output from Marshmallow.
#app.route("/machine/<uuid:machine_id>", methods=['POST'])
def create_machine(machine_id):
input_json = request.get_json() # Already ready to be DB input as is.
errors = MachineSchema().validate(input_json)
if errors:
return jsonify({"status": "failure",message = dumps(errors)})
else:
input_json['id'] = machine_id
dynamodb.Table('machine').put_item(Item=input_json)
return jsonify({"status", "success", error_message = ""})
I have many classes with class methods. All classes contain some additional "metadata" that I want to move to the DB. I have no idea how to link classes with db entities. It's not a classical Object Relational Mapping, it's Class-With-Code Relational Mapping.
Should I use class name + module name (it should be unique globally, but may change during refactor)?
Or better solution is to add some unique field to class?
Maybe the best solution is to create unique enum and put all classes into a dict (enum as key, class as value)?
Example code:
class SampleClass(CalculatorMainClass):
data_1 = 1000
data_2 = {
2010: {
1: 847,
},
}
#classmethod
def calculate(cls, input_data):
pass
I have no idea how to link classes with db entities.
I don't think the is a "best practice" for this so how you do it is really up to you. Approach #2 seems legit, #3 seems complicated but maybe you have a need for that. #1 seems brittle if you anticipate changing the names of your classes. I would recommend putting in a unique identifier as a class variable. Perhaps something like:
class SampleClass(CalculatorMainClass):
_pkey = 1
#classmethod
def calculate(cls, input_data):
pass
#classmethod
def metadata(cls):
# fetch data from database using _pkey
I am writing a Django app, which will send some data from the site to a python script to process. I am planning on sending this data as a JSON string (this need not be the case). Some of the values sent over would ideally be class instances, however this is clearly not possible, and the class name plus any arguments needed to initialize the class must some how be serialized into a JSON value before then being deserialized by the python script. This could be achieved with the code below, but it has several problems:
My attempt
I have put all the data needed for each class, in a list and used that to initialize each class:
import json
class Class1():
def __init__(self, *args, **kwargs):
for k, v in kwargs.items():
setattr(self, k, v)
self._others = args
class Bar():
POTENTIAL_OBJECTS = {"RANGE": range,
"Class1": Class1}
def __init__(self, json_string):
python_dict = json.loads(json_string)
for key, value in python_dict.items():
if isinstance(value, list) and value[0] in Bar.POTENTIAL_OBJECTS:
setattr(self, key, Bar.POTENTIAL_OBJECTS[value[0]](*value[1], **value[2]))
else:
setattr(self, key, value)
example = ('{ "key_1":"Some string", "key_2":["heres", "a", "list"],'
'"key_3":["RANGE", [10], {}], "key_4":["Class1", ["stuff"], {"stuff2":"x"}] }')
a = Bar(example)
The Problems with my approach
Apart from generally being a bit messy and not particularly elegant, there are other problems. Some of the lists in the JSON object will be generated by the user, and this obviously presents problems if the user uses a key from POTENTIAL_OBJECTS. (In a non-simplified version, Bar will have lots of subclasses, each with a second POTENTIAL_OBJECTS so keeping track of all the potential values for front-end validation would be tricky).
My Question
It feels like this must be a reasonably common thing that is needed and there must be some standard patterns or ways of achieving this. Is there a common/better approach/method to achieve this?
EDIT: I have realised, one way round the problem is to make all the keys in POTENTIAL_OBJECTS start with an underscore, and then validate against any underscores in user-inputs at the front-end. It still seems like there must be a better way to de-serialize from JSON to more complex objects than strings/ints/bools/lists etc.
Instead of having one master method to turn any arbitrary JSON into an arbitrary hierarchy of Python objects, the typical pattern would be to create a Django model for each type of thing you are trying to model. Relationships between them would then be modeled via relationship fields (ForeignKey, ManyToMany, etc, as appropriate). For instance, you might create a class Employee that models an employee, and a class Paycheck. Paycheck could then have a ForeignKey field named issued_to that refers to an Employee.
Note also that any scheme similar to the one you describe (where user-created JSON is translated directly into arbitrary Python objects) would have security implications, potentially allowing users to execute arbitrary code in the context of the Django server, though if you were to attempt it, the whitelist approach have started here would be a decent place to start as a way to do it safely.
In short, you're reinventing most of what Django already does for you. The Django ORM features will help you to create models of the specific things you are interested in, validate the data, turn those data into Python objects safely, and even save instances of these models in the database for retrieval later.
That said, if you are to parse a JSON string directly into an object hierarchy, you would have to do a full traversal instead of just going over the top-level items. To do that, you should look into doing something like a depth-first traversal, creating new model instances at each new node in the hierarchy. If you want to validate these inputs server-side, you'd need to replicate this work in Javascript as well.
I want to make attributes of GAE Model properties. The reason is for cases like to turn the value into uppercase before storing it. For a plain Python class, I would do something like:
Foo(db.Model):
def get_attr(self):
return self.something
def set_attr(self, value):
self.something = value.upper() if value != None else None
attr = property(get_attr, set_attr)
However, GAE Datastore have their own concept of Property class, I looked into the documentation and it seems that I could override get_value_for_datastore(model_instance) to achieve my goal. Nevertheless, I don't know what model_instance is and how to extract the corresponding field from it.
Is overriding GAE Property classes the right way to provides getter/setter-like functionality? If so, how to do it?
Added:
One potential issue of overriding get_value_for_datastore that I think of is it might not get called before the object was put into datastore. Hence getting the attribute before storing the object would yield an incorrect value.
Subclassing GAE's Property class is especially helpful if you want more than one "field" with similar behavior, in one or more models. Don't worry, get_value_for_datastore and make_value_from_datastore are going to get called, on any store and fetch respectively -- so if you need to do anything fancy (including but not limited to uppercasing a string, which isn't actually all that fancy;-), overriding these methods in your subclass is just fine.
Edit: let's see some example code (net of imports and main):
class MyStringProperty(db.StringProperty):
def get_value_for_datastore(self, model_instance):
vv = db.StringProperty.get_value_for_datastore(self, model_instance)
return vv.upper()
class MyModel(db.Model):
foo = MyStringProperty()
class MainHandler(webapp.RequestHandler):
def get(self):
my = MyModel(foo='Hello World')
k = my.put()
mm = MyModel.get(k)
s = mm.foo
self.response.out.write('The secret word is: %r' % s)
This shows you the string's been uppercased in the datastore -- but if you change the get call to a simple mm = my you'll see the in-memory instance wasn't affected.
But, a db.Property instance itself is a descriptor -- wrapping it into a built-in property (a completely different descriptor) will not work well with the datastore (for example, you can't write GQL queries based on field names that aren't really instances of db.Property but instances of property -- those fields are not in the datastore!).
So if you want to work with both the datastore and for instances of Model that have never actually been to the datastore and back, you'll have to choose two names for what's logically "the same" field -- one is the name of the attribute you'll use on in-memory model instances, and that one can be a built-in property; the other one is the name of the attribute that ends up in the datastore, and that one needs to be an instance of a db.Property subclass and it's this second name that you'll need to use in queries. Of course the methods underlying the first name need to read and write the second name, but you can't just "hide" the latter because that's the name that's going to be in the datastore, and so that's the name that will make sense to queries!
What you want is a DerivedProperty. The procedure for writing one is outlined in that post - it's similar to what Alex describes, but by overriding get instead of get_value_for_datastore, you avoid issues with needing to write to the datastore to update it. My aetycoon library has it and other useful properties included.
I have some software that is heavily dependent on MySQL, and is written in python without any class definitions. For performance reasons, and because the database is really just being used to store and retrieve large amounts of data, I'd like to convert this to an object-oriented python script that does not use the database at all.
So my plan is to export the database tables to a set of files (not many -- it's a pretty simple database; it's big in that it has a lot of rows, but only a few tables, each of which has just two or three columns).
Then I plan to read the data in, and have a set of functions which provide access to and operations on the data.
My question is this:
is there a preferred way to convert a set of database tables to classes and objects? For example, if I have a table which contains fruit, where each fruit has an id and a name, would I have a "CollectionOfFruit" class which contains a list of "Fruit" objects, or would I just have a "CollectionOfFruit" class which contains a list of tuples? Or would I just have a list of Fruit objects?
I don't want to add any extra frameworks, because I want this code to be easy to transfer to different machines. So I'm really just looking for general advice on how to represent data that might more naturally be stored in database tables, in objects in Python.
Alternatively, is there a good book I should read that would point me in the right direction on this?
If the data is a natural fit for database tables ("rectangular data"), why not convert it to sqlite? It's portable -- just one file to move the db around, and sqlite is available anywhere you have python (2.5 and above anyway).
Generally you want your Objects to absolutely match your "real world entities".
Since you're starting from a database, it's not always the case that the database has any real-world fidelity, either. Some database designs are simply awful.
If your database has reasonable models for Fruit, that's where you start. Get that right first.
A "collection" may -- or may not -- be an artificial construct that's part of the solution algorithm, not really a proper part of the problem. Usually collections are part of the problem, and you should design those classes, also.
Other times, however, the collection is an artifact of having used a database, and a simple Python list is all you need.
Still other times, the collection is actually a proper mapping from some unique key value to an entity, in which case, it's a Python dictionary.
And sometimes, the collection is a proper mapping from some non-unique key value to some collection of entities, in which case it's a Python collections.defaultdict(list).
Start with the fundamental, real-world-like entities. Those get class definitions.
Collections may use built-in Python collections or may require their own classes.
There's no "one size fits all" answer for this -- it'll depend a lot on the data and how it's used in the application. If the data and usage are simple enough you might want to store your fruit in a dict with id as key and the rest of the data as tuples. Or not. It totally depends. If there's a guiding principle out there then it's to extract the underlying requirements of the app and then write code against those requirements.
you could have a fruit class with id and name instance variables. and a function to read/write the information from a file, and maybe a class variable to keep track of the number of fruits (objects) created
In the simple case namedtuples let get you started:
>>> from collections import namedtuple
>>> Fruit = namedtuple("Fruit", "name weight color")
>>> fruits = [Fruit(*row) for row in cursor.execute('select * from fruits')]
Fruit is equivalent to the following class:
>>> Fruit = namedtuple("Fruit", "name weight color", verbose=True)
class Fruit(tuple):
'Fruit(name, weight, color)'
__slots__ = ()
_fields = ('name', 'weight', 'color')
def __new__(cls, name, weight, color):
return tuple.__new__(cls, (name, weight, color))
#classmethod
def _make(cls, iterable, new=tuple.__new__, len=len):
'Make a new Fruit object from a sequence or iterable'
result = new(cls, iterable)
if len(result) != 3:
raise TypeError('Expected 3 arguments, got %d' % len(result))
return result
def __repr__(self):
return 'Fruit(name=%r, weight=%r, color=%r)' % self
def _asdict(t):
'Return a new dict which maps field names to their values'
return {'name': t[0], 'weight': t[1], 'color': t[2]}
def _replace(self, **kwds):
'Return a new Fruit object replacing specified fields with new values'
result = self._make(map(kwds.pop, ('name', 'weight', 'color'), self))
if kwds:
raise ValueError('Got unexpected field names: %r' % kwds.keys())
return result
def __getnewargs__(self):
return tuple(self)
name = property(itemgetter(0))
weight = property(itemgetter(1))
color = property(itemgetter(2))
Another way would be to use the ZODB to directly store objects persistently. The only thing you have to do is to derive your classes from Peristent and everything from the root object up is then automatically stored in that database as an object. The root object comes from the ZODB connection. There are many backends available and the default is simple a file.
A class could then look like this:
class Collection(persistent.Persistent):
def __init__(self, fruit = []):
self.fruit = fruit
class Fruit(peristent.Persistent):
def __init__(self, name):
self.name = name
Assuming you have the root object you can then do:
fruit = Fruit("apple")
root.collection = Collection([fruit])
and it's stored in the database automatically. You can find it again by simply looking accessing 'collection' from the root object:
print root.collection.fruit
You can also derive subclasses from e.g. Fruit as usual.
Useful links with more information:
The new ZODB homepage
a ZODB tutorial
That way you still are able to use the full power of Python objects and there is no need to serialize something e.g. via an ORM but you still have an easy way to store your data.
Here are a couple points for you to consider. If your data is large reading it all into memory may be wasteful. If you need random access and not just sequential access to your data then you'll either have to scan the at most the entire file each time or read that table into an indexed memory structure like a dictionary. A list will still require some kind of scan (straight iteration or binary search if sorted). With that said, if you don't require some of the features of a DB then don't use one but if you just think MySQL is too heavy then +1 on the Sqlite suggestion from earlier. It gives you most of the features you'd want while using a database without the concurrency overhead.
Abstract persistence from the object class. Put all of the persistence logic in an adapter class, and assign the adapter to the object class. Something like:
class Fruit(Object):
#classmethod
def get(cls, id):
return cls.adapter.get(id)
def put(self):
cls.adapter.put(self)
def __init__(self, id, name, weight, color):
self.id = id
self.name = name
self.weight = weight
self.color = color
class FruitAdapter(Object):
def get(id):
# retrieve attributes from persistent storage here
return Fruit(id, name, weight, color)
def put(fruit):
# insert/update fruit in persistent storage here
Fruit.adapter = FruitAdapter()
f = Fruit.get(1)
f.name = "lemon"
f.put()
# and so on...
Now you can build different FruitAdapter objects that interoperate with whatever persistence format you settle on (database, flat file, in-memory collection, whatever) and the basic Fruit class will be completely unaffected.