I'm a self-taught programmer and a lot of the problems I encounter come from a lack of formal education (and often also experience).
My question it the following: How to rationalize where you store the data a class or function creates? I'll make a simple example:
Case: I have a webshop (SHOP) with a REST api and a product provider (PROVIDER) also with a REST API. I determine the product, I send that data to PROVIDER who sends me back formatted data that can be read by SHOP to make a working product on the webshop. PROVIDER also has a secondary REST api that provides generated images.
What I would come up with:
I'd make three classes: ProductBase, Shop and Provider
ProductBase would be the class from where I instantiate and store the individual product information.
Shop would be where I design the api interactions with the webshop.
Provider same as shop, but for interactions with provider api.
My problem: At some point you're creating data that's not clearly separated in concern. For example: Would I store the generated product data (from PROVIDER) in the ProductBase instance I created? It feels like I'm coupling the two classes this way. But it not there, then where?
What if I create product images with PROVIDER and I upload them to SHOP? Do I store the uploaded image-url in PRODUCT? How do you keep track of all this info?
The question I want answered:
I've read a lot on OOP and Design Patterns, and I have adopted a TDD approach which has greatly helped to improve my code but I haven't found anything on how to approach the flow of at runtime generated data within software engineering.
What would be a good way to solve above problem(s) and could you explain your rationale for it?
If I understand correctly, I think your current concern is that you have "raw" product data, which you want to store in objects, and you have "processed" (formatted) product data, which you also want to store in objects. Your question being should you mix them.
Let me just first point out the other obvious option. Namely, having two product classes: RawProduct and ProcessedProduct. Which to do?
(Edit: also, to be sure, product data should not be stored in provider. The provide performs the action of formatting but the data is product data. Not provider data).
It depends. There are a couple of considerations:
1) In general, in OOP, the idea is to couple actions on data with the data. So if possible, you have some method in ProductBase like "format()", where format will send the object off to the API to get formatted, and store the result in an instance variable. You can then also have a method like "find_image", that goes and fetches the image url from the API and then stores that in a field. An object's data is meant to be dynamic. It is meant to be altered by object methods.
2) If you need version control (if you want the full history of the object's state to be available), then you can't override fields with new data. So either you need to store a history of every object field in the object, or you need to create new objects.
3) Is RAM a concern? I sometimes create dataclasses that store only the final part of an object's life so that I can fit more of the objects into memory.
Personally I often find myself creating "RawObject" and "ProcessedObject" classes, it's just easier a lot of the time. But that's probably because I mostly work with document processing, so it's very clear. Usually You'll just update the objects data.
A benefit of having one object with the full history is that it is much easier to debug. Because the raw data and the API result are in the same object. So you can very easily probe what went wrong. If you start splitting things up it's harder to track. In general, the more information an object has about where it's been, the easier it is to figure out what went wrong with it.
Remember also though, since this is a Python question, Python is multi-paridigm. And if you're writing pipeline-style architectures (synchronous, linear processes), then a functional approach can also work well.
Once your data is stored in a product object, anything can hold a reference to that. So a shop can reference an object and a product can reference the object. Be clear on the difference between "has-a" relationships and "is-a" relationships.
Related
As a caveat: I am an utter novice here. I wouldn't be surprised to learn a) this is already answered, but I can't find it because I lack the vocabulary to describe my problem or b) my question is basically silly to begin with, because what I want to do is silly.
Is there some way to store a reference to a class instance that defined and stored in active memory and not stored in NDB? I'm trying to write an app that would help manage a number of characters/guilds in an MMO. I have a class, CharacterClass, that includes properties such as armor, name, etc. that I define in main.py as a base python object, and then define the properties for each of the classes in the game. Each Character, which would be stored in Datastore, would have a property charClass, which would be a reference to one of those instances of CharacterClass. In theory I would be able to do things like
if character.charClass.armor == "Cloth":
while storing the potentially hundreds of unique characters and their specifc data in Datastore, but without creating a copy of "Cloth" for every cloth-armor character, or querying Datastore for what kind of armor a mage wears thousands of times a day.
I don't know what kind of NDB property to use in Character to store the reference to the applicable CharacterClass. Or if that's the right way to do it, even. Thanks for taking the time to puzzle through my confused question.
A string is all you need. You just need to fetch the class based on the string value. You could create a custom property that automatically instantiates the class on reference.
However I have a feeling that hard coding the values in code might be a bit unwieldy. May be you character class instances should be datastore entities as well. It means you can adjust these parameters without deploying new code.
If you want these objects in memory then you can pre-cache them on warmup.
Introduction
In Django, when the data you want to display on a template is included in one object, It's f**** easy. To sum up the steps (that everyone knows actually):
You Write the right method to get your object in your model class
You Call this method in your view, passing the result to the template
You Iterate on the result in the template with a for loop, to display your objects in a table, for example.
Now, let's take a more complex situation
Let's say that the data you want to display is widely spread over different objects of different classes. You need to call many methods to get these data.
Once you call these different methods, you got different variables (unsimilar objects, integers, list of strings, etc.)
Nevertheless, you still want to pass everything to a template and display a pretty table in the end.
The problem is:
If you're passing all the raw objects containing the data you need to your template, it is completely unorganised and you can't iterate on variables in a clean way to get what you need to display your table.
The question is:
How (which structure) and where (models? views?) should I organize my complex data before passing it to a template?
My idea on this (which can be totally wrong):
For each view that need "spread data" to pass to a template, I could create a method (like viewXXX_organize_data()) in views.py, that would take the raws objects and would return a data structure with organized data that would help me to display a table by iterating on it.
About the data structure to choose, I compared lists with dictionaries
dictionaries have key so it's cleaner to call {{dict.a-key-name}} rather than {{ tabl.3}} in the template.
lists can be sorted, so when you need to sort by date the elements you want to display, dictionary is not helpful, arghh, stuck again!
What do you think about all that? Thanks for reading until there, and sharing on this!
With your question you are entering in a conceptual/architectural domain rather than in a "this particular view of the data in my project is hard to represent in the template layer of django". So I will try to give you the birds view (when flying and not on the ground) of the problem and you can decide for yourself.
From the first philosophy box in the django template language documentation it's clearly stated that templates should have as little program logic as possible. This indicates that the representation of the data used in the template should be simple and totally adapted to the template you are trying to build (this is my interpretation of it). This approach indicates that you should have a layer responsible for intermediating the representation of your data (models or other sources) and the data that your template needs to achieve the final representation you want you users to see.
This layer can simple stay in your view, in viewXXX_organize_data, or in some other form respecting to a more complex/elaborated architecture (see DCI or Hexagonal).
In your case I would start by doing something like viewXXX_organize_data() where I would use the most appropriate data structures for the template you are trying to build, while keeping some independence from the way you obtain your data (through models other services etc).
You can even think of not using you model objects directly in the template and creating template specific objects to represent a certain view of the data.
Hope this helps you make a decision. It's not a concrete answer but will help you for sure make a decision and then staying coherent all trough your app.
In may app, I have the following process:
Get a very long list of people
Create an entity for each person
Send an email to each person (step 2 must be completed before step 3 starts)
Because the list of people is very large, I don't want to put them in the same entity group.
In doing step 3, I can query the list of people like this:
Person.all()
Because of eventual consistency, I might miss some people in step 3. What is a good way to ensure that I am not missing anyone in step 3?
Is there a better solution than this?:
while Person.all().count() < N:
pass
for p in Person.all()
# do whatever
EDIT:
Another possible solution came to mind. I could create a linked list of the people. I can store a link to the first one, he can link to the second one and so one. It seems that the performance would be poor however, because you'd be doing each get separately and wouldn't have the efficiencies of a query.
UPDATE: I reread your post and saw that you don't want to put them all in the same entity group. I'm not sure how to guarantee strong consistency without doing so. You might want to restructure your data so that you don't have to put them in the same entity group, but in several. Perhaps depending on some aspect of a group of Person entities? (e.g., mailing list they are on, type of email being sent, etc.) Does each Person only contain a name and an email address, or are there other properties involved?
Google suggests a a few other alternatives:
If your application is likely to encounter heavier write usage, you may need to consider using other means: for example, you might put recent posts in a memcache with an expiration and display a mix of recent posts from the memcache and the Datastore, or you might cache them in a cookie, put some state in the URL, or something else entirely. The goal is to find a caching solution that provides the data for the current user for the period of time in which the user is posting to your application. Remember, if you do a get, a put, or any operation within a transaction, you will always see the most recently written data.
So it looks like you may want to investigate those possibilities, although I'm not sure how well they would translate to what your app needs.
ORIGINAL POST: Use ancestor queries.
From Google's "Structuring Data for Strong Consistency":
To obtain strongly consistent query results, you need to use an ancestor query limiting the results to a single entity group. This works because entity groups are a unit of consistency as well as transactionality. All data operations are applied to the entire group; an ancestor query won't return its results until the entire entity group is up to date. If your application relies on strongly consistent results for certain queries, you may need to take this into consideration when designing your data model. This page discusses best practices for structuring your data to support strong consistency.
So when you create a Person entity, set a parent for it. I believe you could even just have a specific entity be the "parent" of all the others, and it should give you strong consistency. (Although I like to structure my data a bit with ancestors anyway.)
# Gives you the ancestor key
def ancestor_key(kind, id_or_name):
return db.Key.from_path(kind, id_or_name)
# Kind is the db model your using (should be 'Person' in this case) and
# id_or_name should be the key id or name for the parent
new_person = Person(your_params, parent=ancestor_key('Kind', id_or_name)
You could even do queries at that point for all the entities with the same parent, which is nice. But that should help you get more consistent results regardless.
I'm creating a small website with Django, and I need to calculate statistics with data taken from several tables in the database.
For example (nothing to do with my actual models), for a given user, let's say I want all birthday parties he has attended, and people he spoke with in said parties. For this, I would need a wide query, accessing several tables.
Now, from the object-oriented perspective, it would be great if the User class implemented a method that returned that information. From a database model perspective, I don't like at all the idea of adding functionality to a "row instance" that needs to query other tables. I would like to keep all properties and methods in the Model classes relevant to that single row, so as to avoid scattering the business logic all over the place.
How should I go about implementing database-wide queries that, from an object-oriented standpoint, belong to a single object? Should I have an external kinda God-object that knows how to collect and organize this information? Or is there a better, more elegant solution?
I recommend extending Django's Model-Template-View approach with a controller. I usually have a controller.py within my apps which is the only interface to the data sources. So in your above case I'd have something like get_all_parties_and_people_for_user(user).
This is especially useful when your "data taken from several tables in the database" becomes "data taken from several tables in SEVERAL databases" or even "data taken from various sources, e.g. databases, cache backends, external apis, etc.".
User.get_attended_birthday_parties() or Event.get_attended_parties(user) work fine: it's an interface that makes sense when you use it. Creating an additional "all-purpose" object will not make your code cleaner or easier to maintain.
Today I was refactoring some code and revisited an old friend, an Address class (see below). It occurred to me that, in our application, we don't do anything special with addresses-- no queries, only lightweight validation and frequent serialization to JSON. The only "useful" properties from the developer point-of-view are the label and person.
So, I considered refactoring the Address model to use a custom AddressProperty (see below), which strikes me as a good thing to do, but off-the-top I don't see any compelling advantages.
Which method would you choose, why and what tradeoffs guide that decision?
# a lightweight Address for form-based CRUD
# many-to-one relationship with Person objects
class Address(db.Model):
label=db.StringProperty()
person=db.ReferenceProperty(collection_name='addresses')
address1=db.StringProperty()
address2=db.StringProperty()
city=db.StringProperty()
zipcode=db.StringProperty()
# an alternate representation -- worthwhile? tradeoffs?
class Address(db.Model):
label=db.StringProperty()
person=db.ReferenceProperty(collection_name='addresses')
details=AddressProperty() # like db.PostalAddressProperty, more methods
Given that you don't need to query on addresses, and given they tend to be fairly small (as opposed to, say, a large binary blob), I would suggest going with the latter. It'll save space and time (fetching it) - the only real downside is that you have to implement the property yourself.
If you wanted to be a little different you could always store the data in one table as structures and then have another table for lookups and metadata