I'm working on a program which parses data and presents it to users for annotation for the purpose of generating training data for an ML model. I'm looking for advice on how to structure this module's classes in the most logical way. I'm a bit new to OOP; abstracting beyond the typical "vehicle->car->car_brand" model of class inheritance is sort of where I find myself. The basic flow of this program is:
Retrieve messy data from external source
Parse data to create local representation which only contains information relevant to this task
Present data to users, who then mark it up with annotations
Generate statistics on those annotations
Should the interactive methods be part of the same class as the cleaned-up data? What about the methods to generate statistics?
I've tried subsuming all functionalities of this program under one class definition, which works fine but seems reductive and probably difficult for others to grasp quickly. Here is how I think the program might be structured (apologies for all the pseudo-code):
class AnnotationData:
# has methods to retrieve messy data and smooth it into what humans need to see to do this task. Populates class attributes to represent that data.
class AnnotationMethods(AnnotationData):
# has methods to interact with data
class AnnotationStatistics(AnnotationData):
# has methods to generate statistics on data which has been augmented by humans
if __name__ == "__main__":
# create base class
# populate base class with messy data
# smooth messy data into human-readable format
# Instantiate AnnotationMethods class
# Human does annotation
# Instantiate AnnotationStatistics class
# Return sweet sweet stats
Subsuming all of this into a single class works fine. I'm just wondering what the best practice is for divvying up methods which humans interact with from methods which just populate data.
Check this library https://github.com/dedupeio/dedupe. They have a similar workflow as yours.
Related
I am working on a data analysis project based on Pandas. Data which has to be analyzed is collected from application log files. Log entries are based on sessions, which can be different types (and can have different actions), then each session can have mutliple services (also with different types, actions, etc.). I have transformed log file entries to pandas dataframe and then based on that completed all required calculations. At this moment that's around few hundred different calculations, which are at the end printed to stdout. If anomaly is found that is specifically flagged. So, basic functionality is there, but now after this first phase is done, I'm not happy with readability of the code and it seems to me that there must be a way to make the code better organized.
For example what I have at the moment is:
def build(log_file):
# build dataframe from log file entries
return df
def transform(df):
# transform dataframe (for example based on grouped sessions, services)
return transformed_df
def calculate(transformed_df):
# make calculations based on transformed dataframe and print them to stdout
print calculation1
print calculation2
etc.
Since there are numerous criteria for filtering data, there is at least 30-40 different data frame filters present. They are used in calculate and in transform functions. In calculate functions I have also some helper functions which perform tasks which can be applied to similar session/service types and then result is based just on filtered dataframe for that specific type. With all these requirements, transformations, filters, I now have more than 1000 lines of code, which as I said, I have a feeling it can be more readable.
My current idea is to have perhaps classes organized like this:
class Session:
# main class for sessions (it can be inherited by other session types), also with standradized output for calculations
class Service:
# main class for services (it can be inherited by other service types), also with standradized output for calculations, etc.
class Dataframe
# dataframe class with filters, etc.
But I'm not sure if this is good approach. I tried searching here, on github, different blogs, but I didn't find anything which would provide some examples what would be best way to organize code in more than basic panda projects. I would appreciate any suggestion which would put me in the right direction.
I have a bunch of classes that all look somewhat like this:
class FakeAnalysis(Analysis):
def __init__(self, dbi, metric: str, constraints: dict) -> None:
super().__init__(dbi)
self.metric = metric
self.constraints = constraints.copy()
def load_data(self) -> pd.DataFrame:
data = self.dbi.select_data(
{"val"}, {"period"}, **self.constraints
)
return data
def run(self) -> namedtuple:
"""Do some form of dataframe transformation""""
data = self.load_data()
df = data.pivot_table(columns='period',values='val',index='product_tag')
return namedtuple("res", ['df'])(**{"df": df})
They all take in a metric, constraints and a database interface class (dbi) as __init__ arguments. They all load the data necessary by fetching the data through the dbi and then do some sort of data transformation on the resulting dataframe before returning it as a namedtuple containing the transformed data and any other byproducts (i.e. could be multiple dataframes).
The question is: what is the best way to unit test such code? The errors are usually the result of a combination of constraints resulting in unexpected data that the code does not know how to deal with. Should I just test each class with randomly generated constraints and see if it crashes? Or should I create a mock database interface which returns fixed data for a few different constraints and ensure the class returns the results expected for just these constraints? The latter doesn't seem of much use to me although it would be more along the lines of unit testing best practice...
Any better ideas?
This is what occurs to me.
You can validate the data first, and not worry about invalid data in your processing.
You can instead deal with invalid data, by not crashing, but using try blocks to generate reasonable output for the user, or log errors, whatever is proper.
Unit test what your code does. Make sure it does what it says. Do it by mocking and inpecting mock calls. Use mocks to return invalid data and test that they trigger the invalid data exceptions you provided.
If you find difficult to express all cases that could be wrong (maybe you have to generalize a bit here because of dealing with very large or infinite possible inputs), it may be useful to stretch the thing with lots of randomly generated data that will show you cases you have not imagined (trust me, this works).
Capture those to a reasonable amount, until (the typical size of your data, or 10x that, or more, you choose) random data does not seem to trigger errors. Keep your random tests running but reduce the tries to make your tests run fast again, while you go on coding the rest of the system.
Of course mock the database access for this.
At anytime you find that data errors still happen, you can fix that case, and increase the random tries to check better. This is better than writing lots of specific cases by hand.
I understand the workings of OO Programming but have little practical experience in actually using for more than one or two classes. When it comes to practically using it I struggle with the OO Design part. I've come to the following case which could benefit from OO:
I have a few sets of data from different sources, some from file, others from the internet through an API and others of even a different source. Some of them are quite alike when it comes to the data they contain and some of them are really different. I want to visualize this data, and since almost all of the data is based on a location I plan on doing this on a map (using Folium in python to create a leafletjs based map) with markers of some sort (with a little bit of information in a popup). In some cases I also want to create a pdf with an overview of data and save it to disk.
I came up with the following (start of an) idea for the classes (written in python to show the idea):
class locationData(object):
# for all the location based data, will implement coordinates and a name
# for example
class fileData(locationData):
# for the data that is loaded from disk
class measurementData(fileData):
# measurements loaded from disk
class modelData(fileData):
# model results loaded from disk
class VehicleData(locationData):
# vehicle data loaded from a database
class terrainData(locationData):
# Some information about for example a mountain
class dataToPdf(object):
# for writing data to pdf's
class dataFactory(object):
# for creating the objects
class fileDataReader(object):
# for loading the data that is on disk
class vehicleDatabaseReader(object):
# to read the vehicle data from the DB
class terrainDataReader(object):
# reads terrain data
class Data2HTML(object):
# puts the data in Folium objects.
Considering the data to output I figured that each data class render its own data (since it knows what information it has) in for example a render() method. The output of the render method (maybe a dict) would than be used in data2pdf or data2html although I'm not exactly sure how to do this yet.
Would this be a good start for OO design? Does anybody have suggestion or improvements?
the other day I described my approach for a similar question. I think you can use it. I think the best approach would be to have an object that can retrieve and return your data and another one that can show them as you wish, maybe a may, maybe a graph and anything else you would like to have.
What do you think?
Thanks
Tensorflow's scalar/histogram/image_summary functions are very useful for logging data for viewing with tensorboard. But I'd like that information printed to the console as well (e.g. if I'm a crazy person without a desktop environment).
Currently, I'm adding the information of interest to the fetch list before calling sess.run, but this seems redundant as I'm already fetching the merged summaries. Fetching the merged summaries returns a protobuf, so I imagine I could scrape it using some generic python protobuf library, but this seems like a common enough use case that there should be an easier way.
The main motivation here is encapsulation. Let's stay I have my model and training script in different files. My model has a bunch of calls to tf.scalar_summary for the information that useful to log. Ideally, I'd be able to specify whether or not to additionally print this information to console by changing something in the training script without changing the model file. Currently, I either pass all of the useful information to the training script (so I can fetch them), or I pepper the model file with calls to tf.Print
Overall, there isn't first class support for your use case in TensorFlow, so I would parse the merged summaries back into a tf.Summary() protocol buffer, and then filter / print data as you see fit.
If you come up with a nice pattern, you could then merge it back into TensorFlow itself. I could imagine making this an optional setting on the tf.train.SummaryWriter, but it is probably best to just have a separate class for console-printing out interesting summaries.
If you want to encode into the graph itself which items should be summarized and printed, and which items should only be summarized (or to setup a system of different verbosity levels) you could use the Collections argument to the summary op constructors to organize different summaries into different groups. E.g. the loss summary could be put in collections [GraphKeys.SUMMARIES, 'ALWAYS_PRINT'], but another summary could be in collection [GraphKeys.SUMMARIES, 'PRINT_IF_VERBOSE'], etc. Then you can have different merge_summary ops for the different types of printing, and control which ones are run via command line flags.
I am programming some kind of simulation with its data organised in a tree. The main object is World which holds a bunch of methods and a list of City objects. Each City object in turn has a bunch of methods and a list of Population objects. Population objects have no method of their own, they merely hold attributes.
My question regards the latter Population objects, which I can either derive from object or create as dictionaries. What is the most efficient way to organise these?
Here are few cases which illustrate my hesitation:
Saving the Data
I need to be able to save and load the simulation, for which purpose I use the built-in json (I want the data to be human readable). Because of the program is organised in a tree, saving data at each level can be cumbersome. In this case, the population is best kept as a dictionary appended to a population list as an attribute of a City instance. This way, saving is a mere matter of passing the City instance's __dict__ into Json.
Using the Data
If I want to manipulate the population data, it is easier as a class instance than as a dictionary. Not only is the syntax simple, but I can also enjoy introspection features better while coding.
Performance
I am not sure, finally, as to what is the most efficient in terms of resources. An object and a dictionary have little difference in the end, since each object has a __dict__ attribute, which can be used to access all its attributes. If i run my simulation with large numbers of City and Population objects, what will be using the less resources: objects or dictionaries?
So again, what is the most efficient way to organise data in a tree? Are dictionaries or objects preferable? Or is there any secret to organising the data trees?
Why not a hybrid dict/object?
class Population(dict):
def __getattr__(self, key):
return self[key]
def __setattr__(self, key, value):
self[key] = value
Now you can easily access known names via attributes (foo.bar), while still having the dict functionality to easily access unknown names, iterate over them, etc. without the clunky getattr/setattr syntax.
If you want to always initialize them with particular fields, you can add an __init__ method:
def __init__(self, starting=0, birthrate=100, imrate=10, emrate=10, deathrate=100):
self.update(n=starting, b=birthrate, i=imrate, e=emrate, d=deathrate)
As you've seen yourself, there is little practical difference - the main difference, in my opinion, is that using individual, hard-coded attributes is slightly easier with objects (no need to quote the name) while dicts easily allow treating all values as one collection (e.g. summing them). This is why I'd go for objects, since the data of the population objects is likely heterogenous and relatively independent.
I think you should consider using a namedtuple (see the Python docs on the collections module). You get to access the attributes of the Population object by name like you would with a normal class, e.g. population.attribute_name instead of population['attribute_name'] for a dictionary. Since you're not putting any methods on the Population class this is all you need.
For your "saving data" criterion, there's also an _asdict method which returns a dictionary of field names to values that you could pass to json. (You might need to be careful about exactly what you get back from this method depending on which version of Python you're using. Some versions return a dictionary, and some return an OrderedDict. This might not make any difference for your purposes.)
namedtuples are also pretty lightweight, so they also work with your 'Running the Simulation' resource requirement. However, I'd echo other people's caution in saying not to worry about that, there's going to be very little difference unless you're doing some serious data-crunching.
I'd say that in every case a Population is a member of a City, and if it's data only, why not use a dictionary?
Don't worry about performance, but if your really need to know I think a dict is faster.