OO Design of data to map - python

I understand the workings of OO Programming but have little practical experience in actually using for more than one or two classes. When it comes to practically using it I struggle with the OO Design part. I've come to the following case which could benefit from OO:
I have a few sets of data from different sources, some from file, others from the internet through an API and others of even a different source. Some of them are quite alike when it comes to the data they contain and some of them are really different. I want to visualize this data, and since almost all of the data is based on a location I plan on doing this on a map (using Folium in python to create a leafletjs based map) with markers of some sort (with a little bit of information in a popup). In some cases I also want to create a pdf with an overview of data and save it to disk.
I came up with the following (start of an) idea for the classes (written in python to show the idea):
class locationData(object):
# for all the location based data, will implement coordinates and a name
# for example
class fileData(locationData):
# for the data that is loaded from disk
class measurementData(fileData):
# measurements loaded from disk
class modelData(fileData):
# model results loaded from disk
class VehicleData(locationData):
# vehicle data loaded from a database
class terrainData(locationData):
# Some information about for example a mountain
class dataToPdf(object):
# for writing data to pdf's
class dataFactory(object):
# for creating the objects
class fileDataReader(object):
# for loading the data that is on disk
class vehicleDatabaseReader(object):
# to read the vehicle data from the DB
class terrainDataReader(object):
# reads terrain data
class Data2HTML(object):
# puts the data in Folium objects.
Considering the data to output I figured that each data class render its own data (since it knows what information it has) in for example a render() method. The output of the render method (maybe a dict) would than be used in data2pdf or data2html although I'm not exactly sure how to do this yet.
Would this be a good start for OO design? Does anybody have suggestion or improvements?

the other day I described my approach for a similar question. I think you can use it. I think the best approach would be to have an object that can retrieve and return your data and another one that can show them as you wish, maybe a may, maybe a graph and anything else you would like to have.
What do you think?
Thanks

Related

Making pandas code more readable/better organized

I am working on a data analysis project based on Pandas. Data which has to be analyzed is collected from application log files. Log entries are based on sessions, which can be different types (and can have different actions), then each session can have mutliple services (also with different types, actions, etc.). I have transformed log file entries to pandas dataframe and then based on that completed all required calculations. At this moment that's around few hundred different calculations, which are at the end printed to stdout. If anomaly is found that is specifically flagged. So, basic functionality is there, but now after this first phase is done, I'm not happy with readability of the code and it seems to me that there must be a way to make the code better organized.
For example what I have at the moment is:
def build(log_file):
# build dataframe from log file entries
return df
def transform(df):
# transform dataframe (for example based on grouped sessions, services)
return transformed_df
def calculate(transformed_df):
# make calculations based on transformed dataframe and print them to stdout
print calculation1
print calculation2
etc.
Since there are numerous criteria for filtering data, there is at least 30-40 different data frame filters present. They are used in calculate and in transform functions. In calculate functions I have also some helper functions which perform tasks which can be applied to similar session/service types and then result is based just on filtered dataframe for that specific type. With all these requirements, transformations, filters, I now have more than 1000 lines of code, which as I said, I have a feeling it can be more readable.
My current idea is to have perhaps classes organized like this:
class Session:
# main class for sessions (it can be inherited by other session types), also with standradized output for calculations
class Service:
# main class for services (it can be inherited by other service types), also with standradized output for calculations, etc.
class Dataframe
# dataframe class with filters, etc.
But I'm not sure if this is good approach. I tried searching here, on github, different blogs, but I didn't find anything which would provide some examples what would be best way to organize code in more than basic panda projects. I would appreciate any suggestion which would put me in the right direction.

What is a sensible way to store matrices (which represent images) either in memory or on disk, to make them available to a GUI application?

I am looking for some high level advice about a project that I am attempting.
I want to write a PyQt application (following the model-view pattern) to read in images from a directory one by one and process them. Typically there will be a few thousand .png images (each around 1 megapixel, 16 bit grayscale) in the directory. After being read in, the application will then process the integer pixel values of each image in some way, and crucially the result will be a matrix of floats for each. Once processed, the user should be able be able to then go back and explore any of the matrices they choose (or multiple at once), and possibly apply further processing.
My question is regarding a sensible way to store the matrices in memory, and access them when needed. After reading in the raw .png files and obatining the corresponding matrix of floats, I can then see the following options for handling the result:
Simply store each matrix as a numpy array and have every one of them stored in a class attribute. That way they will all be easily accessible to the code when requested by the user, but will this be poor in terms of RAM required?
After processing each, write out the matrix to a text file, and read it back in from the text file when requested by the user.
I have seen examples (see here) of people using SQLite databases to store data for a GUI application (using MVC pattern), and then query the database when you need access to data. This seems like it would have the advantage that data is not stored in RAM by the "model" part of the application (like in option 1), and is possibly more storage-efficient than option 2, but is this suitable given that my data are matrices?
I have seen examples (see here) of people using something called HDF5 for storing application data, and that this might be similar to using a SQLite database? Again, suitable for matrices?
Finally, I see that PyQt has the classes QImage and QPixmap. Do these make sense for solving the problem I have described?
I am a little lost with all the options, and don't want to spend too much time investigating all of them in too much detail so would appreciate some general advice. If someone could offer comments on each of the options I have described (as well as letting me know if any can be ruled out in this situation) that would be great!
Thank you

How to structure module with interactive methods

I'm working on a program which parses data and presents it to users for annotation for the purpose of generating training data for an ML model. I'm looking for advice on how to structure this module's classes in the most logical way. I'm a bit new to OOP; abstracting beyond the typical "vehicle->car->car_brand" model of class inheritance is sort of where I find myself. The basic flow of this program is:
Retrieve messy data from external source
Parse data to create local representation which only contains information relevant to this task
Present data to users, who then mark it up with annotations
Generate statistics on those annotations
Should the interactive methods be part of the same class as the cleaned-up data? What about the methods to generate statistics?
I've tried subsuming all functionalities of this program under one class definition, which works fine but seems reductive and probably difficult for others to grasp quickly. Here is how I think the program might be structured (apologies for all the pseudo-code):
class AnnotationData:
# has methods to retrieve messy data and smooth it into what humans need to see to do this task. Populates class attributes to represent that data.
class AnnotationMethods(AnnotationData):
# has methods to interact with data
class AnnotationStatistics(AnnotationData):
# has methods to generate statistics on data which has been augmented by humans
if __name__ == "__main__":
# create base class
# populate base class with messy data
# smooth messy data into human-readable format
# Instantiate AnnotationMethods class
# Human does annotation
# Instantiate AnnotationStatistics class
# Return sweet sweet stats
Subsuming all of this into a single class works fine. I'm just wondering what the best practice is for divvying up methods which humans interact with from methods which just populate data.
Check this library https://github.com/dedupeio/dedupe. They have a similar workflow as yours.

store large data python

I am new with Python. Recenty,I have a project which processing huge amount of health data in xml file.
Here is an example:
In my data, there is about 100 of them and each of them have different id, origin, type and text . I want to store in data all of them so that I could training this dataset, the first idea in my mind was to use 2D arry ( one stores id and origin the other stores text). However, I found there are too many features and I want to know which features belong to each document.
Could anyone recommend a best way to do it.
For scalability ,simplicity and maintainance, you should normalised those data, build a database schema and move those stuff into database (sqlite,postgres,mysql,whatever)
This will move complicate data logic out of python. This is a typical practice of Model-view-controller.
Create a python dictionary and traverse it are quick and dirty. It will become huge technical time sink very soon if you want to make practical sense out of the data.

Tree of trees? Table of trees? What kind of data structure have I created?

I am creating a python module that creates and operates on data structures to store lots of semantically tagged data and metadata from real experiments. So in an experiment you have:
subjects
treatments
replicates
Enclosing these 3 categories is the experiment, and combinations of the three categories are what I am calling "units". Now there is no inherently correct hierarchy between the 3 (table-like) but for certain analyses it is useful to think of a certain permutation of the 3 as a hierarchy,
e.g. (subjects-->(treatments-->(replicates)))
or
(replicates-->(treatments-->(subjects)))
Moreover, when collecting data, files will be copy-pasted into a folder on a desktop, so data is at least coming in as a tree. I have thought a lot about which hierarchy is "better" but I keep coming up with use cases for most of the 6 possible permutations. I want my module to be flexible in that the user can think of the experiment or collect the data using whatever hierarchy, table, hierarchy-table hybrid makes sense to them.
Also the "units" or (table entries) are containers for arbitrary amounts of data (bytes to Gigabytes, whatever ideally) of any organizational complexity. This is why I didn't think a relational database approach was really the way to go and a NoSQL type solution makes more sense. But then i have the problem of how to order the three categories if none is "correct".
So my question is what is this multifaceted data structure?
Does some sort of fluid data structure or set of algorithms exist to easily inter-convert or produce structured views?
The short answer is that HDF5 addresses these fairly common concerns and I would suggest it. http://www.hdfgroup.org/HDF5/
In python: http://docs.h5py.org/en/latest/high/group.html
http://odo.pydata.org/en/latest/hdf5.html
will help.

Categories

Resources