How to structure data clustering python project?

How to structure data clustering python project? - python

The project: Read in 2D data, cluster datapoints based on different cluster techniques/models, and evaluate how well the clustering has worked.
Since I am unhappy with my project structure so far, and have little experience with project structures, I hope to get some feedback on how to proceed. The structure is as follows:
/2Dclustering
__init__.py
__main__.py
__2dcluster.py
/cluster_forming
__init__.py
__cluster_models.py
/evaluate_cluster
__how_good_is_clustering.py
__choose_the_better_cluster.py
In main.py, we read in the input data, and create a 2dcluster object using __2dcluster.py that is then saved as the output. The 2dcluster class uses the function from cluster_forming and evaluate_cluster to form a cluster and adding a metric (i.e.how well did it perfom?) to it. In both subfolders (cluster_forming and evaluate_cluster), we have just files with a bunch of functions instead of classes. My question is:
1.) Does it in general make sense to break everything into so many subfolders?
2.) Would it make sense to have class objects for evaluate_clusters that evaluate_cluster? I feel like now it is a little messy but I have no intuition if creating classes would over-complicate things.
3.) Is there a sensible way of creating classes that deal with all the subclasses, i.e. a class that just combines other classes- or is this nonsense?
If anyone has an intuition on a structure that would make more sense, Id be really happy to hear it. As I said, as someone that has never written bigger projects, I am kind of at loss on what is considered a clever solution and what is overcomplicating the project. Thanks!

Your project's structure should balance the tension between competing stakeholder's needs.
i) The Coder : This person (probably you) will want to have a number of small, composable functions spread across the codebase where they can be easily tested in isolation, and reused in lots of different ways. The code should be split logically into isolatable, functional blocks that provide clear themes of functionality. Over time, as the codebase gets larger, it pays to split it into ever smaller unit components, to support testing and debugging.
ii) The End-User : This person wants to be able to install your code and be able to run or import it with a minimum of fiddling about. Their priority will be utility, and as such they will want a single point of entry, with a simple interface without having to spend time learning about project structures to get stuff done.
The structure should split the codebase up into different, but meaningful blocks of code, each of which might exist as elements in their own right.
The user should be able to run or access your code via a handful of useful entry-points, and the tester/coder should be able to isolate any problem to a discrete function, the fixing of which wont impact anything else in the codebase.
Often, when building a project, it's common to start with a single, monolithic chunk of code, and then over time, split it out into separate units to support maintenance. As a project matures, splitting out commonly reused components into their own utility areas becomes a good strategy - having the foresight to do this from the start is laudable, but not always necessary.
If your project is to do with clustering, then there's likely a workflow that follows the steps outlined: process data, perform clustering, evaluate results - so there's likely going to be a functional split that develops along those lines - but they're all part of a fairly tightly coupled package of functionality, so I'd be tempted to arrange all of these into a single directory - maybe even a single .py file initially, depending on how much code you're likely to generate.
Possibly, if you're going to process data in lots of different ways (i.e. not just for clustering) then there might be a case for developing some utility data-reading/processing package that you can hook in for future processing tasks, which would warrant making a different package, or placing in its own sub-directory but that's highly speculative - and presupposes that you'll be bulking this package out with additional (non clustering) functionality/workflows.
I don't think you need to build your own classes on the fly as proposed; A cluster is just a set of associations between object identifiers and groups. Any clustering can be expressed as a set of tuples, where each tuple associates one index(i) with one group(g) with i's drawn from the set I (all your data's indices) and G (the full collection of groups).
One cluster assignment boils down to (i,g) where i ∊ I and g ∊ G
So a full clustering would consist of a list of all [(i,g)] for each i in I, and each associated g in G
Which is likely going to be the same for any cluster/grouping.

Related

Visualizing the relationships of classes and methods in Python

I am reading the source code of a game written in Python, which involves a lot of methods under many classes tangles together. I want to start with a graph which gives an overview of the whole package. Something like Class1.methodA uses Class2.methodA and Class2.methodC; Class2.methodC uses Class2.methodB.... And presented in a graph with nodes and arrows so that I can see the dependencies clearly.
I can certainly do that manually level by level, but that will take a lot of time and might mess up when it gets complex.
I've seen a tool called "snakefood" which visualize dependencies. I tried but failed (does not work for Python3? I am not sure why. And therefore also not sure if it is what I am looking for). Any suggestions?

Has Python got branch prediction?

I implemented a physics simulation in Python (most of the heavy lifting is done in numerical libraries anyways, thus performance is good enough).
Now that the project has grown a bit, I added extra functionality via parameters that do not change during the simulation. With that comes the necessity to have the program do one thing or another based on their values, i.e., quite a few if-else scattered around the code.
My question is simple: does Python implement some form of branch prediction? Am I going to wear the performance significantly or is the interpreter smart enough to see that some parameters never change? Having a constant if-else inside a function that is called a million times, is the conditional evaluated every time or some magic happens? When there is no easy way to remove the conditional altogether, is there a way to give the interpreter some hints and favour/emulate branch prediction?

You could in theory benefit here from some JIT functionality that may observe the control flow over time and could effectively suppress never-taken branches by rearranging the code. Some of the Python interpreters contain JIT compilers (I think PyPy does in newer versions, maybe Jython as well), and may be able to do this optimization, but that of course depends on the actual code.
However, the main form of branch prediction is done in HW, and is unrelated to the SW or language constructs used (in Python's case - quite a few levels of abstraction above). This mechanism eventually observes these conditional code paths as branches, and may be able to learn them if they are indeed statically determined. However, as any prediction mechanism, it has limited capacity, and since your code is supposed to be big, it may not be able to accommodate predictions for all these branches. It's still considered quite good, so chances are that the critical ones may work.
Finally, if you really want to optimize your code, you can convert some of these conditions to constants (assigning an argument a constant value instead of parsing the command line), or hiding the condition completely with something like __debug__. This way you won't have to worry about predicting them, but can restore the capability with minimal work if you need them in the future.

Is the Python object mapping that I'm doing for neo4j too naive?

I'm looking for some general advice on how to either re-write application code to be non-naive, or whether to abandon neo4j for another data storage model. This is not only "subjective", as it relates significantly to specific, correct usage of the neo4j driver in Python and why it performs the way it does with my code.
Background:
My team and I have been using neo4j to store graph-friendly data that is initially stored in Python objects. Originally, we were advised by a local/in-house expert to use neo4j, as it seemed to fit our data storage and manipulation/querying requirements. The data are always specific instances of a set of carefully-constructed ontologies. For example (pseudo-data):
Superclass1 -contains-> SubclassA
Superclass1 -implements->SubclassB
Superclass1 -isAssociatedWith-> Superclass2
SubclassB -hasColor-> Color1
Color1 -hasLabel-> string::"Red"
...and so on, to create some rather involved and verbose hierarchies.
For prototyping, we were storing these data as sequences of grammatical triples (subject->verb/predicate->object) using RDFLib, and using RDFLib's graph-generator to construct a graph.
Now, since this information is just a complicated hierarchy, we just store it in some custom Python objects. We also do this in order to provide an easy API to others devs that need to interface with our core service. We hand them a Python library that is our Object model, and let them populate it with data, or, we populate it and hand it to them for easy reading, and they do what they want with it.
To store these objects permanently, and to hopefully accelerate the writing and reading (querying/filtering) of these data, we've built custom object-mapping code that utilizes the official neo4j python driver to write and read these Python objects, recursively, to/from a neo4j database.
The Problem:
For large and complicated data sets (e.g. 15k+ nodes and 15k+ relations), the object relational mapping (ORM) portion of our code is too slow, and scales poorly. But neither I, nor my colleague are experts in databases or neo4j. I think we're being naive about how to accomplish this ORM. We began to wonder if it even made sense to use neo4j, when more traditional ORMs (e.g. SQL Alchemy) might just be a better choice.
For example, the ORM commit algorithm we have now is a recursive function that commits an object like this (pseudo code):
def commit(object):
for childstr in object: # For each child object
child = getattr(object, childstr) # Get the actual object
if attribute is <our object base type): # Open transaction, make nodes and relationship
with session.begin_transaction() as tx:
<construct Cypher query with:
MERGE object (make object node)
MERGE child (make its child node)
MERGE object-[]->child (create relation)
>
tx.run(<All 3 merges>)
commit(child) # Recursively write the child and its children to neo4j
Is it naive to do it like this? Would an OGM library like Py2neo's OGM be better, despite ours being customized? I've seen this and similar questions that recommend this or that OGM method, but in this article, it says not to use OGMs at all.
Must we really just implement every method and benchmark for performance? It seems like there must be some best-practices (other than using the batch IMPORT, which doesn't fit our use cases). And we've read through articles like those linked, and seen the various tips on writing better queries, but it seems better to step back and examine the case more generally before attempting to optimize code line-by line. Although it's clear that we can improve the ORM algorithm to some degree.
Does it make sense to write and read large, deep hierarchical objects to/from neo4j using a recursive strategy like this? Is there something in Cypher, or the neo4j drivers that we're missing? Or is it better to use something like Py2neo's OGM? Is it best to just abandon neo4j altogether? The benefits of neo4j and Cypher are difficult to ignore, and our data does seem to fit well in a graph. Thanks.

It's hard to know without looking at all the code and knowing the class hierarchy, but at the moment I'd hazard a guess that your code is slow in the OGM bit because every relationship is created in its own transaction. So you're doing a huge number of transactions for a larger graph which is going to slow things down.
I'd suggest for an initial import where you're creating every class/object, rather than just adding a new one or editing the relationships for one class, that you use your class inspectors to simply create a graph representation of the data, and then use Cypher to construct it in a lot fewer transactions in Neo4J. Using some basic topological graph theory you could then optimise it by reducing the number of lookups you need to do, too.
You can create a NetworkX MultiDiGraph in your python code to model the structure of your classes. From there on in there are a few different strategies to put the data into Neo4J - I also just found this but have no idea about whether it works or how efficient it is.
The most efficient way to query to import your graph will depend on the topology of the graph, and whether it is cyclical or not. Some options are below.
1. Create the Graph in Two Sets of Queries
Run one query for every node label to create every node, and then another to create every edge between every combination of node labels (the efficiency of this will depend on how many different node labels you're using).
2. Starting from the topologically highest or lowest point in the graph, create the graph as a series of paths
If you have lots of different edge labels and node labels, this might involve writing a lot of cypher logic combining UNWIND and FOREACH (CASE r.label = 'SomeLabel' THEN [1] ELSE [] | CREATE (n:SomeLabel {node_unique_id: x})->, but if the graph is very hierarchical you could also use python to keep track of which nodes have all their lower nodes and relationships created already and then use that knowledge to limit the size of paths that get sent to Neo4J in a query.
3. Use APOC to import the whole graph
Another option, which may or may not fit your use case and may or may not be more performant would be to export the graph to GraphML using NetworkX and then use the APOC GraphML import tool.
Again, it's hard to offer a precise solution without seeing all your data, but I hope this is somewhat useful as a steer in the right direction! Happy to help / answer any other questions based on more data.

There is a lot going on here so I'll try to address this in smaller questions
Would an OGM library like Py2neo's OGM be better
With any ORM/OGM library, the reality is that you can always get better performance by bypassing them and delving into the belly of the beast. That is not really the ORMs entire job though. An ORM is meant to save you time and effort by making relatively efficient DB use easy.
So it depends, if you want best performance, skip the ORM, and invest your time working on as low a level as you can (*Requires advanced low level knowledge of the beast you are working with, and a lot of your time). Otherwise, an ORM library is usually your best bet.
Our code is too slow, and scales poorly
Databases are complex. If at all possible, I would recommend bringing someone(s) on board to be a company wide database admin/expert. (This is harder when you don't already have one to vet new hires actually know what they are talking about)
Assuming that is not an option, here are some things to consider.
IO is expensive. Especially over the network. Minimize data that has to be sent in either direction. (This is why you page return results. Only return the data you need, as you actually need it)
Caveat to that, creating request connections is very expensive. Minimize calls to the DB. (Have fun balancing the two ^_^) (Note: ORMs usually have built in mechanics to only commit what has changed)
Get to the data you want fast. Create indexes in the database to vastly improve fetch speed. The more unique and consistent the id is, the better.
Caveat, indexes have to be updated on writes that alter a value in them. So indexes reduce write speed and eat more memory to gain read speed. Minimize indexes.
Transactions are a memory operation. Committing a transaction is a disk IO operation. This is why batch jobs are far more efficient.
Caveat, Memory isn't infinite. Keep your jobs a reasonable size.
As you can probably tell, scaling DB operations to production levels is not fun. It's too easy to burn yourself over-optimizing on any axis, and this is just surface level over simplifications.
For prototyping, we were storing these data as sequences of grammatical triples
Less a question, and more a statement, but different types of databases have different strengths and weaknesses. Scheme-less DBs are more specialized for cache stores; Graph DBs are specialized for querying based on relationships (edges); Relational DBs are specialized for fetching/updating records (tables); And Triplestores are more Specialized for, well, triples (RDF); (ect. there are more types)
I mention this because it sounds like your data might be mostly "write once, read many". In this case, you probably actually should be using a Triplestore. You can use any DB type for anything, but picking the best DB requires you to know how you use your data, and how that use can possible evolve.
Must we really just implement every method and benchmark for performance?
Well, this is part of why stored procedures are so important. ORMs help abstract this part, and having an in house domain expert would really help. It could just be that you are pushing the limits of what 1 machine can do. Maybe you just need to upgrade to a cluster; or maybe you have horrible code inefficiencies that have you touching a node 10k times in 1 save operation when no (or 1) value changed. To be honest though, bench-marking doesn't do much unless you know what you are looking for. For example, usually the difference between 5 hours and 0.5 seconds could be as simple as creating 1 index.
(To be fair, while buying bigger and better database servers/clusters may be the inefficient solution, it is sometimes the most cost effective compared to the salary of 1 Database Admin. So, again, depends your your priorities. And I'm sure your boss would probably prioritize differently from what you'd like)
TL;DR
You should hire a domain expert to help you.
If that is not an option, go to the bookstore (or google) pick up Databases 4 dummies (hands on learn databases online tutorial classes), and become the domain expert yourself. (Which you can than use to boost your worth to the company)
If you don't have time for that, probably your only saving grace would be to just upgrade your hardware to solve the problem with brute force. (*As long as growth isn't exponential)

solving ODEs on networks with PyDSTool

After using scipy.integrate for a while I am at the point where I need more functions like bifurcation analysis or parameter estimation. This is why im interested in using the PyDSTool, but from the documentation I can't figure out how to work with ModelSpec and if this is actually what will lead me to the solution.
Here is a toy example of what I am trying to do: I have a network with two nodes, both having the same (SIR) dynamic, described by two ODEs, but different initial conditions. The equations are coupled between nodes via the Epsilon (see formula below).
formulas as a picture for better read, the 'n' and 'm' are indices, not exponents ~>
http://image.noelshack.com/fichiers/2014/28/1404918182-odes.png
(could not use the upload on stack, sadly)
In the two node case my code (using PyDSTool) looks like this:
#multiple SIR metapopulations
#parameter and initial condition definition; a dict is a must
import PyDSTool as pdt
params={'alpha': 0.7, 'beta':0.1, 'epsilon1':0.5,'epsilon2':0.5}
ini={'s1':0.99,'s2':1,'i1':0.01,'i2':0.00}
DSargs=pdt.args(name='SIRtest_multi',
ics=ini,
pars=params,
tdata=[0,20],
#the for-macro generates formulas for s1,s2 and i1,i2;
#sum works similar but sums over the expressions in it
varspecs={'s[o]':'for(o,1,2,-alpha*s[o]*sum(k,1,2,epsilon[k]*i[k]))',
'i[l]':'for(l,1,2,alpha*s[l]*sum(m,1,2,epsilon[m]*i[m]))'})
#generator
DS = pdt.Generator.Vode_ODEsystem(DSargs)
#computation, a trajectory object is generated
trj=DS.compute('test')
#extraction of the points for plotting
pts=trj.sample()
#plotting; pylab is imported along with PyDSTool as plt
pdt.plt.plot(pts['t'],pts['s1'],label='s1')
pdt.plt.plot(pts['t'],pts['i1'],label='i1')
pdt.plt.plot(pts['t'],pts['s2'],label='s2')
pdt.plt.plot(pts['t'],pts['i2'],label='i2')
pdt.plt.legend()
pdt.plt.xlabel('t')
pdt.plt.show()
But in my original problem, there are more than 1000 nodes and 5 ODEs for each, every node is coupled to a different number of other nodes and the epsilon values are not equal for all the nodes. So tinkering with this syntax did not led me anywhere near the solution yet.
What I am actually thinking of is a way to construct separate sub-models/solver(?) for every node, having its own parameters (epsilons, since they are different for every node). Then link them to each other. And this is the point where I do not know wether it is possible in PyDSTool and if it is the way to handle this kind of problems.
I looked through the examples and the Docs of PyDSTool but could not figure out how to do it, so help is very appreciated! If the way I'm trying to do things is unorthodox or plain stupid, you are welcome to make suggestions how to do it more efficiently. (Which is actually more efficient/fast/better way to solve problems like this: subdivide it into many small (still not decoupled) models/solvers or one containing all the ODEs at once?)
(Im neither a mathematician nor a programmer, but willing to learn, so please be patient!)

The solution is definitely not to build separate simulation models. That won't work because so many variables will be continuously coupled between the sub-models. You absolutely must have all the ODEs in one place together.
It sounds like the solution you need is to use the ModelSpec object constructs. These let you hierarchically build the sub-model definitions out of symbolic pieces. They can have their own "epsilon" parameters, etc. You declare all the pieces when you're finished and let PyDSTool make the final strings containing the ODE definitions for you. I suggest you look at the tutorial example at:
http://www.ni.gsu.edu/~rclewley/PyDSTool/Tutorial/Tutorial_compneuro.html
and the provided examples: ModelSpec_test.py, MultiCompartments.py. But, remember that you still have to have a source for the parameters and coupling data (i.e., a big matrix or dictionary loaded from a file) to be able to automate the process of building the model, otherwise you'd still be writing it all out by hand.
You have to build some classes for the components that you want to have. You might also create a factory function (compare 'makeSoma' in the neuralcomp.py toolbox) that will take all your sub-components and create an ODE based on summing something up from each of the declared components. At the end, you can refer to the parameters by their position in the hierarchy. One might be 's1.epsilon' while another might be 'i4.epsilon'.
Unfortunately, to build models like this efficiently you will have to learn to do some more complex programming! So start by understanding all the steps in the tutorial. You can email me directly through the SourceForge support discussions or email once you've got started and have specific questions.

Iterating over a large data set in long running Python process - memory issues?

I am working on a long running Python program (a part of it is a Flask API, and the other realtime data fetcher).
Both my long running processes iterate, quite often (the API one might even do so hundreds of times a second) over large data sets (second by second observations of certain economic series, for example 1-5MB worth of data or even more). They also interpolate, compare and do calculations between series etc.
What techniques, for the sake of keeping my processes alive, can I practice when iterating / passing as parameters / processing these large data sets? For instance, should I use the gc module and collect manually?
UPDATE
I am originally a C/C++ developer and would have NO problem (and would even enjoy) writing parts in C++. I simply have 0 experience doing so. How do I get started?
Any advice would be appreciated.
Thanks!

Working with large datasets isn't necessarily going to cause memory complications. As long as you use sound approaches when you view and manipulate your data, you can typically make frugal use of memory.
There are two concepts you need to consider as you're building the models that process your data.
What is the smallest element of your data need access to to perform a given calculation? For example, you might have a 300GB text file filled with numbers. If you're looking to calculate the average of the numbers, read one number at a time to calculate a running average. In this example, the smallest element is a single number in the file, since that is the only element of our data set that we need to consider at any point in time.
How can you model your application such that you access these elements iteratively, one at a time, during that calculation? In our example, instead of reading the entire file at once, we'll read one number from the file at a time. With this approach, we use a tiny amount of memory, but can process an arbitrarily large data set. Instead of passing a reference to your dataset around in memory, pass a view of your dataset, which knows how to load specific elements from it on demand (which can be freed once worked with). This similar in principle to buffering and is the approach many iterators take (e.g., xrange, open's file object, etc.).
In general, the trick is understanding how to break your problem down into tiny, constant-sized pieces, and then stitching those pieces together one by one to calculate a result. You'll find these tenants of data processing go hand-in-hand with building applications that support massive parallelism, as well.
Looking towards gc is jumping the gun. You've provided only a high-level description of what you are working on, but from what you've said, there is no reason you need to complicate things by poking around in memory management yet. Depending on the type of analytics you are doing, consider investigating numpy which aims to lighten the burden of heavy statistical analysis.

Its hard to say without real look into your data/algo, but the following approaches seem to be universal:
Make sure you have no memory leaks, otherwise it would kill your program sooner or later. Use objgraph for it - great tool! Read the docs - it contains good examples of the types of memory leaks you can face at python program.
Avoid copying of data whenever possible. For example - if you need to work with part of the string or do string transformations - don't create temporary substring - use indexes and stay read-only as long as possible. It could make your code more complex and less "pythonic" but this is the cost for optimization.
Use gc carefully - it can make you process irresponsible for a while and at the same time add no value. Read the doc. Briefly: you should use gc directly only when there is real reason to do that, like Python interpreter being unable to free memory after allocating big temporary list of integers.
Seriously consider rewriting critical parts on C++. Start thinking about this unpleasant idea already now to be ready to do it when you data become bigger. Seriously, it usually ends this way. You can also give a try to Cython it could speed up the iteration itself.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.