For the last while I've been writing analytical Python code that gets run on demand when users interact with a front end tool throught a queue based batch processing.
Typically the users set some values in the front end tool that get passed as parameters to the analytical code and they either supply a dataset or choose a subset of data from an overall data source that their company provides.
Typically each analytical model sits in a larger repo amongst other analytical models so each model would usually sit in it's own module and that module would export one function which is the entrypoint in to that model. The models range from being simple models that take on the order of minutes to very complex stastical or machine learning based models and might use combinations of numpy/Pandas/Numba or Dask dataframes that take on the order of hours.
Now to my question, I've been going back on forth on where I should be aiming to concentrate my testing efforts for this type of code. Previously earlier on in my career I naively thought that every function should have a unit test so my code would have a comprehensive of set of tests.
I quickly realised that this was counter-productive as even a small performance refactor could result in ripping apart and possibly even throwing away a lot of the unit tests. So clearly it felt like I should only be writing tests for the main public function of each model, however, this usually means the opposite happening, for some of the more complex models, edge cases that were quite deep into the control flow were now hard to test.
My question then is how should I be aiming to properly test these analytical models? Some people would probably say "Only test public facing functions, if you can't test edge cases through the public facing functions then they should technically not be reachable so don't need to be there". But, I've found, in reality this doesn't quite work.
To provide a simple example, say the particular model is to calculate a frequency matrix for dropoff/pickoff points from a taxi dataset.
import pandas as pd
def _cat(col1, col2):
cat_col = col1.astype(str).str.cat(col2.astype(str), ', ')
return cat_col
def _make_points_df(taxi_df):
pickup_points = _cat(taxi_df["pickup_longitude"], taxi_df["pickup_latitude"])
dropoff_points = _cat(taxi_df["dropoff_longitude"], taxi_df["dropoff_latitude"])
points_df = pd.DataFrame({"pickup": pickup_points, "dropoff": dropoff_points})
return points_df
def _points_df_to_freq_mat(points_df):
mat_df = points_df.groupby(['pickup', 'dropoff']).size().unstack(fill_value=0)
return mat_df
def _validate_taxi_df(taxi_df):
if type(taxi_df) is not pd.DataFrame:
raise TypeError(f"taxi_df param must be a pandas dataframe, got: {type(taxi_df)}")
expected_cols = {
"pickup_longitude",
"pickup_latitude",
"dropoff_longitude",
"dropoff_latitude",
}
if set(taxi_df) != expected_cols:
raise RuntimeError(
f"Expected the following columns for taxi_df param: {expected_cols}."
f"Got: {set(taxi_df)}"
)
def calculate_frequency_matrix(taxi_df, long_round=1, lat_round=1):
"""Calculate a dropoff/pickup frequency matrix which tells you the number of times
passengers have been picked up and dropped from a given discrete point. The
resolution of these points is controlled by using the long_round and lat_round params
Paramaters
----------
taxi_df : pandas.DataFrame
Dataframe specifying dropoff and pickup long/lat coordinates
long_round : int
Number of decimal places to round the dropoff and pickup longitude values to
lat_round : int
Number of decimal places to round the dropoff and pickup latitude values to
Returns
-------
pandas.DataFrame
Dataframe in matrix format of frequency of dropoff/pickup points
Raises
------
TypeError : If taxi_df is not a pandas DataFrame
RuntimeError : If taxi_df does not contain correct columns
"""
_validate_taxi_df(taxi_df)
taxi_df = taxi_df.copy()
taxi_df["pickup_longitude"] = taxi_df["pickup_longitude"].round(long_round)
taxi_df["dropoff_longitude"] = taxi_df["dropoff_longitude"].round(long_round)
taxi_df["pickup_latitude"] = taxi_df["pickup_latitude"].round(lat_round)
taxi_df["dropoff_latitude"] = taxi_df["dropoff_latitude"].round(lat_round)
points_df = _make_points_df(taxi_df)
mat_df = _points_df_to_freq_mat(points_df)
return mat_df
Taking in a dataframe like
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
0 -73.988129 40.732029 -73.990173 40.756680
1 -73.964203 40.679993 -73.959808 40.655403
2 -73.997437 40.737583 -73.986160 40.729523
3 -73.956070 40.771900 -73.986427 40.730469
4 -73.970215 40.761475 -73.961510 40.755890
5 -73.991302 40.749798 -73.980515 40.786549
6 -73.978310 40.741550 -73.952072 40.717003
7 -74.012711 40.701527 -73.986481 40.719509
Say in terms of a folder structure this code would sit at
analytics/models/taxi_freq/taxi_freq.py
and the
analytics/models/taxi_freq/__init__.py
file would look like
from taxi_freq import calculate_frequency_matrix
And obviously the private functions in the above code could be split across multiple utiltiy files in analytics/models/taxi_freq/.
Would the consensus be to only test the calculate_frequency_matrix function, or should the "private" helper methods and other utility files/functions within the taxi_freq module also be tested?
As with software development in general, also with testing you always have to search for solutions that represent the (ideally optimal) tradeoff between competing goals. One of the primary goals of testing in general and also for unit-testing is to find bugs (see Myers, Badgett, Sandler: The Art of Software Testing, or, Beizer: Software Testing Techniques, but also many others).
In your project you may have a more relaxed position on this, but there are many software projects where it would have serious consequences if implementation level bugs escape to later development phases or even to the field. Some say, your goal should rather be to increase confidence in your code - and this is also true, but confidence can only be a consequence of doing testing right. If you don't test to find bugs, then I will simply not have confidence in your code after you have finished testing.
When finding bugs is a primary goal of unit-testing, then attempts to keep unit-test suites completely independent of implementation details is likely to result in inefficient test suites - that is, test suites that are not suited to find all bugs that could be found. Different implementations have different potential bugs. If you don't use unit-testing for finding these bugs, then any other test level (integration, subsystem, system) is definitely less suited for finding them systematically.
For example, think about the different ways to implement a Fibonacci function: as an iterative or recursive function, as a closed form expression (Moivre/Binet), or as a lookup table: The interface is always the same, the possible bugs differ significantly, and so do the unit-testing strategies. There will be a useful set of implementation independent test cases, but these alone will not be sufficient to find all bugs that are likely for the specific implementation.
The goal to have an effective test suite therefore is in competition with another goal, namely to have a maintenance friendly test suite. This goal, however, comes in different forms with different consequences: You could demand that the unit-test suite shall not be affected when implementation details change. This is quite tough and IMO puts the secondary goal of maintenance friendly test code above the primary goal of finding bugs.
Meszaros has a more balanced formulation, namely "The effort for changes to the code base shall be commensurate with the effort to maintain the test suite." (see Meszaros: Principles of Test Automation: Ensure Commensurate Effort). That is, little changes to the SUT shall only require little changes to the test suite, for larger changes to the SUT it is acceptable that the test suite requires equally large changes. (However, for me personally the formulation "the effort for test code maintenance shall be low" is sufficient.)
Conclusion:
For me, as I see finding bugs as the primary goal and test suite maintainability as a secondary goal, this leads to the following consequence: I accept that I have to test also implementation details to find more bugs. But, despite this fact I nevertheless try to keep the maintenance effort low: I do this mostly by applying the following mechanisms that aim at making it simpler to adjust the test suite in case of changes to the SUT:
First, if the goal of a specific test case can be reached by an implementation agnostic test case and an implementation dependent test case, prefer the implementation agnostic test case. In other words, don't make individual test cases unnecessarily implementation dependent.
Second, hide implementation details behind helper functions. There can be helper functions for specific setups, teardowns, assertions etc. This is an extremely powerful mechanism to limit the effect of implementation details within the test suite.
Related
The project: Read in 2D data, cluster datapoints based on different cluster techniques/models, and evaluate how well the clustering has worked.
Since I am unhappy with my project structure so far, and have little experience with project structures, I hope to get some feedback on how to proceed. The structure is as follows:
/2Dclustering
__init__.py
__main__.py
__2dcluster.py
/cluster_forming
__init__.py
__cluster_models.py
/evaluate_cluster
__how_good_is_clustering.py
__choose_the_better_cluster.py
In main.py, we read in the input data, and create a 2dcluster object using __2dcluster.py that is then saved as the output. The 2dcluster class uses the function from cluster_forming and evaluate_cluster to form a cluster and adding a metric (i.e.how well did it perfom?) to it. In both subfolders (cluster_forming and evaluate_cluster), we have just files with a bunch of functions instead of classes. My question is:
1.) Does it in general make sense to break everything into so many subfolders?
2.) Would it make sense to have class objects for evaluate_clusters that evaluate_cluster? I feel like now it is a little messy but I have no intuition if creating classes would over-complicate things.
3.) Is there a sensible way of creating classes that deal with all the subclasses, i.e. a class that just combines other classes- or is this nonsense?
If anyone has an intuition on a structure that would make more sense, Id be really happy to hear it. As I said, as someone that has never written bigger projects, I am kind of at loss on what is considered a clever solution and what is overcomplicating the project. Thanks!
Your project's structure should balance the tension between competing stakeholder's needs.
i) The Coder : This person (probably you) will want to have a number of small, composable functions spread across the codebase where they can be easily tested in isolation, and reused in lots of different ways. The code should be split logically into isolatable, functional blocks that provide clear themes of functionality. Over time, as the codebase gets larger, it pays to split it into ever smaller unit components, to support testing and debugging.
ii) The End-User : This person wants to be able to install your code and be able to run or import it with a minimum of fiddling about. Their priority will be utility, and as such they will want a single point of entry, with a simple interface without having to spend time learning about project structures to get stuff done.
The structure should split the codebase up into different, but meaningful blocks of code, each of which might exist as elements in their own right.
The user should be able to run or access your code via a handful of useful entry-points, and the tester/coder should be able to isolate any problem to a discrete function, the fixing of which wont impact anything else in the codebase.
Often, when building a project, it's common to start with a single, monolithic chunk of code, and then over time, split it out into separate units to support maintenance. As a project matures, splitting out commonly reused components into their own utility areas becomes a good strategy - having the foresight to do this from the start is laudable, but not always necessary.
If your project is to do with clustering, then there's likely a workflow that follows the steps outlined: process data, perform clustering, evaluate results - so there's likely going to be a functional split that develops along those lines - but they're all part of a fairly tightly coupled package of functionality, so I'd be tempted to arrange all of these into a single directory - maybe even a single .py file initially, depending on how much code you're likely to generate.
Possibly, if you're going to process data in lots of different ways (i.e. not just for clustering) then there might be a case for developing some utility data-reading/processing package that you can hook in for future processing tasks, which would warrant making a different package, or placing in its own sub-directory but that's highly speculative - and presupposes that you'll be bulking this package out with additional (non clustering) functionality/workflows.
I don't think you need to build your own classes on the fly as proposed; A cluster is just a set of associations between object identifiers and groups. Any clustering can be expressed as a set of tuples, where each tuple associates one index(i) with one group(g) with i's drawn from the set I (all your data's indices) and G (the full collection of groups).
One cluster assignment boils down to (i,g) where i ∊ I and g ∊ G
So a full clustering would consist of a list of all [(i,g)] for each i in I, and each associated g in G
Which is likely going to be the same for any cluster/grouping.
Introduction
I'd like to know what other topic modellers consider to be an optimal topic-modelling workflow all the way from pre-processing to maintenance. While this question consists of a number of sub-questions (which I will specify below), I believe this thread would be useful for myself and others who are interested to learn about best practices of end-to-end process.
Proposed Solution Specifications
I'd like the proposed solution to preferably rely on R for text processing (but Python is fine also) and topic-modelling itself to be done in MALLET (although if you believe other solutions work better, please let us know). I tend to use the topicmodels package in R, however I would like to switch to MALLET as it offers many benefits over topicmodels. It can handle a lot of data, it does not rely on specific text pre-processing tools and it appears to be widely used for this purpose. However some of the issues outline below are also relevant for topicmodels too. I'd like to know how others approach topic modelling and which of the below steps could be improved. Any useful piece of advice is welcome.
Outline
Here is how it's going to work: I'm going to go through the workflow which in my opinion works reasonably well, and I'm going to outline problems at each step.
Proposed Workflow
1. Clean text
This involves removing punctuation marks, digits, stop words, stemming words and other text-processing tasks. Many of these can be done either as part of term-document matrix decomposition through functions such as for example TermDocumentMatrix from R's package tm.
Problem: This however may need to be performed on the text strings directly, using functions such as gsub in order for MALLET to consume these strings. Performing in on the strings directly is not as efficient as it involves repetition (e.g. the same word would have to be stemmed several times)
2. Construct features
In this step we construct a term-document matrix (TDM), followed by the filtering of terms based on frequency, and TF-IDF values. It is preferable to limit your bag of features to about 1000 or so. Next go through the terms and identify what requires to be (1) dropped (some stop words will make it through), (2) renamed or (3) merged with existing entries. While I'm familiar with the concept of stem-completion, I find that it rarely works well.
Problem: (1) Unfortunately MALLET does not work with TDM constructs and to make use of your TDM, you would need to find the difference between the original TDM -- with no features removed -- and the TDM that you are happy with. This difference would become stop words for MALLET. (2) On that note I'd also like to point out that feature selection does require a substantial amount of manual work and if anyone has ideas on how to minimise it, please share your thoughts.
Side note: If you decide to stick with R alone, then I can recommend the quanteda package which has a function dfm that accepts a thesaurus as one of the parameters. This thesaurus allows to to capture patterns (usually regex) as opposed to words themselves, so for example you could have a pattern \\bsign\\w*.?ups? that would match sign-up, signed up and so on.
3. Find optimal parameters
This is a hard one. I tend to break data into test-train sets and run cross-validation fitting a model of k topics and testing the fit using held-out data. Log likelihood is recorded and compared for different resolutions of topics.
Problem: Log likelihood does help to understand how good is the fit, but (1) it often tends to suggest that I need more topics than it is practically sensible and (2) given how long it generally takes to fit a model, it is virtually impossible to find or test a grid of optimal values such as iterations, alpha, burn-in and so on.
Side note: When selecting the optimal number of topics, I generally select a range of topics incrementing by 5 or so as incrementing a range by 1 generally takes too long to compute.
4. Maintenance
It is easy to classify new data into a set existing topics. However if you are running it over time, you would naturally expect that some of your topics may cease to be relevant, while new topics may appear. Furthermore, it might be of interest to study the lifecycle of topics. This is difficult to account for as you are dealing with a problem that requires an unsupervised solution and yet for it to be tracked over time, you need to approach it in a supervised way.
Problem: To overcome the above issue, you would need to (1) fit new data into an old set of topics, (2) construct a new topic model based on new data (3) monitor log likelihood values over time and devise a threshold when to switch from old to new; and (4) merge old and new solutions somehow so that the evolution of topics would be revealed to a lay observer.
Recap of Problems
String cleaning for MALLET to consume the data is inefficient.
Feature selection requires manual work.
Optimal number of topics selection based on LL does not account for what is practically sensible
Computational complexity does not give the opportunity to find an optimal grid of parameters (other than the number of topics)
Maintenance of topics over time poses challenging issues as you have to retain history but also reflect what is currently relevant.
If you've read that far, I'd like to thank you, this is a rather long post. If you are interested in the suggest, feel free to either add more questions in the comments that you think are relevant or offer your thoughts on how to overcome some of these problems.
Cheers
Thank you for this thorough summary!
As an alternative to topicmodels try the package mallet in R. It runs Mallet in a JVM directly from R and allows you to pull out results as R tables. I expect to release a new version soon, and compatibility with tm constructs is something others have requested.
To clarify, it's a good idea for documents to be at most around 1000 tokens long (not vocabulary). Any more and you start to lose useful information. The assumption of the model is that the position of a token within a given document doesn't tell you anything about that token's topic. That's rarely true for longer documents, so it helps to break them up.
Another point I would add is that documents that are too short can also be a problem. Tweets, for example, don't seem to provide enough contextual information about word co-occurrence, so the model often devolves into a one-topic-per-doc clustering algorithm. Combining multiple related short documents can make a big difference.
Vocabulary curation is in practice the most challenging part of a topic modeling workflow. Replacing selected multi-word terms with single tokens (for example by swapping spaces for underscores) before tokenizing is a very good idea. Stemming is almost never useful, at least for English. Automated methods can help vocabulary curation, but this step has a profound impact on results (much more than the number of topics) and I am reluctant to encourage people to fully trust any system.
Parameters: I do not believe that there is a right number of topics. I recommend using a number of topics that provides the granularity that suits your application. Likelihood can often detect when you have too few topics, but after a threshold it doesn't provide much useful information. Using hyperparameter optimization makes models much less sensitive to this setting as well, which might reduce the number of parameters that you need to search over.
Topic drift: This is not a well understood problem. More examples of real-world corpus change would be useful. Looking for changes in vocabulary (e.g. proportion of out-of-vocabulary words) is a quick proxy for how well a model will fit.
I have a module to test, module includes a serie of functions / simple classes.
Wondering if there any attempts(ie package) to generate automatically:
1) Generate Python code from initial Python file containing function definition.
2) This code list of call to the functions with random/parametric data as parameters.
It is technically feasible by using inspect and python meta classes,
usually limited to numerical type functions....(numpy array).
Because string (ie url input) would be impossible (only parametrized...).
EDIT: By random, it means obviously "parametric random".
Suppose we have
def f(x1,x2,x3)
For all xi of f
if type(xi) = array1D ->
Do those tests: empty array, zeros array, negative array(random),
positivearray(random), high values, low values, integer array, real
number array, ordered array, equal space array,.....
if type(xi)=int -> test zero, 1, 2,3,4, randomValues, Negative
Do people think such project is possible using inspect and meta class? (limited to numpy/numerical items).
Suppose you have a very large library..., things can be done in background.
You might be thinking of fuzz testing, where a bunch of garbage data is submitted to a function to see if anything makes it behave badly. It sounds like the Hypothesis library will let you generate different test cases based on some parameters.
I spent searching, it seems this kind of project does not really exist (to my knowledge):
Technically, this is a mix of packages (issues):
Hypothese : data generation for input, running the code with crash/error.
(without the invariant part of Hypothese)
Jedi: Static analysis of code/Inference of the type
Type inference is a difficult issue in Python (in general)
implementing type inference
If type is num/array of num:
Boundary exists/ typical usage is clearly defined
If type is string: Inference is pretty difficult without human guessing.
Same for others, Context guessing is important
I'm a data analysis student and I'm starting to explore Genetic Algorithms at the moment. I'm trying to solve a problem with GA but I'm not sure about the formulation of the problem.
Basically I have a state of a variable being 0 or 1 (0 it's in the normal range of values, 1 is in a critical state). When the state is 1 I can apply 3 solutions (let's consider Solution A, B and C) and for each solution I know the time that the solution was applied and the time where the state of the variable goes to 0.
So I have for the problem a set of data that have a critical event at 1, the solution applied and the time interval (in minutes) from the critical event to the application of the solution, and the time interval (in minutes) from the application of the solution until the event goes to 0.
I want with a genetic algorithm to know which is the best solution for a critical event and the fastest one. And if it is possible to rank the solutions acquired so if in the future on solution can't be applied I can always apply the second best for example.
I'm thinking of developing the solution in Python since I'm new to GA.
Edit: Specifying the problem (responding to AMack)
Yes is more a less that but with some nuances. For example the function A can be more suitable to make the variable go to F but because exist other problems with the variable are applied more than one solution. So on the data that i receive for an event of V, sometimes can be applied 3 ou 4 functions but only 1 or 2 of them are specialized for the problem that i want to analyze. My objetive is to make a decision support on the solution to use when determined problem appear. But the optimal solution can be more that one because for some event function A acts very fast but in other case of the same event function A don't produce a fast response and function C is better in that case. So in the end i pretend a solution where is indicated what are the best solutions to the problem but not only the fastest because the fastest in the majority of the cases sometimes is not the fastest in the same issue but with a different background.
I'm unsure of what your question is, but here are the elements you need for any GA:
A population of initial "genomes"
A ranking function
Some form of mutation, crossing over within the genome
and reproduction.
If a critical event is always the same, your GA should work very well. That being said, if you have a different critical event but the same genome you will run into trouble. GA's evolve functions towards the best possible solution for A Set of conditions. If you constantly run the GA so that it may adapt to each unique situation you will find a greater degree of adaptability, but have a speed issue.
You have a distinct advantage using python because string manipulation (what you'll probably use for the genome) is easy, however...
python is slow.
If the genome is short, the initial population is small, and there are very few generations this shouldn't be a problem. You lose possibly better solutions that way but it will be significantly faster.
have fun...
You should take a look at the GARAGe Michigan State. They are a GA research group with a fair number of resources in terms of theory, papers, and software that should provide inspiration.
To start, let's make sure I understand your problem.
You have a set of sample data, each element containing a time series of a binary variable (we'll call it V). When V is set to True, a function (A, B, or C) is applied which returns V to it's False state. You would like to apply a genetic algorithm to determine which function (or solution) will return V to False in the least amount of time.
If this is the case, I would stay away from GAs. GAs are typically used for some kind of function optimization / tuning. In general, the underlying assumption is that what you permute is under your control during the algorithm's application (i.e., you are modifying parameters used by the algorithm that are independent of the input data). In your case, my impression is that you just want to find out which of your (I assume) static functions perform best in a wide variety of cases. If you don't feel your current dataset provides a decent approximation of your true input distribution, you can always sample from it and permute the values to see what happens; however, this would not be a GA.
Having said all of this, I could be wrong. If anyone has used GAs in verification like this, please let me know. I'd certainly be interested in learning about it.
Many behavioural experimental designs in psychology/neuroscience require conditional branching (e.g. only proceed to the test phase if a requisite performance level has been reached in an initial practice phase). PsychoPy’s Builder view allows one to generate a Python script to run an experiment using largely graphical controls. But it doesn't seem to have built-in support for conditional branching.
Can skipping a particular routine on a given run be implemented in Builder by using Python snippets in a Code component? Or does it require moving to the full Python Coder environment?
The Coder view in PsychoPy gives you full access to the Python programming language and hence you can implement arbitrarily complex experimental designs.
PsychoPy’s graphical Builder view, meanwhile, emphasises ease of use and simplicity over flexibility. One thing it does not cater for directly is conditional branching. It can, however, be hacked to achieve it indirectly.
Let’s say you have a three-phase experiment: a practice block, followed by two possible experimental blocks, ConditionA or ConditionB. After completing the practice block, high-performing subjects are assigned to conditionA, while low-performing subjects are assigned to conditionB.
To implement this in Builder, create three routines to represent each of the task blocks (Practice, conditionA, and conditionB). Each will also be surrounded by a loop (practice_loop, A_loop, and B_loop, respectively.) Also insert a routine between Practice and conditionA (called, say, assignCondition).
In the assignCondition routine, place a Code component. Assume in this case that a performance score counter was maintained in the Practice routine. We can use this to change the number of repetitions of subsequent routines. That is, by setting the repetition number of a loop to zero, we ensure that the routines inside that loop will not be executed. Hence the number of repetitions of these loops will not be a fixed value, but instead a variable (say, repetitionsA and repetitionsB).
In the "Begin Routine" tab of the assignCondition routine's Code component, put a Python snippet like this:
if performanceScore > 25:
repetitionsA = 50 # run this routine 50 times
repetitionsB = 0 # don't run this condition at all
else:
repetitionsA = 0 # vice versa: don't run this
repetitionsB = 50 # do run this
A fuller description of this technique is given by Matt Wall in a blog post here (with an fMRI block design as example, in which the order of blocks needs to be variable):
http://computingforpsychologists.wordpress.com/2013/11/12/how-to-hack-conditional-branching-in-the-psychopy-builder/