Best way to unit test data analysis with database calls - python

I have a bunch of classes that all look somewhat like this:
class FakeAnalysis(Analysis):
def __init__(self, dbi, metric: str, constraints: dict) -> None:
super().__init__(dbi)
self.metric = metric
self.constraints = constraints.copy()
def load_data(self) -> pd.DataFrame:
data = self.dbi.select_data(
{"val"}, {"period"}, **self.constraints
)
return data
def run(self) -> namedtuple:
"""Do some form of dataframe transformation""""
data = self.load_data()
df = data.pivot_table(columns='period',values='val',index='product_tag')
return namedtuple("res", ['df'])(**{"df": df})
They all take in a metric, constraints and a database interface class (dbi) as __init__ arguments. They all load the data necessary by fetching the data through the dbi and then do some sort of data transformation on the resulting dataframe before returning it as a namedtuple containing the transformed data and any other byproducts (i.e. could be multiple dataframes).
The question is: what is the best way to unit test such code? The errors are usually the result of a combination of constraints resulting in unexpected data that the code does not know how to deal with. Should I just test each class with randomly generated constraints and see if it crashes? Or should I create a mock database interface which returns fixed data for a few different constraints and ensure the class returns the results expected for just these constraints? The latter doesn't seem of much use to me although it would be more along the lines of unit testing best practice...
Any better ideas?

This is what occurs to me.
You can validate the data first, and not worry about invalid data in your processing.
You can instead deal with invalid data, by not crashing, but using try blocks to generate reasonable output for the user, or log errors, whatever is proper.
Unit test what your code does. Make sure it does what it says. Do it by mocking and inpecting mock calls. Use mocks to return invalid data and test that they trigger the invalid data exceptions you provided.
If you find difficult to express all cases that could be wrong (maybe you have to generalize a bit here because of dealing with very large or infinite possible inputs), it may be useful to stretch the thing with lots of randomly generated data that will show you cases you have not imagined (trust me, this works).
Capture those to a reasonable amount, until (the typical size of your data, or 10x that, or more, you choose) random data does not seem to trigger errors. Keep your random tests running but reduce the tries to make your tests run fast again, while you go on coding the rest of the system.
Of course mock the database access for this.
At anytime you find that data errors still happen, you can fix that case, and increase the random tries to check better. This is better than writing lots of specific cases by hand.

Related

Making pandas code more readable/better organized

I am working on a data analysis project based on Pandas. Data which has to be analyzed is collected from application log files. Log entries are based on sessions, which can be different types (and can have different actions), then each session can have mutliple services (also with different types, actions, etc.). I have transformed log file entries to pandas dataframe and then based on that completed all required calculations. At this moment that's around few hundred different calculations, which are at the end printed to stdout. If anomaly is found that is specifically flagged. So, basic functionality is there, but now after this first phase is done, I'm not happy with readability of the code and it seems to me that there must be a way to make the code better organized.
For example what I have at the moment is:
def build(log_file):
# build dataframe from log file entries
return df
def transform(df):
# transform dataframe (for example based on grouped sessions, services)
return transformed_df
def calculate(transformed_df):
# make calculations based on transformed dataframe and print them to stdout
print calculation1
print calculation2
etc.
Since there are numerous criteria for filtering data, there is at least 30-40 different data frame filters present. They are used in calculate and in transform functions. In calculate functions I have also some helper functions which perform tasks which can be applied to similar session/service types and then result is based just on filtered dataframe for that specific type. With all these requirements, transformations, filters, I now have more than 1000 lines of code, which as I said, I have a feeling it can be more readable.
My current idea is to have perhaps classes organized like this:
class Session:
# main class for sessions (it can be inherited by other session types), also with standradized output for calculations
class Service:
# main class for services (it can be inherited by other service types), also with standradized output for calculations, etc.
class Dataframe
# dataframe class with filters, etc.
But I'm not sure if this is good approach. I tried searching here, on github, different blogs, but I didn't find anything which would provide some examples what would be best way to organize code in more than basic panda projects. I would appreciate any suggestion which would put me in the right direction.

Tensorflow: lazily retrieve dataset per batch using for MySQL database

Disclaimer: In the past, I've predominantly used PyTorch, hence my reasoning is in accordance with how things are done in PyTorch as well.
I have a large database (MySQL) which I want to load as a dataset. It is not feasible to keep this dataset in memory at all times, hence it needs to be done lazily/on demand. My plan is to instantiate a Dataset object from the range of row id's, then retrieve the corresponding rows. This is much like how you would use file names/paths when using large files like images which you would then load that way. The issue with this method is that I can only retrieve one row per worker thread, meaning that I have to issue a SELECT query for each. I found that storing a batch in a table and issuing a JOIN as if it was a foreign key is orders of magnitude faster.
My first thought was to apply a map operation over each batch, which would require me to call a function of that kind after I obtain the batch from the dataset. In PyTorch, I would be able to define all this behaviour in a class that inherits from its Dataset class, which I think is a cleaner way to do it, and encapsulates this behaviour. Is there anyway to (neatly) do this within tensorflow?
Bonus points if someone can conjure up a method that is perfectly encapsulated (the user does not know how the dataset is internally stored and kept track of) from the user, yet conforms to the tensorflow API (i.e. a callable class to be used a generator for tf.data.Dataset.from_generator()).
Edit: In PyTorch, a common implementation is as follows (which I consider to be "neat" and is encapsulated).
class MyDataset(torch.Dataset):
def __init__(self, row_ids):
# Store row ids, do any pre-processing if necessary.
def __getitem__(self, item):
# From the item (may be several), join all corresponding
# database rows and apply post-processing.

Elegant way to check data from a form

I think this is a general question, however, in my case I'm working with PyQt5 and Python 3.
I'm setting up a small software, which is recording data with a measurement device. Before a measurement I have to put some data which are then validated by the software for correctness, like, is a mandatory field filled, or is the input value within the allowed range, or i.e., I have a start value A, a stop value B and a step width W, so I have to validate if W<=B-A
My question is, what's the most elegant way of checking the form? I can just simply do it one by one:
class Form:
...
def check_form(self):
if self.fieldA.text() == "":
return False
if not self.check_range(self.fieldB.text()):
return False
# Check fields one by one...
...
def check_range(self, val):
if val > self.max_val:
return False
else:
return True
But actually, this isn't really pretty, it's repeating code and a lot to write and hard to maintain. So my question is, is there a better way? One idea which came up was to define a object (maybe a dict), which contains all relevant form information, like label, default value, conditions and so on... I can put that form into a json file and then just load it and even generate the form when I need it. Maybe some obstacles may occur, especially I have to think how to handle drop down lists, but I think this is at least one approach.
Anyway. 100% someone else were struggling with this issue before me, so maybe there's a standard way how to solve this in an elegant way.
A good UX practice is to inform user the expectation of the software instead of making them guess, possiblly enter incorrect value and then informed/misinformed with cryptic error message that they don't understand!
In the situation you described here, clearly indicate an asterisk(*) next to input label to indicate it is a mandatory field. Possibly software should also come up with sensible default value. Also using correct widget to solve the particular problem you've described here would be a good idea. So instead of having three QLineEdit fields that take numerical inputs as text fields, I'd just use QSpinBox or QDoubleSpinBox that are designed to request a well constraint range of values.

Having problems keeping a simulation deterministic with random.Random(0) in python

I have a very large simulation in python with lots of modules. I call a lot of random functions. To keep the same random results I have a variable keep_seed_random.
As so:
import random
keep_seed_random = True
if keep_seed_random is False:
fixed_seed = random.Random(0)
else:
fixed_seed = random
Then I use fixed_seed all over the program, such as
fixed_seed.choice(['male', 'female'])
fixed_seed.randint()
fixed_seed.gammavariate(3, 3)
fixed_seed.random()
fixed_seed.randrange(20, 40)
and so on...
It used to work well.
But now, that the programme is too large, there is something else interfering and the results are no longer identical, even when I choose keep_seed_random = False
My question is whether there is any other source of randomness in Python that I am missing?
P.S. I import random just once.
EDITED
We have been trying to pinpoint the exact moment when the program turned from exact same results to different results. It seemed to be when we introduced a lot of reading of databases with no connection to random modules.
The results now ALTERNATE among two similar results.
That is, I run main.py once get a result of 8148.78 for GDP
I run again I get 7851.49
Again 8148.78 back
Again 7851.49
Also for the working version, before the change, the first result (when we create instances and pickle save them) I get one result. Then, from the second onwards the results are the same. So, I am guessing it is related to pickle reading/loading.
The question remains!
2nd EDITED
We partially found the problem.
The problem is when we create instances and pickle dump and then pickle load.
We still cannot have the exact same results for creating and just loading.
However, when loading repeatedly the results are exact.
Thus, the problem is in PICKLE
Some randomization may occur when dumping and loading (I guess).
Thanks,
This is difficult to diagnose without a good reproduce case as #mart0903 mentions. However, in general, there are several sources of randomness that can occur. A few things come to mind:
If for example you are using the multiprocessing and/or subprocess packages to spawn several parallel processes, you may be experiencing a race condition. That is, different processes finishing at different times each time you run the program. Perhaps you are combining the result in some way that is dependent on these threads executing in a particular order.
Perhaps you are simply looping over a dictionary and expecting the keys to be in a certain order, when in fact, dictionaries are not ordered. For example run the following a couple times in a row (I'm using Python 3.5 in case it matters) and you'll notice that the key-value pairs print out in a different order each time:
if __name__=='__main__':
data = dict()
data['a'] = 6
data['b'] = 7
data['c'] = 42
for key in data:
print(key + ' : ' + str(data[key]))
You might even be looking at time-stamps or set some value, or perhaps generating a uuid somewhere that you are using in a calculation.
The possibilities could go on. But again, difficult to nail down without a simple reproduce case. It may just take some good-ol breakpoints and a lot of stepping through code.
Good luck!

Which strategy would be best: saving a value as a field or just computing it on hand with a method

Which of these two strategies would be better for calculating upvotes/downvotes for a post:
These are model fields:
ups
downs
total
def save(self, *args, **kwargs): # Grab total value when needed
self.total = self.ups - self.downs
super.(yourmodel, self).save(*args, **kwargs)
Versus:
ups
downs
def total(ups, downs): # totals will just be computed when needed
return ups - downs # versus saved as a column
Is there really any difference? Speed? Style?
Thanks
I would do the latter. Generally, I wouldn't store any data that can be derived from other data in the database unless the calculation is time-consuming. In this case, it is a trivial calculation. The reason being that if you store derived data, you introduce the possibility for consistency errors.
Note that I would do the same with class instances. No need to store the total if you can make it a property. Less variables means less room for error.
I totally agree with #Caludiu. I would go with the second approach, but as always there are pros and cons:
The first approach seems harmless but it can give you some headaches in future. Think about your application evolution. What if you want to make some more calculous derived from values in your models? If you would want to be consistent, you will have to save them too in your database and then you will be saving a lot of "duplicated" information. The tables derived from your models won't be normalized and not only can grow unnecessarily but increase the posibility of consistency errors.
On the other hand, if you take the second approach, you won't have any problems about database design but you could fall into a lot of tough django queries because you need to do a lot of calculus to retrieve the information you want. These kind of calculus are riddiculy easy as an object method (or message, as you prefer) but when you want to do a query like this in django-style you will see how somethings get complicated.
Again, in my opinion, you should take the second approach. But it's on you to make the desicion you think fits better on your needs...
Hope it helps!

Categories

Resources